ci(benchmark): compare PR against a same-runner main build#886
Conversation
The PR "Benchmark Results" comment classified benchmarks by comparing the PR run against a cached `main` baseline measured on a different runner at a different time. Cross-runner variance dwarfs each run's own min/max spread, so the range-overlap classifier flipped large numbers of unrelated benchmarks to improved/regressed (e.g. #805, a regex-only change, reported non-regex benchmarks as improved). The verdicts were not actionable. Build the PR's base commit in a new `build-main` job and benchmark that `main` build and the PR build back-to-back on the same runner, after a warm-up discard of both binaries, so deltas reflect the diff rather than infrastructure. The range-overlap classifier is unchanged; it now compares two runs measured under the same conditions. The `main`-base run is `continue-on-error` so a PR that adds a benchmark `main` cannot run is not blocked (it degrades to new). The comparison stays advisory (comment-only). - Extract the ~200-line inline comparison into scripts/benchmark-compare.js with unit tests (scripts/test-benchmark-compare.ts), wired into both cli jobs; report the measured median per-run variance as a noise floor. - Regenerate the deterministic profile diff from the same `main` build and remove the dead cross-runner benchmark/profile baseline caches. - Add ADR 0076 and update docs/build-system.md and docs/benchmarks.md. Closes #815 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR moves benchmark comparison to a same-runner main-versus-PR flow, adds a shared comparison script and tests, updates CI and PR workflows to produce and consume main artifacts, and refreshes related docs and ADR references. ChangesSame-runner benchmark comparison
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Comment |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Benchmark Results430 benchmarks · PR vs same-runner Interpreted: 🟢 33 improved · 🔴 30 regressed · 367 unchanged · avg +1.1% Typical per-run noise (median variance): interpreted ±2.8%, bytecode ±2.1%. Deltas within noise overlap and read as unchanged. arraybuffer.js — Interp: 🔴 2, 12 unch. · avg -1.4% · Bytecode: 🟢 2, 🔴 1, 11 unch. · avg +2.2%
arrays.js — Interp: 19 unch. · avg +1.3% · Bytecode: 🟢 1, 🔴 2, 16 unch. · avg -0.5%
async-await.js — Interp: 6 unch. · avg +0.4% · Bytecode: 6 unch. · avg +1.4%
async-generators.js — Interp: 2 unch. · avg +5.3% · Bytecode: 🟢 1, 1 unch. · avg +0.2%
atomics.js — Interp: 6 unch. · avg +2.2% · Bytecode: 🟢 1, 🔴 1, 4 unch. · avg -1.6%
base64.js — Interp: 10 unch. · avg +0.7% · Bytecode: 🟢 1, 🔴 2, 7 unch. · avg -0.9%
classes.js — Interp: 🔴 5, 26 unch. · avg -1.3% · Bytecode: 🟢 5, 🔴 1, 25 unch. · avg +2.2%
closures.js — Interp: 🔴 1, 10 unch. · avg +0.5% · Bytecode: 11 unch. · avg +1.5%
collections.js — Interp: 🟢 2, 10 unch. · avg +1.5% · Bytecode: 12 unch. · avg -1.4%
csv.js — Interp: 🟢 1, 🔴 1, 11 unch. · avg -2.1% · Bytecode: 🟢 1, 12 unch. · avg +0.8%
destructuring.js — Interp: 🟢 1, 🔴 5, 16 unch. · avg -1.0% · Bytecode: 🟢 2, 🔴 3, 17 unch. · avg +0.0%
fibonacci.js — Interp: 8 unch. · avg +0.5% · Bytecode: 8 unch. · avg -2.5%
float16array.js — Interp: 🟢 4, 🔴 1, 27 unch. · avg -1.0% · Bytecode: 🔴 7, 25 unch. · avg -1.7%
for-of.js — Interp: 🟢 1, 6 unch. · avg +4.4% · Bytecode: 🔴 1, 6 unch. · avg -0.2%
generators.js — Interp: 🔴 1, 3 unch. · avg -1.4% · Bytecode: 4 unch. · avg +2.3%
intl.js — Interp: 6 unch. · avg -1.5% · Bytecode: 🔴 1, 5 unch. · avg -0.7%
iterators.js — Interp: 🟢 8, 34 unch. · avg +3.5% · Bytecode: 🟢 3, 🔴 6, 33 unch. · avg +0.3%
json.js — Interp: 🟢 2, 🔴 2, 16 unch. · avg +1.8% · Bytecode: 🔴 6, 14 unch. · avg -1.8%
jsx.jsx — Interp: 🔴 2, 19 unch. · avg -2.0% · Bytecode: 🟢 2, 🔴 3, 16 unch. · avg -1.2%
modules.js — Interp: 9 unch. · avg +1.5% · Bytecode: 9 unch. · avg -0.5%
numbers.js — Interp: 11 unch. · avg -1.6% · Bytecode: 🟢 3, 8 unch. · avg +5.4%
objects.js — Interp: 7 unch. · avg +2.3% · Bytecode: 🔴 1, 6 unch. · avg -2.5%
promises.js — Interp: 12 unch. · avg +0.3% · Bytecode: 🟢 3, 9 unch. · avg +4.0%
property-access.js — Interp: 🔴 1, 4 unch. · avg -3.8% · Bytecode: 5 unch. · avg +1.3%
regexp.js — Interp: 11 unch. · avg -2.0% · Bytecode: 🟢 1, 🔴 1, 9 unch. · avg +1.5%
strings.js — Interp: 🔴 1, 18 unch. · avg -0.8% · Bytecode: 🔴 3, 16 unch. · avg -1.4%
temporal.js — Interp: 🟢 1, 5 unch. · avg -0.4% · Bytecode: 6 unch. · avg -1.9%
tsv.js — Interp: 9 unch. · avg +1.0% · Bytecode: 🟢 2, 7 unch. · avg +1.3%
typed-arrays.js — Interp: 🟢 7, 15 unch. · avg +18.9% · Bytecode: 🟢 6, 🔴 3, 13 unch. · avg +20.1%
uint8array-encoding.js — Interp: 🟢 1, 🔴 6, 11 unch. · avg -14.7% · Bytecode: 🔴 4, 14 unch. · avg -8.9%
weak-collections.js — Interp: 🟢 5, 🔴 2, 8 unch. · avg +16.0% · Bytecode: 🟢 7, 8 unch. · avg +28.9%
Deterministic profile diffDeterministic profile diff: no significant changes. Measured on ubuntu-latest x64. Each PR run also builds the |
Suite TimingTest Runner (interpreted: 10,948 passed; bytecode: 10,948 passed)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Test runner worker shutdown frees thread-local heaps in bulk; that shutdown reclamation is not counted as GC collections or collected objects.
Benchmarks (interpreted: 430; bytecode: 430)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Benchmark runner performs explicit between-file collections, so collection and collected-object counts can be much higher than the test runner.
Measured on ubuntu-latest x64. |
test262 Conformance
Areas closest to 100%
Per-test deltas (+1 / -0)Newly passing (1):
Steady-state failures are non-blocking; regressions vs the cached main baseline (lower total pass count, or any PASS → non-PASS transition) fail the conformance gate. Measured on ubuntu-latest x64, bytecode mode. Areas grouped by the first two test262 path components; minimum 25 attempted tests, areas already at 100% excluded. Δ vs main compares against the most recent cached |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/pr.yml:
- Line 196: The benchmark workflow is overly gated by the build-main job,
causing benchmark and benchmark-comment to be skipped when the base build fails.
Update the pr.yml workflow so benchmark can still run and post PR-only feedback
even if build-main fails, while preserving the build-main ordering dependency
for available artifacts. Use the benchmark, benchmark-comment, and build-main
job definitions to keep the downstream graceful-fallback path in
buildBenchmarkComparison reachable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 33c9903d-14e0-4950-850a-a6a8bf9245a4
📒 Files selected for processing (9)
.github/workflows/ci.yml.github/workflows/pr.ymldocs/adr/0076-same-runner-benchmark-comparison.mddocs/adr/README.mddocs/benchmarks.mddocs/build-system.mdscripts/benchmark-compare.jsscripts/profile-diff.jsscripts/test-benchmark-compare.ts
The benchmark job hard-depended on build-main (needs: [build, build-main]), so a build-main failure (base predating the benchmarkrunner target, or a transient FPC install hiccup) skipped benchmark and, transitively, benchmark-comment — the PR got no benchmark feedback at all, even though build-pr succeeded. That made the downstream graceful-degradation path (optional main downloads + null-main handling in buildBenchmarkComparison) unreachable. Keep the build-main ordering dependency but run benchmark whenever `build` itself succeeds (if: !cancelled() && needs.build.result == 'success'), and make the main-build inputs optional: continue-on-error on the build-main download, a tolerant chmod, and if-no-files-found: ignore on the main result and profile uploads. When build-main fails the comparison now posts PR-only numbers instead of going silent. Addresses CodeRabbit review feedback on #886. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The benchmark (interpreted) leg timed out after 51 minutes on #886: the PR measured pass hung while the main pass (same workloads, same runner) had just completed in ~3 min. Locally the full interpreted suite runs in ~2 min, so this is a non-deterministic per-benchmark calibration runaway on a loaded GitHub runner, not a benchmark bug. The trigger was self-inflicted: the two-binary warm-up ran with the default 200 ms calibration (no override), making it ~a full extra pass per binary — roughly 2.5x the interpreted work and heavy runner load right before the final measured pass, which is when calibration is most likely to misfire. There was also no timeout, so a runaway burned 51 min and red-marked the PR. - Replace the two-binary warm-up with a single cheap one (PR binary, ROUNDS=1 + CALIBRATION_MS=10): still ramps the CPU and warms the shared benchmark-file cache, but light and low-load. The runner's per-benchmark warm-up (5 iters) covers binary-specific warmth, so one binary suffices. - Wrap each pass in `timeout -k 30 480` (8 min cap; normal ~3 min). A killed main pass writes no JSON and degrades to PR-only; a killed PR pass fails fast instead of spinning. - Add `timeout-minutes: 30` to the benchmark job as a hard backstop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
mainbaseline cached from a different job, on a different GitHub-hosted runner, at a different time. Cross-runner variance is systematically larger than each run's own min/max spread, so the range-overlap classifier flipped large numbers of unrelated benchmarks to 🟢 improved / 🔴 regressed — e.g. perf(regexp): memoize the decoded subject across regex VM calls #805 (regex-decode only) reportedarraybuffer.js/arrays.js, which use no regex, as broadly improved.build-mainjob builds the benchmark runner from the PR's base commit (github.event.pull_request.base.sha), and thebenchmarkjob benchmarks thatmainbuild and the PR build back-to-back on the same runner, after a warm-up discard of both binaries. The range-overlap classifier is unchanged — it now compares two runs measured under the same conditions, so the systematic between-runner offset cancels.mainbuild; the dead cross-runnerbenchmark-*-baseline-*andprofile-baseline-*caches (ci.ymlsaves +pr.ymlrestores) are removed. The separatetest262-baselinecache is untouched.github-scriptcomparison moved toscripts/benchmark-compare.jswith unit tests (scripts/test-benchmark-compare.ts) wired into bothclijobs; the comment now reports the measured median per-run variance as a noise floor.Key constraints / non-goals (confirmed via grilling):
mainin every PR run roughly doubles the benchmark job's CI time — the accepted price of a trustworthy comparison.VISION.mddeprioritizes raw throughput; test262 remains the gating regression check.main-base run iscontinue-on-errorso a PR that adds/changes a benchmarkmaincannot run is not blocked — such benchmarks degrade to 🆕 new. The wholemainbuild is optional too: ifbuild-mainfails,benchmarkstill runs (if: !cancelled() && needs.build.result == 'success') and posts PR-only numbers rather than going silent.Testing
scripts/test-benchmark-compare.ts— 17 assertions incl. the Benchmark PR comparison is unreliable — baseline is a cached run from a different runner, so deltas reflect runner variance, not the diff #815 case (same-runner overlap →unchanged, not a spurious 🟢) and the no-main degradation path.GocciaTestRunner testsand--mode=bytecode— 10944/10944 pass in both modes.actionlint— no new findings vs theorigin/mainbaseline../format.pas --checkclean. All doc checks + markdownlint green (pre-commit).docs/build-system.md,docs/benchmarks.md; corrected stale cache-save descriptions in both docs.Closes #815