frostney · frostney · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -779,7 +779,10 @@ jobs:
           GOCCIA_BENCH_ROUNDS: 7
         run: ./build/GocciaBenchmarkRunner benchmarks ${{ matrix.mode == 'bytecode' && '--mode=bytecode' || '' }} --no-progress
 
-      - name: Run benchmarks and save baseline
+      # main/x86_64-linux emits JSON so the next step can validate the benchmark
+      # report shape. PR comparison no longer reads a cached baseline from here —
+      # each PR builds and benchmarks `main` on its own runner (issue #815).
+      - name: Run benchmarks (validate JSON output)
         if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux'
         env:
           GOCCIA_BENCH_CALIBRATION_MS: 100
@@ -818,13 +821,6 @@ jobs:
             sys.exit(1)
           PY
 
-      - name: Cache benchmark baseline
-        if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux'
-        uses: actions/cache/save@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
-        with:
-          path: benchmark-${{ matrix.mode }}-results.json
-          key: benchmark-${{ matrix.mode }}-baseline-${{ github.sha }}
-
       - uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2
         if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode'
 
@@ -855,13 +851,6 @@ jobs:
             --summary-output benchmark-profile-aggregate.json \
             --markdown-output benchmark-profile-aggregate.md
 
-      - name: Cache deterministic profile baseline
-        if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode'
-        uses: actions/cache/save@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
-        with:
-          path: profile-baseline/
-          key: profile-baseline-${{ github.sha }}
-
       - name: Upload benchmark profile
         if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode' && hashFiles('benchmark-profile-aggregate.json') != '' && hashFiles('profile-baseline/**') != ''
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
@@ -949,6 +938,9 @@ jobs:
       - name: Check embedded resource CLI behaviour
         run: bun run scripts/test-cli-embedded-resources.ts
 
+      - name: Check benchmark comparison logic
+        run: bun run scripts/test-benchmark-compare.ts
+
       - name: Check Windows TLS dependencies
         if: runner.os == 'Windows'
         shell: pwsh

diff --git a/.github/workflows/pr.yml b/.github/workflows/pr.yml
diff --git a/docs/adr/0076-same-runner-benchmark-comparison.md b/docs/adr/0076-same-runner-benchmark-comparison.md
@@ -0,0 +1,14 @@
+# Same-runner benchmark comparison
+
+**Date:** 2026-06-27
+**Area:** `ci` / `tooling`
+**Issue:** [#815](https://github.com/frostney/GocciaScript/issues/815)
+**Pull Request:** [#886](https://github.com/frostney/GocciaScript/pull/886)
+
+The PR "Benchmark Results" comment previously classified benchmarks as 🟢 improved / 🔴 regressed by comparing the PR's freshly-measured run against a **cached baseline measured in a different job, on a different GitHub-hosted runner, at a different time** (whatever `main` last pushed). The baseline was written by `ci.yml`'s `benchmark` job via `actions/cache/save` and restored in `pr.yml`'s `benchmark-comment` job by `restore-keys` prefix. Cross-runner/cross-time variance is systematically larger than each run's own min/max spread, so the range-overlap classifier (improved iff PR min > baseline max) flipped large numbers of benchmarks to improved/regressed even when the diff could not possibly affect them — e.g. PR [#805](https://github.com/frostney/GocciaScript/pull/805) (regex-decode only) reported `arraybuffer.js` and `arrays.js`, which use no regex, as broadly improved. The verdicts were confident but not actionable, and a real regression was indistinguishable from runner noise.
+
+We now build the PR's **base commit** (`github.event.pull_request.base.sha` — the `main` the PR targets) in a dedicated `build-main` job, and the `benchmark` job benchmarks that `main` build and the PR build **back-to-back on the same runner**, after a discarded warm-up run that sheds runner cold-start (page faults, sub-turbo clocks). The range-overlap classifier is unchanged; it now compares two runs measured under the same conditions, so the systematic between-runner offset cancels and only within-run noise remains — which is exactly what range overlap is designed to absorb. The deterministic profile diff (opcode/allocation counts) is regenerated from the same `main` build instead of a cached profile, making the whole comparison self-contained; the cross-runner `benchmark-*-baseline-*` and `profile-baseline-*` caches in `ci.yml` and their restores in `pr.yml` are removed. The ~200-line comparison previously inlined in workflow YAML moved to [`scripts/benchmark-compare.js`](../../scripts/benchmark-compare.js) with unit tests in `scripts/test-benchmark-compare.ts`, so the classifier semantics — including the #815 false-positive case — are locked by a test rather than embedded in `pr.yml`.
+
+The explicit trade-off is cost: building and benchmarking `main` inside every PR run roughly **doubles the benchmark job's CI time** (one extra `--prod` build of the benchmark runner plus a second benchmark pass per mode). That is the accepted price of a trustworthy comparison. The comparison stays **advisory (comment-only)**, not a merge-blocking gate: even same-runner measurement has a noise floor (re-running the identical `--prod` binary showed ±11–15%, up to ±59%), `VISION.md` (line 43) deprioritizes raw throughput, and test262 remains the gating regression check. The range-overlap rule absorbs that floor as "unchanged overlap," and the comment now reports the measured median per-run variance so readers can judge whether a sub-noise delta is meaningful.
+
+Alternatives considered and rejected: **normalizing the cached cross-runner baseline** to an in-run calibration reference — cheaper, but a single calibration benchmark cannot cancel per-workload-class variance (memory-bound vs CPU-bound runners scale differently), so false verdicts shrink rather than disappear; a **consistent self-hosted runner** — removes cross-runner variance structurally but adds infrastructure burden and runs untrusted PR code on owned hardware in a public repo; and **demoting timing entirely** in favour of the already-trustworthy deterministic profile diff — honest and cheap, but the deterministic diff only catches regressions that change instruction or allocation counts, not a pure wall-clock-per-op slowdown, which is part of what the gate exists to surface. Per-file interleaving of the two binaries (the issue's "ideally interleaved") is a possible future refinement if residual same-runner drift proves material; the reported noise floor makes such drift visible. See [docs/build-system.md](../build-system.md) `benchmark` job reference and [VISION.md](../../VISION.md).
diff --git a/docs/adr/README.md b/docs/adr/README.md
@@ -85,3 +85,4 @@ Durable architecture and implementation decisions for GocciaScript. New ADRs use
 - [0073 — Opt-in ShadowRealm](0073-opt-in-shadowrealm.md)
 - [0074 — Deferred bytecode call-stack frames](0074-deferred-bytecode-call-stack-frames.md)
 - [0075 — ShadowRealm full test262 conformance](0075-shadowrealm-conformance.md)
+- [0076 — Same-runner benchmark comparison](0076-same-runner-benchmark-comparison.md)
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -260,10 +260,10 @@ When a benchmark has a `setup` or `teardown` function, a second line displays th
 Benchmarks run as part of the CI pipeline in both **interpreter mode** and
 **bytecode mode**. CI uses `GOCCIA_BENCH_CALIBRATION_MS=100` and
 `GOCCIA_BENCH_ROUNDS=7` for stable measurements with IQR outlier filtering. On
-pushes to `main`, the ubuntu-latest x64 runner saves JSON baselines for each
-mode (`benchmark-interpreted-results.json` and `benchmark-bytecode-results.json`)
-to the GitHub Actions cache. See [testing.md](testing.md#ci-integration) for the
-full pipeline overview.
+pushes to `main`, the ubuntu-latest x64 runner emits JSON and validates the
+report shape. PR comparison builds and benchmarks `main` on the PR's own runner
+rather than reading a cached baseline ([ADR 0076](adr/0076-same-runner-benchmark-comparison.md)).
+See [testing.md](testing.md#ci-integration) for the full pipeline overview.
 
 Main bytecode benchmark runs also retain deterministic VM profile reports. The
 GitHub Actions artifact is named `benchmark-profile` and contains:
@@ -284,13 +284,15 @@ paths are `benchmark-profiles/runs/<artifactId>/aggregate.json.gz`,
 
 ### PR Benchmark Comparison
 
-The PR workflow (`.github/workflows/pr.yml`) restores the cached baseline JSON from `main`, runs the benchmark matrix, and posts a collapsible comparison comment on the PR. Each benchmark file gets a **unified table** with both execution modes side by side:
+The PR workflow (`.github/workflows/pr.yml`) builds the PR's base commit (`main`) in a `build-main` job and benchmarks that `main` build and the PR build **back-to-back on the same runner** (after a discarded warm-up), so deltas reflect the diff rather than cross-runner variance (issue #815, [ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). The comparison logic lives in [`scripts/benchmark-compare.js`](../scripts/benchmark-compare.js) (unit-tested by `scripts/test-benchmark-compare.ts`). It posts a collapsible comparison comment on the PR; each benchmark file gets a **unified table** with both execution modes side by side:
 
-- Each table row shows `| Benchmark | Interpreted | Δ | Bytecode | Δ |` with the point estimate and cached/PR min-max range in the form `10,000 ops/sec [9,500..10,500] → 9,200 ops/sec [8,700..9,700]`
-- Classification uses **range overlap** instead of a single point value: if the PR run sits fully above the baseline range it is improved, if it sits fully below it is regressed, and overlapping ranges are treated as unchanged noise
+- Each table row shows `| Benchmark | Interpreted (main → PR) | Δ | Bytecode (main → PR) | Δ |` with the point estimate and same-runner `main`/PR min-max range in the form `10,000 ops/sec [9,500..10,500] → 9,200 ops/sec [8,700..9,700]`
+- Classification uses **range overlap** instead of a single point value: if the PR run sits fully above the `main` range it is improved, if it sits fully below it is regressed, and overlapping ranges are treated as unchanged noise
 - Percentage deltas remain in the `Δ` column as secondary context, even when the classifier marks a benchmark as `~ overlap`
+- The overall summary reports the measured **median per-run variance** as a noise floor, so sub-noise deltas are read as unchanged
 - Results are **grouped by file**, each in a collapsible `<details>` section
 - Files with significant changes (improvements or regressions) are auto-expanded
 - Each file summary shows per-mode counts (e.g., `Interp: 🟢 1, 7 unch. · Bytecode: 🟢 2, 6 unch.`)
 - The **overall PR summary** shows per-mode totals on separate lines with average percentage deltas
-- 🟢 marks non-overlapping improvements, 🔴 marks non-overlapping regressions, `~ overlap` marks overlapping ranges, and 🆕 marks new benchmarks with no baseline
+- The comparison is **advisory** (comment-only, never merge-blocking)
+- 🟢 marks non-overlapping improvements, 🔴 marks non-overlapping regressions, `~ overlap` marks overlapping ranges, and 🆕 marks new benchmarks absent from the `main` build
diff --git a/docs/build-system.md b/docs/build-system.md
@@ -669,7 +669,7 @@ Runs on the full platform matrix:
 
 **`test262`** (needs build, ubuntu-latest x64 only, **non-blocking**) — Downloads the `gocciascript-x86_64-linux` build, checks out [`tc39/test262`](https://github.com/tc39/test262) at the SHA pinned in `scripts/test262-suite-sha.txt`, runs `bun scripts/run_test262_suite.ts --suite-dir test262-suite --mode=bytecode --jobs=4 --timeout-ms=20000 --output=test262-results.json`, and uploads the report as a 30-day workflow artifact. The run step uses `continue-on-error: true` because the conformance lane is allowed to carry known steady-state failures while the engine closes compatibility gaps. On main, the JSON is also stashed via `actions/cache/save` under `test262-baseline-<sha>` so the PR workflow can compute Δ vs main, and `cd website && bun run publish-test262 ../test262-results.json` publishes the compressed run report plus its UTC daily dashboard pointer to Vercel Blob when `BLOB_READ_WRITE_TOKEN` is configured. Main runs also upload the `test262-profile` artifact and publish matching aggregate/detail profile payloads under the separate `test262-profiles/` Blob namespace for weekly performance review. The one-off `cd website && bun run backfill-test262` command seeds retained artifact reports and reruns expired historical days directly into Blob. The pin is bumped weekly by `.github/workflows/test262-bump.yml`. See [docs/test262.md](test262.md) for the harness and profile report contracts.
 
-**`benchmark`** (needs build) — Runs all benchmarks on all platforms. On main (ubuntu-latest x64), saves benchmark results as JSON to `actions/cache` for PR comparison. The main bytecode benchmark lane also captures deterministic VM profile details, uploads the `benchmark-profile` artifact, and publishes aggregate/detail profile payloads under the separate `benchmark-profiles/` Blob namespace when `BLOB_READ_WRITE_TOKEN` is configured.
+**`benchmark`** (needs build) — Runs all benchmarks on all platforms. On main (ubuntu-latest x64), it additionally emits JSON and validates the report shape. PR comparison no longer reads a cached baseline from here — each PR builds and benchmarks `main` on its own runner ([ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). The main bytecode benchmark lane also captures deterministic VM profile details, uploads the `benchmark-profile` artifact, and publishes aggregate/detail profile payloads under the separate `benchmark-profiles/` Blob namespace when `BLOB_READ_WRITE_TOKEN` is configured.
 
 **`cli`** (needs build) — Downloads pre-built binaries and runs CLI behavior smoke tests on all platforms via Bun: options across all apps, lexer numeric-separator rejection, parser error display, config-file loading, and app-specific features including Sandbox Runner seed baselines. Windows runs additionally assert that the loader binary does not link OpenSSL DLLs (HTTPS must use the platform TLS stack statically).
 
@@ -692,7 +692,7 @@ Runs on **ubuntu-latest x64 only** (single runner, no matrix).
 
 **`test`** (needs build) — Runs all JavaScript tests and Pascal unit tests.
 
-**`benchmark`** (needs build) — Restores the cached benchmark baseline JSON from main, runs all benchmarks with JSON output, and posts a collapsible comparison comment on the PR grouped by file. Each file section shows the cached baseline and PR `opsPerSec` point estimates side by side, with each point estimate carrying its min/max ops/sec range in brackets. Classification uses range overlap: fully above the baseline range is an improvement, fully below is a regression, and overlapping ranges are treated as unchanged noise. Percentage deltas are still shown as secondary context, and files with significant changes are auto-expanded. If no baseline exists, shows results without comparison.
+**`benchmark`** (needs build + build-main) — Compares the PR against a `main` build measured **on the same runner**. A separate `build-main` job builds the benchmark runner from the PR's base commit (`github.event.pull_request.base.sha`); the `benchmark` job then warms up the runner (discarded) and benchmarks the `main` build and the PR build back-to-back, so deltas reflect the diff rather than cross-runner variance (issue #815, [ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). It posts a collapsible comparison comment grouped by file: each file section shows the same-runner `main` and PR `opsPerSec` point estimates side by side, each carrying its min/max ops/sec range in brackets. Classification (in [`scripts/benchmark-compare.js`](../scripts/benchmark-compare.js), unit-tested by `scripts/test-benchmark-compare.ts`) uses range overlap: fully above the `main` range is an improvement, fully below is a regression, and overlapping ranges are treated as unchanged noise. The comment also reports the measured median per-run variance as a noise floor. Percentage deltas are secondary context, files with significant changes are auto-expanded, and the comparison is advisory (comment-only, never merge-blocking). The deterministic profile diff (opcode/allocation counts) is regenerated from the same `main` build. If no `main` build is available, shows results without comparison.
 
 **`test262`** (needs build, **non-blocking**) — Checks out `tc39/test262` at the pinned SHA, runs `bun scripts/run_test262_suite.ts --suite-dir test262-suite --mode=bytecode --jobs=4 --timeout-ms=20000 --output=test262-results.json`, and uploads the JSON report. Failing tests do not fail the job. The downstream `test262-comment` job (`if: always()`) restores the most recent `test262-baseline-` cache entry from main, then `bun scripts/run_test262_suite.ts --comment test262-results.json <baseline>` builds the markdown body and the workflow posts/updates a comment using marker `<!-- test262-results -->`. The comment shows a per-category breakdown (built-ins, harness, intl402, language, staging) with Δ-vs-main columns when a baseline is cached, an "Areas closest to 100%" sub-table, and a collapsible per-test delta list. PR CI does not generate or upload the full-corpus test262 profile artifact.