Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 7 additions & 15 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -779,7 +779,10 @@ jobs:
GOCCIA_BENCH_ROUNDS: 7
run: ./build/GocciaBenchmarkRunner benchmarks ${{ matrix.mode == 'bytecode' && '--mode=bytecode' || '' }} --no-progress

- name: Run benchmarks and save baseline
# main/x86_64-linux emits JSON so the next step can validate the benchmark
# report shape. PR comparison no longer reads a cached baseline from here β€”
# each PR builds and benchmarks `main` on its own runner (issue #815).
- name: Run benchmarks (validate JSON output)
if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux'
env:
GOCCIA_BENCH_CALIBRATION_MS: 100
Expand Down Expand Up @@ -818,13 +821,6 @@ jobs:
sys.exit(1)
PY

- name: Cache benchmark baseline
if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux'
uses: actions/cache/save@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
with:
path: benchmark-${{ matrix.mode }}-results.json
key: benchmark-${{ matrix.mode }}-baseline-${{ github.sha }}

- uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2
if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode'

Expand Down Expand Up @@ -855,13 +851,6 @@ jobs:
--summary-output benchmark-profile-aggregate.json \
--markdown-output benchmark-profile-aggregate.md

- name: Cache deterministic profile baseline
if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode'
uses: actions/cache/save@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
with:
path: profile-baseline/
key: profile-baseline-${{ github.sha }}

- name: Upload benchmark profile
if: github.ref == 'refs/heads/main' && matrix.target == 'x86_64-linux' && matrix.mode == 'bytecode' && hashFiles('benchmark-profile-aggregate.json') != '' && hashFiles('profile-baseline/**') != ''
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
Expand Down Expand Up @@ -949,6 +938,9 @@ jobs:
- name: Check embedded resource CLI behaviour
run: bun run scripts/test-cli-embedded-resources.ts

- name: Check benchmark comparison logic
run: bun run scripts/test-benchmark-compare.ts

- name: Check Windows TLS dependencies
if: runner.os == 'Windows'
shell: pwsh
Expand Down
397 changes: 166 additions & 231 deletions .github/workflows/pr.yml

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/adr/0076-same-runner-benchmark-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Same-runner benchmark comparison

**Date:** 2026-06-27
**Area:** `ci` / `tooling`
**Issue:** [#815](https://github.com/frostney/GocciaScript/issues/815)
**Pull Request:** [#886](https://github.com/frostney/GocciaScript/pull/886)

The PR "Benchmark Results" comment previously classified benchmarks as 🟒 improved / πŸ”΄ regressed by comparing the PR's freshly-measured run against a **cached baseline measured in a different job, on a different GitHub-hosted runner, at a different time** (whatever `main` last pushed). The baseline was written by `ci.yml`'s `benchmark` job via `actions/cache/save` and restored in `pr.yml`'s `benchmark-comment` job by `restore-keys` prefix. Cross-runner/cross-time variance is systematically larger than each run's own min/max spread, so the range-overlap classifier (improved iff PR min > baseline max) flipped large numbers of benchmarks to improved/regressed even when the diff could not possibly affect them β€” e.g. PR [#805](https://github.com/frostney/GocciaScript/pull/805) (regex-decode only) reported `arraybuffer.js` and `arrays.js`, which use no regex, as broadly improved. The verdicts were confident but not actionable, and a real regression was indistinguishable from runner noise.

We now build the PR's **base commit** (`github.event.pull_request.base.sha` β€” the `main` the PR targets) in a dedicated `build-main` job, and the `benchmark` job benchmarks that `main` build and the PR build **back-to-back on the same runner**, after a discarded warm-up run that sheds runner cold-start (page faults, sub-turbo clocks). The range-overlap classifier is unchanged; it now compares two runs measured under the same conditions, so the systematic between-runner offset cancels and only within-run noise remains β€” which is exactly what range overlap is designed to absorb. The deterministic profile diff (opcode/allocation counts) is regenerated from the same `main` build instead of a cached profile, making the whole comparison self-contained; the cross-runner `benchmark-*-baseline-*` and `profile-baseline-*` caches in `ci.yml` and their restores in `pr.yml` are removed. The ~200-line comparison previously inlined in workflow YAML moved to [`scripts/benchmark-compare.js`](../../scripts/benchmark-compare.js) with unit tests in `scripts/test-benchmark-compare.ts`, so the classifier semantics β€” including the #815 false-positive case β€” are locked by a test rather than embedded in `pr.yml`.

The explicit trade-off is cost: building and benchmarking `main` inside every PR run roughly **doubles the benchmark job's CI time** (one extra `--prod` build of the benchmark runner plus a second benchmark pass per mode). That is the accepted price of a trustworthy comparison. The comparison stays **advisory (comment-only)**, not a merge-blocking gate: even same-runner measurement has a noise floor (re-running the identical `--prod` binary showed Β±11–15%, up to Β±59%), `VISION.md` (line 43) deprioritizes raw throughput, and test262 remains the gating regression check. The range-overlap rule absorbs that floor as "unchanged overlap," and the comment now reports the measured median per-run variance so readers can judge whether a sub-noise delta is meaningful.

Alternatives considered and rejected: **normalizing the cached cross-runner baseline** to an in-run calibration reference β€” cheaper, but a single calibration benchmark cannot cancel per-workload-class variance (memory-bound vs CPU-bound runners scale differently), so false verdicts shrink rather than disappear; a **consistent self-hosted runner** β€” removes cross-runner variance structurally but adds infrastructure burden and runs untrusted PR code on owned hardware in a public repo; and **demoting timing entirely** in favour of the already-trustworthy deterministic profile diff β€” honest and cheap, but the deterministic diff only catches regressions that change instruction or allocation counts, not a pure wall-clock-per-op slowdown, which is part of what the gate exists to surface. Per-file interleaving of the two binaries (the issue's "ideally interleaved") is a possible future refinement if residual same-runner drift proves material; the reported noise floor makes such drift visible. See [docs/build-system.md](../build-system.md) `benchmark` job reference and [VISION.md](../../VISION.md).
1 change: 1 addition & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,4 @@ Durable architecture and implementation decisions for GocciaScript. New ADRs use
- [0073 β€” Opt-in ShadowRealm](0073-opt-in-shadowrealm.md)
- [0074 β€” Deferred bytecode call-stack frames](0074-deferred-bytecode-call-stack-frames.md)
- [0075 β€” ShadowRealm full test262 conformance](0075-shadowrealm-conformance.md)
- [0076 β€” Same-runner benchmark comparison](0076-same-runner-benchmark-comparison.md)
18 changes: 10 additions & 8 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,10 +260,10 @@ When a benchmark has a `setup` or `teardown` function, a second line displays th
Benchmarks run as part of the CI pipeline in both **interpreter mode** and
**bytecode mode**. CI uses `GOCCIA_BENCH_CALIBRATION_MS=100` and
`GOCCIA_BENCH_ROUNDS=7` for stable measurements with IQR outlier filtering. On
pushes to `main`, the ubuntu-latest x64 runner saves JSON baselines for each
mode (`benchmark-interpreted-results.json` and `benchmark-bytecode-results.json`)
to the GitHub Actions cache. See [testing.md](testing.md#ci-integration) for the
full pipeline overview.
pushes to `main`, the ubuntu-latest x64 runner emits JSON and validates the
report shape. PR comparison builds and benchmarks `main` on the PR's own runner
rather than reading a cached baseline ([ADR 0076](adr/0076-same-runner-benchmark-comparison.md)).
See [testing.md](testing.md#ci-integration) for the full pipeline overview.

Main bytecode benchmark runs also retain deterministic VM profile reports. The
GitHub Actions artifact is named `benchmark-profile` and contains:
Expand All @@ -284,13 +284,15 @@ paths are `benchmark-profiles/runs/<artifactId>/aggregate.json.gz`,

### PR Benchmark Comparison

The PR workflow (`.github/workflows/pr.yml`) restores the cached baseline JSON from `main`, runs the benchmark matrix, and posts a collapsible comparison comment on the PR. Each benchmark file gets a **unified table** with both execution modes side by side:
The PR workflow (`.github/workflows/pr.yml`) builds the PR's base commit (`main`) in a `build-main` job and benchmarks that `main` build and the PR build **back-to-back on the same runner** (after a discarded warm-up), so deltas reflect the diff rather than cross-runner variance (issue #815, [ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). The comparison logic lives in [`scripts/benchmark-compare.js`](../scripts/benchmark-compare.js) (unit-tested by `scripts/test-benchmark-compare.ts`). It posts a collapsible comparison comment on the PR; each benchmark file gets a **unified table** with both execution modes side by side:

- Each table row shows `| Benchmark | Interpreted | Ξ” | Bytecode | Ξ” |` with the point estimate and cached/PR min-max range in the form `10,000 ops/sec [9,500..10,500] β†’ 9,200 ops/sec [8,700..9,700]`
- Classification uses **range overlap** instead of a single point value: if the PR run sits fully above the baseline range it is improved, if it sits fully below it is regressed, and overlapping ranges are treated as unchanged noise
- Each table row shows `| Benchmark | Interpreted (main β†’ PR) | Ξ” | Bytecode (main β†’ PR) | Ξ” |` with the point estimate and same-runner `main`/PR min-max range in the form `10,000 ops/sec [9,500..10,500] β†’ 9,200 ops/sec [8,700..9,700]`
- Classification uses **range overlap** instead of a single point value: if the PR run sits fully above the `main` range it is improved, if it sits fully below it is regressed, and overlapping ranges are treated as unchanged noise
- Percentage deltas remain in the `Ξ”` column as secondary context, even when the classifier marks a benchmark as `~ overlap`
- The overall summary reports the measured **median per-run variance** as a noise floor, so sub-noise deltas are read as unchanged
- Results are **grouped by file**, each in a collapsible `<details>` section
- Files with significant changes (improvements or regressions) are auto-expanded
- Each file summary shows per-mode counts (e.g., `Interp: 🟒 1, 7 unch. · Bytecode: 🟒 2, 6 unch.`)
- The **overall PR summary** shows per-mode totals on separate lines with average percentage deltas
- 🟒 marks non-overlapping improvements, πŸ”΄ marks non-overlapping regressions, `~ overlap` marks overlapping ranges, and πŸ†• marks new benchmarks with no baseline
- The comparison is **advisory** (comment-only, never merge-blocking)
- 🟒 marks non-overlapping improvements, πŸ”΄ marks non-overlapping regressions, `~ overlap` marks overlapping ranges, and πŸ†• marks new benchmarks absent from the `main` build
4 changes: 2 additions & 2 deletions docs/build-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -669,7 +669,7 @@ Runs on the full platform matrix:

**`test262`** (needs build, ubuntu-latest x64 only, **non-blocking**) β€” Downloads the `gocciascript-x86_64-linux` build, checks out [`tc39/test262`](https://github.com/tc39/test262) at the SHA pinned in `scripts/test262-suite-sha.txt`, runs `bun scripts/run_test262_suite.ts --suite-dir test262-suite --mode=bytecode --jobs=4 --timeout-ms=20000 --output=test262-results.json`, and uploads the report as a 30-day workflow artifact. The run step uses `continue-on-error: true` because the conformance lane is allowed to carry known steady-state failures while the engine closes compatibility gaps. On main, the JSON is also stashed via `actions/cache/save` under `test262-baseline-<sha>` so the PR workflow can compute Ξ” vs main, and `cd website && bun run publish-test262 ../test262-results.json` publishes the compressed run report plus its UTC daily dashboard pointer to Vercel Blob when `BLOB_READ_WRITE_TOKEN` is configured. Main runs also upload the `test262-profile` artifact and publish matching aggregate/detail profile payloads under the separate `test262-profiles/` Blob namespace for weekly performance review. The one-off `cd website && bun run backfill-test262` command seeds retained artifact reports and reruns expired historical days directly into Blob. The pin is bumped weekly by `.github/workflows/test262-bump.yml`. See [docs/test262.md](test262.md) for the harness and profile report contracts.

**`benchmark`** (needs build) β€” Runs all benchmarks on all platforms. On main (ubuntu-latest x64), saves benchmark results as JSON to `actions/cache` for PR comparison. The main bytecode benchmark lane also captures deterministic VM profile details, uploads the `benchmark-profile` artifact, and publishes aggregate/detail profile payloads under the separate `benchmark-profiles/` Blob namespace when `BLOB_READ_WRITE_TOKEN` is configured.
**`benchmark`** (needs build) β€” Runs all benchmarks on all platforms. On main (ubuntu-latest x64), it additionally emits JSON and validates the report shape. PR comparison no longer reads a cached baseline from here β€” each PR builds and benchmarks `main` on its own runner ([ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). The main bytecode benchmark lane also captures deterministic VM profile details, uploads the `benchmark-profile` artifact, and publishes aggregate/detail profile payloads under the separate `benchmark-profiles/` Blob namespace when `BLOB_READ_WRITE_TOKEN` is configured.

**`cli`** (needs build) β€” Downloads pre-built binaries and runs CLI behavior smoke tests on all platforms via Bun: options across all apps, lexer numeric-separator rejection, parser error display, config-file loading, and app-specific features including Sandbox Runner seed baselines. Windows runs additionally assert that the loader binary does not link OpenSSL DLLs (HTTPS must use the platform TLS stack statically).

Expand All @@ -692,7 +692,7 @@ Runs on **ubuntu-latest x64 only** (single runner, no matrix).

**`test`** (needs build) β€” Runs all JavaScript tests and Pascal unit tests.

**`benchmark`** (needs build) β€” Restores the cached benchmark baseline JSON from main, runs all benchmarks with JSON output, and posts a collapsible comparison comment on the PR grouped by file. Each file section shows the cached baseline and PR `opsPerSec` point estimates side by side, with each point estimate carrying its min/max ops/sec range in brackets. Classification uses range overlap: fully above the baseline range is an improvement, fully below is a regression, and overlapping ranges are treated as unchanged noise. Percentage deltas are still shown as secondary context, and files with significant changes are auto-expanded. If no baseline exists, shows results without comparison.
**`benchmark`** (needs build + build-main) β€” Compares the PR against a `main` build measured **on the same runner**. A separate `build-main` job builds the benchmark runner from the PR's base commit (`github.event.pull_request.base.sha`); the `benchmark` job then warms up the runner (discarded) and benchmarks the `main` build and the PR build back-to-back, so deltas reflect the diff rather than cross-runner variance (issue #815, [ADR 0076](adr/0076-same-runner-benchmark-comparison.md)). It posts a collapsible comparison comment grouped by file: each file section shows the same-runner `main` and PR `opsPerSec` point estimates side by side, each carrying its min/max ops/sec range in brackets. Classification (in [`scripts/benchmark-compare.js`](../scripts/benchmark-compare.js), unit-tested by `scripts/test-benchmark-compare.ts`) uses range overlap: fully above the `main` range is an improvement, fully below is a regression, and overlapping ranges are treated as unchanged noise. The comment also reports the measured median per-run variance as a noise floor. Percentage deltas are secondary context, files with significant changes are auto-expanded, and the comparison is advisory (comment-only, never merge-blocking). The deterministic profile diff (opcode/allocation counts) is regenerated from the same `main` build. If no `main` build is available, shows results without comparison.

**`test262`** (needs build, **non-blocking**) β€” Checks out `tc39/test262` at the pinned SHA, runs `bun scripts/run_test262_suite.ts --suite-dir test262-suite --mode=bytecode --jobs=4 --timeout-ms=20000 --output=test262-results.json`, and uploads the JSON report. Failing tests do not fail the job. The downstream `test262-comment` job (`if: always()`) restores the most recent `test262-baseline-` cache entry from main, then `bun scripts/run_test262_suite.ts --comment test262-results.json <baseline>` builds the markdown body and the workflow posts/updates a comment using marker `<!-- test262-results -->`. The comment shows a per-category breakdown (built-ins, harness, intl402, language, staging) with Ξ”-vs-main columns when a baseline is cached, an "Areas closest to 100%" sub-table, and a collapsible per-test delta list. PR CI does not generate or upload the full-corpus test262 profile artifact.

Expand Down
Loading
Loading