fix(ci): calibrate gate must fail when it measures nothing (stop the false green)#272
Merged
Merged
Conversation
The calibrate job passed GREEN whenever all bench apps failed to boot: aggregate-calibration.mjs emitted precision=1.0/recall=1.0 on an empty aggregate and exit(0), and the PR comment rendered ✅. In practice every bench app has been failing (health-check timeout + SurfaceMCP never started in CI), so the gate has certified success while measuring zero data — pure false assurance, the same pattern as the read-only "100%". - aggregate: empty result now flags `calibrationRan: false` with null precision/recall (never 1.0), still exit 0 so the artifact + comment emit. - comment: renders "❌ Calibration DID NOT RUN" instead of green ✅. - calibrate.yml: new "Enforce calibration actually ran" step fails the job (red) when calibrationRan is false OR any wired kind violates threshold. This makes the calibrate check RED until the bench infra is resurrected in CI (boot the 5 app servers + start SurfaceMCP/camofox). main is not branch-protected, so this is informational, not merge-blocking. A gate that lies is worse than no gate. Verified locally: empty -> calibrationRan:false + enforce exits 1; valid report -> calibrationRan:true; comment renders the failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
❌ BugHunter Calibration DID NOT RUN | 2026-06-03 0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass. Failed: Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The
calibrateCI gate has been passing green while measuring nothing. On this repo's own PRs the calibration comment reads✅ ... precision=1 recall=1— but the job log shows the cause:Every bench app fails to boot (the CI never starts the app servers or SurfaceMCP), so the aggregate is empty — and
aggregate-calibration.mjsdeliberately emittedprecision=1.0 / recall=1.0andexit(0)for the empty case "so CI doesn't go red on bench-app flake." The result: total infrastructure failure is reported as a perfect pass. Same false-assurance pattern as the read-only "100%" real-app number.What this PR does (the honesty fix only)
Makes the gate tell the truth. It does not yet make calibration produce real data (that's the larger infra work below).
aggregate-calibration.mjs— empty result now setscalibrationRan: falsewithprecision/recall: null(never1.0). Stillexit(0)so the artifact + comment still emit. Non-empty aggregates getcalibrationRan: true.post-calibration-comment.mjs— renders❌ **BugHunter Calibration DID NOT RUN** — 0 bench apps produced a report ... This is a failure, not a pass.instead of a green ✅.calibrate.yml— newEnforce calibration actually ranstep fails the job (red) whencalibrationRanis false or any wired kind violates threshold. Runs after the comment so the PR still shows why.Effect
The
calibratecheck will go red until the bench infra is resurrected in CI.mainis not branch-protected, so this is informational, not merge-blocking. A gate that certifies success on zero data is worse than no gate.Verified locally
calibrationRan: false,precision: null; enforce step exits 1calibrationRan: true, real precision flows throughDID NOT RUNfailure blockFollow-up (NOT in this PR — larger infra)
To make
calibrateproduce real precision/recall in CI, the workflow needs to, per app: install deps, start the app's full server (the benchboot.shonly runsnpm run dev, never the API server that owns/healthz), start a SurfaceMCP instance on 3102, and (for UI detectors) a camofox browser MCP. None of that exists in CI today. That spans this repo +cunninghambe/BugHunter-benchand is a multi-part effort — tracked separately.🤖 Generated with Claude Code