fix(ci): calibrate gate must fail when it measures nothing (stop the false green) by cunninghambe · Pull Request #272 · cunninghambe/BugHunter

cunninghambe · 2026-06-03T14:18:48Z

Why

The calibrate CI gate has been passing green while measuring nothing. On this repo's own PRs the calibration comment reads ✅ ... precision=1 recall=1 — but the job log shows the cause:

Calibrate vibe-todo   Health check timed out after 30000ms for http://127.0.0.1:4101/healthz
Calibrate next-blog   SurfaceMCP unreachable at http://127.0.0.1:3102: fetch failed
Aggregate             All bench apps failed; emitting empty aggregate.

Every bench app fails to boot (the CI never starts the app servers or SurfaceMCP), so the aggregate is empty — and aggregate-calibration.mjs deliberately emitted precision=1.0 / recall=1.0 and exit(0) for the empty case "so CI doesn't go red on bench-app flake." The result: total infrastructure failure is reported as a perfect pass. Same false-assurance pattern as the read-only "100%" real-app number.

What this PR does (the honesty fix only)

Makes the gate tell the truth. It does not yet make calibration produce real data (that's the larger infra work below).

aggregate-calibration.mjs — empty result now sets calibrationRan: false with precision/recall: null (never 1.0). Still exit(0) so the artifact + comment still emit. Non-empty aggregates get calibrationRan: true.
post-calibration-comment.mjs — renders ❌ **BugHunter Calibration DID NOT RUN** — 0 bench apps produced a report ... This is a failure, not a pass. instead of a green ✅.
calibrate.yml — new Enforce calibration actually ran step fails the job (red) when calibrationRan is false or any wired kind violates threshold. Runs after the comment so the PR still shows why.

Effect

The calibrate check will go red until the bench infra is resurrected in CI. main is not branch-protected, so this is informational, not merge-blocking. A gate that certifies success on zero data is worse than no gate.

Verified locally

empty aggregate → calibrationRan: false, precision: null; enforce step exits 1
valid report → calibrationRan: true, real precision flows through
comment formatter renders the DID NOT RUN failure block

Follow-up (NOT in this PR — larger infra)

To make calibrate produce real precision/recall in CI, the workflow needs to, per app: install deps, start the app's full server (the bench boot.sh only runs npm run dev, never the API server that owns /healthz), start a SurfaceMCP instance on 3102, and (for UI detectors) a camofox browser MCP. None of that exists in CI today. That spans this repo + cunninghambe/BugHunter-bench and is a multi-part effort — tracked separately.

🤖 Generated with Claude Code

The calibrate job passed GREEN whenever all bench apps failed to boot: aggregate-calibration.mjs emitted precision=1.0/recall=1.0 on an empty aggregate and exit(0), and the PR comment rendered ✅. In practice every bench app has been failing (health-check timeout + SurfaceMCP never started in CI), so the gate has certified success while measuring zero data — pure false assurance, the same pattern as the read-only "100%". - aggregate: empty result now flags `calibrationRan: false` with null precision/recall (never 1.0), still exit 0 so the artifact + comment emit. - comment: renders "❌ Calibration DID NOT RUN" instead of green ✅. - calibrate.yml: new "Enforce calibration actually ran" step fails the job (red) when calibrationRan is false OR any wired kind violates threshold. This makes the calibrate check RED until the bench infra is resurrected in CI (boot the 5 app servers + start SurfaceMCP/camofox). main is not branch-protected, so this is informational, not merge-blocking. A gate that lies is worse than no gate. Verified locally: empty -> calibrationRan:false + enforce exits 1; valid report -> calibrationRan:true; comment renders the failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-03T14:22:03Z

❌ BugHunter Calibration DID NOT RUN | 2026-06-03

0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass.

Failed: calib-vibe-todo (unreadable), calib-vite-shop (unreadable), calib-vue-board (unreadable), calib-next-blog (unreadable), calib-astro-saas (unreadable)

Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log.

cunninghambe merged commit 3ace057 into main Jun 3, 2026
1 of 2 checks passed

cunninghambe deleted the fix/calibrate-gate-honest-on-zero-data branch June 3, 2026 14:33

cunninghambe mentioned this pull request Jun 3, 2026

fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): calibrate gate must fail when it measures nothing (stop the false green)#272

fix(ci): calibrate gate must fail when it measures nothing (stop the false green)#272
cunninghambe merged 1 commit into
mainfrom
fix/calibrate-gate-honest-on-zero-data

cunninghambe commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cunninghambe commented Jun 3, 2026

Why

What this PR does (the honesty fix only)

Effect

Verified locally

Follow-up (NOT in this PR — larger infra)

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant