Skip to content

fix(ci): calibrate gate must fail when it measures nothing (stop the false green)#272

Merged
cunninghambe merged 1 commit into
mainfrom
fix/calibrate-gate-honest-on-zero-data
Jun 3, 2026
Merged

fix(ci): calibrate gate must fail when it measures nothing (stop the false green)#272
cunninghambe merged 1 commit into
mainfrom
fix/calibrate-gate-honest-on-zero-data

Conversation

@cunninghambe

Copy link
Copy Markdown
Owner

Why

The calibrate CI gate has been passing green while measuring nothing. On this repo's own PRs the calibration comment reads ✅ ... precision=1 recall=1 — but the job log shows the cause:

Calibrate vibe-todo   Health check timed out after 30000ms for http://127.0.0.1:4101/healthz
Calibrate next-blog   SurfaceMCP unreachable at http://127.0.0.1:3102: fetch failed
Aggregate             All bench apps failed; emitting empty aggregate.

Every bench app fails to boot (the CI never starts the app servers or SurfaceMCP), so the aggregate is empty — and aggregate-calibration.mjs deliberately emitted precision=1.0 / recall=1.0 and exit(0) for the empty case "so CI doesn't go red on bench-app flake." The result: total infrastructure failure is reported as a perfect pass. Same false-assurance pattern as the read-only "100%" real-app number.

What this PR does (the honesty fix only)

Makes the gate tell the truth. It does not yet make calibration produce real data (that's the larger infra work below).

  • aggregate-calibration.mjs — empty result now sets calibrationRan: false with precision/recall: null (never 1.0). Still exit(0) so the artifact + comment still emit. Non-empty aggregates get calibrationRan: true.
  • post-calibration-comment.mjs — renders ❌ **BugHunter Calibration DID NOT RUN** — 0 bench apps produced a report ... This is a failure, not a pass. instead of a green ✅.
  • calibrate.yml — new Enforce calibration actually ran step fails the job (red) when calibrationRan is false or any wired kind violates threshold. Runs after the comment so the PR still shows why.

Effect

The calibrate check will go red until the bench infra is resurrected in CI. main is not branch-protected, so this is informational, not merge-blocking. A gate that certifies success on zero data is worse than no gate.

Verified locally

  • empty aggregate → calibrationRan: false, precision: null; enforce step exits 1
  • valid report → calibrationRan: true, real precision flows through
  • comment formatter renders the DID NOT RUN failure block

Follow-up (NOT in this PR — larger infra)

To make calibrate produce real precision/recall in CI, the workflow needs to, per app: install deps, start the app's full server (the bench boot.sh only runs npm run dev, never the API server that owns /healthz), start a SurfaceMCP instance on 3102, and (for UI detectors) a camofox browser MCP. None of that exists in CI today. That spans this repo + cunninghambe/BugHunter-bench and is a multi-part effort — tracked separately.

🤖 Generated with Claude Code

The calibrate job passed GREEN whenever all bench apps failed to boot:
aggregate-calibration.mjs emitted precision=1.0/recall=1.0 on an empty
aggregate and exit(0), and the PR comment rendered ✅. In practice every
bench app has been failing (health-check timeout + SurfaceMCP never
started in CI), so the gate has certified success while measuring zero
data — pure false assurance, the same pattern as the read-only "100%".

- aggregate: empty result now flags `calibrationRan: false` with null
  precision/recall (never 1.0), still exit 0 so the artifact + comment
  emit.
- comment: renders "❌ Calibration DID NOT RUN" instead of green ✅.
- calibrate.yml: new "Enforce calibration actually ran" step fails the
  job (red) when calibrationRan is false OR any wired kind violates
  threshold.

This makes the calibrate check RED until the bench infra is resurrected
in CI (boot the 5 app servers + start SurfaceMCP/camofox). main is not
branch-protected, so this is informational, not merge-blocking. A gate
that lies is worse than no gate.

Verified locally: empty -> calibrationRan:false + enforce exits 1;
valid report -> calibrationRan:true; comment renders the failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

BugHunter Calibration DID NOT RUN | 2026-06-03

0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass.

Failed: calib-vibe-todo (unreadable), calib-vite-shop (unreadable), calib-vue-board (unreadable), calib-next-blog (unreadable), calib-astro-saas (unreadable)

Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log.

@cunninghambe cunninghambe merged commit 3ace057 into main Jun 3, 2026
1 of 2 checks passed
@cunninghambe cunninghambe deleted the fix/calibrate-gate-honest-on-zero-data branch June 3, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant