fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP by cunninghambe · Pull Request #273 · cunninghambe/BugHunter

cunninghambe · 2026-06-03T16:00:35Z

Why

The calibrate gold-matcher (calibrate/match.ts) required the gold's structuralMatch.normalizedMessage to be a literal substring of the cluster's signatureKey or rootCause. But the BugHunter-bench gold authors normalizedMessage as semantic labels — homepage-multiple-h1, disallow-all, no-meta-description — that never appear verbatim in any detector output. So msgMatch always failed, and every structurally-correct detection was scored as false-positive + false-negative. The matcher could not produce a true positive for most kinds even when the detector found the planted bug perfectly. This guaranteed ~0% recall regardless of detector quality.

This was the deepest of several layers of calibrate rot found while resurrecting the gate (see also #272). It meant BugHunter's recall has never been validly measured.

What

Demote normalizedMessage from a hard gate to a tiebreaker:

A single kind + normalizedLocation candidate → true positive (the message label need not align).
Multiple same-kind+location candidates → disambiguate by normalizedMessage: exactly one message-match wins; 0 or >1 → fatal ambiguity (gold is under-specified for that surface).

Location remains required (so a genuine location mismatch is still a miss), and * wildcards still work for both fields.

Validated end-to-end

Ran bughunter calibrate against the next-blog bench app (with a healthy SurfaceMCP anon surface, see SurfaceMCP#26):

before this fix:  tp=0 fp=4 fn=15  recall=0
after  this fix:  tp=1 fp=3 fn=14  recall=0.067
✓ true_positive: seo_h1_missing_or_multiple (gold next-blog-003, via structural)

The planted homepage-multiple-h1 bug — detected as seo_h1 on / — now correctly scores as calibrate's first valid structural true positive. (The remaining 14 misses are legitimate: the run was anon-only so auth-gated bugs were unreachable, and only 5 pages were crawled.)

Tests

match.test.ts (TDD, RED→GREEN): kind+location match with a non-aligning message → TP; genuine location mismatch → FN; message disambiguates multiple candidates → TP; indistinguishable multiple candidates → ambiguity. Full src suite: 2329 passing, tsc + build clean.

🤖 Generated with Claude Code

…breaker The calibrate gold-matcher required structuralMatch.normalizedMessage to be a literal substring of the cluster's signatureKey/rootCause. But the BugHunter-bench gold authors normalizedMessage as semantic labels ("homepage-multiple-h1", "disallow-all") that never appear in detector output, so msgMatch always failed — every structurally-correct detection was scored as false-positive + false-negative, guaranteeing ~0% recall even when detectors work perfectly. Demote normalizedMessage to a tiebreaker: a single kind+location candidate is a true positive; normalizedMessage only disambiguates among multiple same-kind+ location candidates (exactly one message-match wins; 0 or >1 is fatal ambiguity). Validated end-to-end against the next-blog bench app: the planted homepage-multiple-h1 bug (detected as seo_h1 on "/") now correctly scores as a true positive — calibrate's first valid structural TP. next-blog recall went 0/15 -> 1/15 (the remaining misses are legitimate: anon-only run, 5-page crawl). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-03T16:04:08Z

❌ BugHunter Calibration DID NOT RUN | 2026-06-03

0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass.

Failed: calib-vibe-todo (unreadable), calib-vite-shop (unreadable), calib-vue-board (unreadable), calib-next-blog (unreadable), calib-astro-saas (unreadable)

Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP#273

fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP#273
cunninghambe wants to merge 1 commit into
mainfrom
fix/calibrate-matcher-message-tiebreaker

cunninghambe commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cunninghambe commented Jun 3, 2026

Why

What

Validated end-to-end

Tests

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant