Skip to content

fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP#273

Open
cunninghambe wants to merge 1 commit into
mainfrom
fix/calibrate-matcher-message-tiebreaker
Open

fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP#273
cunninghambe wants to merge 1 commit into
mainfrom
fix/calibrate-matcher-message-tiebreaker

Conversation

@cunninghambe

Copy link
Copy Markdown
Owner

Why

The calibrate gold-matcher (calibrate/match.ts) required the gold's structuralMatch.normalizedMessage to be a literal substring of the cluster's signatureKey or rootCause. But the BugHunter-bench gold authors normalizedMessage as semantic labelshomepage-multiple-h1, disallow-all, no-meta-description — that never appear verbatim in any detector output. So msgMatch always failed, and every structurally-correct detection was scored as false-positive + false-negative. The matcher could not produce a true positive for most kinds even when the detector found the planted bug perfectly. This guaranteed ~0% recall regardless of detector quality.

This was the deepest of several layers of calibrate rot found while resurrecting the gate (see also #272). It meant BugHunter's recall has never been validly measured.

What

Demote normalizedMessage from a hard gate to a tiebreaker:

  • A single kind + normalizedLocation candidate → true positive (the message label need not align).
  • Multiple same-kind+location candidates → disambiguate by normalizedMessage: exactly one message-match wins; 0 or >1 → fatal ambiguity (gold is under-specified for that surface).

Location remains required (so a genuine location mismatch is still a miss), and * wildcards still work for both fields.

Validated end-to-end

Ran bughunter calibrate against the next-blog bench app (with a healthy SurfaceMCP anon surface, see SurfaceMCP#26):

before this fix:  tp=0 fp=4 fn=15  recall=0
after  this fix:  tp=1 fp=3 fn=14  recall=0.067
✓ true_positive: seo_h1_missing_or_multiple (gold next-blog-003, via structural)

The planted homepage-multiple-h1 bug — detected as seo_h1 on / — now correctly scores as calibrate's first valid structural true positive. (The remaining 14 misses are legitimate: the run was anon-only so auth-gated bugs were unreachable, and only 5 pages were crawled.)

Tests

match.test.ts (TDD, RED→GREEN): kind+location match with a non-aligning message → TP; genuine location mismatch → FN; message disambiguates multiple candidates → TP; indistinguishable multiple candidates → ambiguity. Full src suite: 2329 passing, tsc + build clean.

🤖 Generated with Claude Code

…breaker

The calibrate gold-matcher required structuralMatch.normalizedMessage to be a
literal substring of the cluster's signatureKey/rootCause. But the BugHunter-bench
gold authors normalizedMessage as semantic labels ("homepage-multiple-h1",
"disallow-all") that never appear in detector output, so msgMatch always failed —
every structurally-correct detection was scored as false-positive + false-negative,
guaranteeing ~0% recall even when detectors work perfectly.

Demote normalizedMessage to a tiebreaker: a single kind+location candidate is a
true positive; normalizedMessage only disambiguates among multiple same-kind+
location candidates (exactly one message-match wins; 0 or >1 is fatal ambiguity).

Validated end-to-end against the next-blog bench app: the planted
homepage-multiple-h1 bug (detected as seo_h1 on "/") now correctly scores as a
true positive — calibrate's first valid structural TP. next-blog recall went
0/15 -> 1/15 (the remaining misses are legitimate: anon-only run, 5-page crawl).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

BugHunter Calibration DID NOT RUN | 2026-06-03

0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass.

Failed: calib-vibe-todo (unreadable), calib-vite-shop (unreadable), calib-vue-board (unreadable), calib-next-blog (unreadable), calib-astro-saas (unreadable)

Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant