fix(calibrate): match gold on kind+location (normalizedMessage as tiebreaker) — first valid recall TP#273
Open
cunninghambe wants to merge 1 commit into
Open
Conversation
…breaker
The calibrate gold-matcher required structuralMatch.normalizedMessage to be a
literal substring of the cluster's signatureKey/rootCause. But the BugHunter-bench
gold authors normalizedMessage as semantic labels ("homepage-multiple-h1",
"disallow-all") that never appear in detector output, so msgMatch always failed —
every structurally-correct detection was scored as false-positive + false-negative,
guaranteeing ~0% recall even when detectors work perfectly.
Demote normalizedMessage to a tiebreaker: a single kind+location candidate is a
true positive; normalizedMessage only disambiguates among multiple same-kind+
location candidates (exactly one message-match wins; 0 or >1 is fatal ambiguity).
Validated end-to-end against the next-blog bench app: the planted
homepage-multiple-h1 bug (detected as seo_h1 on "/") now correctly scores as a
true positive — calibrate's first valid structural TP. next-blog recall went
0/15 -> 1/15 (the remaining misses are legitimate: anon-only run, 5-page crawl).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
❌ BugHunter Calibration DID NOT RUN | 2026-06-03 0 bench apps produced a report — the calibration gate measured nothing (no precision/recall data). This is a failure, not a pass. Failed: Likely cause: bench apps did not boot (health-check timeout) or SurfaceMCP was not started. See the calibrate job log. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The calibrate gold-matcher (
calibrate/match.ts) required the gold'sstructuralMatch.normalizedMessageto be a literal substring of the cluster'ssignatureKeyorrootCause. But theBugHunter-benchgold authorsnormalizedMessageas semantic labels —homepage-multiple-h1,disallow-all,no-meta-description— that never appear verbatim in any detector output. SomsgMatchalways failed, and every structurally-correct detection was scored as false-positive + false-negative. The matcher could not produce a true positive for most kinds even when the detector found the planted bug perfectly. This guaranteed ~0% recall regardless of detector quality.This was the deepest of several layers of calibrate rot found while resurrecting the gate (see also #272). It meant BugHunter's recall has never been validly measured.
What
Demote
normalizedMessagefrom a hard gate to a tiebreaker:kind+normalizedLocationcandidate → true positive (the message label need not align).normalizedMessage: exactly one message-match wins; 0 or >1 → fatal ambiguity (gold is under-specified for that surface).Location remains required (so a genuine location mismatch is still a miss), and
*wildcards still work for both fields.Validated end-to-end
Ran
bughunter calibrateagainst thenext-blogbench app (with a healthy SurfaceMCP anon surface, see SurfaceMCP#26):The planted homepage-multiple-h1 bug — detected as
seo_h1on/— now correctly scores as calibrate's first valid structural true positive. (The remaining 14 misses are legitimate: the run was anon-only so auth-gated bugs were unreachable, and only 5 pages were crawled.)Tests
match.test.ts(TDD, RED→GREEN): kind+location match with a non-aligning message → TP; genuine location mismatch → FN; message disambiguates multiple candidates → TP; indistinguishable multiple candidates → ambiguity. Fullsrcsuite: 2329 passing, tsc + build clean.🤖 Generated with Claude Code