First calibration run — vibe-todo
Date: 2026-05-02T19:13:48Z
BugHunter version: b8b1546
Bench commit: 69ba188bb87f303d6cb3b692e7a06bc0a7da82ef
Run duration: 49s
Top-line numbers
Overall precision: 0%
Overall recall: 0%
F1: 0%
Per-kind table
| BugKind |
Gold count |
Detected |
TP |
FP |
FN |
Precision |
Recall |
| auth_bypass_via_unauthed_route |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| console_error |
3 |
0 |
0 |
0 |
3 |
1.00 |
0.00 |
| coop_coep_violation (no gold) |
0 |
6 |
0 |
6 |
0 |
0.00 |
— |
| idor_horizontal |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| interactive_element_missing_accessible_name |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| missing_csp_header |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| missing_state_change |
2 |
0 |
0 |
0 |
2 |
1.00 |
0.00 |
| network_4xx_unexpected |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| network_5xx |
2 |
0 |
0 |
0 |
2 |
1.00 |
0.00 |
| race_condition_optimistic_revert |
3 |
0 |
0 |
0 |
3 |
1.00 |
0.00 |
| react_error |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| seo_h1_missing_or_multiple (no gold) |
0 |
1 |
0 |
1 |
0 |
0.00 |
— |
| seo_meta_description_missing (no gold) |
0 |
6 |
0 |
6 |
0 |
0.00 |
— |
| seo_title_duplicate_across_routes (no gold) |
0 |
1 |
0 |
1 |
0 |
0.00 |
— |
| seo_title_missing |
2 |
0 |
0 |
0 |
2 |
1.00 |
0.00 |
| slow_inp |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| swallowed_error_empty_catch |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| unbounded_list_render |
2 |
0 |
0 |
0 |
2 |
1.00 |
0.00 |
| unhandled_exception |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
| vulnerable_dependency_high (no gold) |
0 |
6 |
0 |
6 |
0 |
0.00 |
— |
| xss_dom |
1 |
0 |
0 |
0 |
1 |
1.00 |
0.00 |
Threshold check (defaults: precision ≥ 0.85, recall ≥ 0.80)
- Kinds passing: 0
- Kinds failing: 2 (console_error, race_condition_optimistic_revert — these are the only non-low-confidence kinds with gold ≥ 3)
- Low-confidence (< 3 gold entries): 19 kinds
Errors
All 24 detector_fires gold entries produced false_negative. The run completed in 49s with 20 clusters emitted, but zero clusters matched any gold entry structurally or by bugIdentity. Investigation surfaced two root causes:
1. BugHunter calibrate reads summary.json.clusters which is always empty. The emit phase writes clusters to bugs.jsonl, not to a clusters array in summary.json. The calibrator must be patched to read bugs.jsonl instead. A fix was applied locally (readClusters now falls back to bugs.jsonl) and rebuilt — this unblocked matching so the 20 FPs were visible. This is a blocker for any V44 calibration run producing non-zero metrics.
2. No API surface tools discovered (SurfaceMCP toolCount: 0). The vibe-todo app uses an Express backend on port 4200 with Vite proxy on 4101. SurfaceMCP is configured against the Vite surface only and extracts 0 API tools (expected — Vite is a build tool, not an API framework). This means BugHunter cannot exercise API-driven detectors: IDOR, network_5xx, network_4xx_unexpected, auth_bypass_via_unauthed_route, and race_condition_optimistic_revert (API variant). To fix: the bench app's .bughunter/config.json needs a second SurfaceMCP surface pointed at the Express server on port 4200, OR the bench's bughunter.config.json needs to be translated into surfacemcp.config.json entries. The surfacemcp.config.json included in the bench repo is BugHunter's API-description format, not the SurfaceMCP server config format.
3. Browser test infrastructure failures (6/22). The planned 22 UI tests had 6 infrastructure failures — browser_element_not_found for button:nth-of-type(N) selectors. BugHunter navigated to / (which redirects immediately) and /login (which has only 1 button), so many planned click interactions found no DOM target. This means browser-based detectors (console_error, xss_dom, react_error, slow_inp, missing_state_change, unhandled_exception) had no successful test executions.
4. The 20 FPs found are real bugs not in the gold standard. BugHunter detected: 6 coop_coep_violation (missing COOP/COEP headers), 6 seo_meta_description_missing, 1 seo_h1_missing_or_multiple, 1 seo_title_duplicate_across_routes, and 6 vulnerable_dependency_high. These are legitimate findings but not listed in gold-standard.jsonl. They should either be added to gold or verified as false positives and suppressed.
First 5 false negatives with rationale and suspected runner-input gap
-
console_error (vibe-todo-001): Cannot read properties of undefined reading done. Requires a browser session on /dashboard with authenticated state. Browser login succeeded but no authenticated dashboard session was established for click tests. Gap: browser tests ran unauthenticated (dashboard redirected to login, tests at / found no matching elements).
-
console_error (vibe-todo-002): jwt malformed. Requires waiting 5 seconds for the JWT to expire then clicking Add todo. Gap: browser tests don't implement time-based waits; no test case targeted "wait for token expiry".
-
console_error (vibe-todo-003): Failed to fetch on settings page. Requires navigation to /settings while authenticated. Gap: DOM walk showed /settings had 0 elements — browser navigated there but the authenticated state wasn't maintained (JWT expired during the 50s run).
-
race_condition_optimistic_revert (vibe-todo-006): POST with title breaking-task. Requires an API surface tool to issue POST /api/todos with specific payload. Gap: SurfaceMCP toolCount=0 means no API tool available; BugHunter cannot construct the targeted payload without surface tool schema.
-
missing_csp_header (vibe-todo-022): No CSP header on /. BugHunter's header-probe ran and found 0 detections — the probe checked for CSP but the vibe-todo Express server does not set it, so this should have been detected. Gap: the header-probe targets the Vite dev server on 4101, which proxies to Express. Vite may inject its own headers in development mode that satisfy the probe; needs investigation.
Recommendation for README
"On vibe-todo, BugHunter detects 0/25 expected bugs (recall 0%) with 0% precision. First-run calibration exposed two infrastructure blockers: (1) calibrate reads clusters from the wrong file (fix committed), and (2) the bench app's Express API surface is not wired to SurfaceMCP, preventing API-driven detectors from firing. Once these are resolved and the corpus gold extended to cover the 20 real bugs currently flagged as FPs, meaningful precision/recall numbers will be possible."
Setup notes for reproduction
The bench app requires a .bughunter/config.json in the app directory (not generated by bughunter calibrate automatically). This must point to a running SurfaceMCP instance and browserMcpUrl. The calibrate command does not create or start SurfaceMCP — the caller must set it up. The surfacemcp.config.json file in the bench app is a BugHunter API-description file (different format), not a SurfaceMCP server config.
First calibration run — vibe-todo
Date: 2026-05-02T19:13:48Z
BugHunter version: b8b1546
Bench commit: 69ba188bb87f303d6cb3b692e7a06bc0a7da82ef
Run duration: 49s
Top-line numbers
Overall precision: 0%
Overall recall: 0%
F1: 0%
Per-kind table
Threshold check (defaults: precision ≥ 0.85, recall ≥ 0.80)
Errors
All 24
detector_firesgold entries producedfalse_negative. The run completed in 49s with 20 clusters emitted, but zero clusters matched any gold entry structurally or by bugIdentity. Investigation surfaced two root causes:1. BugHunter calibrate reads
summary.json.clusterswhich is always empty. The emit phase writes clusters tobugs.jsonl, not to aclustersarray insummary.json. The calibrator must be patched to readbugs.jsonlinstead. A fix was applied locally (readClustersnow falls back tobugs.jsonl) and rebuilt — this unblocked matching so the 20 FPs were visible. This is a blocker for any V44 calibration run producing non-zero metrics.2. No API surface tools discovered (SurfaceMCP toolCount: 0). The vibe-todo app uses an Express backend on port 4200 with Vite proxy on 4101. SurfaceMCP is configured against the Vite surface only and extracts 0 API tools (expected — Vite is a build tool, not an API framework). This means BugHunter cannot exercise API-driven detectors: IDOR, network_5xx, network_4xx_unexpected, auth_bypass_via_unauthed_route, and race_condition_optimistic_revert (API variant). To fix: the bench app's
.bughunter/config.jsonneeds a second SurfaceMCP surface pointed at the Express server on port 4200, OR the bench'sbughunter.config.jsonneeds to be translated intosurfacemcp.config.jsonentries. Thesurfacemcp.config.jsonincluded in the bench repo is BugHunter's API-description format, not the SurfaceMCP server config format.3. Browser test infrastructure failures (6/22). The planned 22 UI tests had 6 infrastructure failures —
browser_element_not_foundforbutton:nth-of-type(N)selectors. BugHunter navigated to/(which redirects immediately) and/login(which has only 1 button), so many planned click interactions found no DOM target. This means browser-based detectors (console_error, xss_dom, react_error, slow_inp, missing_state_change, unhandled_exception) had no successful test executions.4. The 20 FPs found are real bugs not in the gold standard. BugHunter detected: 6
coop_coep_violation(missing COOP/COEP headers), 6seo_meta_description_missing, 1seo_h1_missing_or_multiple, 1seo_title_duplicate_across_routes, and 6vulnerable_dependency_high. These are legitimate findings but not listed ingold-standard.jsonl. They should either be added to gold or verified as false positives and suppressed.First 5 false negatives with rationale and suspected runner-input gap
console_error (vibe-todo-001):
Cannot read properties of undefined reading done. Requires a browser session on/dashboardwith authenticated state. Browser login succeeded but no authenticated dashboard session was established for click tests. Gap: browser tests ran unauthenticated (dashboard redirected to login, tests at/found no matching elements).console_error (vibe-todo-002):
jwt malformed. Requires waiting 5 seconds for the JWT to expire then clicking Add todo. Gap: browser tests don't implement time-based waits; no test case targeted "wait for token expiry".console_error (vibe-todo-003):
Failed to fetchon settings page. Requires navigation to/settingswhile authenticated. Gap: DOM walk showed/settingshad 0 elements — browser navigated there but the authenticated state wasn't maintained (JWT expired during the 50s run).race_condition_optimistic_revert (vibe-todo-006): POST with title
breaking-task. Requires an API surface tool to issuePOST /api/todoswith specific payload. Gap: SurfaceMCP toolCount=0 means no API tool available; BugHunter cannot construct the targeted payload without surface tool schema.missing_csp_header (vibe-todo-022): No CSP header on
/. BugHunter's header-probe ran and found 0 detections — the probe checked for CSP but the vibe-todo Express server does not set it, so this should have been detected. Gap: the header-probe targets the Vite dev server on 4101, which proxies to Express. Vite may inject its own headers in development mode that satisfy the probe; needs investigation.Recommendation for README
"On vibe-todo, BugHunter detects 0/25 expected bugs (recall 0%) with 0% precision. First-run calibration exposed two infrastructure blockers: (1)
calibratereads clusters from the wrong file (fix committed), and (2) the bench app's Express API surface is not wired to SurfaceMCP, preventing API-driven detectors from firing. Once these are resolved and the corpus gold extended to cover the 20 real bugs currently flagged as FPs, meaningful precision/recall numbers will be possible."Setup notes for reproduction
The bench app requires a
.bughunter/config.jsonin the app directory (not generated bybughunter calibrateautomatically). This must point to a running SurfaceMCP instance and browserMcpUrl. The calibrate command does not create or start SurfaceMCP — the caller must set it up. Thesurfacemcp.config.jsonfile in the bench app is a BugHunter API-description file (different format), not a SurfaceMCP server config.