[calibration] vibe-todo first run

## First calibration run — vibe-todo

Date: 2026-05-02T19:13:48Z
BugHunter version: b8b1546c566389ca5e6d93517ba64bc4064e6606
Bench commit: 69ba188bb87f303d6cb3b692e7a06bc0a7da82ef
Run duration: 49s

### Top-line numbers
Overall precision: 0%
Overall recall: 0%
F1: 0%

### Per-kind table
| BugKind | Gold count | Detected | TP | FP | FN | Precision | Recall |
|---------|------------|----------|----|----|----|-----------|--------|
| auth_bypass_via_unauthed_route | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| console_error | 3 | 0 | 0 | 0 | 3 | 1.00 | 0.00 |
| coop_coep_violation (no gold) | 0 | 6 | 0 | 6 | 0 | 0.00 | — |
| idor_horizontal | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| interactive_element_missing_accessible_name | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| missing_csp_header | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| missing_state_change | 2 | 0 | 0 | 0 | 2 | 1.00 | 0.00 |
| network_4xx_unexpected | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| network_5xx | 2 | 0 | 0 | 0 | 2 | 1.00 | 0.00 |
| race_condition_optimistic_revert | 3 | 0 | 0 | 0 | 3 | 1.00 | 0.00 |
| react_error | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| seo_h1_missing_or_multiple (no gold) | 0 | 1 | 0 | 1 | 0 | 0.00 | — |
| seo_meta_description_missing (no gold) | 0 | 6 | 0 | 6 | 0 | 0.00 | — |
| seo_title_duplicate_across_routes (no gold) | 0 | 1 | 0 | 1 | 0 | 0.00 | — |
| seo_title_missing | 2 | 0 | 0 | 0 | 2 | 1.00 | 0.00 |
| slow_inp | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| swallowed_error_empty_catch | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| unbounded_list_render | 2 | 0 | 0 | 0 | 2 | 1.00 | 0.00 |
| unhandled_exception | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |
| vulnerable_dependency_high (no gold) | 0 | 6 | 0 | 6 | 0 | 0.00 | — |
| xss_dom | 1 | 0 | 0 | 0 | 1 | 1.00 | 0.00 |

### Threshold check (defaults: precision ≥ 0.85, recall ≥ 0.80)
- Kinds passing: 0
- Kinds failing: 2 (console_error, race_condition_optimistic_revert — these are the only non-low-confidence kinds with gold ≥ 3)
- Low-confidence (< 3 gold entries): 19 kinds

### Errors

All 24 `detector_fires` gold entries produced `false_negative`. The run completed in 49s with 20 clusters emitted, but zero clusters matched any gold entry structurally or by bugIdentity. Investigation surfaced two root causes:

**1. BugHunter calibrate reads `summary.json.clusters` which is always empty.** The emit phase writes clusters to `bugs.jsonl`, not to a `clusters` array in `summary.json`. The calibrator must be patched to read `bugs.jsonl` instead. A fix was applied locally (`readClusters` now falls back to `bugs.jsonl`) and rebuilt — this unblocked matching so the 20 FPs were visible. This is a blocker for any V44 calibration run producing non-zero metrics.

**2. No API surface tools discovered (SurfaceMCP toolCount: 0).** The vibe-todo app uses an Express backend on port 4200 with Vite proxy on 4101. SurfaceMCP is configured against the Vite surface only and extracts 0 API tools (expected — Vite is a build tool, not an API framework). This means BugHunter cannot exercise API-driven detectors: IDOR, network_5xx, network_4xx_unexpected, auth_bypass_via_unauthed_route, and race_condition_optimistic_revert (API variant). To fix: the bench app's `.bughunter/config.json` needs a second SurfaceMCP surface pointed at the Express server on port 4200, OR the bench's `bughunter.config.json` needs to be translated into `surfacemcp.config.json` entries. The `surfacemcp.config.json` included in the bench repo is BugHunter's API-description format, not the SurfaceMCP server config format.

**3. Browser test infrastructure failures (6/22).** The planned 22 UI tests had 6 infrastructure failures — `browser_element_not_found` for `button:nth-of-type(N)` selectors. BugHunter navigated to `/` (which redirects immediately) and `/login` (which has only 1 button), so many planned click interactions found no DOM target. This means browser-based detectors (console_error, xss_dom, react_error, slow_inp, missing_state_change, unhandled_exception) had no successful test executions.

**4. The 20 FPs found are real bugs not in the gold standard.** BugHunter detected: 6 `coop_coep_violation` (missing COOP/COEP headers), 6 `seo_meta_description_missing`, 1 `seo_h1_missing_or_multiple`, 1 `seo_title_duplicate_across_routes`, and 6 `vulnerable_dependency_high`. These are legitimate findings but not listed in `gold-standard.jsonl`. They should either be added to gold or verified as false positives and suppressed.

### First 5 false negatives with rationale and suspected runner-input gap

1. **console_error (vibe-todo-001)**: `Cannot read properties of undefined reading done`. Requires a browser session on `/dashboard` with authenticated state. Browser login succeeded but no authenticated dashboard session was established for click tests. Gap: browser tests ran unauthenticated (dashboard redirected to login, tests at `/` found no matching elements).

2. **console_error (vibe-todo-002)**: `jwt malformed`. Requires waiting 5 seconds for the JWT to expire then clicking Add todo. Gap: browser tests don't implement time-based waits; no test case targeted "wait for token expiry".

3. **console_error (vibe-todo-003)**: `Failed to fetch` on settings page. Requires navigation to `/settings` while authenticated. Gap: DOM walk showed `/settings` had 0 elements — browser navigated there but the authenticated state wasn't maintained (JWT expired during the 50s run).

4. **race_condition_optimistic_revert (vibe-todo-006)**: POST with title `breaking-task`. Requires an API surface tool to issue `POST /api/todos` with specific payload. Gap: SurfaceMCP toolCount=0 means no API tool available; BugHunter cannot construct the targeted payload without surface tool schema.

5. **missing_csp_header (vibe-todo-022)**: No CSP header on `/`. BugHunter's header-probe ran and found 0 detections — the probe checked for CSP but the vibe-todo Express server does not set it, so this should have been detected. Gap: the header-probe targets the Vite dev server on 4101, which proxies to Express. Vite may inject its own headers in development mode that satisfy the probe; needs investigation.

### Recommendation for README
"On vibe-todo, BugHunter detects 0/25 expected bugs (recall 0%) with 0% precision. First-run calibration exposed two infrastructure blockers: (1) `calibrate` reads clusters from the wrong file (fix committed), and (2) the bench app's Express API surface is not wired to SurfaceMCP, preventing API-driven detectors from firing. Once these are resolved and the corpus gold extended to cover the 20 real bugs currently flagged as FPs, meaningful precision/recall numbers will be possible."

### Setup notes for reproduction

The bench app requires a `.bughunter/config.json` in the app directory (not generated by `bughunter calibrate` automatically). This must point to a running SurfaceMCP instance and browserMcpUrl. The calibrate command does not create or start SurfaceMCP — the caller must set it up. The `surfacemcp.config.json` file in the bench app is a BugHunter API-description file (different format), not a SurfaceMCP server config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[calibration] vibe-todo first run #93

First calibration run — vibe-todo

Top-line numbers

Per-kind table

Threshold check (defaults: precision ≥ 0.85, recall ≥ 0.80)

Errors

First 5 false negatives with rationale and suspected runner-input gap

Recommendation for README

Setup notes for reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BugKind	Gold count	Detected	FP	FN	Precision	Recall
auth_bypass_via_unauthed_route	1	0	0	1	1.00	0.00
console_error	3	0	0	3	1.00	0.00
coop_coep_violation (no gold)	0	6	6	0	0.00	—
idor_horizontal	1	0	0	1	1.00	0.00
interactive_element_missing_accessible_name	1	0	0	1	1.00	0.00
missing_csp_header	1	0	0	1	1.00	0.00
missing_state_change	2	0	0	2	1.00	0.00
network_4xx_unexpected	1	0	0	1	1.00	0.00
network_5xx	2	0	0	2	1.00	0.00
race_condition_optimistic_revert	3	0	0	3	1.00	0.00
react_error	1	0	0	1	1.00	0.00
seo_h1_missing_or_multiple (no gold)	0	1	1	0	0.00	—
seo_meta_description_missing (no gold)	0	6	6	0	0.00	—
seo_title_duplicate_across_routes (no gold)	0	1	1	0	0.00	—
seo_title_missing	2	0	0	2	1.00	0.00
slow_inp	1	0	0	1	1.00	0.00
swallowed_error_empty_catch	1	0	0	1	1.00	0.00
unbounded_list_render	2	0	0	2	1.00	0.00
unhandled_exception	1	0	0	1	1.00	0.00
vulnerable_dependency_high (no gold)	0	6	6	0	0.00	—
xss_dom	1	0	0	1	1.00	0.00

[calibration] vibe-todo first run #93

Description

First calibration run — vibe-todo

Top-line numbers

Per-kind table

Threshold check (defaults: precision ≥ 0.85, recall ≥ 0.80)

Errors

First 5 false negatives with rationale and suspected runner-input gap

Recommendation for README

Setup notes for reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions