fix: kill SEO-on-admin + new-window/download false positives (real-app precision 13.6%→~67%)#271
Merged
Merged
Conversation
SEO detectors (h1/title/meta/canonical + duplicate-title) fired on auth-gated /admin/* pages that search engines never index. On the spoonworks real-app benchmark this was 34/35 of all seo_h1 findings — the dominant false-positive category behind the honest 13.6% precision. Add isSeoIndexable() + an `indexable` field on SeoPageInput. A page is indexable only when it declares no `noindex` and an anonymous request fetches it with a 2xx (search engines crawl unauthenticated; auth-gated routes 3xx-redirect to login). classifySeoCorpus skips per-page hygiene rules and excludes non-indexable pages from the duplicate-title cross- check. The producer in execute.ts probes anonymous reachability per scraped page before classifying. Validated against run trucp69ve6r6j9vk0i4ttkfc: 34/35 admin seo_h1 FPs suppressed, the public duplicate-title true positive preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
missing_state_change fired on working controls whose click opens a new tab or triggers a download (e.g. "Download label" -> window.open(url, '_blank')) — no in-page change happens, by design. On the spoonworks real-app benchmark these were confirmed false positives by adversarial triage. Add an openedNewWindowOrDownload signal to PostState. The producer instruments window.open before each click and inspects the clicked element's nearest anchor (target="_blank" / download attribute); classifyMissingStateChange returns null when set. Validated live in camofox: window.open -> count 1, _blank/download anchor -> detected, normal click -> neither (still fires). The two remaining FPs in this cluster (POST-with-loading, async stepper) stem from deeper executor signal gaps (networkRequests hardcoded [], domMutationCount ignores attribute/text mutations) and are documented in docs/follow-ups/missing-state-change-executor-gaps.md rather than fixed here, to avoid regressing the detector's true positives. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
✅ BugHunter Calibration | | 2026-06-03 Overall: tp=0 fp=0 fn=0 precision=1 recall=1 f1=0
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The README claimed 4/4 = 100% real-app precision on the spoonworks benchmark. An audit showed that number was a read-only artifact: that run was invoked with
--read-only, which skipped 2738/2932 tests (every mutating one), leaving only the static dependency-audit detector to fire. The "32 infra failures" attributed to a broken browser surface were allMutatingActionRejectedError: read-only mode— the browser surface was never broken.The honest full-surface measurement (authed as
owner+ anon, mutating, mcp-http, 3676 tests, all detectors incl. a11y/SEO) plus per-cluster adversarial triage (skeptic → independent refuter, 0 flips, 0 uncertain) gives:trucp69…)Per-detector:
vulnerable_dependency_high5/5 real,seo_title_duplicate_across_routes1/1 real,missing_state_change0/3,seo_h1_missing_or_multiple0/35. The false positives are concentrated in two patterns this PR fixes.What
1.
fix(seo)— skip SEO hygiene rules on non-indexable routes.SEO rules (h1/title/meta/canonical + duplicate-title) fired on auth-gated
/admin/*pages that search engines never index — 34/35 of allseo_h1findings. AddsisSeoIndexable()+ anindexablefield onSeoPageInput. A page is indexable only when it declares nonoindexand an anonymous request fetches it with a 2xx (search engines crawl unauthenticated; auth-gated routes 3xx-redirect to login).classifySeoCorpusskips per-page hygiene rules and excludes non-indexable pages from the duplicate-title cross-check; the producer inexecute.tsprobes anonymous reachability per scraped page.trucp69…: 34/35 adminseo_h1FPs suppressed; the 1 kept (/admin/login) is genuinely anonymously reachable (200), and the public duplicate-title true positive is preserved.2.
fix(state-change)— ignore clicks that open a new window or download.missing_state_changeflagged working controls whose click opens a new tab / triggers a download (e.g. "Download label" →window.open(url, '_blank')) — no in-page change happens, by design. Adds anopenedNewWindowOrDownloadsignal toPostState; the producer instrumentswindow.openbefore each click and inspects the clicked element's nearest anchor (target="_blank"/download); the classifier returnsnullwhen set.window.open→ count 1,_blank/downloadanchor → detected, normal click → neither (still fires).Deliberately deferred (documented, not silently skipped)
The other two
missing_state_changeFPs ("Create return label" POST-with-loading, "Increase quantity" async stepper) stem from deeper executor signal gaps:execute.tshardcodesnetworkRequests: []on the UI click path, anddomMutationCountcounts onlychildListmutations (not attribute/text). Fixing those touches the detector's true-positive surface, so they're documented indocs/follow-ups/missing-state-change-executor-gaps.mdrather than risked here.Impact
Projected precision on this surface: 13.6% → ~6/9 ≈ 67% (38 FPs → ~3). A full re-run to confirm the aggregate has not been done.
Verification
classify/seo.test.ts(11),classify/state-change.test.ts(+2). Red→green watched.tsc --noEmitclean; eslint clean on changed files; classify suite 316 pass; full suite 2961 pass (the 1 failure,tests/plan-probe.test.ts, fails on cleanmaintoo — a stale plan-count assertion, unrelated to this PR).Note on the 5 real
vulnerable_dependency_highAll 5 are genuine (npm-audit + GHSA confirmed) but severity-inflated — tagged
criticalwhile the applicable advisory ishigh(e.g. thecriticalNext.js advisories target 15.x, not the installed 16.2.2). A severity-calibration follow-up, not a precision issue.🤖 Generated with Claude Code