fix(error-lookup): suppress weak catalog when semantic returned useful results + content-thin chunk filter#22
Merged
critesjosh merged 2 commits intomainfrom May 4, 2026
Conversation
…k catalog when semantic ran Companion to docsgpt's apiref-empty-chunk filter (critesjosh/docsgpt-aztec#66). Two layered changes: 1. **Client-side defense-in-depth filter** (`isUsefulSemanticChunk` in `src/tools/error-lookup.ts`). Mirrors the Python helper `_is_empty_apiref_chunk` in docsgpt's `/api/search`: drops chunks whose body — after stripping the rendered file-path heading — is empty or path-only (every line contains `/` and no whitespace). Defense-in-depth because the MCP server can be pointed at any DocsGPT deployment via `API_URL`; a fork or older instance may not have the server-side filter, and a future ingest regression could reintroduce path-only chunks. Critically: legitimate signature-only chunks survive. Filter inspects content shape (whitespace presence in remaining lines), not length — `pub fn poseidon(input: [Field; N]) -> Field` has spaces, so it never trips the path-only test. When all returned chunks are path-only, `lookupAztecError` now reports `semanticHealth: "no_results"` (semantically accurate: the backend ran cleanly but didn't return anything useful) rather than "ok" with three useless paths. 2. **Suppress weak catalog hints when semantic was useful** (`formatErrorLookupResult` in `src/utils/format.ts`). The user- reported anchoring failure: when semantic returns content-bearing chunks AND every catalog match is below the strong-match threshold, the catalog hits are pure noise — the user keeps reading them as "the primary answer" even though semantic gave us the actual answer. New `suppressWeakCatalog` flag hides the catalog section entirely from rendered output in that case. They remain in `result.catalogMatches` for programmatic consumers needing every signal. When semantic was unhelpful (no_results / failed / version mismatch / no client) the weak catalog is KEPT — it's the user's only signal. The "Lower-Confidence Catalog Hints" header + neutral "treat as low-confidence cues only" note frame it honestly. Tests: 282/282 (was 264, +18 across error-lookup + format suites). - `isUsefulSemanticChunk` regression cases: path-only / md-heading- only / completely empty / signature-bearing / doc-comment-bearing / multi-line path re-exports. - `lookupAztecError` integration: all-path-only chunks → no_results, mixed chunks → only useful ones surface. - Suppression matrix: weak + semantic-ok hides catalog; weak + every other state keeps it visible. - Strong catalog matches always render normally regardless of semantic state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review feedback. Two related issues: 1. Sourceish set used `match.source` and `match.title` to detect a rendered file-path heading line. But `/api/search` rewrites `source` to a public URL (`_aztec_source_url` produces e.g. `https://github.com/.../foo.nr`), so the bare-path heading `aztec-nr/.../foo.nr` never matched the URL — the heading was never stripped, the chunk fell through to the path-shape check which also missed because `# foo/bar.nr` contains whitespace from the markdown marker. Result: a class of empty chunks slipping through both gates. 2. The mitigation — strip a leading `#+ ` from each line before the path-shape predicate — makes the metadata coupling unnecessary. Drop the sourceish comparison entirely. New helper `lineIsPathShaped` strips heading markers, then checks "contains `/` and no whitespace". Real signature lines always have whitespace (`pub fn ...`, `struct ...`, `pub use a::b;`), so they never trip the predicate. Equivalent fix on the docsgpt side: critesjosh/docsgpt-aztec#66 gets the same shape-only simplification. New regression test: chunk with `#`-prefixed heading body and a URL-rewritten source field — the exact failure mode codex described — is correctly identified as "no useful results". 283/283 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🎉 This PR is included in version 1.21.1 🎉 The release is available on: Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to
critesjosh/docsgpt-aztec#66(which filters content-thin apiref chunks server-side). This PR adds the matching client-side fixes:Background
The v1.21 dogfood test reported
aztec_lookup_error("note already nullified")as "the same bogus result". Empirical investigation showed the threshold fix from PR #20 was working — semantic was firing — but two compounding issues made the response look broken:note_existence_request.nr,utils.nr) with no body content. The user saw an apparently-empty## Related Documentationsection.Contract already initializedcatalog hint stayed visible under## Lower-Confidence Catalog Hints. The user remembered it from v1.20 and concluded "unchanged".Fix
Part 1 —
isUsefulSemanticChunkfilterNew helper in
src/tools/error-lookup.ts. Mirrors the Python helper_is_empty_apiref_chunkin docsgpt: drops chunks whose body — after stripping the rendered file-path heading — is empty or path-only. Critically: legitimate signature-only chunks survive. Filter inspects content shape (whitespace presence), not length.Defense-in-depth because:
API_URL. A fork or older instance may not have the server-side filter.When all returned chunks are path-only,
lookupAztecErrorreportssemanticHealth: "no_results"rather than"ok"with three useless paths.Part 2 —
suppressWeakCatalogin the formatterNew flag in
formatErrorLookupResult. Behavior matrix:## Known Errors## Lower-Confidence Catalog HintsWhen semantic gave us substance, the weak hint is pure noise the user keeps anchoring on — hide it. When semantic was unhelpful, the weak hint stays visible (it's the user's only signal) with a neutral "low-confidence cues only" note.
The catalog is still present in
result.catalogMatchesfor programmatic consumers that need every signal — only the rendered output is filtered.Test plan
npm run build(tsc) — cleannpx vitest run— 282/282 (was 264; +18 new cases)isUsefulSemanticChunkregression: path-only / md-heading-only / completely empty / signature-bearing (pub fn poseidon) / doc-comment-bearing /pub struct/ multi-line path re-exportslookupAztecErrorintegration: all-path-only →no_results, mixed → only useful chunks surfaceaztec_lookup_error("note already nullified")against the updated docsgpt + this MCP version. Expected output:## Related Documentationwith substantive chunks (post docsgpt#66 filtering); noContract already initializedmention; clean message.Companion docsgpt PR
Server-side filter:
critesjosh/docsgpt-aztec#66. Order doesn't matter for shipping — either side independently improves the UX, both together close the loop.🤖 Generated with Claude Code