fix: extraction/resolution accuracy (BOM, retry strip, framework regex)#101
Open
andreinknv wants to merge 2 commits into
Open
fix: extraction/resolution accuracy (BOM, retry strip, framework regex)#101andreinknv wants to merge 2 commits into
andreinknv wants to merge 2 commits into
Conversation
…ork regex)
Three accuracy bugs caught by an audit pass, bundled in one PR.
### 1. UTF-8 BOM caused spurious "modified" hash mismatches
`src/extraction/index.ts` (hashContent)
A file written with a BOM by one editor and re-saved without a BOM by
another (VSCode strips by default; some Windows editors preserve it)
hashed to two different values. Sync then reported the file as modified
on every run. hashContent now strips a leading U+FEFF before hashing.
### 2. Parse-retry comment strip was a no-op for Python, Ruby, etc.
`src/extraction/index.ts` (last-resort retry path), `src/utils.ts`
The "shrink the file by removing comment-only lines" fallback used
`/^\s*\/\//.test(line)` for every language. Files whose comment marker
is not `//` (Python `#`, Ruby `#`) had nothing stripped, so the retry
ran the same content that had already crashed and the file silently
stayed unindexed. Added a per-language LINE_COMMENT_MARKER table and a
stripCommentLinesForRetry helper used at the retry call site.
### 3. Framework route extractors matched docstrings/comments
`src/resolution/frameworks/{python,express,laravel,rust,csharp}.ts`
`pattern.exec(content)` ran the route regex over raw file content, so a
route example in a Python docstring or a commented-out route in JS was
extracted as a real route node. AI assistants then saw phantom routes
that do not exist in the running app.
Added stripCommentsForRegex (utils.ts) which neutralizes block comments,
whole-line line comments, and (for Python) triple-quoted strings,
preserving newlines so match.index maps back to the original line
numbers. Applied at the top of every framework extractor that runs a
regex over content. Deliberately does NOT strip arbitrary string
literals, since those carry the actual route paths the regex needs.
Languages covered: js/ts/tsx/jsx, java, csharp, c/cpp, go, rust, swift,
kotlin, dart, scala, php, python, ruby, pascal.
## Files changed
| File | Change |
|---|---|
| src/utils.ts | Add stripBom, stripCommentsForRegex, stripCommentLinesForRetry, LINE_COMMENT_MARKER table |
| src/extraction/index.ts | Strip BOM in hashContent; use language-aware retry strip |
| src/resolution/frameworks/python.ts | Strip comments before django/flask/fastapi route regex |
| src/resolution/frameworks/express.ts | Strip comments before express route regex |
| src/resolution/frameworks/laravel.ts | Strip comments before laravel route regex |
| src/resolution/frameworks/rust.ts | Strip comments before actix/rocket/axum route regex |
| src/resolution/frameworks/csharp.ts | Strip comments before aspnet route regex |
| __tests__/extraction-resolution-accuracy.test.ts | 21 regression tests |
## Test plan
- [x] npm test: 400/400 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation)
- [x] npx tsc --noEmit clean
- [x] 21 new tests covering: BOM normalization, per-language line stripping, and false-positive-prevention for every affected framework
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer caught: the minimalApiPattern loop in csharp.ts (Map{Get,Post,...}
ASP.NET Core 6+ style) was not updated when the routePatterns loop above
it was switched to use the comment-stripped content, leaving commented-out
app.MapGet calls still being extracted as real routes.
Added a regression test asserting both line-comment and block-comment
forms are skipped for minimal-API routes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three extraction/resolution accuracy bugs caught by an audit pass, bundled in one PR for review tractability.
1. UTF-8 BOM caused spurious "modified" hash mismatches
src/extraction/index.ts(hashContent)A file written with a BOM by one editor and re-saved without a BOM by another (VSCode strips by default; some Windows editors preserve it) hashed to two different values. Sync then reported the file as modified on every run despite no real change.
hashContentnow strips a leadingU+FEFFbefore hashing.2. Parse-retry comment strip was a no-op for Python, Ruby, etc.
src/extraction/index.ts(last-resort retry path),src/utils.tsThe "shrink the file by removing comment-only lines" fallback (which fires when a file repeatedly crashes the WASM parser) used
/^\s*\/\//.test(line)for every language. Python (#), Ruby (#), shell, etc. had nothing stripped — the retry parsed the same content that had already crashed and the file silently stayed unindexed. Added a per-languageLINE_COMMENT_MARKERtable and astripCommentLinesForRetryhelper at the retry call site.3. Framework route extractors matched docstrings and comments
src/resolution/frameworks/{python,express,laravel,rust,csharp}.tspattern.exec(content)ran the route regex over raw file content, so a route example in a Python docstring or a commented-out route in JS/Rust/C# XML-doc comment was extracted as a realroutenode. AI assistants then saw phantom routes that don't exist in the running app.Added
stripCommentsForRegex(content, language)inutils.tswhich neutralizes:/* ... */) for C-family languages//,#)"""…""",'''…''') for Python (the docstring carrier)=begin…=endblocks for RubyNewlines are preserved so
match.indexderived from the stripped content maps back to the same line number in the original. Deliberately does not strip arbitrary string literals — those carry the real route paths the regex needs to see.Applied at the top of every affected framework extractor (django, flask, fastapi, express, laravel, actix/rocket/axum, aspnet).
Files changed
src/utils.tsstripBom,stripCommentsForRegex,stripCommentLinesForRetry,LINE_COMMENT_MARKERsrc/extraction/index.tshashContent; use language-aware retry stripsrc/resolution/frameworks/python.tssrc/resolution/frameworks/express.tssrc/resolution/frameworks/laravel.tssrc/resolution/frameworks/rust.tssrc/resolution/frameworks/csharp.ts__tests__/extraction-resolution-accuracy.test.tsTest plan
npm test: 400/400 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated to this change)npx tsc --noEmitclean🤖 Generated with Claude Code