Skip to content

fix: extraction/resolution accuracy (BOM, retry strip, framework regex)#101

Open
andreinknv wants to merge 2 commits into
colbymchenry:mainfrom
andreinknv:fix/extraction-resolution-accuracy
Open

fix: extraction/resolution accuracy (BOM, retry strip, framework regex)#101
andreinknv wants to merge 2 commits into
colbymchenry:mainfrom
andreinknv:fix/extraction-resolution-accuracy

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Summary

Three extraction/resolution accuracy bugs caught by an audit pass, bundled in one PR for review tractability.

1. UTF-8 BOM caused spurious "modified" hash mismatches

src/extraction/index.ts (hashContent)

A file written with a BOM by one editor and re-saved without a BOM by another (VSCode strips by default; some Windows editors preserve it) hashed to two different values. Sync then reported the file as modified on every run despite no real change. hashContent now strips a leading U+FEFF before hashing.

2. Parse-retry comment strip was a no-op for Python, Ruby, etc.

src/extraction/index.ts (last-resort retry path), src/utils.ts

The "shrink the file by removing comment-only lines" fallback (which fires when a file repeatedly crashes the WASM parser) used /^\s*\/\//.test(line) for every language. Python (#), Ruby (#), shell, etc. had nothing stripped — the retry parsed the same content that had already crashed and the file silently stayed unindexed. Added a per-language LINE_COMMENT_MARKER table and a stripCommentLinesForRetry helper at the retry call site.

3. Framework route extractors matched docstrings and comments

src/resolution/frameworks/{python,express,laravel,rust,csharp}.ts

pattern.exec(content) ran the route regex over raw file content, so a route example in a Python docstring or a commented-out route in JS/Rust/C# XML-doc comment was extracted as a real route node. AI assistants then saw phantom routes that don't exist in the running app.

Added stripCommentsForRegex(content, language) in utils.ts which neutralizes:

  • Block comments (/* ... */) for C-family languages
  • Whole-line single-line comments per language (//, #)
  • Triple-quoted strings ("""…""", '''…''') for Python (the docstring carrier)
  • =begin…=end blocks for Ruby

Newlines are preserved so match.index derived from the stripped content maps back to the same line number in the original. Deliberately does not strip arbitrary string literals — those carry the real route paths the regex needs to see.

Applied at the top of every affected framework extractor (django, flask, fastapi, express, laravel, actix/rocket/axum, aspnet).

Files changed

File Change
src/utils.ts Add stripBom, stripCommentsForRegex, stripCommentLinesForRetry, LINE_COMMENT_MARKER
src/extraction/index.ts Strip BOM in hashContent; use language-aware retry strip
src/resolution/frameworks/python.ts Strip comments before django/flask/fastapi route regex
src/resolution/frameworks/express.ts Strip comments before express route regex
src/resolution/frameworks/laravel.ts Strip comments before laravel route regex
src/resolution/frameworks/rust.ts Strip comments before actix/rocket/axum route regex
src/resolution/frameworks/csharp.ts Strip comments before aspnet route regex
__tests__/extraction-resolution-accuracy.test.ts 21 regression tests

Test plan

  • npm test: 400/400 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated to this change)
  • npx tsc --noEmit clean
  • 21 new tests cover: BOM normalization, per-language line stripping (including unknown-language no-op), and false-positive prevention for every affected framework

🤖 Generated with Claude Code

andreinknv and others added 2 commits April 26, 2026 12:41
…ork regex)

Three accuracy bugs caught by an audit pass, bundled in one PR.

### 1. UTF-8 BOM caused spurious "modified" hash mismatches

`src/extraction/index.ts` (hashContent)

A file written with a BOM by one editor and re-saved without a BOM by
another (VSCode strips by default; some Windows editors preserve it)
hashed to two different values. Sync then reported the file as modified
on every run. hashContent now strips a leading U+FEFF before hashing.

### 2. Parse-retry comment strip was a no-op for Python, Ruby, etc.

`src/extraction/index.ts` (last-resort retry path), `src/utils.ts`

The "shrink the file by removing comment-only lines" fallback used
`/^\s*\/\//.test(line)` for every language. Files whose comment marker
is not `//` (Python `#`, Ruby `#`) had nothing stripped, so the retry
ran the same content that had already crashed and the file silently
stayed unindexed. Added a per-language LINE_COMMENT_MARKER table and a
stripCommentLinesForRetry helper used at the retry call site.

### 3. Framework route extractors matched docstrings/comments

`src/resolution/frameworks/{python,express,laravel,rust,csharp}.ts`

`pattern.exec(content)` ran the route regex over raw file content, so a
route example in a Python docstring or a commented-out route in JS was
extracted as a real route node. AI assistants then saw phantom routes
that do not exist in the running app.

Added stripCommentsForRegex (utils.ts) which neutralizes block comments,
whole-line line comments, and (for Python) triple-quoted strings,
preserving newlines so match.index maps back to the original line
numbers. Applied at the top of every framework extractor that runs a
regex over content. Deliberately does NOT strip arbitrary string
literals, since those carry the actual route paths the regex needs.

Languages covered: js/ts/tsx/jsx, java, csharp, c/cpp, go, rust, swift,
kotlin, dart, scala, php, python, ruby, pascal.

## Files changed

| File | Change |
|---|---|
| src/utils.ts | Add stripBom, stripCommentsForRegex, stripCommentLinesForRetry, LINE_COMMENT_MARKER table |
| src/extraction/index.ts | Strip BOM in hashContent; use language-aware retry strip |
| src/resolution/frameworks/python.ts | Strip comments before django/flask/fastapi route regex |
| src/resolution/frameworks/express.ts | Strip comments before express route regex |
| src/resolution/frameworks/laravel.ts | Strip comments before laravel route regex |
| src/resolution/frameworks/rust.ts | Strip comments before actix/rocket/axum route regex |
| src/resolution/frameworks/csharp.ts | Strip comments before aspnet route regex |
| __tests__/extraction-resolution-accuracy.test.ts | 21 regression tests |

## Test plan

- [x] npm test: 400/400 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation)
- [x] npx tsc --noEmit clean
- [x] 21 new tests covering: BOM normalization, per-language line stripping, and false-positive-prevention for every affected framework

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer caught: the minimalApiPattern loop in csharp.ts (Map{Get,Post,...}
ASP.NET Core 6+ style) was not updated when the routePatterns loop above
it was switched to use the comment-stripped content, leaving commented-out
app.MapGet calls still being extracted as real routes.

Added a regression test asserting both line-comment and block-comment
forms are skipped for minimal-API routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant