feat(reachability): opt-in library-mode seeds the public API surface#117
feat(reachability): opt-in library-mode seeds the public API surface#117gadievron wants to merge 1 commit into
Conversation
|
Merge-order note (not a defect — flagging for landing order) This stacks on #75 ("stop silent zero-seed blackout") — consider The premise "default mode blacks out a no-entry-point library" is already neutralized once #75 lands: #75's zero-seed fallback returns all units unfiltered instead of blacking out. So the two mode-OFF tests here ( The feature itself is unaffected and is the better fix: library-mode ON refines #75's blunt keep-all to the precise public-API-reachable subset (the |
A pure library exposes no main/route/CLI entry point, so the structural detector finds nothing and apply_reachability_filter drops EVERY unit — the library, and any vulnerable sink it contains, is never analysed (verified: a library whose public function calls a private eval() sink scans to 0 units). The public API IS the entry surface for a library. Adds an opt-in `library_mode` to apply_reachability_filter: when set, seed every public/exported function (`is_exported` when the parser provides it, else name-not-underscore) and let the existing forward BFS pull in their callees. The seed merge is union-only, so it can never demote a structurally-detected app entry point — turning the flag on for an app can only ADD reachable units. Default is False, so every existing caller is byte-identical. Threaded through parse_repository -> _parse_python (the Python parse path that applies this filter; other languages compute reachability in their own pipelines and are a follow-on). The CLI --library-mode flag is a thin follow-on passthrough. Tests (tests/test_library_mode_reachability.py): blackout-when-off, public-API-seeded-when-on (private callee reached via the edge), unreferenced- private-stays-out (precision), app-baseline-unchanged, app-mode-on-is-additive- only (adversarial: union can't subtract), and a parse_repository wiring guard. 6 passed; e2e confirmed False -> [] / True -> [public_api, _sink]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
29ab78c to
27160de
Compare
route/CLI entry point isn't blacked out) lived only in the Python path (core/parser_adapter). The other six parsers run as subprocesses with their own reachability copy, so a C/JS/etc. library still collapsed: tree-sitter's C core pruned 661 -> 24 (all wasm), the public ts_parser_* API never seeded. Centralize _library_seed_ids into utilities/agentic_enhancer.library_seed_ids (now handles both is_exported snake_case on disk and isExported camelCase from the pipelines' in-memory normalize). Thread an opt-in library_mode through all 4 parallel surfaces for each of the 6 subprocess parsers (parse dispatch -> _parse_<lang> subprocess cmd -> test_pipeline --library-mode argparse -> union library_seed_ids into entry_points before the BFS), plus scan_repository and the 'openant parse' / 'openant scan' CLI flags. Union-only: never drops a structurally detected entry point, so app scans are unaffected. Verified end-to-end on tree-sitter C: without the flag the blackout warning fires (24/661); with --library-mode, 352 public-API seeds -> 550 reachable, the parser core (parser.c/lexer.c/stack.c/subtree.c/query.c/node.c) now analysable. Tests: 8 new (library_seed_ids both casings + name heuristic); #117 Python path unchanged (6 green).
What
An opt-in
library_modefor the reachability filter that seeds the public API surface, so a pure library is actually analysed instead of scanning to zero.Why
Reachability seeds from structural entry points (main / route handlers / CLI / input patterns). A pure library has none of those — its public API is the entry surface — so
apply_reachability_filter(core/parser_adapter.py) detects 0 entry points, the forward BFS reaches nothing, and every unit is dropped from the dataset. The library, and any vulnerable sink it contains, is never analysed.Verified end-to-end on the real parser: a library whose public function calls a private
eval()sink scans to[](0 units). Nothing in it is examined.How
apply_reachability_filter(..., library_mode=False)+_library_seed_ids(functions): when enabled, seed every public/exported function (is_exportedwhen the parser provides it — excludes Cstaticetc.; otherwise name-not-_) and let the existing forward BFS pull in their callees.entry_points | _library_seed_ids(...)), so it can never demote a structurally-detected app entry point — turning the flag on for an app can only add reachable units, never remove one.False, so every existing caller is byte-identical.parse_repository→_parse_python.Reachability safety
Purely additive. With the flag off, behavior is unchanged (the seeding block is skipped). With it on, the monotonic BFS over a union-only seed set guarantees the reachable set can only grow — an adversarial review confirmed it cannot degrade an app scan.
Tests
tests/test_library_mode_reachability.py— 6 passed:parse_repositorywiring guard (which caught a real threading bug a filter-only unit test missed).E2E:
library_mode=False→[];library_mode=True→[public_api, _sink].Scope / follow-on
Wired into the Python parse path (
_parse_pythonis the only_parse_<lang>that applies this filter; other languages compute reachability in their own pipelines). A caller passinglibrary_mode=Truefor a non-Python repo currently no-ops — a bounded limitation, not a degradation. The CLI--library-modeflag is a thin passthrough (scanner.py/cli.py) left as a follow-on.is_exportedalready exists for C/Go/JS.Compatibility
None — new optional parameter, default off.
Coordination with open PRs
Touches
core/parser_adapter.py, which #10 / #66 / #75 also touch, but in different regions (this adds an opt-inlibrary_modeparameter + a seed helper; no overlap with their changes). No open PR adds library-mode or exported-symbol seeding — this is standalone.