feat(search): per-file diversification so top-K isn't one class's methods#107
Open
andreinknv wants to merge 2 commits into
Open
feat(search): per-file diversification so top-K isn't one class's methods#107andreinknv wants to merge 2 commits into
andreinknv wants to merge 2 commits into
Conversation
…hods
When a query matches many symbols in a single file, current ranking
returns the matching class plus 9 of its members from the same file.
The first hit is informative; the next 9 are implementation detail
that pushes peer files (subclasses, callers, sibling modules) past the
limit. This PR caps results per file so search surfaces representative
breadth across the codebase rather than burying the user in one
class's internals.
## Empirical lift on codegraph (limit=10, default cap=3)
| Query | Before (max from one file) | After |
|---|---:|---:|
| ExtractionOrchestrator | 10/10 | 9/10 (only one file matches; backfill kicks in) |
| database | 8/10 | 3/10 |
| config | 5/10 | 3/10 |
| resolve | 4/10 | 3/10 |
| extract / parse | 3 (no regression) | 3 |
Top-1 result is preserved in every case — diversification only
reorders second-and-onward.
## Components
- `SearchOptions.perFileCap?: number` — default 3; 0 disables.
- `diversifyByFile(results, limit, perFileCap)` in
src/search/query-utils.ts: pure function. First pass picks at most
perFileCap per file in score order. If limit isn't yet filled,
backfills from skipped (in original score order) so we never return
fewer results than the caller requested.
- searchNodes wires it after the existing rescoring pass, when there
are more candidates than the caller's limit. Relies on the existing
5x internal over-fetch in searchNodesFTS for headroom — no new
multiplier added (multiplier-on-multiplier composition was the
reviewer's blocking concern in an earlier draft).
## Files changed
| File | Change |
|---|---|
| src/types.ts | Add perFileCap to SearchOptions |
| src/search/query-utils.ts | Add diversifyByFile pure helper |
| src/db/queries.ts | Wire diversifyByFile into searchNodes; comment on the over-fetch composition |
| __tests__/diversify.test.ts (NEW) | 13 regression tests |
## Test plan
- [x] npm test: 393/393 pass on macOS
- [x] npx tsc --noEmit clean
- [x] Bench script confirms the lift in the table above
- [x] Independent reviewer pass before pushing — addressed:
- Multiplier-on-multiplier (4x outer * 5x inner = 20x for large
limits): outer multiplier removed; inner over-fetch is sufficient.
- Within-limit reorder: documented as intentional pure-function
behavior; integration path correctly skips when results <= limit.
- MCP exposure of perFileCap: deferred — default 3 is the desired
new behavior; MCP can pick it up later if users want to tune.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
searchNodes returns bm25 + kindBonus + scorePathRelevance + nameMatchBonus, which has no upper bound. The CLI then displayed that as `(score * 100).toFixed(0)%` — producing "10449%", "6553%", "2251%" for ordinary searches like `query serve` against ollama. Beyond being misleading, the value isn't comparable across queries. Render each hit's score as a fraction of the top hit's score so the top result is always "100%" and everything below scales relative to it. Topscore=0 (degenerate) shows as "0%" instead of NaN.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a query matches many symbols in a single file, current ranking returns the matching class plus 9 of its members from the same file. The first hit is informative; the next 9 are implementation detail that pushes peer files (subclasses, callers, sibling modules) past the limit. This PR caps results per file so search surfaces representative breadth across the codebase rather than burying the user in one class's internals.
Empirical lift on codegraph (
limit=10, defaultperFileCap=3)ExtractionOrchestratordatabaseconfigresolveextract/parseTop-1 result is preserved in every case — diversification only reorders second-and-onward.
Components
SearchOptions.perFileCap?: numberinsrc/types.ts— default 3; set to 0 to disable.diversifyByFile(results, limit, perFileCap)insrc/search/query-utils.ts: pure function. First pass picks at mostperFileCapper file in score order. Iflimitisn't yet filled, backfills from skipped (in original score order) so we never return fewer results than the caller requested.Wiring in
src/db/queries.ts: applied after the existing rescoring pass, when there are more candidates than the caller's limit. Relies on the existing 5× internal over-fetch insearchNodesFTSfor headroom — no new multiplier added (multiplier-on-multiplier composition was the reviewer's blocking concern in an earlier draft).13 regression tests covering pure-function behavior (cap, backfill, top-preservation, perFileCap=0, limit edges) + integration tests against an end-to-end DB.
Files changed
src/types.tsperFileCaptoSearchOptionssrc/search/query-utils.tsdiversifyByFilepure helpersrc/db/queries.tsdiversifyByFileintosearchNodes; comment on the over-fetch composition__tests__/diversify.test.ts(NEW)Why this PR (not symbol clustering)
The previous proposal was symbol clustering — grouping
User/UserService/UserControllerinto a "User feature." After prototyping against codegraph (which is a tool-style codebase, not entity-style), the clusters that emerged were mostly verb collisions (get*,extract*,resolve*) — naming convention noise, not features. Result diversification turned out to be the actual cure for the pain point clustering was meant to address: "search returned 10 hits, all from one class" → "search returned representatives across files."Test plan
npm test: 393/393 pass on macOSnpx tsc --noEmitcleanresults.length <= limit.perFileCap: deferred — default 3 is the desired new behavior; MCP callers pick it up implicitly.Backwards compatibility note
A caller relying on getting all 10 hits from one file via
searchNodeswill now see at most 3 (with backfill if no peers exist). The new behavior is opt-out viaperFileCap: 0.🤖 Generated with Claude Code