feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104
Open
andreinknv wants to merge 1 commit into
Open
feat(search): subword tokens + Porter stemmer + stopword filter for FTS#104andreinknv wants to merge 1 commit into
andreinknv wants to merge 1 commit into
Conversation
The codebase no longer ships embeddings (commit 453c39d), so all search quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search because the SQLite default tokenizer treats `getParser` as a single indivisible token. Three changes that compound to fix that: 1. **Subword tokens.** New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10x. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces. 2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser`/`parsing`/`parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike. 3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from the query before constructing the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via a new shared `filterStopwords` helper. ## Empirical results (codegraph's own src/, 1242 nodes, 71 files) | Query | baseline rank | this PR rank | |---|---:|---:| | `ExtractionOrchestrator` | 1 | 1 | | `how does file parsing work` | NOT FOUND in 20 | 2 | | `database connection management` | 18 | 1 | | `resolves references between modules` | 19 | 2 | Mean rank: ~14 → 1.5. Concept-mode docstring re-weighting was tested as a fourth lever and rejected — it regressed `how does file parsing work` because amplifying docstring weight floods the result list with prose-keyword spam more than it lifts truly relevant prose. Not included. ## Migration v4 Existing v3 databases get migrated by: - Adding the `name_subwords` column to `nodes` (idempotent guard so a re-run after partial DDL failure doesn't fail with "duplicate column") - Dropping the old FTS table + triggers (tokenize cannot be ALTERed) - Recreating FTS without triggers - Backfilling name_subwords for every existing node - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')` - Recreating the triggers afterward (so they don't fire mid-backfill, which corrupted FTS5 in earlier prototype runs) ## Files changed | File | Change | |---|---| | `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` | | `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS | | `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers | | `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard | | `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS | | `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version | | `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency | ## Test plan - [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] `npx tsc --noEmit` clean - [x] Bench script confirms targets at colbymchenry#18, colbymchenry#19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR - [x] Independent reviewer pass before pushing — addressed three findings: - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts) - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`) - made migration idempotent on partial-DDL re-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 26, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Embeddings were removed in 453c39d (issue #87). Since then, all search quality has to come from FTS. The original PR #74 documented several queries where FTS-only badly trailed semantic search — the dominant cause is that SQLite's default tokenizer treats
getParseras one indivisible token, so a query forparsernever matches it at the FTS layer no matter how clever the post-hoc rescorer is.This PR closes most of that gap with three compounding, model-free changes.
What changed
Subword tokens. New
name_subwordscolumn onnodespopulated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10×. A query forparsernow findsgetParserat the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces.Porter stemmer.
tokenize="porter unicode61"on the FTS table collapses morphological variants —parser/parsing/parsesall stem toparsso a natural-language query matches identifier subwords and docstring prose alike.Stopword stripping.
searchNodesFTSfilters stopwords from the query before the OR-join. Without this, words likehow/does/thebecome OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existingSTOP_WORDSset insrc/search/query-utils.tsvia a new sharedfilterStopwordshelper (no second source of truth).Empirical results
Bench: codegraph's own
src/, 1242 nodes, 71 files. Same query set as PR #74's evidence. Lower rank = better.ExtractionOrchestratorhow does file parsing work(target:getParser)database connection management(target:DatabaseConnection)resolves references between modules(target:resolveOne)Mean rank: ~14 → 1.5. Same workload, no model dependencies, no install-time changes, no Node version constraints.
What I tested and rejected
Concept-mode docstring re-weighting (swap BM25 weights to favor docstring 20× when query contains stopwords): regressed
how does file parsing workfrom #4 to #11 because amplifying docstring weight floods results with prose-keyword spam more than it lifts truly relevant prose. Not included.Migration v4
Existing v3 databases get migrated by:
name_subwordscolumn tonodes(idempotentPRAGMA table_infoguard so a re-run after partial DDL failure doesn't fail with "duplicate column")ALTERed)name_subwordsfor every existing nodeINSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')CURRENT_SCHEMA_VERSION3 → 4. Two existing version-pinned tests updated.Files changed
src/utils.tssplitIdentifierTokens,buildNameSubwords(with dedup)src/search/query-utils.tsfilterStopwordshelper using existingSTOP_WORDSsrc/db/schema.sqlname_subwordscolumn, add it tonodes_fts, addtokenize="porter unicode61", update three triggerssrc/db/migrations.tssrc/db/queries.tsname_subwordson insert/update; new BM25 weights(0,20,5,1,2,10); stopword filter insearchNodesFTS__tests__/foundation.test.ts,__tests__/pr19-improvements.test.ts__tests__/search-quality.test.ts(NEW)Test plan
npm test: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated)npx tsc --noEmitcleanSTOP_WORDSfromquery-utils.ts)buildNameSubwords(parseno longer stores"parse parse")ALTER TABLEno longer fails)Notes for review
schema.sql, existing installs run migration v4 on first open.🤖 Generated with Claude Code