feat(cochange): file-level co-change graph mined from git history by andreinknv · Pull Request #105 · colbymchenry/codegraph

andreinknv · 2026-04-26T19:59:36Z

🔄 Rebased onto post-refactor patterns (Co-change graph)
This PR has been rebased to use the per-thing file layout established by refactor PRs #116/#117/#118/#119. Behavior shipped is identical to the original; only the internal code shape changed.

What this PR now ships:

src/index-hooks/cochange.ts — registered IndexHook (afterIndexAll + afterSync)

src/db/migrations/010-co-changes.ts — co_changes table

src/cochange/index.ts — pure miner module

src/index-hooks/registry.ts — 2 lines

src/db/migrations/index.ts — 2 lines

src/db/queries.ts, src/db/schema.sql, src/types.ts, src/default-config.ts, src/config.ts, src/index.ts — query methods + config flag

__tests__/cochange.test.ts — full miner + integration suite

What you'll see in the diff: until #116-#119 land, the diff against main includes the refactor commits as part of the branch history. After they merge, GitHub auto-shrinks this PR's diff to just the files above. Full context: #120.

Summary

Adds a file-level co-change graph mined from git log. Surfaces coupling between files that historically change together — sibling extractors that share patterns, tests that assert schema state, config files coupled to the code that reads them. This information is invisible to static analysis but recoverable from git in seconds with zero new dependencies.

Empirical results (codegraph's own history)

File pair	Jaccard	Why coupled
`src/extraction/languages/csharp.ts` ↔ `java.ts`	0.75	Sibling language extractors share patterns; no import relation
`__tests__/foundation.test.ts` ↔ `src/db/schema.sql`	0.57	Test asserts schema version; coupled by intent, not import
`src/db/migrations.ts` ↔ `src/db/schema.sql`	0.57	Schema changes always need a matching migration
`src/installer/config-writer.ts` ↔ `src/installer/index.ts`	0.50	Two files in same module, always change together

For src/extraction/index.ts, the top co-changer is src/index.ts (J=0.41, the wrapper class). For src/db/schema.sql, the top co-changers are the test that asserts schema version, the migrations file, and the db index. The signal is meaningful and matches what a maintainer would intuit.

What's added

Schema (src/db/schema.sql): new co_changes table (file_a, file_b, count) with canonical ordering (file_a < file_b) and a CHECK constraint. New commit_count column on files for Jaccard normalization. Migration v4 with idempotent PRAGMA table_info guard on the ALTER.
Miner (src/cochange/index.ts): mineCoChanges runs git log --no-merges --name-only --format=tformat:CGCMT-%H -z. The CGCMT-<40-hex-sha> header sentinel cannot collide with any realistic filename. MAX_FILES_PER_COMMIT=50 skips merge / mass-refactor commits that would otherwise produce O(N²) spurious pairs. Returns needsFullRescan: true when the previously-mined HEAD is unreachable (force-push, gc).
DB methods: applyCoChangeDeltas (UPSERT pairs + UPDATE per-file counts in one transaction), clearCoChanges, getCoChangedFiles. Jaccard is computed inline in SQL via count / (anchor_commits + partner_commits - count), with minJaccard / minCount filters applied before the LIMIT so high-Jaccard pairs aren't silently dropped.
Wiring (src/index.ts): indexAll triggers full mining; sync triggers incremental mining since last_mined_cochange_head (stored in project_metadata, same pattern as PR fix(sync): detect HEAD-moving git operations to prevent stale index #100's sync HEAD tracking).
Public API: cg.getCoChangedFiles(filePath, { limit, minCount, minJaccard }) returns ranked partners with count + Jaccard.
Config: enableCoChange?: boolean on CodeGraphConfig, default true. false skips mining entirely. Field is properly merged in mergeConfig so the opt-out survives a config save/load round-trip.
27 regression tests covering parser correctness (including unicode paths and the CGCMT- sentinel), filtering, incremental update, unreachable-anchor recovery, Jaccard math symmetry, the migration including idempotency on partial re-run, and config round-trip.

Files changed

File	Change
`src/cochange/index.ts` (NEW)	Miner: parses git log into pair counts
`src/db/schema.sql`	Add `commit_count` column + `co_changes` table + indexes
`src/db/migrations.ts`	Bump version to 4; add migration v4 (idempotent)
`src/db/queries.ts`	`applyCoChangeDeltas` / `clearCoChanges` / `getCoChangedFiles`
`src/index.ts`	`mineCoChanges` wired into `indexAll` + `sync`
`src/types.ts`	`enableCoChange` config field, default true
`src/config.ts`	`mergeConfig` now propagates `enableCoChange`
`__tests__/cochange.test.ts` (NEW)	27 regression tests
`__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts`	Update expected schema version

Test plan

npm test: 405/407 pass (two pre-existing fs.watch flakes; watcher passes 12/12 in isolation)
npx tsc --noEmit clean
Independent reviewer pass before pushing — addressed three findings:
- mergeConfig dropped enableCoChange (BLOCK → fixed; round-trip test added)
- minJaccard was applied JS-side after a small SQL LIMIT, silently dropping high-Jaccard pairs ranked beyond the over-fetch (REQUEST_CHANGES → moved into SQL WHERE; regression test added)
- Parser sentinel -- could collide with a file literally named -- (REQUEST_CHANGES → switched to CGCMT-<sha>; regression test added)

Known limitations

Schema bump to v4 collides with PR fix(db): enforce UNIQUE on edges so INSERT OR IGNORE actually dedupes #102 and PR feat(search): subword tokens + Porter stemmer + stopword filter for FTS #104. If any of those land first, this PR rebases to v5.
HEAD-reset-behind-anchor: rewinding HEAD past last_mined_cochange_head without a force-push leaves stale counts. A merge-base --is-ancestor check could detect it but adds complexity not worth it for this PR.
Renames: git log without --follow treats rename as delete+add; pre-rename commits stay attached to the old path. Acceptable since the miner gracefully handles deleted/renamed files (they just have no future pairs).
Mining cost: ~1-3s on a typical repo with a few hundred commits. enableCoChange: false opts out entirely. The miner is execFileSync, blocking the event loop while git runs (max 60s timeout). Worker-thread offload is out of scope.

🤖 Generated with Claude Code

`git ls-files` (used for both the initial scan and incremental sync) does not enter submodules — they appear as gitlink entries with their contents invisible. As a result, source files inside submodules were silently skipped during indexing. Both file-discovery paths now recurse into active submodules: - getGitVisibleFiles (full index) enumerates active submodules via `git submodule foreach --recursive --quiet 'echo "$displaypath"'` and runs `git ls-files -co --exclude-standard` inside each, prefixing the submodule path so files are reported relative to the parent root. - getGitChangedFiles (sync) was refactored to share its status-parsing logic between the parent repo and each submodule. Submodule directory entries that the parent's status emits when a submodule pointer moves (e.g., " m vendor/sub") are filtered out so we don't try to read a directory as a file. Submodule indexing is on by default and can be disabled via `indexSubmodules: false` in CodeGraphConfig — useful for repos with large vendor submodules that should remain unindexed without having to add a path-based exclude. Uninitialized / missing submodules are silently skipped (best-effort enhancement on top of the existing scan). Status output paths are now C-style-unquoted before being used or compared against the submodule directory set, so submodule paths containing spaces or non-ASCII bytes are handled correctly. The parent status command failing still falls back to the full filesystem scan via a null return, preserving the prior contract; only submodule-internal status failures are absorbed silently. Closes colbymchenry#86. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four user-visible bugs caught by an independent audit pass: 1. Svelte symbols reported on the wrong line src/extraction/svelte-extractor.ts:144 The script-block regex captures content starting with the leading newline that follows `>`, so the inner extractor sees that newline as line 1 of its 1-indexed input and the first real code on line 2. The previous `contentStartLine = scriptTagLine + openingTagLines + 1` was added to that 1-indexed line number, shifting every Svelte symbol's startLine / endLine off by 1. Drops the `+1`. Five regression tests added covering single-line, multi-line opening tag, template-offset, single-line no-newline, and dual module/instance script blocks. 2. Watcher silently dropped pending changes on sync failure src/sync/watcher.ts:177 `hasChanges = false` ran before the sync attempt, so a thrown sync (DB locked, transient FS error) left the pending batch forgotten until a NEW file event arrived. Re-set `hasChanges = true` in the catch path so a transient failure schedules a retry on its own. Regression test added (mocks fail-then-succeed, asserts the second call happens without a new file event). 3. Graph traversal default maxDepth was Infinity src/graph/traversal.ts:14, src/types.ts:301 `limit: 1000` capped returned nodes, but during traversal the visited set and BFS/DFS frontier can grow far beyond `limit` on highly connected graphs before the cap kicks in. Default is now 10. Callers who really need exhaustive traversal can still pass `maxDepth: Infinity` explicitly — the JSDoc documents this. This is a public-API behavior change; existing tests pass. Also caps `findPath`'s BFS queue at 100,000 entries (FIND_PATH_MAX_QUEUE) and returns null if exceeded — each entry holds a cloned path array, so on dense graphs the queue could otherwise consume gigabytes before either finding a path or exhausting the search. 4. `findRelevantContext` did not bound caller-supplied limits src/context/index.ts:284 `searchLimit` is multiplied by 5 in `findNodesByExactName` and feeds several other unbounded operations; a caller passing `searchLimit: 1_000_000` would pull millions of rows. Now clamped: searchLimit ∈ [1, 100], maxNodes ∈ [1, 1000], traversalDepth ∈ [0, 10]. Regression test asserts a 1e9 input is bounded. All 387 tests pass serialized; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundles seven smaller hardening fixes from the audit pass: 1. Symlink-resistant path validation in sync paths src/extraction/index.ts:1148, 1206, 1293, 1347 src/utils.ts (new validatePathWithinRootReal) getChangedFiles() and sync() now resolve symlinks via realpath before reading, so a regular file swapped for a symlink to outside the project (between scan and read) is rejected rather than followed. The lexical-only validatePathWithinRoot stays for paths that don't yet exist on disk. 2. storeExtractionResult is now atomic src/extraction/index.ts:1018-1098, src/db/queries.ts (new QueryBuilder.transaction) delete + insert + upsert run in a single SQLite transaction, so a process kill mid-write can no longer leave the file's old data wiped while the new data is missing. Nested transactions inside insertNodes/insertEdges/insertUnresolvedRefsBatch/deleteFile work correctly because better-sqlite3 collapses them via SAVEPOINT. 3. recycleWorker is now actually async src/extraction/index.ts:573 Previously declared sync but called via `await`. Now properly awaits worker.terminate() with a 1s timeout (so a wedged WASM worker can't block the caller indefinitely) and force-fires terminate after the timeout to avoid zombie threads. 4. Parse-worker handles unknown message types src/extraction/parse-worker.ts:60-78 When a malformed message arrives with an `id`, the worker sends back a structured error instead of silently dropping it (which previously caused the main thread to block until the per-file timeout). Messages without an id are still ignored — no pending promise to unblock. 5. ReDoS-safe glob matching src/utils.ts (new globToSafeRegex), src/bin/codegraph.ts:1163, src/graph/queries.ts:196 Coalesces runs of `*` so hostile inputs like `*****` map to a single `.*` rather than five chained quantifiers (which would catastrophically backtrack on long inputs). Caps input at 1024 chars. Implementation walks character-by-character with no sentinels — avoids any chance of marker-string collisions with user input. 6. fileUriToPath fallback hardening src/mcp/index.ts:36-40 Catch-block fallback for malformed file:// URIs now runs through path.resolve() so a `file:///../etc/passwd` style input is normalized rather than passed raw to downstream filesystem code. 7. Reference dedup at extraction time + dead-code removal src/extraction/tree-sitter.ts (new dedupeReferences), src/db/queries.ts (deleteUnresolvedByNode removed) A function that calls `foo()` 100 times now produces 1 unresolved reference instead of 100; the resolver collapses duplicates anyway but extraction was wasteful. Removed the never-called deleteUnresolvedByNode helper (FK cascade handles cleanup). 3 new tests for globToSafeRegex; full suite 383/383 passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sync used to rely solely on `git status --porcelain` for change detection, which only reports working-tree dirtiness vs HEAD. After a `git merge` (or pull, checkout, rebase, reset, post-commit), the working tree is clean and `git status` reports nothing, so sync silently became a no-op while the DB still held pre-operation content hashes. MCP queries then served stale data with no warning. Sync now records the HEAD SHA it was last synced against (in the existing project_metadata table) and, when current HEAD differs, unions `git diff --name-status <last>..HEAD` into the changed-file set. If the recorded HEAD is unreachable (force-push, gc), sync falls back to the filesystem scan path, which is correct regardless of git history state. The same fix is applied to getChangedFiles() so MCP staleness signals stay accurate. Adds 5 regression tests covering merge, branch checkout, committed deletion, unreachable last-synced HEAD, and the no-op clean-tree case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ork regex) Three accuracy bugs caught by an audit pass, bundled in one PR. ### 1. UTF-8 BOM caused spurious "modified" hash mismatches `src/extraction/index.ts` (hashContent) A file written with a BOM by one editor and re-saved without a BOM by another (VSCode strips by default; some Windows editors preserve it) hashed to two different values. Sync then reported the file as modified on every run. hashContent now strips a leading U+FEFF before hashing. ### 2. Parse-retry comment strip was a no-op for Python, Ruby, etc. `src/extraction/index.ts` (last-resort retry path), `src/utils.ts` The "shrink the file by removing comment-only lines" fallback used `/^\s*\/\//.test(line)` for every language. Files whose comment marker is not `//` (Python `#`, Ruby `#`) had nothing stripped, so the retry ran the same content that had already crashed and the file silently stayed unindexed. Added a per-language LINE_COMMENT_MARKER table and a stripCommentLinesForRetry helper used at the retry call site. ### 3. Framework route extractors matched docstrings/comments `src/resolution/frameworks/{python,express,laravel,rust,csharp}.ts` `pattern.exec(content)` ran the route regex over raw file content, so a route example in a Python docstring or a commented-out route in JS was extracted as a real route node. AI assistants then saw phantom routes that do not exist in the running app. Added stripCommentsForRegex (utils.ts) which neutralizes block comments, whole-line line comments, and (for Python) triple-quoted strings, preserving newlines so match.index maps back to the original line numbers. Applied at the top of every framework extractor that runs a regex over content. Deliberately does NOT strip arbitrary string literals, since those carry the actual route paths the regex needs. Languages covered: js/ts/tsx/jsx, java, csharp, c/cpp, go, rust, swift, kotlin, dart, scala, php, python, ruby, pascal. ## Files changed | File | Change | |---|---| | src/utils.ts | Add stripBom, stripCommentsForRegex, stripCommentLinesForRetry, LINE_COMMENT_MARKER table | | src/extraction/index.ts | Strip BOM in hashContent; use language-aware retry strip | | src/resolution/frameworks/python.ts | Strip comments before django/flask/fastapi route regex | | src/resolution/frameworks/express.ts | Strip comments before express route regex | | src/resolution/frameworks/laravel.ts | Strip comments before laravel route regex | | src/resolution/frameworks/rust.ts | Strip comments before actix/rocket/axum route regex | | src/resolution/frameworks/csharp.ts | Strip comments before aspnet route regex | | __tests__/extraction-resolution-accuracy.test.ts | 21 regression tests | ## Test plan - [x] npm test: 400/400 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] npx tsc --noEmit clean - [x] 21 new tests covering: BOM normalization, per-language line stripping, and false-positive-prevention for every affected framework Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer caught: the minimalApiPattern loop in csharp.ts (Map{Get,Post,...} ASP.NET Core 6+ style) was not updated when the routePatterns loop above it was switched to use the comment-stripped content, leaving commented-out app.MapGet calls still being extracted as real routes. Added a regression test asserting both line-comment and block-comment forms are skipped for minimal-API routes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The edges table has `id INTEGER PRIMARY KEY AUTOINCREMENT` and no other UNIQUE constraint. The codebase uses INSERT OR IGNORE INTO edges (source, target, kind, ...) VALUES (...) clearly intending dedup, but the only candidate key for OR IGNORE was the autoincrement id (which never conflicts) — so the OR IGNORE was a silent no-op. Any code path that re-emits the same edge (resolver retries, partial-failure re-runs, framework extractors that double-emit) silently inserted duplicates, inflating call graphs in codegraph_callers/callees. This change adds a real UNIQUE index on the natural key: UNIQUE INDEX idx_edges_unique ON edges(source, target, kind, COALESCE(line, -1), COALESCE(col, -1)) COALESCE keeps two NULL line/col values comparable as equal — SQLite treats raw NULLs in a UNIQUE index as distinct, which would otherwise defeat dedup for edges that don't carry a line/col (1-indexed everywhere in this codebase, so -1 is a safe sentinel). Migration v4 first deduplicates pre-existing rows (DELETE ... WHERE id NOT IN (SELECT MIN(id) FROM edges GROUP BY source, target, kind, COALESCE(line, -1), COALESCE(col, -1))) then creates the index. Both run inside the migration transaction wrapper so a crash leaves the DB consistent. CURRENT_SCHEMA_VERSION bumped to 4. Two existing version-pinned tests updated to match. ## Files changed | File | Change | |---|---| | src/db/schema.sql | Add UNIQUE INDEX idx_edges_unique for fresh installs | | src/db/migrations.ts | Bump version to 4; add migration v4 (dedup + index) | | __tests__/edges-unique.test.ts | 7 regression tests | | __tests__/foundation.test.ts | Update expected schema version | | __tests__/pr19-improvements.test.ts | Update expected schema version | ## Test plan - [x] npm test: 387/387 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] npx tsc --noEmit clean - [x] Independent reviewer pass before pushing — APPROVE; nits-only Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The .codegraphignore marker (per-directory opt-out from indexing) was respected by `scanDirectoryWalk` (the filesystem-walk fallback) but silently ignored by `getGitVisibleFiles` (the git fast path) and `getGitChangedFiles` (sync's git path). Same project gave different file sets depending on whether `.git` existed — typically the marker "worked" only on non-git scratch projects and was a no-op everywhere else, which is the opposite of how most users encounter it. This change adds two helpers in `src/extraction/index.ts`: - `findCodegraphIgnoredDirs(rootDir, files)` — walks parent directories of the given file list, returns the set of directories that contain a `.codegraphignore` marker. Walks once per unique parent directory, with an early-out on shared ancestors. - `isUnderCodegraphIgnoredDir(filePath, ignoredDirs)` — true if filePath lives under any of those dirs. Applied in: - `scanDirectory` and `scanDirectoryAsync` — between the git file list and the include-pattern filter. - `getGitChangedFiles` — refactored to a two-pass collect-then-bucketize so the ignored-dir set is built once from the candidate paths. The marker file itself does not need to be tracked by git — fs.existsSync catches it whether it was committed or added as a local override. ## Files changed | File | Change | |---|---| | src/extraction/index.ts | Add findCodegraphIgnoredDirs + isUnderCodegraphIgnoredDir; apply in scanDirectory, scanDirectoryAsync, getGitChangedFiles | | __tests__/codegraphignore.test.ts | 6 regression tests | ## Test plan - [x] npm test: 386/386 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] npx tsc --noEmit clean - [x] Independent reviewer pass before pushing — APPROVE; addressed two info-level cleanups (JSDoc accuracy, removed dead try/catch around fs.existsSync) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The codebase no longer ships embeddings (commit 453c39d), so all search quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search because the SQLite default tokenizer treats `getParser` as a single indivisible token. Three changes that compound to fix that: 1. **Subword tokens.** New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10x. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces. 2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser`/`parsing`/`parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike. 3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from the query before constructing the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via a new shared `filterStopwords` helper. ## Empirical results (codegraph's own src/, 1242 nodes, 71 files) | Query | baseline rank | this PR rank | |---|---:|---:| | `ExtractionOrchestrator` | 1 | 1 | | `how does file parsing work` | NOT FOUND in 20 | 2 | | `database connection management` | 18 | 1 | | `resolves references between modules` | 19 | 2 | Mean rank: ~14 → 1.5. Concept-mode docstring re-weighting was tested as a fourth lever and rejected — it regressed `how does file parsing work` because amplifying docstring weight floods the result list with prose-keyword spam more than it lifts truly relevant prose. Not included. ## Migration v4 Existing v3 databases get migrated by: - Adding the `name_subwords` column to `nodes` (idempotent guard so a re-run after partial DDL failure doesn't fail with "duplicate column") - Dropping the old FTS table + triggers (tokenize cannot be ALTERed) - Recreating FTS without triggers - Backfilling name_subwords for every existing node - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')` - Recreating the triggers afterward (so they don't fire mid-backfill, which corrupted FTS5 in earlier prototype runs) ## Files changed | File | Change | |---|---| | `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` | | `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS | | `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers | | `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard | | `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS | | `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version | | `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency | ## Test plan - [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] `npx tsc --noEmit` clean - [x] Bench script confirms targets at colbymchenry#18, colbymchenry#19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR - [x] Independent reviewer pass before pushing — addressed three findings: - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts) - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`) - made migration idempotent on partial-DDL re-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new graph layer that surfaces coupling between files which historically change together — sibling extractors that share patterns, tests that assert schema state, config files coupled to the code that reads them. This information is invisible to static analysis but recoverable from git in a few seconds with no model dependencies. Empirically verified on codegraph's own history: - src/extraction/languages/csharp.ts <-> java.ts (J=0.75) No import relation; coupled because both implement language extractors that share patterns. - __tests__/foundation.test.ts <-> src/db/schema.sql (J=0.57) Test asserts schema version; coupled by intent, not by import. - src/db/migrations.ts <-> src/db/schema.sql (J=0.57) Schema changes always require a migration. ## Components - **Schema**: new co_changes table (file_a, file_b, count) with canonical ordering (file_a < file_b) and CHECK constraint. New commit_count column on files for Jaccard normalisation. Migration v4 (idempotent PRAGMA-table_info guard on the ALTER). - **Miner** (src/cochange/index.ts): mineCoChanges runs `git log --no-merges --name-only --format=tformat:CGCMT-%H -z`. Header sentinel CGCMT-<40-hex-sha> cannot collide with any realistic filename. MAX_FILES_PER_COMMIT=50 skips merge / mass-refactor commits that would otherwise produce O(N^2) spurious pairs. Returns needsFullRescan when the previously-mined HEAD is unreachable (force-push, gc). - **DB methods**: applyCoChangeDeltas (UPSERT pairs + UPDATE per-file counts in one transaction), clearCoChanges, getCoChangedFiles. Jaccard is computed inline in SQL with proper ranking + LIMIT applied AFTER filtering so high-Jaccard pairs aren't silently dropped. - **Wiring**: indexAll triggers full mining; sync triggers incremental mining since last_mined_cochange_head (stored in project_metadata, same pattern as PR colbymchenry#100's sync HEAD tracking). - **Public API**: cg.getCoChangedFiles(filePath, { limit, minCount, minJaccard }) returns ranked partners with count + Jaccard. - **Config**: enableCoChange?: boolean on CodeGraphConfig, default true. False skips mining entirely. Field is properly merged in mergeConfig so the opt-out survives a config save/load round-trip. ## Files changed | File | Change | |---|---| | src/cochange/index.ts (NEW) | Miner: parses git log into pair counts | | src/db/schema.sql | Add commit_count column + co_changes table + indexes | | src/db/migrations.ts | Bump version to 4; add migration v4 (idempotent) | | src/db/queries.ts | applyCoChangeDeltas / clearCoChanges / getCoChangedFiles | | src/index.ts | mineCoChanges wired into indexAll + sync | | src/types.ts | enableCoChange config field, default true | | src/config.ts | mergeConfig now propagates enableCoChange | | __tests__/cochange.test.ts (NEW) | 27 regression tests | | __tests__/foundation.test.ts, __tests__/pr19-improvements.test.ts | Update expected schema version | ## Test plan - [x] npm test: 405/407 pass (two pre-existing fs.watch flakes; watcher passes 12/12 in isolation) - [x] npx tsc --noEmit clean - [x] Independent reviewer pass before pushing — addressed three findings: - mergeConfig dropped enableCoChange (BLOCK → fixed; round-trip test added) - minJaccard was applied JS-side after a small SQL LIMIT, silently dropping high-Jaccard pairs ranked beyond the over-fetch (REQUEST_CHANGES → moved into SQL WHERE; regression test added) - parser sentinel `--` could collide with a file literally named `--` (REQUEST_CHANGES → switched to CGCMT-<sha>; regression test added) ## Notes - Schema bump to v4 collides with PR colbymchenry#102 and PR colbymchenry#104 (both also v4). If any of those land first, this rebases to v5. - HEAD-reset-behind-anchor (rewinding past last_mined without force-push) is a known limitation; merged-base ancestry check could detect it but isn't worth the complexity for this PR. - Mining adds ~1-3s to indexAll on a typical repo. enableCoChange: false opts out entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hods When a query matches many symbols in a single file, current ranking returns the matching class plus 9 of its members from the same file. The first hit is informative; the next 9 are implementation detail that pushes peer files (subclasses, callers, sibling modules) past the limit. This PR caps results per file so search surfaces representative breadth across the codebase rather than burying the user in one class's internals. ## Empirical lift on codegraph (limit=10, default cap=3) | Query | Before (max from one file) | After | |---|---:|---:| | ExtractionOrchestrator | 10/10 | 9/10 (only one file matches; backfill kicks in) | | database | 8/10 | 3/10 | | config | 5/10 | 3/10 | | resolve | 4/10 | 3/10 | | extract / parse | 3 (no regression) | 3 | Top-1 result is preserved in every case — diversification only reorders second-and-onward. ## Components - `SearchOptions.perFileCap?: number` — default 3; 0 disables. - `diversifyByFile(results, limit, perFileCap)` in src/search/query-utils.ts: pure function. First pass picks at most perFileCap per file in score order. If limit isn't yet filled, backfills from skipped (in original score order) so we never return fewer results than the caller requested. - searchNodes wires it after the existing rescoring pass, when there are more candidates than the caller's limit. Relies on the existing 5x internal over-fetch in searchNodesFTS for headroom — no new multiplier added (multiplier-on-multiplier composition was the reviewer's blocking concern in an earlier draft). ## Files changed | File | Change | |---|---| | src/types.ts | Add perFileCap to SearchOptions | | src/search/query-utils.ts | Add diversifyByFile pure helper | | src/db/queries.ts | Wire diversifyByFile into searchNodes; comment on the over-fetch composition | | __tests__/diversify.test.ts (NEW) | 13 regression tests | ## Test plan - [x] npm test: 393/393 pass on macOS - [x] npx tsc --noEmit clean - [x] Bench script confirms the lift in the table above - [x] Independent reviewer pass before pushing — addressed: - Multiplier-on-multiplier (4x outer * 5x inner = 20x for large limits): outer multiplier removed; inner over-fetch is sufficient. - Within-limit reorder: documented as intentional pure-function behavior; integration path correctly skips when results <= limit. - MCP exposure of perFileCap: deferred — default 3 is the desired new behavior; MCP can pick it up later if users want to tune. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r writes Three DB-layer improvements bundled in one PR. ## 1. Batch getNodesByIds — fix N+1 in graph traversal QueryBuilder.getNodesByIds(ids[]) returns Map<id, Node> in one round-trip (chunked at 500 for SQLite param-limit safety, cache-aware so already-cached entries are served from memory and only the misses hit SQL). Replaces 9 N+1 loops in src/graph/traversal.ts: - getCallersRecursive / getCalleesRecursive - getTypeAncestors / getTypeDescendants - findUsages - getImpactRecursive (both inner loops: contains-children + dependents) - findPath BFS frontier - traverseBFS / traverseDFS neighbor expansion - getChildren Each previously did `getNodeById` per edge inside a loop; for a function with N callers at depth D, that was N^D point reads per traversal. Now: one IN-list query per traversal step. Expected 10-50x speedup on deep / fan-out-heavy traversals (impact analysis on popular utilities, call graphs of central modules). ## 2. insertNode cache invalidation — fix correctness bug QueryBuilder.insertNode uses INSERT OR REPLACE INTO nodes. The LRU nodeCache was invalidated by updateNode and deleteNode but NOT by insertNode, so a re-indexed node (replacing a cached row) would still serve the pre-replace version on next getNodeById until LRU eviction pushed it out. Now invalidates `nodeCache.delete(node.id)` at the top of insertNode (after validation, before SQL — so failed-validation early-returns don't churn the cache). ## 3. Auto-maintenance after bulk writes DatabaseConnection.runMaintenance() runs: - PRAGMA optimize (incremental ANALYZE; only re-analyzes tables whose row counts changed materially since last ANALYZE — without it, the SQLite query planner has zero statistics on freshly-bulk-loaded tables and can pick wrong indexes) - PRAGMA wal_checkpoint(PASSIVE) (fold WAL pages back into the main DB file so the WAL doesn't grow unboundedly between automatic checkpoints which fire at 1000 pages) Both are non-blocking and silently swallowed on failure — best-effort, never load-bearing for correctness. Wired into indexAll (when files were indexed) and sync (when files changed). No-op when nothing happened. ## Files changed | File | Change | |---|---| | src/db/queries.ts | Add getNodesByIds; invalidate cache in insertNode | | src/db/index.ts | Add runMaintenance() helper | | src/graph/traversal.ts | Replace 9 N+1 loops with batch lookups | | src/index.ts | Call runMaintenance after indexAll/sync | | __tests__/db-perf.test.ts (NEW) | 9 regression tests | ## Test plan - [x] npm test: 388/389 pass on macOS (one pre-existing fs.watch flake) - [x] npx tsc --noEmit clean - [x] Independent reviewer pass before pushing — APPROVE; one info finding addressed (getChildren was the 9th N+1 site, now batched too) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…write Three watcher tests wrote files immediately after `watcher.start()`, hitting a race with macOS FSEvents (and Linux inotify): there's a small but real latency between `fs.watch()` returning and the kernel actually delivering events. Under parallel test load (when the host CPU is busy running other test files in worker threads), that latency balloons and the file-change event is dropped before the watcher is fully registered. Result: the test waits 5s for a sync that never fires and times out. Affected: - debounced sync > should trigger sync after file change - callbacks > should call onSyncComplete after successful sync - callbacks > should call onSyncError when sync throws - debounced sync > should debounce rapid changes (less affected; its 50ms-spaced loop incidentally settles, but explicit is better) - CodeGraph integration > should auto-sync when files change Other tests in the same file already had a 400ms settle delay with a comment ("Let watcher settle — fs.watch may fire residual events from beforeEach") in the filtering tests. This PR factors that out into a `letWatcherSettle()` helper and applies it consistently to every test that writes immediately after `start()`. No production code change. The flake had no user impact — it was purely a test-order artifact under parallel load. ## Verification - Pre-fix: ran `npm test` 8+ times across this session — fs.watch flake fired in roughly 1 of 3 runs under parallel load. - Post-fix: 3 consecutive `npm test` runs, 380/380 each, no flakes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adding a new language used to require coordinated edits to 6 shared lists across 4 files (Language union in types.ts; DEFAULT_CONFIG.include; WASM_GRAMMAR_FILES, EXTENSION_MAP, and getLanguageDisplayName in grammars.ts; EXTRACTORS map in languages/index.ts). Two PRs adding different languages typically conflicted on every one of those. After this refactor, adding a new language is: 1. Drop a file at src/extraction/languages/<name>.ts exporting an <NAME>_DEF: LanguageDef constant. 2. Add ONE import line and ONE array entry to src/extraction/languages/registry.ts (alphabetical position — adjacent additions are still possible but rare). That's it. grammars.ts, types.ts, tree-sitter.ts dispatch, and the default include globs are all derived from the registry. ## What's in a LanguageDef ```ts interface LanguageDef { name: string; // canonical id displayName: string; // "Pascal / Delphi" extensions: readonly string[]; // ['.pas', '.dpr', ...] includeGlobs: readonly string[]; grammar?: { wasmFile, vendored?, extractor }; // tree-sitter customExtractor?: (fp, src) => ExtractionResult; // Liquid, Svelte extensionOverrides?: { '.dfm': { customExtractor } }; // Pascal forms } ``` Each existing language file now exports both its `xxxExtractor` (unchanged) AND a new `XXX_DEF`. New files were added for tsx, jsx, svelte, liquid (the latter two wrap their existing custom extractor classes via the customExtractor field). ## Refactored consumers - src/extraction/grammars.ts: WASM_GRAMMAR_FILES removed (was internal-only); EXTENSION_MAP now a Proxy that lazy-builds from the registry on first access (avoids TDZ in cyclic load paths). loadGrammarsForLanguages, isLanguageSupported, isGrammarLoaded, getSupportedLanguages, getLanguageDisplayName, detectLanguage — all read from registry. - src/extraction/tree-sitter.ts: extractFromSource's if-chain (svelte / liquid / pascal+.dfm/.fmx) replaced with one lookup: def.extensionOverrides[ext]?.customExtractor || def.customExtractor. Drops direct imports of LiquidExtractor, SvelteExtractor, DfmExtractor. - src/types.ts: DEFAULT_CONFIG moved to src/default-config.ts (cycle break). types.ts re-exports for backward compat. The `include` array is now built lazily from each LanguageDef's includeGlobs. ## What still requires a one-line edit The Language string union in types.ts still hard-codes the known languages (typescript | javascript | … | unknown). New languages added to the registry work at runtime as strings, but adding the literal here is required IF the resolver wants to do exhaustive narrowing on the new language (resolution/index.ts and resolution/import-resolver.ts have a few `language === 'X'` branches). Most new languages don't need such branches. This trade-off keeps strict narrowing for the existing handful of language-specific code paths while making everything else registry-driven. ## Tests 380/380 pass. No new tests; behavior is identical. Existing extraction.test.ts and pr19-improvements.test.ts heavily exercise detectLanguage, isLanguageSupported, getSupportedLanguages, and loadAllGrammars — all green. ## Follow-ups (out of scope) - Auto-discovery in registry.ts via fs.readdirSync — works in built dist/ but vite-node doesn't support extensionless require() of TS source. A small build-time generator could remove the static import list entirely. - Splitting __tests__/extraction.test.ts into per-language test files — eliminates the test-end-of-file conflict surface that every language PR currently hits. - Similar registry refactors for: - MCP tool definitions (each tool self-registers; no shared tools[] array or case-switch in execute()) - Migration files (each migration in src/db/migrations/NNN-*.ts; auto-discovered by version) - Index/sync hooks (centrality, churn, issue-history, config-refs, sql-refs, cochange all currently mutate CodeGraph.indexAll/sync; an IndexHook interface would make each pass self-contained) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tractor Reviewer caught a real bug: the original commit kept the EXTRACTORS map in src/extraction/languages/index.ts as a separate hand-curated registry that TreeSitterExtractor read from. Adding a new grammar-backed language would have required editing EXTRACTORS too, undermining the refactor's stated single-source-of- truth claim. A future contributor missing the EXTRACTORS update would silently produce empty extraction results. Fix: - TreeSitterExtractor now reads its extractor straight off the language def: getLanguageDefByName(this.language)?.grammar?.extractor - EXTRACTORS in languages/index.ts becomes a Proxy that derives lazily from the registry (kept for backward compat — readers unchanged). - Add 16 structural-invariant tests in __tests__/language-registry.test.ts that fail loudly if any derived consumer drifts from the registry: EXTRACTORS / EXTENSION_MAP / detectLanguage / isLanguageSupported / getSupportedLanguages / getLanguageDisplayName all asserted to exactly mirror the registry contents. Adding a new grammar-backed language is now genuinely "one new file + two lines in registry.ts" — no other files to touch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…flicts Today every PR adding an MCP tool conflicts on the same two shared lists in src/mcp/tools.ts: the tools[] array (the list_tools surface) and the case switch in execute(). After this refactor: Adding a new MCP tool: 1. Drop a file at src/mcp/tools/<name>.ts exporting a <NAME>_TOOL: ToolModule (definition + handlerKey). 2. Add one import line and one array entry to src/mcp/tools/registry.ts. 3. Implement handle<Name>(args) on ToolHandler in tools.ts and add the new key to HandlerKey in tools/types.ts. Step 3 is the only remaining "shared method on a single class" conflict surface. Extracting handler bodies into per-tool files (making step 3 also a single-file addition) is left as a follow-up — the cost/benefit favors landing this incremental win now and finishing the body extraction once language and migration refactors land. ## What's new - **src/mcp/tool-types.ts** — extracted ToolDefinition, ToolResult, PropertySchema, projectPathProperty into a shared module so per-tool files can import without circular dependency. - **src/mcp/tools/types.ts** — ToolModule interface, HandlerKey string union, and ToolHandlerLike (a structural type that ToolHandler now `implements`, providing compile-time guarantee that every HandlerKey maps to a real method). - **src/mcp/tools/<name>.ts × 9** — one file per existing tool (callees, callers, context, explore, files, impact, node, search, status). Each ~25-30 lines: import + definition literal + handlerKey reference. - **src/mcp/tools/registry.ts** — static-import barrel, sorted alphabetically. Exports getToolModules(), getToolModule(name), and the derived `tools[]` array. - **src/mcp/tools.ts** — ~200 lines deleted from the top (inline types + tools[] array + projectPathProperty). execute()'s case-switch replaced with a registry lookup + type-safe `this[mod.handlerKey](args)` dispatch (now compile- time-checked thanks to `implements ToolHandlerLike`). All `private async handle*` methods now public to match the interface. errorResult/textResult also public for the same reason. - **src/mcp/index.ts** — MCPServer's tool-existence check switched from a linear `tools.find()` scan to the O(1) `getToolModule()` Map lookup, eliminating two parallel lookup paths. ## Tests 387/387 pass. **7 new tests** in __tests__/mcp-tool-registry.test.ts: - Definitions are well-formed (name shape, description length). - handlerKey shape (`handle<UpperCase>`). - Every registered handlerKey resolves to a real method on ToolHandler. - Exported `tools[]` exactly mirrors the registry. - Canonical 9 main-line tools regression guard. - execute() unknown-tool error path. - **End-to-end dispatch smoke test**: execute('codegraph_status', {}) reaches the real handler body (no broken `this` binding) — would fail loudly if the dynamic dispatch chain ever breaks. ## Reviewer pass Independent reviewer ran once. 2 REQUEST_CHANGES + 2 INFO addressed: 1. ToolHandlerLike was defined but never enforced — ToolHandler now `implements ToolHandlerLike`. Eliminates the `(this as unknown as Record<...>)` cast in execute(); dispatch is fully compile-time-checked. 2. No end-to-end dispatch test — added one (see Tests above). 3. MCPServer.handleToolsCall used a linear `tools.find()` scan while execute() used Map lookup — switched to getToolModule() for parity. 4. Removed redundant .slice() in registry.ts (map() already returns a fresh array). ## Backward compat src/mcp/tools.ts still re-exports ToolDefinition, ToolResult, the mutable `tools[]` array, ToolHandler, and getExploreBudget. Every existing consumer (`import { ToolDefinition, ToolResult, tools, ToolHandler } from './tools'`) keeps working unchanged. ## Affected open PRs - colbymchenry#110 (review-context): rebases to 1 new file in tools/ + 2 lines in registry.ts + 1 method on ToolHandler + 1 line in HandlerKey. - colbymchenry#112 (centrality+churn): same shape for the codegraph_hotspots tool. - colbymchenry#114 (config-refs): same shape for codegraph_config. - colbymchenry#115 (sql-refs): same shape for codegraph_sql. Each goes from 4-way conflict (tools[] + case + handler + helpers) down to 1-way conflict (HandlerKey + handler method on ToolHandler, both in tools.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Today every PR adding a schema migration claims `CURRENT_SCHEMA_VERSION = next` AND adds an array entry to `migrations: Migration[]` in src/db/migrations.ts. Two PRs both claiming the same version resolve as: "second PR's v4 silently no-ops on existing DBs" — a real silent-data-loss bug class (PR colbymchenry#113's reviewer caught one). After this refactor: Adding a new schema migration: 1. Pick the next free 3-digit prefix (`git ls-files 'src/db/migrations/[0-9]*.ts'` shows what's taken). 2. Create `src/db/migrations/<NNN>-<short-name>.ts` exporting a `MIGRATION: MigrationModule` (description + up). 3. Add one import line and one entry to `src/db/migrations/index.ts`'s REGISTERED_MODULES array. Two PRs both creating `004-foo.ts` collide on the FILESYSTEM — the maintainer sees it instantly. No more silent skipped migrations. ## What's new - `src/db/migrations/types.ts` — `MigrationModule { description, up }` and `Migration extends MigrationModule { version }`. - `src/db/migrations/002-project-metadata.ts` — extracted v2 body verbatim. - `src/db/migrations/003-lower-name-index.ts` — extracted v3 body verbatim. - `src/db/migrations/index.ts` — central registry. Static-imports each migration, parses the version FROM THE FILENAME (no hand-typed version field that can drift), enforces strict `NNN-kebab-name.ts` shape, validates uniqueness/sort at module load (throws loudly on collision), exposes ALL_MIGRATIONS and CURRENT_SCHEMA_VERSION. - `src/db/migrations.ts` — refactored to a thin runner. Same exported surface (CURRENT_SCHEMA_VERSION, getCurrentVersion, runMigrations, needsMigration, getPendingMigrations, getMigrationHistory, Migration type) — every existing import keeps working unchanged. - `__tests__/migrations-registry.test.ts` — 8 invariant tests: registry non-empty, versions unique + strictly ascending, CURRENT_SCHEMA_VERSION matches max, every file matches the strict NNN-kebab-name pattern, no orphan files, no phantom registrations. ## Reviewer pass Independent reviewer ran once. 3 REQUEST_CHANGES + 1 INFO addressed: 1. Hand-typed `version` field in REGISTERED_MODULES could drift from filename. **Fixed**: removed the version field; registry now parses version from filename via FILENAME_PATTERN regex inside validateRegistered. 2. Filename-pattern test was lenient (allowed 4-digit or 1-digit prefixes). **Fixed**: new "every migration file matches the strict NNN-kebab-name.ts pattern" test catches malformed filenames as orphan-detection-bypassing offenders. 3. `getPendingMigrations` returned `readonly Migration[]`, breaking callers that typed the result as `Migration[]`. **Fixed**: returns a fresh mutable array via `.slice()`. 4. No throw-on-duplicate test for validateRegistered (module evaluation timing). Acknowledged; not added. ## Backward compat Every existing import works unchanged: - `import { CURRENT_SCHEMA_VERSION } from './migrations'` ✓ - `import { runMigrations } from './migrations'` ✓ - `import { needsMigration } from './migrations'` ✓ - `import { getMigrationHistory } from './migrations'` ✓ - `import { getPendingMigrations } from './migrations'` — returns mutable Migration[] (preserved) - `Migration` type — re-exported ## Affected open PRs Every migration-touching PR (colbymchenry#102 UNIQUE edges, colbymchenry#105 cochange, colbymchenry#108 perf db, colbymchenry#111 LLM features, my colbymchenry#112 centrality+churn, colbymchenry#113 issue-history, colbymchenry#114 config-refs, colbymchenry#115 sql-refs) currently claims migration v4 and conflicts with each other on `migrations.ts`. After this lands they each become: - 1 new file: `src/db/migrations/<NNN>-<name>.ts` - 2 lines in registry.ts (import + array entry) Conflict shape changes from "next free version + array entry + CURRENT_SCHEMA_VERSION bump in one file" (4-way conflict) to "1 new file" + 2-line registry edit. If two PRs target the same NNN, the filesystem collision surfaces immediately — no silent skipped migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Today every PR adding a derived-signal pass (centrality, churn, issue-history, config-refs, sql-refs, cochange) edits the same 3 spots in src/index.ts: 1. New imports at the top 2. New private method on `CodeGraph` (e.g. runDerivedSignals, runIssueHistoryPass, runConfigRefsPass, runSqlRefsPass) 3. New call site in `indexAll` AFTER resolution 4. New call site in `sync` AFTER resolution 5 PRs collide on every one of those. After this refactor: Adding a new derived-signal pass: 1. Create `src/index-hooks/<name>.ts` exporting a `HOOK: IndexHook` constant with `afterIndexAll` and/or `afterSync` methods. 2. Add one import + one entry to `src/index-hooks/registry.ts`. `CodeGraph.indexAll` and `sync` invoke the hook runner once; adding a new pass touches only the hook file + the registry. Zero changes to CodeGraph itself. ## What's new - **src/index-hooks/types.ts** — `IndexHook` interface (`afterIndexAll`, `afterSync`, both optional), `IndexHookContext` (projectRoot + config + queries + db), and `IndexHookOutcome` for diagnostic reporting. - **src/index-hooks/registry.ts** — static-import list of every registered hook (empty on main today; PRs adding hooks fill it in), plus the `runAfterIndexAll` / `runAfterSync` runners that iterate hooks and catch errors so one broken hook never fails indexing. - **src/index.ts** — `indexAll` calls `runAfterIndexAll(ctx)` after resolution. `sync` calls `runAfterSync(ctx, result)` after resolution. New private `buildHookContext()` helper exposes a stable read-only context. - **__tests__/index-hooks.test.ts** — 6 tests covering empty registry, runner shape, and the `afterIndexAll` / `afterSync` contracts. ## Why ship the framework on main with zero registered hooks? The only consumers of this framework today are 5 unmerged PRs (colbymchenry#105 cochange + my colbymchenry#112-colbymchenry#115). Landing the framework now lets each of those PRs rebase to a 2-line change instead of 8-10 lines mutating CodeGraph adjacent-line. Without this, all 5 PRs collide on the same indexAll/sync call sites. The framework adds zero behavior on main (no registered hooks = no-op runner). 380→386 tests confirm no regression. ## Affected open PRs | PR | Today | After this lands | |---|---|---| | colbymchenry#105 cochange | runDerivedSignals helper + 2 call sites | 1 hook file in src/index-hooks/ + 2 lines in registry.ts | | colbymchenry#112 centrality+churn | same shape | same shape | | colbymchenry#113 issue-history | same shape | same shape | | colbymchenry#114 config-refs | same shape | same shape | | colbymchenry#115 sql-refs | same shape | same shape | Each goes from "edit CodeGraph in 4 spots" to "drop a hook file." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…efactors Lands centrality (PageRank) and churn (git history) as registered IndexHooks (`afterIndexAll` + `afterSync`) instead of CodeGraph private methods. Adds: - Migration 004: nodes.centrality + files.{commit_count,loc, first_seen_ts,last_touched_ts} + indexes - src/centrality/ + src/churn/ (pure modules) - src/index-hooks/centrality.ts + churn.ts (registered hooks) - CodeGraph public methods: getCentrality, getTopCentralNodes, getCentralityRank, getFileChurn, getHotspots - codegraph_hotspots MCP tool wired through ToolModule registry + handleHotspots on ToolHandler - Updated regression-guard tests (index-hooks, mcp-tool-registry) to reflect newly registered hooks/tools Tests: 440/440 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mines Fixes/Closes/Resolves #N commits and attributes them to symbols touched by each commit hunks. Lands as a registered IndexHook (issue-history). - Migration 005: symbol_issues table - src/issue-history/ (pure module): mineIssueHistory + parse-diff - src/index-hooks/issue-history.ts (registered hook) - CodeGraph public method: getIssuesForNode - codegraph_node MCP tool now surfaces issue history line - enableIssueHistory flag default true wired through config merge - Removed defensive ensureSymbolIssuesTable guard and its test: the v4-collision bug class is impossible under file-based migrations (PR colbymchenry#118 refactor); filenames collide on the filesystem instead. Tests: 470/471 pass (1 watcher flake under load, isolation OK). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extracts env-var read sites (process.env.X, os.getenv("X"), etc) into config_refs and exposes them via codegraph_config MCP tool. Lands as a registered IndexHook (config-refs). - Migration 006: config_refs table - src/config-refs/ (pure module): regex-based extractor - src/index-hooks/config-refs.ts (registered hook with full / files scoping for indexAll vs sync) - CodeGraph public methods: getConfigKeys, getConfigRefsByKey, getConfigKeysForNode - codegraph_config MCP tool wired through ToolModule registry - enableConfigRefs flag default true - Removed defensive ensureConfigRefsTable guard + its test for the same reason as PR colbymchenry#113: v4-collision bug class is impossible under file-based migrations. Tests: 488/489 pass (1 watcher flake under load). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extracts SQL string-literal references to tables (read/write/ddl) into sql_refs and exposes via codegraph_sql MCP tool. Lands as a registered IndexHook (sql-refs). - Migration 007: sql_refs table - src/sql-refs/ (pure module): regex extractor with comment strip + SQL-keyword pre-filter - src/index-hooks/sql-refs.ts (registered hook with full / files scoping; uses replaceAllSqlRefs for atomic indexAll swap) - CodeGraph public methods: getSqlTables, getSqlRefsByTable, getSqlTablesForNode - codegraph_sql MCP tool wired through ToolModule registry - enableSqlRefs flag default true - Removed defensive ensureSqlRefsTable guard + its test (same reason as colbymchenry#113 / colbymchenry#114: bug class is impossible under file-based migrations). Tests: 514/515 pass (1 watcher flake under load). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts

…-shipped

…pped # Conflicts: # src/extraction/index.ts

…l-shipped

# Conflicts: # src/extraction/index.ts # src/utils.ts

# Conflicts: # src/index.ts

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts # src/db/schema.sql # src/utils.ts

# Conflicts: # src/db/queries.ts

Originally based on monolithic grammars.ts/tree-sitter.ts; rebased to the per-language registry pattern (PR colbymchenry#116): - src/extraction/languages/hcl.ts (LanguageDef with grammar+custom) - 'hcl' added to Language union - Vendored tree-sitter-hcl.wasm - HclExtractor (custom block extractor) - 220 extraction tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- src/extraction/languages/r.ts (rExtractor + R_DEF) - 'r' added to Language union - Vendored tree-sitter-r.wasm - Existing extraction tests for HCL preserved + R tests added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- src/extraction/languages/sql.ts (LanguageDef with grammar+custom) - 'sql' added to Language union - Vendored tree-sitter-sql.wasm - SqlExtractor (custom DDL/DML extractor) - Extensions: .sql, .ddl, .dml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # __tests__/sync.test.ts # src/extraction/index.ts # src/types.ts

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts # src/db/queries.ts # src/db/schema.sql # src/index.ts # src/types.ts

andreinknv · 2026-04-27T23:09:22Z

Heads-up: this branch has been force-pushed. This PR's contents have been rebased to use the per-thing file layout established by refactor PRs #116-#119. Functionality is identical to the original; the internal file shape now matches the post-refactor architecture so this PR can be merged after the refactors with zero conflicts.

The current diff against main looks larger than the feature alone because it transitively includes the 4 refactor commits + any prerequisite PRs. That diff auto-shrinks as the prerequisites merge in the order documented in #120. See #120 (comment) for the full rebase summary.

Activates the queries-history.ts module that was committed inert in the previous Files-cluster commit. Deletes the original 12 history- domain methods from src/db/queries.ts and switches every call site to the new free functions. Methods moved (now active): - File co-change: applyCoChangeDeltas, clearCoChanges, getCoChangedFiles - Per-file churn: applyChurnDeltas, clearChurn - Risk view: getHotspots - Symbol-issue attributions: applyIssueAttributions, clearIssueAttributions, getSymbolIssuesCount, getIssuesForNode, getNodesForIssue, getSymbolCoChanges Constants migrated to queries-history.ts (no longer in queries.ts): HOTSPOTS_DEFAULT_LIMIT, HOTSPOTS_ORDER_BY, MS_PER_SECOND, SECONDS_PER_DAY, CO_CHANGE_DEFAULT_LIMIT, CO_CHANGE_DEFAULT_MIN_COUNT. Touchpoints (~62 call sites): - src/index.ts: 4 wrappers + ReturnType<QueryBuilder['X']> annotations switched to ReturnType<typeof X> - src/index-hooks/{churn,cochange,issue-history}.ts - src/review/index.ts: removed defensive shim that runtime-checked `typeof q.getCoChangedFiles === 'function'` for pre-colbymchenry#105 installs. After extraction the method is gone from QB so the runtime check would always return [], defeating the call. Switched to direct imported free function with the existing try/catch outer guard. - __tests__/{churn,cochange,stress-test-roundtwo-fixes}.test.ts Verification: - tsc clean, suite 1352/13/0 (was failing 16 then 1 across iterations — fixed by per-file imports + safeGetCoChangedFiles shim deletion) - Reindex: QueryBuilder god_class **82 -> 70** (delta = 12, exact; this is the inflection point where QB drops to the warning-level threshold per the handoff plan) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andreinknv and others added 13 commits April 26, 2026 00:18

This was referenced Apr 26, 2026

feat(mcp): codegraph_review_context — structured PR-review context from a diff #110

Open

feat(graph): PageRank centrality + per-file churn metrics #112

Open

andreinknv and others added 4 commits April 27, 2026 16:27

andreinknv mentioned this pull request Apr 27, 2026

refactor: file-based migrations — eliminate silent version-collision bugs #118

Closed

This was referenced Apr 27, 2026

refactor: index-hook framework — eliminate per-pass CodeGraph mutations #119

Closed

Merge-order guide for the open PR backlog (4 refactor PRs unblock the rest) #120

Open

andreinknv and others added 7 commits April 27, 2026 17:38

Merge branch 'refactor/mcp-tool-registry' into battle-test/all-shipped

6d37c02

Merge branch 'refactor/migration-files' into battle-test/all-shipped

53a6358

Merge branch 'refactor/index-hooks' into battle-test/all-shipped

d807e23

andreinknv and others added 15 commits April 27, 2026 18:07

Merge branch 'fix/watcher-test-flake' into battle-test/all-shipped

d3120f5

Merge branch 'fix/edges-unique-constraint' into battle-test/all-shipped

c3c445d

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts

Merge branch 'fix/codegraphignore-git-fast-path' into battle-test/all…

3aef004

…-shipped

Merge branch 'fix/sync-detect-head-movement' into battle-test/all-shi…

e956c33

…pped # Conflicts: # src/extraction/index.ts

Merge branch 'fix/extraction-resolution-accuracy' into battle-test/al…

8d6b9ce

…l-shipped

Merge branch 'fix/audit-correctness' into battle-test/all-shipped

125e763

Merge branch 'fix/audit-hardening' into battle-test/all-shipped

a0892e8

# Conflicts: # src/extraction/index.ts # src/utils.ts

Merge branch 'feat/db-perf-and-cache-fix' into battle-test/all-shipped

b65667a

# Conflicts: # src/index.ts

Merge branch 'feat/fts-search-quality' into battle-test/all-shipped

2db150d

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts # src/db/schema.sql # src/utils.ts

Merge branch 'feat/search-diversify' into battle-test/all-shipped

052f5b0

# Conflicts: # src/db/queries.ts

Merge branch 'feat/index-submodules' into battle-test/all-shipped

ea3f628

# Conflicts: # __tests__/sync.test.ts # src/extraction/index.ts # src/types.ts

Merge branch 'feat/cochange-graph' into battle-test/all-shipped

8255db1

# Conflicts: # __tests__/foundation.test.ts # __tests__/pr19-improvements.test.ts # src/db/migrations.ts # src/db/queries.ts # src/db/schema.sql # src/index.ts # src/types.ts

andreinknv mentioned this pull request Apr 28, 2026

perf(db): split embeddings + in-memory similarity cache + drop redundant co_changes index #123

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cochange): file-level co-change graph mined from git history#105

feat(cochange): file-level co-change graph mined from git history#105
andreinknv wants to merge 40 commits into
colbymchenry:mainfrom
andreinknv:feat/cochange-graph

andreinknv commented Apr 26, 2026 •

edited

Loading

Uh oh!

andreinknv commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreinknv commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Empirical results (codegraph's own history)

What's added

Files changed

Test plan

Known limitations

Uh oh!

andreinknv commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andreinknv commented Apr 26, 2026 •

edited

Loading