Skip to content

feat(graph): PageRank centrality + per-file churn metrics#24

Open
mschreib28 wants to merge 9 commits into
mainfrom
upstream/feat/centrality-and-churn
Open

feat(graph): PageRank centrality + per-file churn metrics#24
mschreib28 wants to merge 9 commits into
mainfrom
upstream/feat/centrality-and-churn

Conversation

@mschreib28
Copy link
Copy Markdown
Owner

🔄 Rebased onto post-refactor patterns (Centrality + churn + hotspots)\n> This PR has been rebased to use the per-thing file layout established by refactor PRs colbymchenry#116/colbymchenry#117/colbymchenry#118/colbymchenry#119. Behavior shipped is identical to the original; only the internal code shape changed.\n>\n> What this PR now ships:\n>\n> - src/index-hooks/centrality.ts + src/index-hooks/churn.ts — two registered hooks\n> - src/db/migrations/004-centrality-churn.ts — adds nodes.centrality + files.{commit_count,loc,first_seen_ts,last_touched_ts}\n> - src/mcp/tools/hotspots.tsHOTSPOTS_TOOL: ToolModule for codegraph_hotspots\n> - src/centrality/index.ts + src/churn/index.ts — pure compute modules\n> - src/index-hooks/registry.ts, src/db/migrations/index.ts, src/mcp/tools/registry.ts, src/mcp/tools/types.ts — registry entries (~2 lines each)\n> - src/db/queries.ts, src/db/schema.sql, src/types.ts, src/default-config.ts, src/config.ts, src/index.ts, src/mcp/tools.ts — query methods, config flags, public API, handler\n> - __tests__/centrality.test.ts, __tests__/churn.test.ts\n>\n> What you'll see in the diff: until colbymchenry#116-colbymchenry#119 land, the diff against main includes the refactor commits as part of the branch history. After they merge, GitHub auto-shrinks this PR's diff to just the files above. Full context: colbymchenry#120.\n>\n> ---\n\n## Summary\n\nAdds two graph-derived signals — PageRank centrality on nodes and per-file churn mined from git — and combines them into a codegraph_hotspots MCP tool that ranks files by risk = centrality × churn. Empirically validated before implementation: top hotspot on this repo's own history is src/extraction/tree-sitter.ts (47 commits × 2347 LOC), exactly the file PR colbymchenry#77 split.\n\nWhy both signals together: centrality alone surfaces architecturally-important code; churn alone surfaces volatile code. The intersection — central code that changes often — is where bugs accumulate, and that's the agent-facing query this PR ships.\n\n## What's added\n\n- Centrality (src/centrality/index.ts): PageRank over calls + references edges (excluding contains so lexical hierarchy doesn't dominate). Pure compute, dangling-mass redistribution, Float64Array. Sub-100 ms on this repo (1,688 nodes / 4,319 edges).\n- Churn (src/churn/index.ts): per-file commit count, first/last touched timestamps, LOC. Mined from git log -z. Incremental via last_mined_churn_head with needsFullRescan recovery for force-push / gc.\n- Migration v4 with PRAGMA-table_info guards — coexists cleanly with colbymchenry#105 (cochange) which also targets v4.\n- codegraph_hotspots MCP tool: ranked file list, sortBy = risk | centrality | churn, with minCommits / minCentrality filters.\n- codegraph_node annotation: centrality + 1-based rank when available.\n- Config flags: enableCentrality / enableChurn (default true) for opt-out on shallow clones / non-git checkouts.\n\n## Bench\n\n| Step | Time |\n|---|---|\n| Centrality recompute (1.7k nodes / 4.3k edges) | ~50 ms |\n| Full churn mine (98 commits) | ~200 ms |\n| Total derived-signals pass | < 300 ms |\n\nBoth passes are best-effort: any failure is logged + swallowed so indexing/sync still returns success. The data is advisory.\n\n## Sample output (from node dist/bin/codegraph.js index . && cg.getHotspots())\n\n\n1. src/extraction/tree-sitter.ts pr=0.0349 commits=47 risk=1.6399\n2. src/db/queries.ts pr=0.0539 commits=21 risk=1.1317\n3. src/index.ts pr=0.0738 commits=15 risk=1.1068\n4. src/types.ts pr=0.0699 commits=15 risk=1.0479\n5. src/mcp/tools.ts pr=0.0311 commits=23 risk=0.7156\n\n\nSide-finding from PageRank: ranks 15 / 23 in the raw output (trim, run) are resolver false-positives — overly-generic short names being conflated across files. Useful diagnostic byproduct, not a goal of this PR.\n\n## Tests\n\n408 pass total, 28 new:\n- centrality.test.ts (10) — empty / isolated / sinks / star / cycle / mass-conservation / disconnected components / unknown-node edges / damping sentinel\n- churn.test.ts (13) — non-git / commit counts / timestamps / MAX_FILES_PER_COMMIT skip / incremental mining / unreachable sha / no-op / unicode + spaces in paths / LOC newline edge cases\n- derived-signals.test.ts (5) — end-to-end pipeline hub/leaf / no-git fallback / enableCentrality:false / sync incremental / force-push recovery\n\n## Reviewer pass\n\nIndependent reviewer agent ran once. Two REQUEST_CHANGES + three INFO items addressed:\n\n- Added minCentrality to codegraph_hotspots schema (capability gap vs. underlying API).\n- Added integration test for needsFullRescan recovery path (only untested branch).\n- Documented why upsertFile deliberately omits churn columns from ON CONFLICT DO UPDATE.\n- Annotated the ?? '' belt-and-braces guard at the recovery path.\n\n## Test plan\n\n- [x] npx tsc --noEmit clean\n- [x] npm test — 408/408 passing\n- [x] End-to-end smoke on this repo: codegraph index . → centrality + churn populated, hotspots match spike-validated output\n- [x] Migration v3 → v4 round-trip verified in foundation/pr19 tests\n- [x] Reviewer agent: REQUEST_CHANGES → APPROVE after fixes\n- [ ] Reviewer's final pass after fixes (optional — happy to re-run)\n\n## Coexistence note\n\n#105 (cochange) targets the same migration v4 with its own commit_count column add. Both PRs use idempotent PRAGMA table_info guards on the column add, so whichever lands first does not block the other.\n\n🤖 Generated with Claude Code\n\n


Copied from colbymchenry/codegraph#112

andreinknv and others added 9 commits April 27, 2026 16:27
Adding a new language used to require coordinated edits to 6
shared lists across 4 files (Language union in types.ts;
DEFAULT_CONFIG.include; WASM_GRAMMAR_FILES, EXTENSION_MAP, and
getLanguageDisplayName in grammars.ts; EXTRACTORS map in
languages/index.ts). Two PRs adding different languages typically
conflicted on every one of those.

After this refactor, adding a new language is:

  1. Drop a file at src/extraction/languages/<name>.ts exporting an
     <NAME>_DEF: LanguageDef constant.
  2. Add ONE import line and ONE array entry to
     src/extraction/languages/registry.ts (alphabetical position —
     adjacent additions are still possible but rare).

That's it. grammars.ts, types.ts, tree-sitter.ts dispatch, and the
default include globs are all derived from the registry.

## What's in a LanguageDef

```ts
interface LanguageDef {
  name: string;                  // canonical id
  displayName: string;           // "Pascal / Delphi"
  extensions: readonly string[]; // ['.pas', '.dpr', ...]
  includeGlobs: readonly string[];
  grammar?: { wasmFile, vendored?, extractor };  // tree-sitter
  customExtractor?: (fp, src) => ExtractionResult;  // Liquid, Svelte
  extensionOverrides?: { '.dfm': { customExtractor } };  // Pascal forms
}
```

Each existing language file now exports both its `xxxExtractor`
(unchanged) AND a new `XXX_DEF`. New files were added for tsx, jsx,
svelte, liquid (the latter two wrap their existing custom extractor
classes via the customExtractor field).

## Refactored consumers

- src/extraction/grammars.ts: WASM_GRAMMAR_FILES removed (was
  internal-only); EXTENSION_MAP now a Proxy that lazy-builds from
  the registry on first access (avoids TDZ in cyclic load paths).
  loadGrammarsForLanguages, isLanguageSupported, isGrammarLoaded,
  getSupportedLanguages, getLanguageDisplayName, detectLanguage —
  all read from registry.
- src/extraction/tree-sitter.ts: extractFromSource's if-chain
  (svelte / liquid / pascal+.dfm/.fmx) replaced with one lookup:
  def.extensionOverrides[ext]?.customExtractor || def.customExtractor.
  Drops direct imports of LiquidExtractor, SvelteExtractor,
  DfmExtractor.
- src/types.ts: DEFAULT_CONFIG moved to src/default-config.ts (cycle
  break). types.ts re-exports for backward compat. The `include`
  array is now built lazily from each LanguageDef's includeGlobs.

## What still requires a one-line edit

The Language string union in types.ts still hard-codes the known
languages (typescript | javascript | … | unknown). New languages
added to the registry work at runtime as strings, but adding the
literal here is required IF the resolver wants to do exhaustive
narrowing on the new language (resolution/index.ts and
resolution/import-resolver.ts have a few `language === 'X'`
branches). Most new languages don't need such branches.

This trade-off keeps strict narrowing for the existing handful of
language-specific code paths while making everything else
registry-driven.

## Tests

380/380 pass. No new tests; behavior is identical. Existing
extraction.test.ts and pr19-improvements.test.ts heavily exercise
detectLanguage, isLanguageSupported, getSupportedLanguages, and
loadAllGrammars — all green.

## Follow-ups (out of scope)

- Auto-discovery in registry.ts via fs.readdirSync — works in
  built dist/ but vite-node doesn't support extensionless require()
  of TS source. A small build-time generator could remove the
  static import list entirely.
- Splitting __tests__/extraction.test.ts into per-language test
  files — eliminates the test-end-of-file conflict surface that
  every language PR currently hits.
- Similar registry refactors for:
  - MCP tool definitions (each tool self-registers; no shared
    tools[] array or case-switch in execute())
  - Migration files (each migration in src/db/migrations/NNN-*.ts;
    auto-discovered by version)
  - Index/sync hooks (centrality, churn, issue-history,
    config-refs, sql-refs, cochange all currently mutate
    CodeGraph.indexAll/sync; an IndexHook interface would make
    each pass self-contained)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tractor

Reviewer caught a real bug: the original commit kept the
EXTRACTORS map in src/extraction/languages/index.ts as a separate
hand-curated registry that TreeSitterExtractor read from. Adding
a new grammar-backed language would have required editing
EXTRACTORS too, undermining the refactor's stated single-source-of-
truth claim. A future contributor missing the EXTRACTORS update
would silently produce empty extraction results.

Fix:
- TreeSitterExtractor now reads its extractor straight off the
  language def: getLanguageDefByName(this.language)?.grammar?.extractor
- EXTRACTORS in languages/index.ts becomes a Proxy that derives
  lazily from the registry (kept for backward compat — readers
  unchanged).
- Add 16 structural-invariant tests in __tests__/language-registry.test.ts
  that fail loudly if any derived consumer drifts from the registry:
  EXTRACTORS / EXTENSION_MAP / detectLanguage / isLanguageSupported /
  getSupportedLanguages / getLanguageDisplayName all asserted to
  exactly mirror the registry contents.

Adding a new grammar-backed language is now genuinely "one new file
+ two lines in registry.ts" — no other files to touch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flicts

Today every PR adding an MCP tool conflicts on the same two
shared lists in src/mcp/tools.ts: the tools[] array (the
list_tools surface) and the case switch in execute(). After this
refactor:

  Adding a new MCP tool:
  1. Drop a file at src/mcp/tools/<name>.ts exporting a
     <NAME>_TOOL: ToolModule (definition + handlerKey).
  2. Add one import line and one array entry to
     src/mcp/tools/registry.ts.
  3. Implement handle<Name>(args) on ToolHandler in tools.ts and
     add the new key to HandlerKey in tools/types.ts.

Step 3 is the only remaining "shared method on a single class"
conflict surface. Extracting handler bodies into per-tool files
(making step 3 also a single-file addition) is left as a
follow-up — the cost/benefit favors landing this incremental win
now and finishing the body extraction once language and migration
refactors land.

## What's new

- **src/mcp/tool-types.ts** — extracted ToolDefinition, ToolResult,
  PropertySchema, projectPathProperty into a shared module so
  per-tool files can import without circular dependency.
- **src/mcp/tools/types.ts** — ToolModule interface, HandlerKey
  string union, and ToolHandlerLike (a structural type that
  ToolHandler now `implements`, providing compile-time guarantee
  that every HandlerKey maps to a real method).
- **src/mcp/tools/<name>.ts × 9** — one file per existing tool
  (callees, callers, context, explore, files, impact, node, search,
  status). Each ~25-30 lines: import + definition literal +
  handlerKey reference.
- **src/mcp/tools/registry.ts** — static-import barrel, sorted
  alphabetically. Exports getToolModules(), getToolModule(name),
  and the derived `tools[]` array.
- **src/mcp/tools.ts** — ~200 lines deleted from the top
  (inline types + tools[] array + projectPathProperty).
  execute()'s case-switch replaced with a registry lookup +
  type-safe `this[mod.handlerKey](args)` dispatch (now compile-
  time-checked thanks to `implements ToolHandlerLike`).
  All `private async handle*` methods now public to match the
  interface. errorResult/textResult also public for the same reason.
- **src/mcp/index.ts** — MCPServer's tool-existence check switched
  from a linear `tools.find()` scan to the O(1) `getToolModule()`
  Map lookup, eliminating two parallel lookup paths.

## Tests

387/387 pass. **7 new tests** in __tests__/mcp-tool-registry.test.ts:
- Definitions are well-formed (name shape, description length).
- handlerKey shape (`handle<UpperCase>`).
- Every registered handlerKey resolves to a real method on
  ToolHandler.
- Exported `tools[]` exactly mirrors the registry.
- Canonical 9 main-line tools regression guard.
- execute() unknown-tool error path.
- **End-to-end dispatch smoke test**: execute('codegraph_status', {})
  reaches the real handler body (no broken `this` binding) — would
  fail loudly if the dynamic dispatch chain ever breaks.

## Reviewer pass

Independent reviewer ran once. 2 REQUEST_CHANGES + 2 INFO addressed:

1. ToolHandlerLike was defined but never enforced —
   ToolHandler now `implements ToolHandlerLike`. Eliminates the
   `(this as unknown as Record<...>)` cast in execute(); dispatch
   is fully compile-time-checked.
2. No end-to-end dispatch test — added one (see Tests above).
3. MCPServer.handleToolsCall used a linear `tools.find()` scan
   while execute() used Map lookup — switched to getToolModule()
   for parity.
4. Removed redundant .slice() in registry.ts (map() already
   returns a fresh array).

## Backward compat

src/mcp/tools.ts still re-exports ToolDefinition, ToolResult, the
mutable `tools[]` array, ToolHandler, and getExploreBudget. Every
existing consumer (`import { ToolDefinition, ToolResult, tools,
ToolHandler } from './tools'`) keeps working unchanged.

## Affected open PRs

- colbymchenry#110 (review-context): rebases to 1 new file in tools/ + 2
  lines in registry.ts + 1 method on ToolHandler + 1 line in
  HandlerKey.
- colbymchenry#112 (centrality+churn): same shape for the codegraph_hotspots
  tool.
- colbymchenry#114 (config-refs): same shape for codegraph_config.
- colbymchenry#115 (sql-refs): same shape for codegraph_sql.

Each goes from 4-way conflict (tools[] + case + handler + helpers)
down to 1-way conflict (HandlerKey + handler method on ToolHandler,
both in tools.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today every PR adding a schema migration claims
`CURRENT_SCHEMA_VERSION = next` AND adds an array entry to
`migrations: Migration[]` in src/db/migrations.ts. Two PRs both
claiming the same version resolve as: "second PR's v4 silently
no-ops on existing DBs" — a real silent-data-loss bug class
(PR colbymchenry#113's reviewer caught one).

After this refactor:

  Adding a new schema migration:
  1. Pick the next free 3-digit prefix (`git ls-files
     'src/db/migrations/[0-9]*.ts'` shows what's taken).
  2. Create `src/db/migrations/<NNN>-<short-name>.ts` exporting a
     `MIGRATION: MigrationModule` (description + up).
  3. Add one import line and one entry to
     `src/db/migrations/index.ts`'s REGISTERED_MODULES array.

Two PRs both creating `004-foo.ts` collide on the FILESYSTEM —
the maintainer sees it instantly. No more silent skipped
migrations.

## What's new

- `src/db/migrations/types.ts` — `MigrationModule { description,
  up }` and `Migration extends MigrationModule { version }`.
- `src/db/migrations/002-project-metadata.ts` — extracted v2
  body verbatim.
- `src/db/migrations/003-lower-name-index.ts` — extracted v3
  body verbatim.
- `src/db/migrations/index.ts` — central registry. Static-imports
  each migration, parses the version FROM THE FILENAME (no
  hand-typed version field that can drift), enforces strict
  `NNN-kebab-name.ts` shape, validates uniqueness/sort at module
  load (throws loudly on collision), exposes ALL_MIGRATIONS and
  CURRENT_SCHEMA_VERSION.
- `src/db/migrations.ts` — refactored to a thin runner. Same
  exported surface (CURRENT_SCHEMA_VERSION, getCurrentVersion,
  runMigrations, needsMigration, getPendingMigrations,
  getMigrationHistory, Migration type) — every existing import
  keeps working unchanged.
- `__tests__/migrations-registry.test.ts` — 8 invariant tests:
  registry non-empty, versions unique + strictly ascending,
  CURRENT_SCHEMA_VERSION matches max, every file matches the
  strict NNN-kebab-name pattern, no orphan files, no phantom
  registrations.

## Reviewer pass

Independent reviewer ran once. 3 REQUEST_CHANGES + 1 INFO addressed:

1. Hand-typed `version` field in REGISTERED_MODULES could drift
   from filename. **Fixed**: removed the version field; registry
   now parses version from filename via FILENAME_PATTERN regex
   inside validateRegistered.
2. Filename-pattern test was lenient (allowed 4-digit or 1-digit
   prefixes). **Fixed**: new "every migration file matches the
   strict NNN-kebab-name.ts pattern" test catches malformed
   filenames as orphan-detection-bypassing offenders.
3. `getPendingMigrations` returned `readonly Migration[]`,
   breaking callers that typed the result as `Migration[]`.
   **Fixed**: returns a fresh mutable array via `.slice()`.
4. No throw-on-duplicate test for validateRegistered (module
   evaluation timing). Acknowledged; not added.

## Backward compat

Every existing import works unchanged:
- `import { CURRENT_SCHEMA_VERSION } from './migrations'` ✓
- `import { runMigrations } from './migrations'` ✓
- `import { needsMigration } from './migrations'` ✓
- `import { getMigrationHistory } from './migrations'` ✓
- `import { getPendingMigrations } from './migrations'` — returns
   mutable Migration[] (preserved)
- `Migration` type — re-exported

## Affected open PRs

Every migration-touching PR (colbymchenry#102 UNIQUE edges, colbymchenry#105 cochange,
colbymchenry#108 perf db, colbymchenry#111 LLM features, my colbymchenry#112 centrality+churn, colbymchenry#113
issue-history, colbymchenry#114 config-refs, colbymchenry#115 sql-refs) currently
claims migration v4 and conflicts with each other on
`migrations.ts`. After this lands they each become:
- 1 new file: `src/db/migrations/<NNN>-<name>.ts`
- 2 lines in registry.ts (import + array entry)

Conflict shape changes from "next free version + array entry +
CURRENT_SCHEMA_VERSION bump in one file" (4-way conflict) to "1
new file" + 2-line registry edit. If two PRs target the same
NNN, the filesystem collision surfaces immediately — no silent
skipped migrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today every PR adding a derived-signal pass (centrality, churn,
issue-history, config-refs, sql-refs, cochange) edits the same
3 spots in src/index.ts:

  1. New imports at the top
  2. New private method on `CodeGraph` (e.g. runDerivedSignals,
     runIssueHistoryPass, runConfigRefsPass, runSqlRefsPass)
  3. New call site in `indexAll` AFTER resolution
  4. New call site in `sync` AFTER resolution

5 PRs collide on every one of those.

After this refactor:

  Adding a new derived-signal pass:
  1. Create `src/index-hooks/<name>.ts` exporting a
     `HOOK: IndexHook` constant with `afterIndexAll` and/or
     `afterSync` methods.
  2. Add one import + one entry to
     `src/index-hooks/registry.ts`.

`CodeGraph.indexAll` and `sync` invoke the hook runner once;
adding a new pass touches only the hook file + the registry.
Zero changes to CodeGraph itself.

## What's new

- **src/index-hooks/types.ts** — `IndexHook` interface
  (`afterIndexAll`, `afterSync`, both optional), `IndexHookContext`
  (projectRoot + config + queries + db), and
  `IndexHookOutcome` for diagnostic reporting.
- **src/index-hooks/registry.ts** — static-import list of every
  registered hook (empty on main today; PRs adding hooks fill it
  in), plus the `runAfterIndexAll` / `runAfterSync` runners that
  iterate hooks and catch errors so one broken hook never fails
  indexing.
- **src/index.ts** — `indexAll` calls `runAfterIndexAll(ctx)`
  after resolution. `sync` calls `runAfterSync(ctx, result)`
  after resolution. New private `buildHookContext()` helper
  exposes a stable read-only context.
- **__tests__/index-hooks.test.ts** — 6 tests covering empty
  registry, runner shape, and the `afterIndexAll` / `afterSync`
  contracts.

## Why ship the framework on main with zero registered hooks?

The only consumers of this framework today are 5 unmerged PRs
(colbymchenry#105 cochange + my colbymchenry#112-colbymchenry#115). Landing the framework now lets
each of those PRs rebase to a 2-line change instead of 8-10
lines mutating CodeGraph adjacent-line. Without this, all 5 PRs
collide on the same indexAll/sync call sites.

The framework adds zero behavior on main (no registered hooks =
no-op runner). 380→386 tests confirm no regression.

## Affected open PRs

| PR | Today | After this lands |
|---|---|---|
| colbymchenry#105 cochange | runDerivedSignals helper + 2 call sites | 1 hook file in src/index-hooks/ + 2 lines in registry.ts |
| colbymchenry#112 centrality+churn | same shape | same shape |
| colbymchenry#113 issue-history | same shape | same shape |
| colbymchenry#114 config-refs | same shape | same shape |
| colbymchenry#115 sql-refs | same shape | same shape |

Each goes from "edit CodeGraph in 4 spots" to "drop a hook file."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efactors

Lands centrality (PageRank) and churn (git history) as registered
IndexHooks (`afterIndexAll` + `afterSync`) instead of CodeGraph
private methods. Adds:

- Migration 004: nodes.centrality + files.{commit_count,loc,
  first_seen_ts,last_touched_ts} + indexes
- src/centrality/ + src/churn/ (pure modules)
- src/index-hooks/centrality.ts + churn.ts (registered hooks)
- CodeGraph public methods: getCentrality, getTopCentralNodes,
  getCentralityRank, getFileChurn, getHotspots
- codegraph_hotspots MCP tool wired through ToolModule registry
  + handleHotspots on ToolHandler
- Updated regression-guard tests (index-hooks, mcp-tool-registry)
  to reflect newly registered hooks/tools

Tests: 440/440 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants