perf(db): split embeddings + in-memory similarity cache + drop redundant co_changes index by andreinknv · Pull Request #2 · andreinknv/codegraph

andreinknv · 2026-04-28T03:05:17Z

Summary

Three measurable wins on the LLM-tier data layer, validated by spike before implementing each:

	Change	Win	Spike file
F2	Drop `idx_co_changes_a`	Covered by `(file_a, file_b)` PK	(analogous to PR #1)
G	Split embeddings into `symbol_embeddings` table	3.22× faster summary-only scans	`scripts/spikes/spike-embedding-split.mjs`
H	In-memory `EmbeddingCache` for similarity search	4.4× faster top-K cosine	`scripts/spikes/spike-embedding-split.mjs`

Note: This PR sits on top of colbymchenry#111 (LLM symbol summaries). The spike script reproducer covers G and H; F2 follows the same left-prefix-scan-covers-narrow-index pattern as PR #1 (perf/drop-redundant-edge-indexes).

Empirical validation

Run yourself: node scripts/spikes/spike-embedding-split.mjs. Output on a 50K-summary / 768d-embedding synthetic DB:

--- Spike G: storage layout (inline vs split) ---

  inline DB: 196.5 MB
  split  DB: 204.0 MB

  Test: scan summaries by role (common path)
  inline: 46ms avg over 50 queries
  split : 14ms avg over 50 queries
  Δ summary-only: split is 3.22× faster

  Test: scan summaries WITH embedding (rare path)
  inline (single table)   : 71ms avg over 50 queries
  split  (join required)  : 80ms avg over 50 queries
  Δ summary+embedding: 1.12× cost penalty for split

--- Spike H: in-memory embedding cache ---

  cold (per-query SQLite fetch + decode): 104ms avg over 20 queries
  warm (in-memory Float32Array matrix)  : 24ms avg over 20 queries
  Δ similarity search: 4.4× speedup with in-memory cache

The 1.12× cost on summary+embedding scans is dwarfed by the 3.22× win on summary-only scans, which dominate by ~50× in real usage (FTS-anchor lookups, role filters, freshness checks all read summaries-without-embeddings).

What changes

F2 — Migration 015: drop `idx_co_changes_a`

co_changes has PRIMARY KEY (file_a, file_b), which automatically creates a B-tree leading on file_a. SQLite covers WHERE file_a = ? lookups via that PK index — the standalone idx_co_changes_a was redundant.
idx_co_changes_b (on file_b alone) is kept because the PK leads with file_a, so it cannot serve WHERE file_b = ? lookups.
Fresh-DB schema (src/db/schema.sql) updated to skip idx_co_changes_a and dedupe a pre-existing duplicate of idx_co_changes_b.

G — Migration 016: split embeddings into `symbol_embeddings`

CREATE TABLE symbol_embeddings (
    node_id TEXT PRIMARY KEY,
    embedding BLOB NOT NULL,
    embedding_model TEXT NOT NULL,
    FOREIGN KEY (node_id) REFERENCES symbol_summaries(node_id) ON DELETE CASCADE
);
INSERT OR IGNORE INTO symbol_embeddings (node_id, embedding, embedding_model)
  SELECT node_id, embedding, embedding_model
  FROM symbol_summaries
  WHERE embedding IS NOT NULL AND embedding_model IS NOT NULL;
DROP INDEX IF EXISTS idx_summaries_embedding_model;
ALTER TABLE symbol_summaries DROP COLUMN embedding;
ALTER TABLE symbol_summaries DROP COLUMN embedding_model;

Requires SQLite 3.35+ for ALTER TABLE DROP COLUMN. Both better-sqlite3 and node-sqlite3-wasm ship with newer versions, so this is safe.
queries.ts methods (getEmbeddableSummaries, getAllEmbeddings, upsertSymbolEmbedding) updated to use the new table.
clear() and clearCoChanges() extended to wipe both tables (the FK cascade would handle it, but explicit is safer if foreign-key enforcement gets disabled).

H — `EmbeddingCache` in `src/llm/embeddings.ts`

export class EmbeddingCache {
  get(fetcher: EmbeddingFetcher, model: string): CachedEmbeddings;
  invalidate(): void;
}

Decodes every embedding into a flat Float32Array matrix once per (model, generation).
New topKByCosineMatrix(query, matrix, ids, dim, k) operates directly on the flat layout.
Owned by the CodeGraph instance.
Invalidated on:
- indexAll end (when filesIndexed > 0)
- sync end (when files added/modified/removed)
- After embedAllSummaries (when generated > 0)
- clear() (table was emptied)
Mismatched-dim rows are skipped on rebuild — produces a packed matrix with no sparse holes.

Independent review

Reviewed by an independent reviewer agent (read-only, fresh context). Surfaced 3 issues, all addressed in the same diff:

CodeGraph.clear() was missing a cache invalidation — fixed: added this.embeddingCache.invalidate().
Duplicate idx_co_changes_b in schema.sql (pre-existing, harmless due to IF NOT EXISTS) — fixed: deduped.
EmbeddingCache left sparse undefined holes in ids on dim-mismatch — fixed: rewrote to push aligned entries instead of pre-allocating, with a regression test.

Test plan

npx tsc --noEmit clean
npx vitest run — 794/794 tests pass
New tests: __tests__/migrations-015-016.test.ts (upgrade + fresh-DB paths for both migrations)
New tests in __tests__/embeddings.test.ts:
- topKByCosineMatrix matches topKByCosine on the same data
- EmbeddingCache hit/miss/invalidate
- Cache returns empty result without re-fetching
- Cache skips dim-mismatched rows
Spike script (scripts/spikes/spike-embedding-split.mjs) reproduces the headline numbers

Files changed

File	Change
`src/db/migrations/015-prune-co-changes-index.ts`	New: drop `idx_co_changes_a`
`src/db/migrations/016-split-symbol-embeddings.ts`	New: split embeddings into dedicated table
`src/db/migrations/index.ts`	Register both new migrations
`src/db/schema.sql`	Fresh-DB schema reflects new layout; dedupe `idx_co_changes_b`
`src/db/queries.ts`	`getAllEmbeddings`, `upsertSymbolEmbedding`, `getEmbeddableSummaries` use `symbol_embeddings`
`src/llm/embeddings.ts`	`EmbeddingCache` + `topKByCosineMatrix`
`src/index.ts`	Cache integrated into `searchHybrid` and `findSimilar`; invalidations wired
`__tests__/migrations-015-016.test.ts`	New: focused migration tests
`__tests__/embeddings.test.ts`	New: cache + matrix tests
`__tests__/foundation.test.ts`	Updated schema-version expectation
`__tests__/pr19-improvements.test.ts`	Updated schema-version expectation
`scripts/spikes/spike-embedding-split.mjs`	New: G + H reproducer

🤖 Generated with Claude Code

…emory similarity cache Three independently measurable wins on the LLM-tier data layer, all validated up-front via spike before implementation (scripts/spikes/spike-embedding-split.mjs). F2 - drop idx_co_changes_a (migration 015) The (file_a, file_b) PRIMARY KEY index already covers WHERE file_a = ? via SQLite left-prefix scan, so the narrow idx_co_changes_a was dead weight. idx_co_changes_b (on file_b alone) is kept because the PK leads with file_a. G - split embeddings into a dedicated table (migration 016) Moves the 768-dim Float32 BLOB out of symbol_summaries into a new symbol_embeddings table with FK + ON DELETE CASCADE. Spike measurement on a 50K-summary synthetic DB: - Summary-only scan (common path): 3.22x faster (46ms -> 14ms) - Summary+embedding scan (rare path): 1.12x cost penalty - DB size: ~4% larger (separate page chain) Net positive: the common path dominates real usage. H - in-memory EmbeddingCache for similarity search EmbeddingCache decodes every embedding into a flat Float32Array matrix once and reuses it across queries. topKByCosineMatrix operates directly on the flat layout. Cache is invalidated on indexAll, sync, embedAllSummaries (when generated > 0), and clear() - anywhere new vectors land or the table is emptied. - Cold (per-query SQLite fetch + decode): 104ms avg - Warm (in-memory matrix): 24ms avg - 4.4x speedup with cache Also fixes a pre-existing schema.sql inconsistency where idx_co_changes_b was declared twice (harmless thanks to IF NOT EXISTS, but confusing). Test coverage: - __tests__/migrations-015-016.test.ts: upgrade-path and fresh-DB behavior for both new migrations. - __tests__/embeddings.test.ts: topKByCosineMatrix matches topKByCosine; EmbeddingCache hit/miss/invalidate/dim-mismatch. 794/794 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andreinknv · 2026-04-28T03:08:32Z

Superseded by upstream PR colbymchenry#123

Two fixes surfaced while verifying b6cef74 against the live ollama and codegraph indexes: 1. buildReviewContext spread clobbered DEFAULTS with undefined. { ...DEFAULTS, ...options } with options.maxCoChangeWarnings set to undefined (the shape MCP forwards when the caller did not override) set opts.maxCoChangeWarnings to undefined. undefined > 0 is false, so the co-change loop was skipped entirely — coChangeWarnings: [] on every default review_context call. Same shape silently disabled the jaccard threshold and uncapped callers/callees. Replaced the spread with a manual merge that ignores undefined values. 2. searchHybrid only diversified the strict no-embeddings-backend path. When embeddings are configured but the cache is empty (the common summarize:false state, which both project configs use), the call fell through to ftsResults.slice(0, limit) without diversification — letting six New constructors flood a 12-result budget. Same issue on the embed-call-failed and empty-vec-result paths. Routed all three FTS-only fallbacks through diversifyByName. Regression tests in stress-test-roundtwo-fixes.test.ts exercise both paths through the real surface (MCP-shape options for #1, configured- but-unpopulated embeddings for #2). Suite: 1074 / 13 skip / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end measurement of the cache-preservation work uncovered a real cost/quality tradeoff: bulk summary work runs 2.6x faster on a local MLX-served model (Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ via mlx-openai-server) than on claude-bridge -> Haiku, with comparable quality. But the same Qwen3-Coder model confidently confabulates WRONG answers on synthesis-heavy ask queries — got the PRAGMA foreign_keys direction reversed and cited the wrong migration filename when asked about the cache-preservation mechanism. Sonnet via claude-bridge gave precisely correct answers, but takes 12-40s. The pre-existing schema (chat.askModel: string) only allowed swapping the model id within the SAME provider/endpoint. To route bulk and ask through entirely DIFFERENT providers, this PR adds a top-level `askChat` block. ## Changes ### Schema (src/types.ts, src/llm/client.ts) New optional `llm.askChat` block mirroring `chat`: "chat": { "provider": "openai-compat", "endpoint": "http://localhost:8081/v1", "model": "qwen3-coder" }, "askChat": { "provider": "claude-bridge", "model": "claude-sonnet-4-6" } `LlmEndpointConfig.askChat?: ChatProviderConfig | null` parallels the existing `chat` slot. `normalizeEndpointConfig` returns the new field alongside chat + embeddings. ### Resolver (src/llm/provider.ts) New `resolveAskChat()` mirrors `resolveChat()` but never reads legacy flat fields — askChat is opt-in only. Wired into `resolveLlmProviders` so the resulting `ResolvedLlm` carries both `chat` and `askChat` slots. New `getAskModel()` helper with cascade: - askChat.model -> chat.askModel -> chat.model -> chatModel (legacy) This is the right primitive for any caller that wants to know which model id will actually answer an ask/dead-code call. ### Client routing (src/llm/client.ts) `LlmClient` now lazy-loads TWO backends instead of one: - `chatBackend` — for bulk work (summaries, classifier, dir summaries). - `askChatBackend` — for ask + dead-code judge, only when askChat is configured. `LlmClient.chat(messages, { useAskModel: true })` checks `askChatCfg`: - If set → dispatch to ask backend with `useAskModel: false` (the ask backend's own `model` field is already the right model; no re-swap needed within that backend). - If unset → dispatch to chat backend with `useAskModel` passed through (legacy single-provider behaviour: same backend, swap model id via `chat.askModel`). `instantiateBackend` factored out from `getChatBackend` so both paths share construction logic. `isReachable()` now also probes the askChat backend when configured — returns false if EITHER chat OR askChat fails, so status output never claims reachable when ask calls would throw at runtime (e.g. claude-bridge binary missing on the ask side). ### Display + gating (src/bin/codegraph.ts, src/mcp/tools.ts) - `codegraph status` shows a separate `Ask model:` line when ask routes differently from chat. Provider hint in parentheses (`Ask model: claude-sonnet-4-6 (claude-bridge)`) appears only in true split-provider setups, not for single-provider askModel overrides. - `codegraph ask` CLI trailer now uses getAskModel for the displayed model id so the printed `model X` matches what actually generated the answer. - handleAsk and handleDeadCode in the MCP tools now gate on getAskModel rather than getChatModel, with error messages updated to mention the askChat block as an alternative configuration path. Pre-fix, a config with chat=null but askChat configured (rare but legitimate) would have failed the gate even though ask was perfectly configured. ### Tests (__tests__/llm.test.ts) Two new tests: 1. `split provider: useAskModel routes to askChat backend when configured` — uses TWO fake servers, asserts bulk hits chat server only and ask hits askChat server only. Captures and asserts the `model` field in the request body of each, guarding against a regression where routing is right but the model id sent is wrong. 2. `legacy single-provider: useAskModel stays on chat backend, swaps model id` — asserts no-askChat config preserves prior behaviour (single backend handles both calls, model id swaps). `FakeServer` extended with `lastChatBody` capture so tests can assert which model id reached which server. ## Live validation Tested on the codegraph self-repo with chat -> MLX/Qwen3-Coder and askChat -> claude-bridge/Sonnet. Same ask question that Qwen3-Coder confabulated about now returns Sonnet's precisely correct answer: - Pre-split: 6.9s, wrong on PRAGMA foreign_keys direction, cited migration 014 (unrelated). - Post-split: 39.5s, precisely correct including line numbers (`summarizer.ts:189-197`, `queries.ts:2125`) and the right migration filename (022-add-content-hash-index). ## Backwards compatibility Existing configs with just `chat: { ..., askModel: "..." }` continue to work unchanged — askChat is optional. Test #2 covers this path. ## Reviewer trail Two passes. Pass 1: REQUEST_CHANGES + 4 findings. - (1) handleAsk/handleDeadCode used getChatModel — addressed, switched to getAskModel with updated error messages. - (2) status display omitted ask model — addressed, added conditional Ask model line in split-provider setup. - (3) isReachable didn't probe askChat — addressed. - (4) split-provider test didn't assert request-body model id — addressed. Pass 2: APPROVE + 2 info findings (stale test comment, redundant provider hint for single-provider askModel override) — both addressed. Suite: 1089 / 13 skip / 0 fail (was 1087). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… docs Three reviewer-flagged improvements on 6d0e7a2 (the WASM-visibility commit). All informational items from that review — no functional regression, no scope drift, suite stays at 1117/13/0. ## #1 — backend is now per-DatabaseConnection, not a process global `createDatabase` previously set a module-level `activeBackend` and exposed it via `getActiveBackend()`. In the MCP cross-project path (handleStatus called against `projectPath` opens a SECOND DB via openSync) the global reflected whichever DB was opened most recently, not necessarily the one whose stats were just rendered. Benign in practice (every DB in a process resolves the same backend) but structurally imprecise. Refactor: - `createDatabase(dbPath)` now returns `{db, backend}` instead of a bare `SqliteDatabase`. Caller stores both. - `DatabaseConnection` carries a `private backend: SqliteBackend` and exposes `getBackend()`. - `CodeGraph.getBackend()` delegates — that's the public surface. - CLI `codegraph status` and MCP `handleStatus` both call `cg.getBackend()` instead of the global. The global is removed. Two pre-existing tests (`migrations-015-016`, `migrations-022`) that called `createDatabase` directly now destructure `{db: adapter}`. ## #2 — fix recipe deduplicated across the two code surfaces The `xcode-select` / `npm rebuild` / `npm install --save` recipe appeared inline in both `buildWasmFallbackBanner` (sqlite-adapter.ts) and the MCP `handleStatus` formatter (mcp/tools.ts). New `WASM_FALLBACK_FIX_RECIPE` constant in sqlite-adapter.ts is the single source for the one-line summary; the MCP formatter interpolates it. The banner formats the same content multi-line for the stderr surface. README is intentionally separate (different audience, different rendering). ## #3 — README troubleshooting now covers Linux Section title renamed "Indexing is slow on macOS / WASM fallback" -> "Indexing is slow / WASM fallback active". New code block lists fix steps for macOS, Debian/Ubuntu, RHEL/Fedora, and the cross-platform `npm install --save` escape hatch. The banner stderr block also gained the Linux equivalent for symmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both items came back APPROVE-with-info on the prior review and are pure cleanup — no behavior change, no API surface change. ## #1 — RHEL/Fedora step in WASM_FALLBACK_FIX_RECIPE constant The constant in `src/db/sqlite-adapter.ts` is documented as the "single source of truth" for the fix recipe shown in the MCP `Backend:` line, but it only listed macOS + Debian/Ubuntu paths. The multi-line `buildWasmFallbackBanner` and the README both already include the `yum groupinstall "Development Tools"` step for RHEL/Fedora; the constant was the lone surface missing it. Now appended so MCP-displayed guidance matches the other two surfaces on every supported platform. ## #2 — hoist inline `import('./db').SqliteBackend` to top-level `CodeGraph.getBackend()` was the only method in `src/index.ts` using the inline-import type form. Adding `SqliteBackend` to the existing `import { DatabaseConnection, getDatabasePath } from './db'` keeps the file consistent. No circular-import risk since `./db` already re-exports the type from `./db/sqlite-adapter`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… server-config flags Tooling-gap backlog (codegraph/docs/codegraph-tooling-gaps.md) closed: #1 freshness severity bucket — `classifyFreshness` with fresh|recent|stale|very_stale #2 allowStale flag — opt-in bypass for the heavy-drift gate, registry-injected schema #3 module format in status — `module-format.ts` parses package.json + tsconfig (JSONC-safe) #4 codegraph_imports tool + import-classifier — file/directory/bare/unresolvable filters #5 dynamic imports — extractor catches `import('…')` + `require('…')`, incl. template_string #6 build-context refs — new `build_context_refs` table for `__dirname` / `import.meta.*` #7 files.is_test flag — column populated by glob; surfaced in status as `(N test)` colbymchenry#11 summarize-also-embeds (discovered while dogfooding) — `cg.summarizeAll()` chains `embedAllSummaries`; new `cg.embedAll()` for embed-only path; CLI `codegraph embed` CLI/MCP alignment (5/32 → 33+/35): - 13 new CLI commands via `runViaMCP` shim: callers, callees, impact, node, similar, biomarkers, imports, help-tools, explore, hotspots, dead-code, config-refs, sql-refs, module-summary, role, coverage-query, pending-summaries, save-summaries, review-context - 7 new MCP tools: codegraph_imports, codegraph_embed, codegraph_summarize, codegraph_sync, codegraph_reindex, codegraph_coverage_ingest, codegraph_init, codegraph_uninit, codegraph_unlock, codegraph_affected MCP server-level operator config (`codegraph serve --mcp`): - --no-write-tools / --allow-stale-default / --disable-tool (sandboxing) - --llm-endpoint / --llm-chat-model / --llm-ask-model / --llm-embedding-model / --llm-api-key (operator LLM config; per-project config wins on conflict) - New CODEGRAPH_LLM_* env vars wired through `mergeLlmEnv` in resolveLlmProviders Architectural cleanups: - `bypassFreshnessGate` and `isWriteTool` declarative flags on ToolModule (replaces growing string-comparison chain in execute()) - `withAllowStale` registry injection only on tools that DO see the gate - DRY of inline copy-paste in 3 hooks → `src/index-hooks/enclosing.ts` - `LlmClient.isEmbeddingReachable` for split-provider correctness - SyncResult `lockContention` flag → handleSync emits distinct retryable message - `clearStructural` deletes from build_context_refs (was orphan-leaking on --force) - cli:dev npm script + tsx CLI fixed (web-tree-sitter `import type` for type-only refs) Migrations: 023-files-is-test.ts — add `files.is_test` 024-build-context-refs.ts — add `build_context_refs` table Reviewer rounds: 11 total, all REQUEST_CHANGES addressed inline. Notable fixes: - JSONC URL strip via state machine (was eating `https://` tails) - classifyFreshness very_stale now requires isStale (in-sync-but-old → recent) - Dynamic imports also match template_string nodes - process.exit deferred until after finally cleanup in runViaMCP - --same-language / --different-language mutual exclusion guard - help-tools CLI bypasses isInitialized (works without a project) - handleUninit sweeps projectCache by getProjectRoot (no dangling alias leaks) - handleAffected errors instead of silently dropping unsupported glob filters - mergeLlmEnv preserves precedence: legacy flat config wins over env-synthesised block Suite: 1268 passing, 1 expected red (colbymchenry#8 — undecided), 13 skipped, 1 todo, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

upsertSymbolSummary and upsertSymbolEmbedding fired SqliteError: FOREIGN KEY constraint failed when a long-running write — the summarizer mid-LLM-call, the embedder mid-batch, the MCP server mid-tool — held a node id across a sync that deleted the symbol. Repro: capture id, delete the source file, run sync, attempt the upsert. Live race for the MCP path because the freshness gate auto-syncs before each tool but in-flight LLM/embedding requests straddle that boundary. The fix ------- Atomic SQL guard at the upsert layer. Each upsert switches from `INSERT ... VALUES` to `INSERT ... SELECT ... WHERE EXISTS (...)` gated on the FK target — `nodes` for the summary, `symbol_summaries` for the embedding (the actual FK target; nodes-cascade keeps the two in sync). When the parent is gone the SELECT yields zero rows, the INSERT does nothing, ON CONFLICT remains the re-upsert path when the parent IS present. Single-statement, race-free even cross-process — there's no SELECT-then-INSERT window for a concurrent delete to slip into. Both helpers now return `boolean` (true wrote, false skipped) so callers can keep counters accurate. Summarizer + embedder no longer increment `generated`/`cacheHits` on stale-skip; the existing `skipped = candidates - generated - errors - cacheHits` derivation absorbs them naturally without polluting `errors` (which is reserved for real LLM/network failures). saveAgentSummaries reports a mid-write disappearance via the existing skipped/errors trail. Tests ----- __tests__/fk-stale-handle.test.ts — 4 cases: stale-id summary upsert no-ops (was FK-failing), stale-id embedding upsert no-ops, normal write still works for both. Confirmed the first test fails with SqliteError: FOREIGN KEY constraint failed on the unfixed code by temp-reverting the WHERE EXISTS clause. Suite: 1365/13/0 (was 1361/13/0; +4 new). tsc clean. Reviewer pass found two issues, both addressed before commit: - agent-bridge.ts was discarding the new boolean — counters could overcount on cross-process race; now captures + reports skip. - The first draft incremented `errors` on stale-skip; reviewer flagged the semantic overload (errors is for real failures). Resolved by not incrementing any outcome counter — the derived `skipped` metric already covers it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second commit of the trace-logging arc. Wraps every toolHandler.execute() call in a TraceLogger.log() so the call flows into the mcp_tool_calls table that landed in 027139b. The viewer's Agent-trace tab (commit C) reads it back. Lifecycle: - TraceLogger is created lazily once `cg` is open (via ensureTraceLogger() called on each dispatch). Skipped entirely when --no-write-tools is passed: the spirit of that flag is "no DB writes", and trace logging is a write path. - log() is contractually best-effort (DB failures swallowed at debug). The dispatch site additionally wraps it in try/finally so even a contract violation can never strand the tool result — sendResult always fires. Reviewer-memo gates passed: - #1 docstring rot: ensureTraceLogger and the dispatch comments match the implementation. - #2 best-effort claim is upheld at both layers (logger try/catch + caller try/finally). - #5 N/A — this commit doesn't add tables. The 3-line wiring isn't separately tested; TraceLogger has 10 round-trip tests covering every behavior the wiring composes, and the wiring has no branches. Suite 1422/34/0 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codegraph_node now always emits **Lines: N** when the indexed range is known, and a new detail: 'preview' | 'full' arg controls how a fetched body is rendered. Default preview truncates bodies >40 lines to the first 30 lines plus a tail marker that names the override; full preserves the prior verbatim behavior. Lets agents skip a code: true round-trip when **Lines:** alone is enough, and bounds the response when fetching a large symbol — backlog #2 (smart source-snippet truncation). CLI mirror gains --detail to match. Reviewer-flagged: tail-marker total now prefers locFromRange over the body's split count so the two numbers always agree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resumes the QueryBuilder → per-domain queries-*.ts migration that was partially done across earlier waves. This batch handles the centrality domain. Extracted to `src/db/queries-centrality.ts`: - `applyCentralityScores(qb, scores)` — wraps the per-row UPDATE in a single transaction + clears the node cache. - `clearCentrality(qb)` — UPDATE nodes SET centrality = NULL + clears cache. Both follow the established pattern (free function taking `qb`; uses `qb.db` directly + `qb.clearCache()` for cache invalidation). Removed entirely (per reviewer-memo §7 / dead exports): - `getTopNodesByCentrality(opts)` — added speculatively, zero in-tree callers. - `getCentralityRank(nodeId)` — same. Both can be re-added with a concrete caller in the same diff if a future feature needs them. Caller updated: - `src/index-hooks/centrality.ts` switches from `ctx.queries.applyCentralityScores(...)` / `ctx.queries.clearCentrality()` to the imported free functions. QueryBuilder shrinks by 4 methods. Suite 1634/34/0 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Continues the QueryBuilder → per-domain queries-*.ts migration. Metadata domain (4 methods, ~25 call sites across 9 files) ships in this commit. Extracted to `src/db/queries-metadata.ts`: - `getMetadata(qb, key)` — read project_metadata - `setMetadata(qb, key, value)` — upsert - `getAllMetadata(qb)` — full snapshot - `getStaleArtifactsCount(qb)` — derived rollup of summary/embedding/ finding rows whose source_content_hash is behind the file's current content_hash. Used by the freshness gate. All four are pure DB ops on `project_metadata` / joined views — no per-instance caches, no shared statement slots — so the extraction is mechanical. Caller updates (9 src files + 6 test files): - src/freshness.ts (2 sites) - src/biomarkers/index.ts (2 sites) - src/mcp/tools/status.ts (1 site) - src/viewer/server.ts (2 sites) - src/index-hooks/{churn,centrality,cochange,issue-history}.ts (12 sites) - src/extraction/index.ts (7 sites) - __tests__/{cochange,freshness,freshness-stress,freshness-v2-stress, issue-history,churn}.test.ts (8 sites) QueryBuilder shrinks by 4 methods. No shims kept (the migration goal is to actually shrink the class, not paper it over). Suite 1634/34/0 unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l failure Stress test surfaced 224 nodes in the live index reporting as "embed errors". Investigation revealed only 7 of those had inputs exceeding the user's `llama-server --batch-size 512` cap — the other 217 were collateral damage. When a batch of 32 contained one or two over-length inputs, the WHOLE batch failed and every row in it was marked `errored`. So a 7-row capability miss inflated into a 224-row "errors" report and 217 healthy rows never got embedded. Fix: when `client.embed(batch)` fails with an "input too large / exceeds batch size / context length" server error, fall back to per-row retry. Each row is embedded individually; rows that still fail get counted as `skipped` (NEW counter, separate from `errors`) since they're a server-capability miss the user can fix by increasing the embed server's --batch-size, not a pipeline failure. Rows that succeed in the per-row pass embed normally through `upsertSymbolEmbedding`, including the new `summary_hash_at_embed` value from migration 035. Result on the live codegraph index: - Before: 0 generated / 224 errors / 0 skipped (ambiguous) - After: 355 generated / 0 errors / 7 skipped (precise) Also updated counter accuracy per the reviewer-memo's recurring scrutiny area #2: "errors" no longer bumps for server-capability misses; the new `skipped` counter covers those, and the CLI / MCP embed-tool output surfaces an actionable hint ("increase --batch-size on the server to embed these"). EmbedResult / EmbedResultRow types gain `skipped: number`. `isInputTooLargeError` regex matches the llama.cpp variant (`too large to process`) plus generic shapes (`exceeds batch size`, `input length exceeds`, `context length`) so other server backends with similar caps trip the same fallback. +1 test in embed-all-nodes.test.ts: the new size-capped fake server returns the exact llama.cpp error shape; the test asserts `errors=0`, `skipped>0`, `generated>0`, and `candidates == generated + skipped`. Suite 1727/34/0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arch arc) Three additions inspired by the 2026-05-08 code-KG paper sweep (CodeRAG, GraphCodeAgent, GraphGen4Code, Maarleveld GNN survey, CodexGraph, FalkorDB). Each verified absent before implementation; all opt-in to keep default structural traversals unchanged. #1 — In-graph similar_to edges (similarity as graph hops) - New EdgeKind similar_to (confidence=INFERRED, metadata.score) - buildSimilarToEdges reuses findSimilarViaVec; delete+insert are wrapped in db.transaction so the replacement is atomic - Migration 036 partial-index on edges(source,kind) WHERE kind=similar_to — guarded by sqlite_master existence check for the pre-016 hand-rolled migration tests - Surfaced as codegraph_admin({action: build-similarity-edges}) and CLI codegraph admin build-similarity-edges --k --min-score - EXCLUDED_EDGE_KINDS keeps it out of default traversals; explicit edgeKinds bypass the filter #2 — mode=intent search over symbol_summaries - FTS5 virtual table summary_fts (porter unicode61, mirrors nodes_fts) + INSERT/UPDATE/DELETE triggers on symbol_summaries - Migration 037 + parallel schema.sql entry for fresh-init path - bm25-ranked, optional kind/language/pathFilter filters - pathFilter LIKE uses canonical backslash-escape pattern matching queries-findings.ts:227-232 (no _/% injection) - Refuse-when-empty error points at codegraph summarize - FTS5 query parse errors caught and re-surfaced as a clear syntax-error message #3 — Intra-procedural def_use edges (TS/JS/TSX/JSX) - New EdgeKind def_use as self-loops on function/method nodes; metadata carries name, defLine, useLines - Scope-bounded extractor in src/extraction/def-use.ts, called from both tsExtractFunction and tsExtractMethod - Skips parameters (function inputs), fields (covered by field_access), nested-scope vars (belong to inner function set) - EXCLUDED_EDGE_KINDS opt-in; no traversal helper assumes source != target Schema-version assertions bumped 35 to 37 in foundation.test.ts and pr19-improvements.test.ts. Suite 1742/0/34 (was 1729 baseline, +13 new tests). Three reviewer rounds: round 1 caught LIKE-escape, atomicity, field rename, FTS5 try/catch; round 2 caught a JSDoc rot from the rename and a contradictory test assertion; round 3 APPROVE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ge-level diff (eval arc) Three additions with pre-set hypotheses + post-impl measurement, per the user's "evaluation and measurement how much useful it will actually be" brief. Pre-impl thresholds were committed before code so post numbers couldn't be motivated. One feature deferred with honest documented reasoning. #1 Selective parse-cache invalidation - clearParseCache(qb, language?) returns deleted-row count - CLI: --clear-parse-cache [language] (boolean OR optional value) - MCP: clearParseCache: boolean + clearParseCacheLanguage: string (typed schema doesn't allow oneOf — split into two args; language wins when both set) Hypothesis (pre-set): >=3x wall-clock speedup, <30s absolute Measured (codegraph repo, 463/498 = 93% TS in parse cache): - TS-only clear: 3.70s wall / 3.77s user-CPU - Full clear: 3.87s wall / 8.13s user-CPU - Wall ratio: 1.04x (parallelism masks the work delta) - User-CPU ratio: 2.16x more work for full clear Verdict: speed threshold NOT MET on monoglot testbed. Real value here is correctness (targeted invalidation when an extractor changes). On polyglot repos at 50% target-lang ratio, expected ~2x wall-clock speedup. #1.5 Docstring source for mode='intent' (user follow-up: "make intent richer") - Migration 038: docstring_fts FTS5 over nodes.docstring + INS/UPD/DEL triggers (with WHEN docstring IS NOT NULL AND != ''); schema.sql parity for fresh-init path; pragma_table_info guard for pre-016 hand-rolled migration test setups - _search-intent.ts queries BOTH summary_fts AND docstring_fts, UNIONs by node_id keeping best rank, surfaces 'via summary' / 'via docstring' provenance label per result - Empty-corpus check fixed: FTS5 external-content COUNT(*) reads from the source table, not the actual indexed rows — switched to direct content count Hypothesis (pre-set): >=30% recall increase Measured (20 hand-picked intent queries, codegraph corpus): - Summary-only hits: 22 - Docstring-only hits: 34 - Combined unique-node hits: 56 (2.55x = 155% improvement) Verdict: well above threshold. Best ROI of this arc — docstrings cover 26% of nodes vs summaries' 18%, AND describe intent verbatim in JSDoc / Python docstrings / Go comments. #2 Edge-level diff in compare_to_ref - EdgeDelta / EdgesDelta types - diffEdgeLists keyed by stable (srcQualName::srcKind=>tgtQualName::tgtKind::edgeKind) so line shifts don't surface as spurious changes - Cross-file edges out of scope (compareToRef is per-file) - Opt-in via includeEdges: true (CLI: --include-edges) - Renderer surfaces source -> target node IDs (round-1 reviewer finding: discarded data; fixed) Hypothesis (pre-set): >=30% additional info (>=20% loose) Measured (HEAD vs HEAD~3 on this branch): - Node changes: 21+11+308 = 340 - Edge changes (NEW signal): 83+11 = 94 - Files surfaced: 22 -> 27 (+5 visible only via edge changes) - Information gain: 94/340 = 27.6% Verdict: >=20% threshold MET; just below >=30% strict. The 5 newly-surfaced files (pure-edge changes) are the qualitative win. #3 Stack-graphs cross-language resolver — DEFERRED Survey of the codegraph corpus: monoglot TypeScript. child_process invocations are ~30 git execFileSync calls; no Python/Ruby/Go spawn targets. Dynamic imports are NPM packages. string_imports table is dominated by test fixtures. Conclusion: this corpus lacks the ground-truth cross-language references needed to measure a scope-graph rule meaningfully. Building infrastructure without testable signal would be speculative abstraction (CLAUDE.md anti-pattern). Stays on the borrow-ideas backlog as the long-horizon item; not blocked, just not session-feasible without a polyglot testbed. Schema-version assertions bumped 37 -> 38 in foundation.test.ts and pr19-improvements.test.ts. Suite 1746/0/34 (was 1742; +4 new tests). Two reviewer rounds: round 1 caught the edge-delta formatter discarding source/target IDs; round 2 APPROVE. Info-level note tracked for later: summary_fts_au trigger could mirror the docstring_fts_au SELECT-WHERE guard pattern for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ulk codegraph_at_range Three additive surface extensions to cut investigation round-trips: #1 codegraph_node accepts `symbols: string[]` (up to 20) alongside the existing single-symbol `symbol`. Duplicate inputs that resolve to the same node are merged. Saves N round-trips when checking a list of suspect symbols. #2 codegraph_node accepts four new inline-expansion flags (`includeCallers` / `includeCallees` / `includeBiomarkers` / `includeTests`). When set, the response folds in the corresponding tool's answer under each card, capped per-section to keep token pressure low (10 callers, 10 callees, 5 findings, 5 test files). Collapses 3-5 round-trips into one for "tell me everything about X" patterns; the dedicated tools remain available for the full lists. #3 codegraph_at_range accepts `ranges: Array<{file, startLine, endLine}>` (up to 100) alongside the single-range form. Output renders one subsection per range so the agent can map results back to specific diff hunks. PR review with N hunks goes from N+1 calls to 1. All three paths are additive — the legacy single-input shapes are preserved verbatim. Backward-compat is locked in by the existing tests plus 19 new ones (8 for node multi/expansions, 5 for bulk at-range, +6 from refactoring). Docs updated in CLAUDE.md, README.md, and the server-instructions playbook. Suite 1772/0/34. No schema migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New module src/llm/commit-intent.ts. Classifies a commit message into one of 8 buckets: feat / fix / refactor / perf / test / docs / chore / unknown. Heuristic-first, no ML model. Public API: - classifyCommitMessage(message) → { intent, score, reason } - 7 conventional-commits prefixes (feat:, fix:, refactor:, perf:, test:, docs:, chore: / build: / ci:) → score 0.95 - Keyword cues for unprefixed messages (add/implement → feat, fix/resolve → fix, refactor/rename → refactor, etc.) → score 0.6 - Body-cue fallbacks ("Closes #N" → fix, "BREAKING CHANGE:" → feat) - Default unknown when nothing matches 45 vitest cases. Friction notes: prefix character class tightened to `[\(:]` so "Fix the token..." doesn't match the `fix:` prefix rule; priority ordering documented inline. Integration into co_changes / codegraph_history / codegraph_hotspots to surface intent breakdown lands separately (tracked as Stage 7 #2 follow-up — needs a small migration to add intent column to co_changes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#2) Wires the heuristic commit-intent classifier (shipped in d12ee01) into a SHA-keyed persistence layer: - Migration 045: new `commit_intents (sha PK, intent, score, seen_at)` table with idx_commit_intents_intent for per-intent queries. - schema.sql parity (recurring scrutiny pattern #5). - Schema-version test assertions bumped 44 → 45. - src/db/queries-commit-intents.ts (new): recordCommitIntent / recordCommitIntents (batch upsert with txn) / getCommitIntent / getCommitIntents (multi-SHA Map fetch) / aggregateIntentBreakdown (returns Record<intent, count> for codegraph_history-style folds) / clearCommitIntents (cochange full-rescan path). Mining-side integration deferred — `mineCoChanges` doesn't currently emit subjects (only SHAs + file pairs). Adding a subject-capture pass is tracked as a Stage 7 #2 follow-up. Suite remains 2374/0/34. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ge 7 #2) Closes the Stage 7 #2 loop end-to-end: subjects flow from git-log through to classified intents in commit_intents. - src/cochange/index.ts: git-log format extended to `tformat:CGCMT-%H%x09%s` (TAB-separated SHA + subject); parser captures subjects per commit; mineCoChanges return shape gains a `subjects: Map<sha, subject>` field. - src/index-hooks/cochange.ts: applyResults now classifies every freshly-mined subject via classifyCommitMessage and batch-upserts to commit_intents. Full-rescan path also clearCommitIntents to drop stale SHAs after a force-push / rebase. Heuristic-only classifier — no LLM call, runs inline with mining. The persistence layer (Stage 7 #2 foundation, 6af1924) stays unchanged; this commit just produces the input rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Stage 7 #2/#3 follow-ups) Two new local-NLI surfaces. Both use the existing bart-large-mnli zero-shot classifier; both keep the heuristic / rule-based path on the happy case and only consult the model when the deterministic rules can't reach a confident answer. Token cost vs the chat backend: ~0 per call. # C — commit-intent NLI fallback `classifyCommitMessageWithFallback(message, classifier?)` runs the existing heuristic first; when it returns `'unknown'` (~30% of commits in messy histories: "wip", "stuff", "more"), feeds the subject into a 7-hypothesis NLI classifier: feat → "a new feature or capability" fix → "a bug fix or error correction" refactor → "code restructuring or cleanup without behaviour change" perf → "a performance improvement or optimisation" test → "tests, specs, or test infrastructure" docs → "documentation, comments, or readme" chore → "dependency bumps, build config, or routine chores" Confidence floor 0.45 — below that we keep `'unknown'` rather than ascribe a low-confidence label (avoids polluting commit_intents with junk on truly-opaque commits). Wired into the cochange index hook: classifier is constructed from `localRoleLlm` config when present, otherwise the hook stays heuristic-only and behaviour- identical to before. The original sync `classifyCommitMessage` is unchanged — existing callers + tests continue to use the heuristic path. # D — new structured change-kind classifier `classifyChangeKind({classifier, beforeBody, afterBody, name, kind})` in `src/llm/change-kind.ts`. Distinct from the existing generative `summarizeChange` (which produces prose via the chat backend): this produces a STRUCTURED label suitable for grouping / filtering / metrics: addition | removal | modification | refactor | signature_change | behavioral_change | doc_only | unknown Rule-based dispatch handles the trivial cases (empty before → addition, empty after → removal, identical → unknown) without an NLI call. The remaining cases consult bart-large-mnli with 4 prose hypotheses against the diff. Same 0.45 confidence floor; sub-threshold tops fall through to `'modification'`. # Supporting refactors - `LocalRoleClassifier.classifyLabels(text, labels)` — new method that runs zero-shot against an arbitrary caller-supplied label set (vs the existing `classify()` which is hardcoded to the 7-class role taxonomy). The classifier wraps the same pipeline, so both surfaces share one model load. - `IndexHook` afterIndexAll/afterSync are already async-aware in the registry; the cochange hook had been synchronous. Made `applyResults`/`applyFullRescan`/`applyIncremental`/`refresh` async so the NLI fallback can await per-commit. Behaviour unchanged when `localRoleLlm` is unset. # Tests + `__tests__/commit-intent-fallback.test.ts` — 13 cases covering short-circuit, NLI dispatch, low-confidence floor, error degradation, all 7 intent labels. + `__tests__/change-kind.test.ts` — 12 cases covering rule-based dispatch, NLI dispatch, low-confidence floor, error paths. + `__tests__/local-role-classifier.test.ts` — 7 new cases for `classifyLabels` (custom label set, short-circuit, abort, unknown labels NOT coerced to ROLE_LABELS). # Why now Confirmed by web research (HF API + WebSearch agent) that the in-process transformers.js model surface is "five small specialized models is the realistic ceiling": no code-tuned summarizer or embedder ships in transformers.js-loadable form today. The biggest remaining win for the existing surface was unifying everything that used to chat-call onto the one bart-large-mnli load — both C and D fit that shape.

…L rebuild script Three deliverables from the gap-B speed/size investment session: 1. `scripts/spikes/qwen-coder-bench.mjs` — multi-dtype + multi-device bench harness for Qwen2.5-Coder-0.5B-Instruct loaded in-process via @huggingface/transformers. Probes the locally-converted ONNX (fp32 / q8 / etc.) on the same 5 curated codegraph snippets, reports per-dtype median/p95 latency + side-by-side outputs. 2. `src/llm/local-causal-llm-client.ts` — new client class that wraps the `text-generation` pipeline + chat templating for small instruction-tuned causal LMs. Sibling to LocalSummaryClient (encoder-decoder seq2seq) and RerankerClient (cross-encoder). Default model intentionally empty — operators must supply a re-exported HF id (the recipe is in the qwen-coder-probe.mjs commit message). Not yet wired into the summarizer phase; lands alongside the integration commit. 3. `scripts/spikes/coreml-rebuild.sh` — recipe to rebuild onnxruntime-node from source with --use_coreml so transformers.js gets `device:'coreml'` on Apple Silicon. The pre-built npm binary doesn't bundle CoreML EP. Vendoring the rebuilt .node file into node_modules is the deployment path. # INT8 quantize bench result (today's deliverable) ORT dynamic INT8 quantize via `optimum-cli onnxruntime quantize --arm64 --per_channel` produces `model_quantized.onnx` (~605MB, 4× smaller than fp32 2.3GB). Bench against fp32 baseline: | Config | Cold load | Median per-call | Quality | |--------------|-----------|-----------------|---------| | cpu / fp32 | 2707ms | 271ms | clean | | cpu / q8 | 733ms | 273ms | regressed | Two findings: - **q8 is not faster per-call.** Only cold-load and disk shrink (4× each). NEON fp32 is well-optimised in onnxruntime CPU EP; dynamic INT8 reduces memory but doesn't speed compute on this path. - **q8 regresses instruction-following.** sum() produces a re-generation of the source code instead of a summary; classifyCommitMessage produces "To summarize the provided code, I'll break it down..." instead of the requested 1-line. Classic small-instruct-LLM failure mode under dynamic quantization — precision loss in attention layers degrades attention to system-prompt formatting constraints. Decision: **do not ship q8**. Keep fp32 as the production path despite the 2.3GB footprint; quality is what matters. Real speedup needs CoreML EP (next task #2). # Probe script update `scripts/spikes/qwen-coder-probe.mjs` got a tighter system prompt (`OUTPUT FORMAT: ONE LINE...`) and `max_new_tokens=30` cap that together produce clean 1-liners on fp32: - sum → "Sum an array of numbers." - FileLoader.load → "Load file from path" - buildSimilarToEdges → "Build Similar To Edges" - parseDaRecord → "Parse DA record from a line and update coverage." - classifyCommitMessage → "Classify commit message" Same prompt + cap on q8 fails (model ignores OUTPUT FORMAT). # Out of scope (next task) CoreML EP rebuild — `scripts/spikes/coreml-rebuild.sh` is the recipe; expected ~3-5× speedup over CPU fp32 on Apple Silicon once the binding is vendored. Documented but not run in this commit. # Footprint 5 files added/modified, ~600 LOC: - 2 spike scripts (bench, mlx-vs-onnx) - 1 rebuild script (coreml) - 1 client class (LocalCausalLlmClient — wired in a follow-up commit) - 1 probe script update (prompt tightening)

andreinknv closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(db): split embeddings + in-memory similarity cache + drop redundant co_changes index#2

perf(db): split embeddings + in-memory similarity cache + drop redundant co_changes index#2
andreinknv wants to merge 1 commit into
pr-111-rebasedfrom
perf/db-optimizations-llm

andreinknv commented Apr 28, 2026

Uh oh!

andreinknv commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreinknv commented Apr 28, 2026

Summary

Empirical validation

What changes

F2 — Migration 015: drop idx_co_changes_a

G — Migration 016: split embeddings into symbol_embeddings

H — EmbeddingCache in src/llm/embeddings.ts

Independent review

Test plan

Files changed

Uh oh!

andreinknv commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

F2 — Migration 015: drop `idx_co_changes_a`

G — Migration 016: split embeddings into `symbol_embeddings`

H — `EmbeddingCache` in `src/llm/embeddings.ts`