feat(graph): enrich context graph with symbol nodes and semantic edges#9
feat(graph): enrich context graph with symbol nodes and semantic edges#9dubscode wants to merge 1 commit intofeat/incremental-indexing-pipelinefrom
Conversation
Add symbol extraction and canonical symbol IDs for TS/JS files, persist symbol/file graph nodes with normalized edge types (defines, references, imports, calls), and gate rollout via symbol enrichment config. Integrate graph enrichment into indexing, extend retrieval graph-hint traversal with graph_node citations, and add migration + validation tests for stability, directionality, compatibility, and volume thresholds. Archive OpenSpec change deeper-context-graph-enrichment and sync the new context-graph-enrichment spec into main specs.
PR Checks Summary
|
There was a problem hiding this comment.
Pull request overview
Adds symbol-level enrichment to the existing context graph so indexing can persist symbol nodes + semantic edges and retrieval can leverage graph hints while keeping file-level retrieval behavior intact.
Changes:
- Add graph symbol extraction + persistence during indexing (feature-flagged).
- Extend hybrid retrieval to traverse graph hints, boost relevant file paths, and return graph-node citations.
- Add DB migration constraints/indexes plus a new Vitest suite and accompanying OpenSpec documentation/archive updates.
Reviewed changes
Copilot reviewed 13 out of 17 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/graph-enrichment.test.ts | New tests covering extraction stability, edge presence/directionality checks, retrieval behavior, and perf/volume thresholds. |
| src/db/migrations/0003_context_graph_enrichment.sql | Adds check constraints for node/edge enums and indexes for graph tables. |
| src/context/retrieval/hybrid.ts | Adds graph-hint traversal to retrieval, path boosting, and graph-node citations. |
| src/context/indexer/incremental.ts | Adds symbolEnrichmentEnabled option to incremental indexing API. |
| src/context/indexer/full-index.ts | Adds symbolEnrichmentEnabled option to full indexing API. |
| src/context/indexer/file-index.ts | Runs extraction/persistence + cleanup for graph enrichment during file upsert/delete. |
| src/context/graph/types.ts | Defines node/edge/symbol types and canonical symbol ID format. |
| src/context/graph/persist.ts | Implements persistence + deletion of graph enrichment per file (nodes + edges). |
| src/context/graph/extract.ts | Implements initial TS/JS regex-based symbol extraction and semantic edge generation. |
| src/context/graph/config.ts | Adds env-var gate for enabling symbol enrichment. |
| openspec/specs/context-graph-enrichment/spec.md | New canonical spec describing requirements for graph enrichment. |
| openspec/changes/deeper-context-graph-enrichment/tasks.md | Removes old change tasks doc (archived). |
| openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/tasks.md | Archived tasks checklist marked completed. |
| openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/specs/context-graph-enrichment/spec.md | Archived requirements excerpt. |
| openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/proposal.md | Archived proposal for enrichment work. |
| openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/design.md | Archived design doc (goals/decisions/risks). |
| openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/.openspec.yaml | Adds archived change metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| trigger?: IncrementalTrigger; | ||
| embedProvider?: ProviderAdapter; | ||
| embeddingModel?: string; | ||
| symbolEnrichmentEnabled?: boolean; |
There was a problem hiding this comment.
runIncrementalIndex now accepts symbolEnrichmentEnabled, but when it falls back to runFullIndex the flag is not forwarded. This makes behavior depend on whether incremental fallback occurs, which is surprising and can disable enrichment unintentionally. Pass symbolEnrichmentEnabled through to the runFullIndex call in the fallback path.
| await persistGraphEnrichmentForFile({ | ||
| db: input.db, | ||
| repoRoot: input.repoRoot, | ||
| path: relativePath, | ||
| extraction, | ||
| }); | ||
| for (const diagnostic of extraction.diagnostics) { | ||
| console.info(`[indexer:graph] ${diagnostic}`); |
There was a problem hiding this comment.
When symbol enrichment is enabled, extraction/persistence is attempted for every indexed file and any diagnostics are logged. For non-JS/TS files this will always emit unsupported-language:* and still write a file node, which can create a lot of DB churn and noisy logs on mixed-language repos. Consider short-circuiting (e.g., check canExtractSymbols(relativePath) or extraction.symbols.length > 0) and/or downgrade/aggregate diagnostics logging.
| await persistGraphEnrichmentForFile({ | |
| db: input.db, | |
| repoRoot: input.repoRoot, | |
| path: relativePath, | |
| extraction, | |
| }); | |
| for (const diagnostic of extraction.diagnostics) { | |
| console.info(`[indexer:graph] ${diagnostic}`); | |
| // Short-circuit for files with no extracted symbols to avoid unnecessary | |
| // persistence and noisy diagnostics (e.g. unsupported-language:*). | |
| if (!extraction.symbols || extraction.symbols.length === 0) { | |
| return { | |
| status: 'indexed', | |
| fileStatus, | |
| chunksInserted: chunks.length, | |
| chunksDeleted, | |
| }; | |
| } | |
| await persistGraphEnrichmentForFile({ | |
| db: input.db, | |
| repoRoot: input.repoRoot, | |
| path: relativePath, | |
| extraction, | |
| }); | |
| const diagnostics = extraction.diagnostics ?? []; | |
| for (const diagnostic of diagnostics) { | |
| const message = String(diagnostic); | |
| if (message.startsWith('unsupported-language:')) { | |
| continue; | |
| } | |
| console.info(`[indexer:graph] ${message}`); |
| # context-graph-enrichment Specification | ||
|
|
||
| ## Purpose | ||
| TBD - created by archiving change deeper-context-graph-enrichment. Update Purpose after archive. |
There was a problem hiding this comment.
The spec Purpose section is left as "TBD". Since this is now the canonical spec under openspec/specs/, it should state the actual purpose of context graph enrichment (at least a 1–2 sentence summary) rather than referencing an archived change.
| TBD - created by archiving change deeper-context-graph-enrichment. Update Purpose after archive. | |
| This specification defines how the context graph is enriched with symbol-level nodes and normalized semantic relationship edges derived from source code. The goal is to enable precise retrieval, impact analysis, and other graph-based queries while preserving compatibility with existing file-level graph consumers. |
| expect(edge.sourceKey).toContain('::file'); | ||
| expect(edge.targetKey).not.toBe(edge.sourceKey); |
There was a problem hiding this comment.
The directionality assertions are very broad: the test currently requires all edges to have sourceKey containing ::file. That doesn't actually validate directionality per edge type, and it also conflicts with the spec language that references/calls edges may originate from a symbol context. Consider tightening this to assert directionality per edge type (e.g., defines should be file -> symbol) rather than hard-coding file as the source for every edge.
| expect(edge.sourceKey).toContain('::file'); | |
| expect(edge.targetKey).not.toBe(edge.sourceKey); | |
| switch (edge.type) { | |
| case 'defines': | |
| case 'imports': | |
| // File-level edges should originate from a file context. | |
| expect(edge.sourceKey).toContain('::file'); | |
| expect(edge.targetKey).not.toBe(edge.sourceKey); | |
| break; | |
| case 'references': | |
| case 'calls': | |
| // Reference/call edges may originate from a symbol context. | |
| expect(edge.sourceKey).toContain('::symbol'); | |
| expect(edge.targetKey).not.toBe(edge.sourceKey); | |
| break; | |
| default: | |
| // At minimum, no edge should be self-referential. | |
| expect(edge.targetKey).not.toBe(edge.sourceKey); | |
| break; | |
| } |
| WHERE n.payload->>'repoRoot' = $1 | ||
| AND ( | ||
| n.node_key = ANY($2::text[]) | ||
| OR n.payload->>'name' = ANY($2::text[]) | ||
| OR n.payload->>'path' = ANY($2::text[]) | ||
| ) |
There was a problem hiding this comment.
traverseGraphHints filters context_nodes by n.payload->>'repoRoot' and then by payload->>'name' / payload->>'path'. There are indexes on node_type and node_key, but none on these JSONB fields, so this query will devolve into a full scan as the graph grows (especially with symbol nodes). Consider adding a functional index (e.g., on (payload->>'repoRoot') and/or (payload->>'name'), (payload->>'path')) or promoting these fields to dedicated columns for efficient lookups.
| 'SELECT id FROM context_nodes WHERE node_key LIKE $1', | ||
| [`${prefix}%`] |
There was a problem hiding this comment.
persistGraphEnrichmentForFile scopes nodes using node_key LIKE $1 with a prefix derived from repoRoot/path. In SQL LIKE, % and _ in the prefix are wildcards, so repos/paths containing those characters can unintentionally match and delete other nodes/edges. Use a non-wildcard prefix match (e.g., LEFT(node_key, LENGTH($1)) = $1 with the raw prefix, or escape %/_ and add an explicit ESCAPE clause).
| 'SELECT id FROM context_nodes WHERE node_key LIKE $1', | |
| [`${prefix}%`] | |
| 'SELECT id FROM context_nodes WHERE LEFT(node_key, LENGTH($1)) = $1', | |
| [prefix] |
| 'SELECT id FROM context_nodes WHERE node_key LIKE $1', | ||
| [`${prefix}%`] |
There was a problem hiding this comment.
deleteGraphEnrichmentForFile uses node_key LIKE $1 with a prefix built from repoRoot/path. Because _/% are wildcards in LIKE, this can delete graph enrichment for other files if the repo path contains those characters. Prefer a deterministic prefix match (or properly escape the prefix + specify an ESCAPE clause).
| 'SELECT id FROM context_nodes WHERE node_key LIKE $1', | |
| [`${prefix}%`] | |
| 'SELECT id FROM context_nodes WHERE LEFT(node_key, LENGTH($1)) = $1', | |
| [prefix] |
| CHECK (edge_type IN ('defines', 'references', 'imports', 'calls')); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_context_nodes_node_type ON context_nodes(node_type); | ||
| CREATE INDEX IF NOT EXISTS idx_context_nodes_node_key ON context_nodes(node_key); |
There was a problem hiding this comment.
context_nodes.node_key is already UNIQUE (created in 0001_init.sql), which implicitly creates an index. Creating an additional idx_context_nodes_node_key index is redundant and adds extra write/storage overhead. Consider removing this index (or replacing the UNIQUE constraint/index if you intended a different index shape).
| CREATE INDEX IF NOT EXISTS idx_context_nodes_node_key ON context_nodes(node_key); |
Summary
This PR adds symbol-level context graph enrichment and semantic relationship edges, then exposes those relationships to retrieval surfaces without breaking existing file-level behavior.
What changed
definesreferencesimportscalls<repo>::<path>::<kind>::<name>::<range-hash>DUBSBOT_ENABLE_SYMBOL_ENRICHMENT=1graph_nodecitations while preserving file/chunk retrieval contract.src/db/migrations/0003_context_graph_enrichment.sqldeeper-context-graph-enrichmentand synced the capability spec:openspec/specs/context-graph-enrichment/spec.mdVerification
pnpm checkspnpm testpnpm typecheckpnpm lintpnpm buildtests/graph-enrichment.test.tsfor:Stack note
This PR is stacked on top of #8 (
feat/incremental-indexing-pipeline).🥞 DubStack