Skip to content

feat(graph): enrich context graph with symbol nodes and semantic edges#9

Open
dubscode wants to merge 1 commit intofeat/incremental-indexing-pipelinefrom
feat/context-graph-enrichment
Open

feat(graph): enrich context graph with symbol nodes and semantic edges#9
dubscode wants to merge 1 commit intofeat/incremental-indexing-pipelinefrom
feat/context-graph-enrichment

Conversation

@dubscode
Copy link
Contributor

@dubscode dubscode commented Mar 4, 2026

Summary

This PR adds symbol-level context graph enrichment and semantic relationship edges, then exposes those relationships to retrieval surfaces without breaking existing file-level behavior.

What changed

  • Added graph enrichment model/types for symbol nodes and normalized edge enums:
    • defines
    • references
    • imports
    • calls
  • Added deterministic canonical symbol ID generation:
    • <repo>::<path>::<kind>::<name>::<range-hash>
  • Added feature/config gate for rollout:
    • DUBSBOT_ENABLE_SYMBOL_ENRICHMENT=1
  • Implemented TS/JS symbol extraction with best-effort diagnostics and partial-failure resilience.
  • Implemented graph persistence and cleanup for enriched file/symbol nodes and semantic edges.
  • Integrated enrichment into indexing flow via shared index helpers.
  • Extended hybrid retrieval to traverse graph hints and include graph_node citations while preserving file/chunk retrieval contract.
  • Added migration for graph enrichment constraints and indexes:
    • src/db/migrations/0003_context_graph_enrichment.sql
  • Archived OpenSpec change deeper-context-graph-enrichment and synced the capability spec:
    • openspec/specs/context-graph-enrichment/spec.md

Verification

  • pnpm checks
    • pnpm test
    • pnpm typecheck
    • pnpm lint
    • pnpm build
  • Added dedicated test coverage in tests/graph-enrichment.test.ts for:
    • canonical ID stability and extraction determinism
    • required edge presence + directionality
    • retrieval compatibility and graph-node citation exposure
    • enrichment runtime/volume acceptance thresholds

Stack note

This PR is stacked on top of #8 (feat/incremental-indexing-pipeline).


🥞 DubStack

Add symbol extraction and canonical symbol IDs for TS/JS files, persist
symbol/file graph nodes with normalized edge types (defines, references,
imports, calls), and gate rollout via symbol enrichment config.

Integrate graph enrichment into indexing, extend retrieval graph-hint
traversal with graph_node citations, and add migration + validation tests
for stability, directionality, compatibility, and volume thresholds.

Archive OpenSpec change deeper-context-graph-enrichment and sync the new
context-graph-enrichment spec into main specs.
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

PR Checks Summary

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds symbol-level enrichment to the existing context graph so indexing can persist symbol nodes + semantic edges and retrieval can leverage graph hints while keeping file-level retrieval behavior intact.

Changes:

  • Add graph symbol extraction + persistence during indexing (feature-flagged).
  • Extend hybrid retrieval to traverse graph hints, boost relevant file paths, and return graph-node citations.
  • Add DB migration constraints/indexes plus a new Vitest suite and accompanying OpenSpec documentation/archive updates.

Reviewed changes

Copilot reviewed 13 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/graph-enrichment.test.ts New tests covering extraction stability, edge presence/directionality checks, retrieval behavior, and perf/volume thresholds.
src/db/migrations/0003_context_graph_enrichment.sql Adds check constraints for node/edge enums and indexes for graph tables.
src/context/retrieval/hybrid.ts Adds graph-hint traversal to retrieval, path boosting, and graph-node citations.
src/context/indexer/incremental.ts Adds symbolEnrichmentEnabled option to incremental indexing API.
src/context/indexer/full-index.ts Adds symbolEnrichmentEnabled option to full indexing API.
src/context/indexer/file-index.ts Runs extraction/persistence + cleanup for graph enrichment during file upsert/delete.
src/context/graph/types.ts Defines node/edge/symbol types and canonical symbol ID format.
src/context/graph/persist.ts Implements persistence + deletion of graph enrichment per file (nodes + edges).
src/context/graph/extract.ts Implements initial TS/JS regex-based symbol extraction and semantic edge generation.
src/context/graph/config.ts Adds env-var gate for enabling symbol enrichment.
openspec/specs/context-graph-enrichment/spec.md New canonical spec describing requirements for graph enrichment.
openspec/changes/deeper-context-graph-enrichment/tasks.md Removes old change tasks doc (archived).
openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/tasks.md Archived tasks checklist marked completed.
openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/specs/context-graph-enrichment/spec.md Archived requirements excerpt.
openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/proposal.md Archived proposal for enrichment work.
openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/design.md Archived design doc (goals/decisions/risks).
openspec/changes/archive/2026-03-04-deeper-context-graph-enrichment/.openspec.yaml Adds archived change metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

trigger?: IncrementalTrigger;
embedProvider?: ProviderAdapter;
embeddingModel?: string;
symbolEnrichmentEnabled?: boolean;
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runIncrementalIndex now accepts symbolEnrichmentEnabled, but when it falls back to runFullIndex the flag is not forwarded. This makes behavior depend on whether incremental fallback occurs, which is surprising and can disable enrichment unintentionally. Pass symbolEnrichmentEnabled through to the runFullIndex call in the fallback path.

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +195
await persistGraphEnrichmentForFile({
db: input.db,
repoRoot: input.repoRoot,
path: relativePath,
extraction,
});
for (const diagnostic of extraction.diagnostics) {
console.info(`[indexer:graph] ${diagnostic}`);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When symbol enrichment is enabled, extraction/persistence is attempted for every indexed file and any diagnostics are logged. For non-JS/TS files this will always emit unsupported-language:* and still write a file node, which can create a lot of DB churn and noisy logs on mixed-language repos. Consider short-circuiting (e.g., check canExtractSymbols(relativePath) or extraction.symbols.length > 0) and/or downgrade/aggregate diagnostics logging.

Suggested change
await persistGraphEnrichmentForFile({
db: input.db,
repoRoot: input.repoRoot,
path: relativePath,
extraction,
});
for (const diagnostic of extraction.diagnostics) {
console.info(`[indexer:graph] ${diagnostic}`);
// Short-circuit for files with no extracted symbols to avoid unnecessary
// persistence and noisy diagnostics (e.g. unsupported-language:*).
if (!extraction.symbols || extraction.symbols.length === 0) {
return {
status: 'indexed',
fileStatus,
chunksInserted: chunks.length,
chunksDeleted,
};
}
await persistGraphEnrichmentForFile({
db: input.db,
repoRoot: input.repoRoot,
path: relativePath,
extraction,
});
const diagnostics = extraction.diagnostics ?? [];
for (const diagnostic of diagnostics) {
const message = String(diagnostic);
if (message.startsWith('unsupported-language:')) {
continue;
}
console.info(`[indexer:graph] ${message}`);

Copilot uses AI. Check for mistakes.
# context-graph-enrichment Specification

## Purpose
TBD - created by archiving change deeper-context-graph-enrichment. Update Purpose after archive.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec Purpose section is left as "TBD". Since this is now the canonical spec under openspec/specs/, it should state the actual purpose of context graph enrichment (at least a 1–2 sentence summary) rather than referencing an archived change.

Suggested change
TBD - created by archiving change deeper-context-graph-enrichment. Update Purpose after archive.
This specification defines how the context graph is enriched with symbol-level nodes and normalized semantic relationship edges derived from source code. The goal is to enable precise retrieval, impact analysis, and other graph-based queries while preserving compatibility with existing file-level graph consumers.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +106
expect(edge.sourceKey).toContain('::file');
expect(edge.targetKey).not.toBe(edge.sourceKey);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directionality assertions are very broad: the test currently requires all edges to have sourceKey containing ::file. That doesn't actually validate directionality per edge type, and it also conflicts with the spec language that references/calls edges may originate from a symbol context. Consider tightening this to assert directionality per edge type (e.g., defines should be file -> symbol) rather than hard-coding file as the source for every edge.

Suggested change
expect(edge.sourceKey).toContain('::file');
expect(edge.targetKey).not.toBe(edge.sourceKey);
switch (edge.type) {
case 'defines':
case 'imports':
// File-level edges should originate from a file context.
expect(edge.sourceKey).toContain('::file');
expect(edge.targetKey).not.toBe(edge.sourceKey);
break;
case 'references':
case 'calls':
// Reference/call edges may originate from a symbol context.
expect(edge.sourceKey).toContain('::symbol');
expect(edge.targetKey).not.toBe(edge.sourceKey);
break;
default:
// At minimum, no edge should be self-referential.
expect(edge.targetKey).not.toBe(edge.sourceKey);
break;
}

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +203
WHERE n.payload->>'repoRoot' = $1
AND (
n.node_key = ANY($2::text[])
OR n.payload->>'name' = ANY($2::text[])
OR n.payload->>'path' = ANY($2::text[])
)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traverseGraphHints filters context_nodes by n.payload->>'repoRoot' and then by payload->>'name' / payload->>'path'. There are indexes on node_type and node_key, but none on these JSONB fields, so this query will devolve into a full scan as the graph grows (especially with symbol nodes). Consider adding a functional index (e.g., on (payload->>'repoRoot') and/or (payload->>'name'), (payload->>'path')) or promoting these fields to dedicated columns for efficient lookups.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +18
'SELECT id FROM context_nodes WHERE node_key LIKE $1',
[`${prefix}%`]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistGraphEnrichmentForFile scopes nodes using node_key LIKE $1 with a prefix derived from repoRoot/path. In SQL LIKE, % and _ in the prefix are wildcards, so repos/paths containing those characters can unintentionally match and delete other nodes/edges. Use a non-wildcard prefix match (e.g., LEFT(node_key, LENGTH($1)) = $1 with the raw prefix, or escape %/_ and add an explicit ESCAPE clause).

Suggested change
'SELECT id FROM context_nodes WHERE node_key LIKE $1',
[`${prefix}%`]
'SELECT id FROM context_nodes WHERE LEFT(node_key, LENGTH($1)) = $1',
[prefix]

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +104
'SELECT id FROM context_nodes WHERE node_key LIKE $1',
[`${prefix}%`]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteGraphEnrichmentForFile uses node_key LIKE $1 with a prefix built from repoRoot/path. Because _/% are wildcards in LIKE, this can delete graph enrichment for other files if the repo path contains those characters. Prefer a deterministic prefix match (or properly escape the prefix + specify an ESCAPE clause).

Suggested change
'SELECT id FROM context_nodes WHERE node_key LIKE $1',
[`${prefix}%`]
'SELECT id FROM context_nodes WHERE LEFT(node_key, LENGTH($1)) = $1',
[prefix]

Copilot uses AI. Check for mistakes.
CHECK (edge_type IN ('defines', 'references', 'imports', 'calls'));

CREATE INDEX IF NOT EXISTS idx_context_nodes_node_type ON context_nodes(node_type);
CREATE INDEX IF NOT EXISTS idx_context_nodes_node_key ON context_nodes(node_key);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context_nodes.node_key is already UNIQUE (created in 0001_init.sql), which implicitly creates an index. Creating an additional idx_context_nodes_node_key index is redundant and adds extra write/storage overhead. Consider removing this index (or replacing the UNIQUE constraint/index if you intended a different index shape).

Suggested change
CREATE INDEX IF NOT EXISTS idx_context_nodes_node_key ON context_nodes(node_key);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants