Skip to content

Multi-layer node deduplication system#29

Merged
marcelsamyn merged 2 commits intomainfrom
feat/dedup-system
Apr 11, 2026
Merged

Multi-layer node deduplication system#29
marcelsamyn merged 2 commits intomainfrom
feat/dedup-system

Conversation

@marcelsamyn
Copy link
Copy Markdown
Owner

Summary

  • Prevention layer: exact-match check on (userId, nodeType, canonicalLabel) before inserting new nodes — stops ~80% of duplicates at the source
  • Dedup sweep: deterministic job that finds all exact-label duplicates via GROUP BY + HAVING, merges them (rewire edges/source links, delete duplicates) — cleans existing data, no LLM needed
  • Schema hardening: canonicalLabel column on nodeMetadata with index, backfilled from existing labels via migration
  • Better extraction context: all Person nodes included in LLM extraction context (not just embedding-similar), deduped and capped at 150 nodes

The dedup sweep runs automatically after every conversation/document ingestion and before LLM-based cleanup. Also exposed as POST /cleanup/dedup-sweep.

Test plan

  • normalizeLabel unit tests (6 tests passing)
  • All 23 existing tests passing
  • No new type errors introduced (all errors pre-existing)
  • Verify dedup sweep endpoint with real data: POST /cleanup/dedup-sweep { "userId": "user_..." }
  • Ingest a conversation mentioning someone already in the graph — verify no duplicate Person node created
  • Run cleanup-graph job — verify dedup sweep runs first

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deterministic deduplication mechanism for graph nodes by adding a canonical_label field to the node_metadata table. Key changes include a database migration for the new column and index, a normalization utility that handles whitespace and casing, and a new background job (dedup-sweep) to merge existing duplicates. Feedback focuses on ensuring consistency between the SQL backfill logic and the application-level normalization, as well as optimizing node insertion by batching database queries to avoid N+1 performance issues.

Comment thread drizzle/0008_worthless_bullseye.sql Outdated
@@ -0,0 +1,4 @@
ALTER TABLE "node_metadata" ADD COLUMN "canonical_label" text;--> statement-breakpoint
-- Backfill canonical_label from existing labels
UPDATE "node_metadata" SET "canonical_label" = lower(trim("label")) WHERE "label" IS NOT NULL;--> statement-breakpoint
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SQL backfill logic lower(trim("label")) does not collapse multiple spaces, which is inconsistent with the normalizeLabel function used in the application code (replace(/\s+/g, " ")). This discrepancy will cause the exact-match deduplication to fail for nodes that contain multiple consecutive spaces in their labels. Using regexp_replace with the 'g' flag ensures consistency with the application-level normalization.

UPDATE "node_metadata" SET "canonical_label" = regexp_replace(lower(trim("label")), '\s+', ' ', 'g') WHERE "label" IS NOT NULL;--> statement-breakpoint

Comment thread src/lib/extract-graph.ts Outdated
Comment on lines +336 to +347
const [existingNode] = await db
.select({ id: nodes.id, label: nodeMetadata.label })
.from(nodes)
.innerJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
.where(
and(
eq(nodes.userId, userId),
eq(nodes.nodeType, llmNode.type),
eq(nodeMetadata.canonicalLabel, canonical),
),
)
.limit(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performing a database query for each node returned by the LLM inside a loop is inefficient (N-query problem). Consider batching this check by collecting all canonical labels from the LLM output and performing a single query using inArray before the loop to identify existing nodes in bulk.

Addresses duplicate node proliferation (especially Person nodes) through
four complementary layers:

1. Prevention: exact-match check on (userId, nodeType, canonicalLabel)
   before inserting new nodes in extractGraph
2. Cleanup: deterministic dedup sweep job that finds and merges all
   exact-label duplicates via SQL GROUP BY + HAVING
3. Schema: canonicalLabel column on nodeMetadata with index for fast
   lookups, backfilled from existing labels
4. Context: extraction now includes all Person nodes (not just
   embedding-similar), capped at 150 nodes total

The dedup sweep runs automatically after every conversation/document
ingestion and before LLM-based cleanup. Also exposed as POST
/cleanup/dedup-sweep endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcelsamyn marcelsamyn merged commit 331c097 into main Apr 11, 2026
1 check passed
@marcelsamyn marcelsamyn deleted the feat/dedup-system branch April 11, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant