Multi-layer node deduplication system by marcelsamyn · Pull Request #29 · marcelsamyn/assistant-memory

marcelsamyn · 2026-04-11T16:49:27Z

Summary

Prevention layer: exact-match check on (userId, nodeType, canonicalLabel) before inserting new nodes — stops ~80% of duplicates at the source
Dedup sweep: deterministic job that finds all exact-label duplicates via GROUP BY + HAVING, merges them (rewire edges/source links, delete duplicates) — cleans existing data, no LLM needed
Schema hardening: canonicalLabel column on nodeMetadata with index, backfilled from existing labels via migration
Better extraction context: all Person nodes included in LLM extraction context (not just embedding-similar), deduped and capped at 150 nodes

The dedup sweep runs automatically after every conversation/document ingestion and before LLM-based cleanup. Also exposed as POST /cleanup/dedup-sweep.

Test plan

normalizeLabel unit tests (6 tests passing)
All 23 existing tests passing
No new type errors introduced (all errors pre-existing)
Verify dedup sweep endpoint with real data: POST /cleanup/dedup-sweep { "userId": "user_..." }
Ingest a conversation mentioning someone already in the graph — verify no duplicate Person node created
Run cleanup-graph job — verify dedup sweep runs first

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a deterministic deduplication mechanism for graph nodes by adding a canonical_label field to the node_metadata table. Key changes include a database migration for the new column and index, a normalization utility that handles whitespace and casing, and a new background job (dedup-sweep) to merge existing duplicates. Feedback focuses on ensuring consistency between the SQL backfill logic and the application-level normalization, as well as optimizing node insertion by batching database queries to avoid N+1 performance issues.

gemini-code-assist · 2026-04-11T16:50:50Z

@@ -0,0 +1,4 @@
+ALTER TABLE "node_metadata" ADD COLUMN "canonical_label" text;--> statement-breakpoint
+-- Backfill canonical_label from existing labels
+UPDATE "node_metadata" SET "canonical_label" = lower(trim("label")) WHERE "label" IS NOT NULL;--> statement-breakpoint


The SQL backfill logic lower(trim("label")) does not collapse multiple spaces, which is inconsistent with the normalizeLabel function used in the application code (replace(/\s+/g, " ")). This discrepancy will cause the exact-match deduplication to fail for nodes that contain multiple consecutive spaces in their labels. Using regexp_replace with the 'g' flag ensures consistency with the application-level normalization.

UPDATE "node_metadata" SET "canonical_label" = regexp_replace(lower(trim("label")), '\s+', ' ', 'g') WHERE "label" IS NOT NULL;--> statement-breakpoint

gemini-code-assist · 2026-04-11T16:50:50Z

+    const [existingNode] = await db
+      .select({ id: nodes.id, label: nodeMetadata.label })
+      .from(nodes)
+      .innerJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id))
+      .where(
+        and(
+          eq(nodes.userId, userId),
+          eq(nodes.nodeType, llmNode.type),
+          eq(nodeMetadata.canonicalLabel, canonical),
+        ),
+      )
+      .limit(1);


Performing a database query for each node returned by the LLM inside a loop is inefficient (N-query problem). Consider batching this check by collecting all canonical labels from the LLM output and performing a single query using inArray before the loop to identify existing nodes in bulk.

Addresses duplicate node proliferation (especially Person nodes) through four complementary layers: 1. Prevention: exact-match check on (userId, nodeType, canonicalLabel) before inserting new nodes in extractGraph 2. Cleanup: deterministic dedup sweep job that finds and merges all exact-label duplicates via SQL GROUP BY + HAVING 3. Schema: canonicalLabel column on nodeMetadata with index for fast lookups, backfilled from existing labels 4. Context: extraction now includes all Person nodes (not just embedding-similar), capped at 150 nodes total The dedup sweep runs automatically after every conversation/document ingestion and before LLM-based cleanup. Also exposed as POST /cleanup/dedup-sweep endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

marcelsamyn force-pushed the feat/dedup-system branch from 698db8d to 1178cfd Compare April 11, 2026 17:39

🎨 style: fix prettier formatting

de17039

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

marcelsamyn merged commit 331c097 into main Apr 11, 2026
1 check passed

marcelsamyn deleted the feat/dedup-system branch April 11, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-layer node deduplication system#29

Multi-layer node deduplication system#29
marcelsamyn merged 2 commits intomainfrom
feat/dedup-system

marcelsamyn commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcelsamyn commented Apr 11, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant