Multi-layer node deduplication system#29
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a deterministic deduplication mechanism for graph nodes by adding a canonical_label field to the node_metadata table. Key changes include a database migration for the new column and index, a normalization utility that handles whitespace and casing, and a new background job (dedup-sweep) to merge existing duplicates. Feedback focuses on ensuring consistency between the SQL backfill logic and the application-level normalization, as well as optimizing node insertion by batching database queries to avoid N+1 performance issues.
| @@ -0,0 +1,4 @@ | |||
| ALTER TABLE "node_metadata" ADD COLUMN "canonical_label" text;--> statement-breakpoint | |||
| -- Backfill canonical_label from existing labels | |||
| UPDATE "node_metadata" SET "canonical_label" = lower(trim("label")) WHERE "label" IS NOT NULL;--> statement-breakpoint | |||
There was a problem hiding this comment.
The SQL backfill logic lower(trim("label")) does not collapse multiple spaces, which is inconsistent with the normalizeLabel function used in the application code (replace(/\s+/g, " ")). This discrepancy will cause the exact-match deduplication to fail for nodes that contain multiple consecutive spaces in their labels. Using regexp_replace with the 'g' flag ensures consistency with the application-level normalization.
UPDATE "node_metadata" SET "canonical_label" = regexp_replace(lower(trim("label")), '\s+', ' ', 'g') WHERE "label" IS NOT NULL;--> statement-breakpoint| const [existingNode] = await db | ||
| .select({ id: nodes.id, label: nodeMetadata.label }) | ||
| .from(nodes) | ||
| .innerJoin(nodeMetadata, eq(nodeMetadata.nodeId, nodes.id)) | ||
| .where( | ||
| and( | ||
| eq(nodes.userId, userId), | ||
| eq(nodes.nodeType, llmNode.type), | ||
| eq(nodeMetadata.canonicalLabel, canonical), | ||
| ), | ||
| ) | ||
| .limit(1); |
There was a problem hiding this comment.
Addresses duplicate node proliferation (especially Person nodes) through four complementary layers: 1. Prevention: exact-match check on (userId, nodeType, canonicalLabel) before inserting new nodes in extractGraph 2. Cleanup: deterministic dedup sweep job that finds and merges all exact-label duplicates via SQL GROUP BY + HAVING 3. Schema: canonicalLabel column on nodeMetadata with index for fast lookups, backfilled from existing labels 4. Context: extraction now includes all Person nodes (not just embedding-similar), capped at 150 nodes total The dedup sweep runs automatically after every conversation/document ingestion and before LLM-based cleanup. Also exposed as POST /cleanup/dedup-sweep endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
698db8d to
1178cfd
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
(userId, nodeType, canonicalLabel)before inserting new nodes — stops ~80% of duplicates at the sourceGROUP BY+HAVING, merges them (rewire edges/source links, delete duplicates) — cleans existing data, no LLM neededcanonicalLabelcolumn onnodeMetadatawith index, backfilled from existing labels via migrationThe dedup sweep runs automatically after every conversation/document ingestion and before LLM-based cleanup. Also exposed as
POST /cleanup/dedup-sweep.Test plan
normalizeLabelunit tests (6 tests passing)POST /cleanup/dedup-sweep { "userId": "user_..." }🤖 Generated with Claude Code