Skip to content

feat: add recall dedup — non-destructive dedup with provenance-aware survivor selection#60

Merged
edheltzel merged 2 commits into
mainfrom
feat/45-dedup
Jun 11, 2026
Merged

feat: add recall dedup — non-destructive dedup with provenance-aware survivor selection#60
edheltzel merged 2 commits into
mainfrom
feat/45-dedup

Conversation

@edheltzel

Copy link
Copy Markdown
Owner

Closes #45

What

Adds recall dedup: duplicate detection and marking across the five memory tables, built on the provenance column (#42), shared chunking (#41), and the export surface (#43).

Safety model (binding tracker principles)

  • Dry-run by default — mutations require --execute.
  • Non-destructive by default--execute writes lineage rows to the new dedup_lineage table (survivor table/id, duplicate table/id, reason, similarity, status, created_at, audit detail JSON); duplicate records stay intact and are hidden from search. --delete is the destructive opt-in, requires --execute, and the CLI recommends recall export --backup first.
  • Within-table (and within-project) only — cross-table duplicate candidates are report-only, never acted on.
  • Survivor priority user_authored > verbatim > extracted > derived > unknown from PROVENANCE_VALUES in src/types/index.ts (single source of truth); ties break by richness (normalized length), importance, recency, then lowest id (total deterministic order).

Detection

  • Exact/normalized: sha256 over normalized text (lowercase, quotes stripped, whitespace collapsed), grouped per table + project. Records under 20 significant chars are never candidates (protects short messages like "ok"). Only active decisions participate, so a superseded decision can never out-survive and hide an active one.
  • Semantic: pairwise cosine over stored embeddings — no embedding-service call needed; skip is clearly reported when no embeddings exist (or --no-semantic). Conservative default threshold 0.95 (--threshold to tune); below-threshold pairs are never merged, and pairs are processed strongest-first with no transitive chaining — every marked duplicate has a direct similarity ≥ threshold to its survivor.

Search integration

Marked duplicates are hidden by default from every path — FTS5 search(), recall semantic, recall hybrid, and the MCP hybridSearch vector pass — via a shared notMarkedDuplicateSql() fragment. recall search --include-duplicates shows them again.

Schema

Notes for review

  • Destructive mode + FKs: with foreign_keys=ON, hard-deleting a message referenced by an LoA range (or an LoA entry referenced as parent) would abort the transaction. Such duplicates are downgraded to marked and reported as FK-protected instead.
  • Input-scaled SQL: scans use keyset pagination (fixed binds); destructive IN-list deletes go through chunked() per the src/lib/chunk.ts audit-note convention.
  • Latent bug fix: blobToEmbedding called readFloatLE on bun:sqlite BLOBs, which are Uint8Array (no such method) — first exercised by dedup's tests; fixed at the single source of truth.
  • Issue Add focused property-based tests for import, chunking, export, and dedup invariants #44 phase 2: decision logic (normalization, survivor selection, grouping, semantic pairing) is pure exported functions in src/lib/dedup.ts, ready for property suites — the suites themselves are intentionally not included here.

Testing

  • bun run lint clean; bun run build clean; full suite 574 pass / 0 fail.
  • New tests: tests/lib/dedup.test.ts (survivor order + tie-breaks, normalization, grouping, semantic pairing) and tests/commands/dedup.test.ts (dry-run vs execute, exact detection, semantic-skip reporting, threshold behavior with synthetic embeddings, idempotence, cross-table report-only, search hide/include, lineage persistence, destructive delete + FK protection).
  • Manual CLI smoke: init → seed → dry-run → execute → search hide/include → --no-semantic--delete guard.

Docs updated: docs/cli-reference.md (Dedup section + search flag), docs/architecture.md (table + migration), CHANGELOG.md.

…survivor selection

Dry-run by default; --execute marks duplicates in the new dedup_lineage
table (schema migration 9→10) without touching the records; --delete is
the destructive opt-in (recall export --backup recommended first).

- Exact/normalized detection within table + project; cross-table
  candidates are report-only
- Semantic detection over stored embeddings (pairwise cosine, no
  embedding service call; conservative 0.95 default threshold; skip is
  reported when no embeddings exist; no transitive chaining)
- Survivor priority user_authored > verbatim > extracted > derived >
  unknown (PROVENANCE_VALUES), then richness, importance, recency, id
- Marked duplicates hidden from all search paths (FTS5, semantic,
  hybrid, MCP) unless --include-duplicates is passed
- dedup_lineage included in recall export for a portable audit trail
- Destructive deletes go through chunked() per the chunk.ts audit note;
  FK-referenced duplicates (LoA message ranges, LoA parents) are kept
  as marked instead of failing the transaction
- Fix latent blobToEmbedding crash on bun:sqlite Uint8Array blobs

Closes #45
@edheltzel

Copy link
Copy Markdown
Owner Author

Review — PR #60 recall dedup (head 7c34cd8)

Reviewed the full diff, verified every binding tracker principle against the code and the live CLI (scratch DBs seeded with exact, semantic, cross-table, cross-project, too-short, superseded-decision, and FK-pinned duplicates), ran five specialist review passes, and confirmed CI green on the head commit plus a local bun run lint / bun run build / bun test574 pass / 0 fail.

Most of the safety model holds up exactly as advertised — dry-run is provably write-free (byte-identical .dump before/after), --execute preserves every record, the --delete guard works, survivor priority comes from PROVENANCE_VALUES, hiding works across FTS5 / recall semantic / recall hybrid / MCP vector pass, migration 9→10 matches the 3→4 idiom, and the chunk.ts bind-count conventions are followed. But there are two blockers, one of them a real violation of a binding principle.


🛑 Blocking 1 — Transitive chaining is possible; destructive mode can delete content whose only surviving neighbor is below threshold

The header of src/lib/dedup.ts promises "Pairs are never chained transitively: every marked duplicate has a direct similarity >= threshold to its survivor." The greedy guard in planDedup only checks plannedDuplicateKeys — records already planned as duplicates. It never checks whether a pair member is already a planned survivor. A survivor can therefore be re-marked by a later, weaker pair in the same run.

Reproduced on a scratch DB (threshold 0.95): three breadcrumbs with stored embeddings at cos(A,B)=0.99, cos(B,C)=0.96, cos(A,C)≈0.91:

breadcrumbs: scanned 3, semantic pairs 2, mark 2
  #1 → survivor #2 [semantic @ 0.990]
  #2 → survivor #3 [semantic @ 0.960]     ← #1's survivor is itself marked

The same hole applies across passes: an exact-pass survivor can be re-marked by the semantic pass.

Fix is small: track planned survivors (e.g. a plannedSurvivorKeys set populated in both passes) and skip any semantic pair whose member is already a planned survivor — strongest-first then guarantees one-hop-only lineage. Please add the A/B/C three-vector regression test (the construction above is ~15 lines) so the invariant is pinned.

🛑 Blocking 2 — src/lib/dedup.ts contains raw NUL bytes and is binary to git

The groupKey separators at src/lib/dedup.ts:132 and :162 are literal U+0000 bytes embedded in the template strings (byte offsets 5357, 5376, 6335). Git classifies the PR's central file as binary (Bin 0 -> 22314 bytes): it cannot be diffed, blamed, or reviewed on GitHub, and editors render the separator invisibly, so a future "whitespace fix" would silently change grouping semantics.

Fix: write the separator as the escape sequence \x00 (or \u0000) in the template literal — identical runtime string, plain-text file — or switch to a printable delimiter that cannot appear in a sha256 hex key (e.g. |). Add a word to the groupKey comment naming the separator choice.


Important (fix here or file follow-ups before merge)

  1. Mark-then-delete escalation is a no-op (verified live): after recall dedup --execute, a subsequent --execute --delete deletes 0 records — previously marked duplicates are excluded from scans and there is no marked → deleted upgrade path. The docs list the two commands as escalating steps without saying this. Either document it explicitly in docs/cli-reference.md ("--delete only affects duplicates found in that run; already-marked records stay marked") or decide the upgrade path now. The output does show "already marked" counts, so it's visible — but it defeats the natural mark→review→delete workflow users will expect.
  2. Un-migrated DB ⇒ silently empty search. getDb() never runs DDL; only recall init creates dedup_lineage. On a DB that missed init (update.sh --no-migrate, manual git pull && bun run build, restored older DB), every per-table FTS query now fails on the NOT EXISTS subquery, the per-table catch swallows it into lastSearchErrors — which has zero production callers — and recall search / MCP memory_search return a clean "No results found." A memory product looking empty with no evidence is the worst failure shape. Related: the PR moved a DB query inside MCP hybridSearch's pre-existing bare catch {} (and embeddingsAvailable = true is set before the query), so DB errors there now masquerade as "embedding service unavailable." Cheap asks: surface lastSearchErrors when results are empty, add a dedup_lineage existence check to recall doctor, and narrow the MCP catch. Supported upgrade path (update.shrecall init) is safe, so follow-up issues are acceptable — but please file them.
  3. Test gaps on promised surfaces: the docs promise hiding on keyword/semantic/hybrid, but only FTS5 search() is tested — embeddingsWhere() (both runSemanticSearch/runHybridSearch call sites) and the MCP vector-path filter have zero coverage (I verified them manually against a scratch DB; they work today). Also untested: a mixed destructive run (some FK-protected, some deleted — verified manually: {deleted: 3, fkProtected: 1}, PRAGMA foreign_key_check clean), and the chain regression from Blocking 1.

Minor / suggestions (non-blocking)

  • LineageEntry.similarity: number | null is never null (exact ⇒ 1.0) — narrow the type or document; the null branch in the CLI renderer is dead code.
  • Threshold (0, 1] validation lives only in the CLI; add the one-line guard to planDedup to protect future lib callers. Also src/index.ts help text says "(0-1)" while validation accepts (0, 1].
  • Migration 9→10 comment says the DDL handles "dedup_lineage and its indexes" before migrations — the indexes are in CREATE_INDEXES, which runs after applyMigrations (deliberately — see connection.ts:90-92). One-word fix so future migration authors don't assume the unique index exists mid-migration.
  • docs/cli-reference.md: state explicitly that --include-duplicates exists only on recall search; semantic/hybrid have no escape hatch (correct behavior, just say it).
  • fkProtectedIds is a hand-maintained mirror of the schema's inbound FKs (currently complete — verified against every FOREIGN KEY clause). Add a schema test asserting FKs referencing DEDUP_TABLES match the hardcoded set, so a future FK fails in CI instead of aborting a user's destructive transaction.
  • applyDedupPlan failures (e.g. UNIQUE collision from a concurrent run) roll back correctly but surface as a raw stack after the "EXECUTE" banner — wrap with "transaction rolled back, no changes written."
  • embeddingsWhere() in embed.ts preserves the pre-existing source_table = '${table}' interpolation (verified identical on main — not new exposure). Follow-up: validate against the table allowlist like runDedup does.
  • normalizeText docstring's "log-level dedup" names nothing in the hooks; point at the write-time decision/error-pattern dedup in RecallExtract.ts and note the deliberate DRY carve-out.
  • recall recent (and MCP memory_recall) still show marked duplicates — consistent with the documented "search paths" scope; fine, just noting the line.

Rulings on the four flagged judgment calls

  1. Active-decisions-only participation — correct. search() already filters decisions to status='active', so deduping superseded copies would add lineage noise with zero search benefit, and a superseded row must never out-survive and hide an active one. Verified live: active + superseded same-text pair → scanned 1, marked 0.
  2. 20-char minimum — accept. Measured on normalized text, protects legitimately repeating acknowledgments, is reported per table ("N too short") so it's observable in dry-run, and its failure direction is benign (record stays visible). Verified live.
  3. FK-referenced duplicates downgraded to marked — correct. PRAGMA foreign_keys = ON is real (connection.ts:59,80), the inbound-FK inventory is complete today (loa_entries → messages ranges, loa_entries → loa_entries parent; decisions.superseded_by is not an FK), and aborting the whole transaction would make destructive mode unusable on any DB with LoA ranges. The duplicate ends up hidden either way, and the CLI reports it ("kept as marked — referenced by LoA lineage"). Verified live with a mixed run; foreign_key_check clean. The schema-drift test above is the one hardening ask.
  4. blobToEmbedding Uint8Array fix — correct, minimal, and required (not scope creep). Verified empirically: bun:sqlite returns plain Uint8Array for BLOBs (Buffer.isBuffer false, no readFloatLE), so the old code throws TypeError on any real DB read — recall dedup's loadEmbeddings is the first code path with test coverage that reads embeddings back from SQLite, which is exactly why it surfaced here. The zero-copy Buffer.from(blob.buffer, blob.byteOffset, blob.byteLength) wrap correctly handles views, and fixing it at the single source of truth is the DRY-mandated move. Nine lines, no callers touched.

Verification record

  • CI: macOS primary ✅ and Ubuntu compatibility smoke ✅ on 7c34cd8.
  • Local (worktree at head): bun run lint ✅, bun run build ✅, bun test → 574 pass / 0 fail.
  • Live CLI on scratch DBs: dry-run write-free (dump-diff identical) ✅ · --execute preserves all records ✅ · --delete without --execute rejected (exit 1) ✅ · search hide + --include-duplicates ✅ · semantic/hybrid/MCP vector SQL excludes marked rows ✅ · idempotent re-run ✅ · destructive mixed FK run ✅ · transitive-chain repro ❌ (Blocking 1) · mark-then-delete escalation no-op ⚠️ (Important 1).

Verdict: REQUEST CHANGES — Blocking 1 (one-set fix + regression test) and Blocking 2 (escape-sequence separators) are both small; everything else is solid work with an unusually well-tested safety model, and I expect this to be an easy re-approve.

…ary dedup.ts

Resolves both blockers from the PR #60 review (comment 4676730890):

- planDedup now tracks plannedSurvivorKeys in both passes and skips any
  semantic pair whose member is already a planned survivor, so lineage
  stays one hop deep — no transitive chaining within or across passes.
  Regression test pins the reviewer's A-B-C repro (cos 0.99/0.96/0.91
  at the default 0.95 threshold).
- The raw NUL groupKey separators are now \x00 escape sequences, so git
  diffs src/lib/dedup.ts as text; the groupKey comment names the
  separator choice.
@edheltzel

Copy link
Copy Markdown
Owner Author

Re-review — fix delta 7c34cd8167f1b8

Independently verified the fix commit in a detached worktree at 167f1b8. Both blockers from the original review are genuinely resolved.

🟢 Blocking 1 — transitive chaining: resolved

  • Code: plannedSurvivorKeys is populated in the exact pass (dedup.ts:476) and the semantic pass (:532), and any semantic pair with a planned-survivor member is skipped (:526). The guard is conservative in the safe direction — a planned survivor can neither be re-marked nor survive a second weaker pair — so lineage is provably one hop. The one-hop guarantee depends on pair ordering, and findSemanticPairs does sort strongest-first with deterministic tie-breaks (:224-230).
  • Live repro (destructive mode): re-ran the exact A/B/C scenario from the original review on a scratch DB — cos(A,B)=0.99, cos(B,C)=0.96, cos(A,C)≈0.91, threshold 0.95, --execute --delete:
    breadcrumbs: scanned 3, exact groups 0, semantic pairs 1, delete 1, unchanged 2
      #1 → survivor #2 [semantic @ 0.990]
    Deleted 1 duplicate(s).
    
    End state: breadcrumbs Add SessionStart hook for automatic memory loading #2 and refactor(hooks): wire quality gate library and add decision status filtering #3 remain; one lineage row (#1 → #2, semantic, 0.99, deleted). The B/C pair at 0.96 is skipped because Add SessionStart hook for automatic memory loading #2 already survives. The buggy build deleted both create npm package for eady install #1 and Add SessionStart hook for automatic memory loading #2 on this data.
  • Regression test: the new test pins this exact scenario plus a general no-survivor-is-a-duplicate invariant. Verified red-first empirically: with 7c34cd8's dedup.ts swapped in, the test fails (18 pass / 1 fail); at 167f1b8 it passes.

🟢 Blocking 2 — NUL bytes: resolved

  • Zero NUL bytes in src/lib/dedup.ts at 167f1b8 (byte scan); file(1) reports plain UTF-8 text.
  • PR diff vs main now shows src/lib/dedup.ts | 629 +++… — line counts, not Bin. (The delta diff 7c34cd8..167f1b8 still displays Bin — expected, since the old blob is binary; the PR-facing diff is the one that matters and it is text.)
  • Separators are now \x00 escape sequences — runtime-identical strings — and the groupKey doc comment names the separator choice, as asked.

Scope & CI

  • Exactly two files touched (src/lib/dedup.ts, tests/commands/dedup.test.ts +31): no scope creep; the three Important items and all minor suggestions from the original review remain untouched, as expected for follow-up material.
  • CI green on 167f1b8: macOS primary ✅, Ubuntu compatibility smoke ✅. Local at head: bun run lint ✅, bun run build ✅, bun test575 pass / 0 fail.

Outstanding (non-blocking, please file at merge)

The original review asked for follow-up issues on Important 1–3 (mark→delete escalation semantics/docs, un-migrated-DB silently-empty search surfacing, test coverage for semantic/hybrid/MCP hiding). No such issues exist yet — they should be filed when this merges so they don't evaporate.

Verdict: APPROVE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add non-destructive dedup with provenance-aware survivor selection

1 participant