…survivor selection
Dry-run by default; --execute marks duplicates in the new dedup_lineage
table (schema migration 9→10) without touching the records; --delete is
the destructive opt-in (recall export --backup recommended first).
- Exact/normalized detection within table + project; cross-table
candidates are report-only
- Semantic detection over stored embeddings (pairwise cosine, no
embedding service call; conservative 0.95 default threshold; skip is
reported when no embeddings exist; no transitive chaining)
- Survivor priority user_authored > verbatim > extracted > derived >
unknown (PROVENANCE_VALUES), then richness, importance, recency, id
- Marked duplicates hidden from all search paths (FTS5, semantic,
hybrid, MCP) unless --include-duplicates is passed
- dedup_lineage included in recall export for a portable audit trail
- Destructive deletes go through chunked() per the chunk.ts audit note;
FK-referenced duplicates (LoA message ranges, LoA parents) are kept
as marked instead of failing the transaction
- Fix latent blobToEmbedding crash on bun:sqlite Uint8Array blobs
Closes #45
Closes #45
What
Adds
recall dedup: duplicate detection and marking across the five memory tables, built on the provenance column (#42), shared chunking (#41), and the export surface (#43).Safety model (binding tracker principles)
--execute.--executewrites lineage rows to the newdedup_lineagetable (survivor table/id, duplicate table/id, reason, similarity, status, created_at, audit detail JSON); duplicate records stay intact and are hidden from search.--deleteis the destructive opt-in, requires--execute, and the CLI recommendsrecall export --backupfirst.user_authored > verbatim > extracted > derived > unknownfromPROVENANCE_VALUESinsrc/types/index.ts(single source of truth); ties break by richness (normalized length), importance, recency, then lowest id (total deterministic order).Detection
activedecisions participate, so a superseded decision can never out-survive and hide an active one.--no-semantic). Conservative default threshold 0.95 (--thresholdto tune); below-threshold pairs are never merged, and pairs are processed strongest-first with no transitive chaining — every marked duplicate has a direct similarity ≥ threshold to its survivor.Search integration
Marked duplicates are hidden by default from every path — FTS5
search(),recall semantic,recall hybrid, and the MCPhybridSearchvector pass — via a sharednotMarkedDuplicateSql()fragment.recall search --include-duplicatesshows them again.Schema
dedup_lineagetable + partial unique index on(duplicate_table, duplicate_id) WHERE status='marked'(idempotence guarantee) — DDL inschema.ts, no-op migration 9→10 per the migration-3→4 precedent.dedup_lineageadded toEXPORT_TABLESso lineage is portable/auditable (Add mem export with JSON, Markdown, SQL dump, and SQLite backup formats #43 export surface).Notes for review
foreign_keys=ON, hard-deleting a message referenced by an LoA range (or an LoA entry referenced as parent) would abort the transaction. Such duplicates are downgraded tomarkedand reported as FK-protected instead.IN-list deletes go throughchunked()per thesrc/lib/chunk.tsaudit-note convention.blobToEmbeddingcalledreadFloatLEonbun:sqliteBLOBs, which areUint8Array(no such method) — first exercised by dedup's tests; fixed at the single source of truth.src/lib/dedup.ts, ready for property suites — the suites themselves are intentionally not included here.Testing
bun run lintclean;bun run buildclean; full suite 574 pass / 0 fail.tests/lib/dedup.test.ts(survivor order + tie-breaks, normalization, grouping, semantic pairing) andtests/commands/dedup.test.ts(dry-run vs execute, exact detection, semantic-skip reporting, threshold behavior with synthetic embeddings, idempotence, cross-table report-only, search hide/include, lineage persistence, destructive delete + FK protection).--no-semantic→--deleteguard.Docs updated:
docs/cli-reference.md(Dedup section + search flag),docs/architecture.md(table + migration),CHANGELOG.md.