edheltzel · edheltzel · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,18 @@ while MCP tool names (`memory_search`, `memory_add`, etc.) remain stable.
 
 ### Added
 
+- **`recall dedup`** — non-destructive dedup with provenance-aware survivor
+  selection (#45): dry-run by default, `--execute` marks duplicates in the new
+  `dedup_lineage` table (schema migration 9→10) without touching the records,
+  `--delete` is the destructive opt-in (take `recall export --backup` first).
+  Detection combines normalized-text matching with semantic matching over
+  stored embeddings (conservative 0.95 default threshold, skip reported when
+  embeddings are unavailable). Survivor priority is `user_authored > verbatim
+  > extracted > derived > unknown`, then richness, importance, recency.
+  Within-table only; cross-table candidates are report-only. Marked
+  duplicates are hidden from every search path unless
+  `recall search --include-duplicates` is passed, and lineage rows are
+  included in `recall export`.
 - **`recall export`** — portable and disaster-recovery exports (#43): JSON,
   Markdown, SQL dump, and SQLite (`VACUUM INTO`) formats with a manifest
   (counts + provenance counts including explicit `unknown`), a stdout/file/

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -70,6 +70,7 @@ both.
 | telos | Purpose framework entries (optional) | Yes |
 | documents | Imported standalone markdown documents (optional) | Yes |
 | embeddings | Vector embeddings for semantic search (768-dim, nomic-embed-text) | N/A |
+| dedup_lineage | Duplicate lineage audit trail from `recall dedup` (survivor, duplicate, reason, similarity, status) | No |
 
 All FTS5-indexed tables have automatic sync triggers.
 
@@ -88,6 +89,15 @@ rows stay `NULL` (unknown) until classified with
 `recall provenance backfill`, which only acts on deterministic write-path
 evidence and never guesses.
 
+The `dedup_lineage` table was added in schema migration 9→10. `recall dedup`
+marks duplicate records non-destructively by writing lineage rows here
+(survivor table/id, duplicate table/id, reason, similarity, status); marked
+duplicates stay in their source tables but are hidden from search unless
+`--include-duplicates` is passed. Survivor selection follows provenance order
+(`user_authored > verbatim > extracted > derived > unknown`), then richness,
+importance, and recency. Dedup acts within a table only; cross-table
+candidates are report-only.
+
 ## Tiered RecallStart (v0.7.0+)
 
 The `RecallStart` hook injects two tiers at the top of every session:

diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -17,6 +17,7 @@ recall search "query" -t decisions      # Hard-filter to decisions only
 recall search "query" --bias-type decisions # Prefer decisions, still show other matching tables
 recall search "query" -p myproject      # Filter by project
 recall search "query" --show-provenance # Show provenance for every result
+recall search "query" --include-duplicates # Include records marked by recall dedup
 recall semantic "query"                 # Semantic search (explicit)
 recall hybrid "query"                   # Hybrid search (explicit)
 ```
@@ -46,6 +47,8 @@ FTS5 supports boolean operators and prefix matching:
 
 By default, search output stays quiet about [Record Provenance](#record-provenance) when a record carries a known value, and visibly flags records whose provenance is unknown (legacy rows that predate the provenance column). Pass `--show-provenance` to display the provenance of every result.
 
+Records marked as duplicates by [`recall dedup`](#dedup) are hidden from every search path (keyword, semantic, hybrid) by default — the records and their lineage remain in the database. Pass `--include-duplicates` to show them.
+
 ---
 
 ## Capture
@@ -269,7 +272,7 @@ Formats:
 
 - **json / markdown** — app-level export of the durable memory tables
   (`sessions`, `messages`, `decisions`, `learnings`, `breadcrumbs`,
-  `loa_entries`). Every row of a provenance-bearing table carries an explicit
+  `loa_entries`, `dedup_lineage`). Every row of a provenance-bearing table carries an explicit
   `provenance` field; legacy `NULL` provenance is exported as the literal
   `unknown` — never omitted, never guessed (see Record Provenance above).
   Embeddings are excluded.
@@ -300,6 +303,46 @@ included.
 overwrites an existing file (a `-N` suffix is added on collision), and prints
 the output path.
 
+## Dedup
+
+Detect and mark duplicate memory records without erasing evidence or lineage.
+
+```bash
+recall dedup                            # Dry-run report (default — writes nothing)
+recall dedup --execute                  # Mark duplicates (non-destructive)
+recall dedup --execute --delete         # Destructive opt-in: hard-delete duplicates
+recall dedup -t breadcrumbs             # Scope to one table
+recall dedup -p myproject               # Scope to one project
+recall dedup --threshold 0.98           # Stricter semantic matching (default 0.95)
+recall dedup --no-semantic              # Exact/normalized text pass only
+```
+
+Safety model:
+
+- **Dry-run by default.** Mutations require `--execute`.
+- **Non-destructive by default.** `--execute` writes lineage rows to the
+  `dedup_lineage` table (survivor, duplicate, reason, similarity, status,
+  timestamp); the duplicate records themselves stay intact and are merely
+  hidden from search. `--delete` is the destructive opt-in and requires
+  `--execute` — run `recall export --backup` first.
+- **Within-table only.** Dedup never merges across tables (or across
+  projects). Cross-table duplicate candidates are report-only.
+- **Survivor priority** is `user_authored > verbatim > extracted > derived >
+  unknown` ([Record Provenance](#record-provenance)); ties break by richness
+  (longer normalized text), importance, recency, then lowest id.
+- **Detection** combines exact/normalized text matching with semantic
+  matching over stored embeddings (no embedding service call needed). The
+  semantic pass is skipped — and reported as skipped — when no embeddings
+  exist; records are never merged below the configured `--threshold`
+  (conservative default: 0.95 cosine similarity). Records with fewer than 20
+  significant characters are never candidates.
+- **Lifecycle-aware.** Only `active` decisions participate; superseded and
+  reverted decisions are managed by the decision lifecycle, not dedup.
+
+Marked duplicates are hidden from all search paths by default; see
+`recall search --include-duplicates`. Lineage is included in
+`recall export`, so the audit trail is portable.
+
 ## Admin
 
 ```bash

diff --git a/src/commands/dedup.ts b/src/commands/dedup.ts
@@ -0,0 +1,145 @@
+// recall dedup command (issue #45).
+//
+// Dry-run by default. --execute marks duplicates (non-destructive: records
+// stay intact, hidden from search via dedup_lineage). --delete is the
+// destructive opt-in and requires --execute; take a `recall export --backup`
+// first. Core logic lives in src/lib/dedup.ts.
+
+import { getDb } from '../db/connection.js';
+import {
+  applyDedupPlan,
+  DEDUP_TABLES,
+  DEFAULT_SEMANTIC_THRESHOLD,
+  planDedup,
+  type ApplyResult,
+  type DedupPlan,
+} from '../lib/dedup.js';
+import type { ProvenanceTable } from '../types/index.js';
+
+export interface DedupOptions {
+  execute?: boolean;
+  delete?: boolean;
+  table?: string;
+  project?: string;
+  threshold?: number;
+  semantic?: boolean;
+}
+
+export interface DedupRunResult {
+  plan: DedupPlan;
+  applied: ApplyResult | null;
+}
+
+export function runDedup(options: DedupOptions = {}): DedupRunResult | undefined {
+  const execute = options.execute ?? false;
+  const destructive = options.delete ?? false;
+
+  if (destructive && !execute) {
+    console.error('--delete requires --execute. Dry-run never deletes.');
+    process.exitCode = 1;
+    return undefined;
+  }
+
+  const target = options.table ?? 'all';
+  if (target !== 'all' && !(DEDUP_TABLES as readonly string[]).includes(target)) {
+    console.error(
+      `Invalid --table "${target}". Valid tables: ${DEDUP_TABLES.join(', ')}, all.`
+    );
+    process.exitCode = 1;
+    return undefined;
+  }
+
+  const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD;
+  if (!Number.isFinite(threshold) || threshold <= 0 || threshold > 1) {
+    console.error(`Invalid --threshold "${options.threshold}". Expected a number in (0, 1].`);
+    process.exitCode = 1;
+    return undefined;
+  }
+
+  const db = getDb();
+  const plan = planDedup(db, {
+    tables: target === 'all' ? undefined : [target as ProvenanceTable],
+    project: options.project,
+    threshold,
+    semantic: options.semantic,
+  });
+
+  const mode = !execute
+    ? '[DRY RUN — no changes written]'
+    : destructive
+      ? '[EXECUTE + DELETE — destructive: duplicates will be removed]'
+      : '[EXECUTE — marking duplicates, non-destructive]';
+  console.log(`${mode}\n`);
+
+  if (destructive) {
+    console.log("Recommended: run 'recall export --backup' before destructive dedup.\n");
+  }
+
+  const verb = execute ? (destructive ? 'delete' : 'mark') : 'would mark';
+  let totalPlanned = 0;
+  for (const report of plan.tables) {
+    totalPlanned += report.planned.length;
+    const unchanged = report.scanned - report.planned.length;
+    const skipped: string[] = [];
+    if (report.alreadyMarked > 0) skipped.push(`${report.alreadyMarked} already marked`);
+    if (report.tooShort > 0) skipped.push(`${report.tooShort} too short`);
+    const skippedNote = skipped.length > 0 ? ` (${skipped.join(', ')})` : '';
+    console.log(
+      `${report.table}: scanned ${report.scanned}, exact groups ${report.exactGroups}, ` +
+      `semantic pairs ${report.semanticPairs}, ${verb} ${report.planned.length}, ` +
+      `unchanged ${unchanged}${skippedNote}`
+    );
+    for (const entry of report.planned.slice(0, 3)) {
+      const sim = entry.similarity !== null ? ` @ ${entry.similarity.toFixed(3)}` : '';
+      console.log(
+        `  #${entry.duplicate_id} → survivor #${entry.survivor_id} [${entry.reason}${sim}]`
+      );
+    }
+    if (report.planned.length > 3) {
+      console.log(`  ...and ${report.planned.length - 3} more`);
+    }
+  }
+  console.log('');
+
+  if (plan.semanticSkipped) {
+    console.log(`Semantic pass: skipped — ${plan.semanticSkipped}`);
+  } else {
+    console.log(`Semantic pass: threshold ${plan.threshold}`);
+  }
+
+  const crossText = plan.crossTable.textMatches;
+  console.log(
+    `Cross-table (report-only, never acted on): ${crossText.length} text match group(s), ` +
+    `${plan.crossTable.semanticPairs} semantic pair(s)`
+  );
+  for (const match of crossText.slice(0, 5)) {
+    const members = match.members.map(m => `${m.table}#${m.id}`).join(' ↔ ');
+    const projectTag = match.project ? ` [${match.project}]` : '';
+    console.log(`  ${members}${projectTag}`);
+  }
+  if (crossText.length > 5) {
+    console.log(`  ...and ${crossText.length - 5} more`);
+  }
+  console.log('');
+
+  if (!execute) {
+    if (totalPlanned > 0) {
+      console.log('Re-run with --execute to mark duplicates (non-destructive).');
+      console.log("Marked duplicates are hidden from search; use 'recall search --include-duplicates' to see them.");
+    } else {
+      console.log('No duplicates found.');
+    }
+    return { plan, applied: null };
+  }
+
+  const applied = applyDedupPlan(db, plan, { destructive });
+  if (destructive) {
+    const fkNote = applied.fkProtected > 0
+      ? ` (${applied.fkProtected} kept as marked — referenced by LoA lineage)`
+      : '';
+    console.log(`Deleted ${applied.deleted} duplicate(s)${fkNote}.`);
+  } else {
+    console.log(`Marked ${applied.marked} duplicate(s). Records remain intact and recoverable.`);
+  }
+  return { plan, applied };
+}
diff --git a/src/commands/embed.ts b/src/commands/embed.ts
@@ -2,8 +2,17 @@
 
 import { getDb } from '../db/connection.js';
 import { embed, embeddingToBlob, blobToEmbedding, cosineSimilarity, checkEmbeddingService, reciprocalRankFusion, EMBEDDING_MODEL } from '../lib/embeddings.js';
+import { notMarkedDuplicateSql } from '../lib/dedup.js';
 import { search as ftsSearch } from '../lib/memory.js';
 
+// Marked duplicates (recall dedup, issue #45) keep their embeddings but are
+// hidden from the semantic search paths, matching the FTS5 default.
+function embeddingsWhere(table?: string): string {
+  const conditions = [notMarkedDuplicateSql('source_table', 'source_id')];
+  if (table) conditions.push(`source_table = '${table}'`);
+  return `WHERE ${conditions.join(' AND ')}`;
+}
+
 interface EmbedOptions {
   table?: 'loa' | 'decisions' | 'messages' | 'learnings';
   limit?: number;
@@ -164,11 +173,10 @@ export async function runSemanticSearch(query: string, options: { table?: string
   const queryEmbedding = queryResult.embedding;
 
   // Get all embeddings (for now, brute force - will optimize later)
-  const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
   const embeddings = db.prepare(`
     SELECT id, source_table, source_id, embedding
     FROM embeddings
-    ${tableFilter}
+    ${embeddingsWhere(options.table)}
   `).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;
 
   if (embeddings.length === 0) {
@@ -304,11 +312,10 @@ export async function runHybridSearch(query: string, options: { table?: string;
   const queryEmbedding = queryResult.embedding;
 
   // Get embeddings from database
-  const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
   const embeddings = db.prepare(`
     SELECT id, source_table, source_id, embedding
     FROM embeddings
-    ${tableFilter}
+    ${embeddingsWhere(options.table)}
   `).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;
 
   // Calculate similarities

diff --git a/src/commands/search.ts b/src/commands/search.ts
@@ -8,6 +8,7 @@ interface SearchOptions {
   biasType?: string;
   limit?: number;
   showProvenance?: boolean;
+  includeDuplicates?: boolean;
 }
 
 export function runSearch(query: string, options: SearchOptions): void {
@@ -23,7 +24,8 @@ export function runSearch(query: string, options: SearchOptions): void {
     project: options.project,
     table: options.table,
     biasType: options.biasType as SearchTable | undefined,
-    limit: options.limit || 20
+    limit: options.limit || 20,
+    includeDuplicates: options.includeDuplicates
   });
 
   if (results.length === 0) {

diff --git a/src/db/migrations.ts b/src/db/migrations.ts
@@ -198,6 +198,12 @@ export const MIGRATIONS: Migration[] = [
       }
     }
   },
+
+  // Migration 9 → 10: Dedup lineage table (issue #45).
+  // No-op — dedup_lineage and its indexes are brand new, handled by the
+  // CREATE TABLE IF NOT EXISTS DDL that runs before migrations (same
+  // precedent as migration 3 → 4 for the extraction tables).
+  (_db) => {},
 ];
 
 // ---------------------------------------------------------------------------

diff --git a/src/db/schema.ts b/src/db/schema.ts
@@ -181,6 +181,24 @@ CREATE TABLE IF NOT EXISTS procedures (
   times_observed INTEGER DEFAULT 2,
   confidence TEXT DEFAULT 'medium' CHECK (confidence IN ('high', 'medium', 'low'))
 );
+
+-- Dedup lineage (issue #45): persistent audit trail of duplicate marking.
+-- Non-destructive by default — a 'marked' row hides the duplicate from search
+-- while the underlying record stays intact. 'deleted' records a destructive
+-- opt-in removal. 'reverted' is reserved vocabulary for a future unmark path
+-- (CHECK constraints cannot be widened later without a table rebuild).
+CREATE TABLE IF NOT EXISTS dedup_lineage (
+  id INTEGER PRIMARY KEY AUTOINCREMENT,
+  created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
+  survivor_table TEXT NOT NULL,
+  survivor_id INTEGER NOT NULL,
+  duplicate_table TEXT NOT NULL,
+  duplicate_id INTEGER NOT NULL,
+  reason TEXT NOT NULL CHECK (reason IN ('exact', 'semantic')),
+  similarity REAL,
+  status TEXT NOT NULL DEFAULT 'marked' CHECK (status IN ('marked', 'deleted', 'reverted')),
+  detail TEXT
+);
 `;
 
 export const CREATE_INDEXES = `
@@ -231,6 +249,14 @@ CREATE INDEX IF NOT EXISTS idx_documents_created ON documents(created_at);
 
 -- Extraction session indexes
 CREATE INDEX IF NOT EXISTS idx_extraction_sessions_ts ON extraction_sessions(timestamp DESC);
+
+-- Dedup lineage indexes: the partial unique index guarantees a record can be
+-- an actively marked duplicate at most once (idempotence); the survivor index
+-- supports lineage audits.
+CREATE UNIQUE INDEX IF NOT EXISTS idx_dedup_lineage_duplicate
+  ON dedup_lineage(duplicate_table, duplicate_id) WHERE status = 'marked';
+CREATE INDEX IF NOT EXISTS idx_dedup_lineage_survivor
+  ON dedup_lineage(survivor_table, survivor_id);
 `;
 
 export const CREATE_FTS = `