diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4a11335..abf0eda 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,18 @@ while MCP tool names (`memory_search`, `memory_add`, etc.) remain stable.
### Added
+- **`recall dedup`** — non-destructive dedup with provenance-aware survivor
+ selection (#45): dry-run by default, `--execute` marks duplicates in the new
+ `dedup_lineage` table (schema migration 9→10) without touching the records,
+ `--delete` is the destructive opt-in (take `recall export --backup` first).
+ Detection combines normalized-text matching with semantic matching over
+ stored embeddings (conservative 0.95 default threshold, skip reported when
+ embeddings are unavailable). Survivor priority is `user_authored > verbatim
+ > extracted > derived > unknown`, then richness, importance, recency.
+ Within-table only; cross-table candidates are report-only. Marked
+ duplicates are hidden from every search path unless
+ `recall search --include-duplicates` is passed, and lineage rows are
+ included in `recall export`.
- **`recall export`** — portable and disaster-recovery exports (#43): JSON,
Markdown, SQL dump, and SQLite (`VACUUM INTO`) formats with a manifest
(counts + provenance counts including explicit `unknown`), a stdout/file/
diff --git a/docs/architecture.md b/docs/architecture.md
index a7d859e..d84e960 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -70,6 +70,7 @@ both.
| telos | Purpose framework entries (optional) | Yes |
| documents | Imported standalone markdown documents (optional) | Yes |
| embeddings | Vector embeddings for semantic search (768-dim, nomic-embed-text) | N/A |
+| dedup_lineage | Duplicate lineage audit trail from `recall dedup` (survivor, duplicate, reason, similarity, status) | No |
All FTS5-indexed tables have automatic sync triggers.
@@ -88,6 +89,15 @@ rows stay `NULL` (unknown) until classified with
`recall provenance backfill`, which only acts on deterministic write-path
evidence and never guesses.
+The `dedup_lineage` table was added in schema migration 9→10. `recall dedup`
+marks duplicate records non-destructively by writing lineage rows here
+(survivor table/id, duplicate table/id, reason, similarity, status); marked
+duplicates stay in their source tables but are hidden from search unless
+`--include-duplicates` is passed. Survivor selection follows provenance order
+(`user_authored > verbatim > extracted > derived > unknown`), then richness,
+importance, and recency. Dedup acts within a table only; cross-table
+candidates are report-only.
+
## Tiered RecallStart (v0.7.0+)
The `RecallStart` hook injects two tiers at the top of every session:
diff --git a/docs/cli-reference.md b/docs/cli-reference.md
index c3d38ff..671de42 100644
--- a/docs/cli-reference.md
+++ b/docs/cli-reference.md
@@ -17,6 +17,7 @@ recall search "query" -t decisions # Hard-filter to decisions only
recall search "query" --bias-type decisions # Prefer decisions, still show other matching tables
recall search "query" -p myproject # Filter by project
recall search "query" --show-provenance # Show provenance for every result
+recall search "query" --include-duplicates # Include records marked by recall dedup
recall semantic "query" # Semantic search (explicit)
recall hybrid "query" # Hybrid search (explicit)
```
@@ -46,6 +47,8 @@ FTS5 supports boolean operators and prefix matching:
By default, search output stays quiet about [Record Provenance](#record-provenance) when a record carries a known value, and visibly flags records whose provenance is unknown (legacy rows that predate the provenance column). Pass `--show-provenance` to display the provenance of every result.
+Records marked as duplicates by [`recall dedup`](#dedup) are hidden from every search path (keyword, semantic, hybrid) by default — the records and their lineage remain in the database. Pass `--include-duplicates` to show them.
+
---
## Capture
@@ -269,7 +272,7 @@ Formats:
- **json / markdown** — app-level export of the durable memory tables
(`sessions`, `messages`, `decisions`, `learnings`, `breadcrumbs`,
- `loa_entries`). Every row of a provenance-bearing table carries an explicit
+ `loa_entries`, `dedup_lineage`). Every row of a provenance-bearing table carries an explicit
`provenance` field; legacy `NULL` provenance is exported as the literal
`unknown` — never omitted, never guessed (see Record Provenance above).
Embeddings are excluded.
@@ -300,6 +303,46 @@ included.
overwrites an existing file (a `-N` suffix is added on collision), and prints
the output path.
+## Dedup
+
+Detect and mark duplicate memory records without erasing evidence or lineage.
+
+```bash
+recall dedup # Dry-run report (default — writes nothing)
+recall dedup --execute # Mark duplicates (non-destructive)
+recall dedup --execute --delete # Destructive opt-in: hard-delete duplicates
+recall dedup -t breadcrumbs # Scope to one table
+recall dedup -p myproject # Scope to one project
+recall dedup --threshold 0.98 # Stricter semantic matching (default 0.95)
+recall dedup --no-semantic # Exact/normalized text pass only
+```
+
+Safety model:
+
+- **Dry-run by default.** Mutations require `--execute`.
+- **Non-destructive by default.** `--execute` writes lineage rows to the
+ `dedup_lineage` table (survivor, duplicate, reason, similarity, status,
+ timestamp); the duplicate records themselves stay intact and are merely
+ hidden from search. `--delete` is the destructive opt-in and requires
+ `--execute` — run `recall export --backup` first.
+- **Within-table only.** Dedup never merges across tables (or across
+ projects). Cross-table duplicate candidates are report-only.
+- **Survivor priority** is `user_authored > verbatim > extracted > derived >
+ unknown` ([Record Provenance](#record-provenance)); ties break by richness
+ (longer normalized text), importance, recency, then lowest id.
+- **Detection** combines exact/normalized text matching with semantic
+ matching over stored embeddings (no embedding service call needed). The
+ semantic pass is skipped — and reported as skipped — when no embeddings
+ exist; records are never merged below the configured `--threshold`
+ (conservative default: 0.95 cosine similarity). Records with fewer than 20
+ significant characters are never candidates.
+- **Lifecycle-aware.** Only `active` decisions participate; superseded and
+ reverted decisions are managed by the decision lifecycle, not dedup.
+
+Marked duplicates are hidden from all search paths by default; see
+`recall search --include-duplicates`. Lineage is included in
+`recall export`, so the audit trail is portable.
+
## Admin
```bash
diff --git a/src/commands/dedup.ts b/src/commands/dedup.ts
new file mode 100644
index 0000000..7c865bf
--- /dev/null
+++ b/src/commands/dedup.ts
@@ -0,0 +1,145 @@
+// recall dedup command (issue #45).
+//
+// Dry-run by default. --execute marks duplicates (non-destructive: records
+// stay intact, hidden from search via dedup_lineage). --delete is the
+// destructive opt-in and requires --execute; take a `recall export --backup`
+// first. Core logic lives in src/lib/dedup.ts.
+
+import { getDb } from '../db/connection.js';
+import {
+ applyDedupPlan,
+ DEDUP_TABLES,
+ DEFAULT_SEMANTIC_THRESHOLD,
+ planDedup,
+ type ApplyResult,
+ type DedupPlan,
+} from '../lib/dedup.js';
+import type { ProvenanceTable } from '../types/index.js';
+
+export interface DedupOptions {
+ execute?: boolean;
+ delete?: boolean;
+ table?: string;
+ project?: string;
+ threshold?: number;
+ semantic?: boolean;
+}
+
+export interface DedupRunResult {
+ plan: DedupPlan;
+ applied: ApplyResult | null;
+}
+
+export function runDedup(options: DedupOptions = {}): DedupRunResult | undefined {
+ const execute = options.execute ?? false;
+ const destructive = options.delete ?? false;
+
+ if (destructive && !execute) {
+ console.error('--delete requires --execute. Dry-run never deletes.');
+ process.exitCode = 1;
+ return undefined;
+ }
+
+ const target = options.table ?? 'all';
+ if (target !== 'all' && !(DEDUP_TABLES as readonly string[]).includes(target)) {
+ console.error(
+ `Invalid --table "${target}". Valid tables: ${DEDUP_TABLES.join(', ')}, all.`
+ );
+ process.exitCode = 1;
+ return undefined;
+ }
+
+ const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD;
+ if (!Number.isFinite(threshold) || threshold <= 0 || threshold > 1) {
+ console.error(`Invalid --threshold "${options.threshold}". Expected a number in (0, 1].`);
+ process.exitCode = 1;
+ return undefined;
+ }
+
+ const db = getDb();
+ const plan = planDedup(db, {
+ tables: target === 'all' ? undefined : [target as ProvenanceTable],
+ project: options.project,
+ threshold,
+ semantic: options.semantic,
+ });
+
+ const mode = !execute
+ ? '[DRY RUN — no changes written]'
+ : destructive
+ ? '[EXECUTE + DELETE — destructive: duplicates will be removed]'
+ : '[EXECUTE — marking duplicates, non-destructive]';
+ console.log(`${mode}\n`);
+
+ if (destructive) {
+ console.log("Recommended: run 'recall export --backup' before destructive dedup.\n");
+ }
+
+ const verb = execute ? (destructive ? 'delete' : 'mark') : 'would mark';
+ let totalPlanned = 0;
+ for (const report of plan.tables) {
+ totalPlanned += report.planned.length;
+ const unchanged = report.scanned - report.planned.length;
+ const skipped: string[] = [];
+ if (report.alreadyMarked > 0) skipped.push(`${report.alreadyMarked} already marked`);
+ if (report.tooShort > 0) skipped.push(`${report.tooShort} too short`);
+ const skippedNote = skipped.length > 0 ? ` (${skipped.join(', ')})` : '';
+ console.log(
+ `${report.table}: scanned ${report.scanned}, exact groups ${report.exactGroups}, ` +
+ `semantic pairs ${report.semanticPairs}, ${verb} ${report.planned.length}, ` +
+ `unchanged ${unchanged}${skippedNote}`
+ );
+ for (const entry of report.planned.slice(0, 3)) {
+ const sim = entry.similarity !== null ? ` @ ${entry.similarity.toFixed(3)}` : '';
+ console.log(
+ ` #${entry.duplicate_id} → survivor #${entry.survivor_id} [${entry.reason}${sim}]`
+ );
+ }
+ if (report.planned.length > 3) {
+ console.log(` ...and ${report.planned.length - 3} more`);
+ }
+ }
+ console.log('');
+
+ if (plan.semanticSkipped) {
+ console.log(`Semantic pass: skipped — ${plan.semanticSkipped}`);
+ } else {
+ console.log(`Semantic pass: threshold ${plan.threshold}`);
+ }
+
+ const crossText = plan.crossTable.textMatches;
+ console.log(
+ `Cross-table (report-only, never acted on): ${crossText.length} text match group(s), ` +
+ `${plan.crossTable.semanticPairs} semantic pair(s)`
+ );
+ for (const match of crossText.slice(0, 5)) {
+ const members = match.members.map(m => `${m.table}#${m.id}`).join(' ↔ ');
+ const projectTag = match.project ? ` [${match.project}]` : '';
+ console.log(` ${members}${projectTag}`);
+ }
+ if (crossText.length > 5) {
+ console.log(` ...and ${crossText.length - 5} more`);
+ }
+ console.log('');
+
+ if (!execute) {
+ if (totalPlanned > 0) {
+ console.log('Re-run with --execute to mark duplicates (non-destructive).');
+ console.log("Marked duplicates are hidden from search; use 'recall search --include-duplicates' to see them.");
+ } else {
+ console.log('No duplicates found.');
+ }
+ return { plan, applied: null };
+ }
+
+ const applied = applyDedupPlan(db, plan, { destructive });
+ if (destructive) {
+ const fkNote = applied.fkProtected > 0
+ ? ` (${applied.fkProtected} kept as marked — referenced by LoA lineage)`
+ : '';
+ console.log(`Deleted ${applied.deleted} duplicate(s)${fkNote}.`);
+ } else {
+ console.log(`Marked ${applied.marked} duplicate(s). Records remain intact and recoverable.`);
+ }
+ return { plan, applied };
+}
diff --git a/src/commands/embed.ts b/src/commands/embed.ts
index c9ae4a0..66bd623 100644
--- a/src/commands/embed.ts
+++ b/src/commands/embed.ts
@@ -2,8 +2,17 @@
import { getDb } from '../db/connection.js';
import { embed, embeddingToBlob, blobToEmbedding, cosineSimilarity, checkEmbeddingService, reciprocalRankFusion, EMBEDDING_MODEL } from '../lib/embeddings.js';
+import { notMarkedDuplicateSql } from '../lib/dedup.js';
import { search as ftsSearch } from '../lib/memory.js';
+// Marked duplicates (recall dedup, issue #45) keep their embeddings but are
+// hidden from the semantic search paths, matching the FTS5 default.
+function embeddingsWhere(table?: string): string {
+ const conditions = [notMarkedDuplicateSql('source_table', 'source_id')];
+ if (table) conditions.push(`source_table = '${table}'`);
+ return `WHERE ${conditions.join(' AND ')}`;
+}
+
interface EmbedOptions {
table?: 'loa' | 'decisions' | 'messages' | 'learnings';
limit?: number;
@@ -164,11 +173,10 @@ export async function runSemanticSearch(query: string, options: { table?: string
const queryEmbedding = queryResult.embedding;
// Get all embeddings (for now, brute force - will optimize later)
- const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
const embeddings = db.prepare(`
SELECT id, source_table, source_id, embedding
FROM embeddings
- ${tableFilter}
+ ${embeddingsWhere(options.table)}
`).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;
if (embeddings.length === 0) {
@@ -304,11 +312,10 @@ export async function runHybridSearch(query: string, options: { table?: string;
const queryEmbedding = queryResult.embedding;
// Get embeddings from database
- const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
const embeddings = db.prepare(`
SELECT id, source_table, source_id, embedding
FROM embeddings
- ${tableFilter}
+ ${embeddingsWhere(options.table)}
`).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;
// Calculate similarities
diff --git a/src/commands/search.ts b/src/commands/search.ts
index 2c13579..9c782a7 100644
--- a/src/commands/search.ts
+++ b/src/commands/search.ts
@@ -8,6 +8,7 @@ interface SearchOptions {
biasType?: string;
limit?: number;
showProvenance?: boolean;
+ includeDuplicates?: boolean;
}
export function runSearch(query: string, options: SearchOptions): void {
@@ -23,7 +24,8 @@ export function runSearch(query: string, options: SearchOptions): void {
project: options.project,
table: options.table,
biasType: options.biasType as SearchTable | undefined,
- limit: options.limit || 20
+ limit: options.limit || 20,
+ includeDuplicates: options.includeDuplicates
});
if (results.length === 0) {
diff --git a/src/db/migrations.ts b/src/db/migrations.ts
index 415e1a4..e5c4cbe 100644
--- a/src/db/migrations.ts
+++ b/src/db/migrations.ts
@@ -198,6 +198,12 @@ export const MIGRATIONS: Migration[] = [
}
}
},
+
+ // Migration 9 → 10: Dedup lineage table (issue #45).
+ // No-op — dedup_lineage and its indexes are brand new, handled by the
+ // CREATE TABLE IF NOT EXISTS DDL that runs before migrations (same
+ // precedent as migration 3 → 4 for the extraction tables).
+ (_db) => {},
];
// ---------------------------------------------------------------------------
diff --git a/src/db/schema.ts b/src/db/schema.ts
index 7f60912..269e0b9 100644
--- a/src/db/schema.ts
+++ b/src/db/schema.ts
@@ -181,6 +181,24 @@ CREATE TABLE IF NOT EXISTS procedures (
times_observed INTEGER DEFAULT 2,
confidence TEXT DEFAULT 'medium' CHECK (confidence IN ('high', 'medium', 'low'))
);
+
+-- Dedup lineage (issue #45): persistent audit trail of duplicate marking.
+-- Non-destructive by default — a 'marked' row hides the duplicate from search
+-- while the underlying record stays intact. 'deleted' records a destructive
+-- opt-in removal. 'reverted' is reserved vocabulary for a future unmark path
+-- (CHECK constraints cannot be widened later without a table rebuild).
+CREATE TABLE IF NOT EXISTS dedup_lineage (
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
+ created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
+ survivor_table TEXT NOT NULL,
+ survivor_id INTEGER NOT NULL,
+ duplicate_table TEXT NOT NULL,
+ duplicate_id INTEGER NOT NULL,
+ reason TEXT NOT NULL CHECK (reason IN ('exact', 'semantic')),
+ similarity REAL,
+ status TEXT NOT NULL DEFAULT 'marked' CHECK (status IN ('marked', 'deleted', 'reverted')),
+ detail TEXT
+);
`;
export const CREATE_INDEXES = `
@@ -231,6 +249,14 @@ CREATE INDEX IF NOT EXISTS idx_documents_created ON documents(created_at);
-- Extraction session indexes
CREATE INDEX IF NOT EXISTS idx_extraction_sessions_ts ON extraction_sessions(timestamp DESC);
+
+-- Dedup lineage indexes: the partial unique index guarantees a record can be
+-- an actively marked duplicate at most once (idempotence); the survivor index
+-- supports lineage audits.
+CREATE UNIQUE INDEX IF NOT EXISTS idx_dedup_lineage_duplicate
+ ON dedup_lineage(duplicate_table, duplicate_id) WHERE status = 'marked';
+CREATE INDEX IF NOT EXISTS idx_dedup_lineage_survivor
+ ON dedup_lineage(survivor_table, survivor_id);
`;
export const CREATE_FTS = `
diff --git a/src/index.ts b/src/index.ts
index 4040e17..a2e2aaf 100644
--- a/src/index.ts
+++ b/src/index.ts
@@ -30,6 +30,8 @@ import { runOnboard } from './commands/onboard.js';
import { runMigrate } from './commands/migrate.js';
import { runPath } from './commands/path.js';
import { runExport } from './commands/export.js';
+import { runDedup } from './commands/dedup.js';
+import { DEFAULT_SEMANTIC_THRESHOLD } from './lib/dedup.js';
import { closeDb } from './db/connection.js';
const program = new Command();
@@ -180,13 +182,15 @@ program
.option('--bias-type
', 'Softly boost one table without filtering others (messages, loa, decisions, learnings, breadcrumbs)')
.option('-l, --limit ', 'Max results', '20')
.option('--show-provenance', 'Show provenance for every result (default: only unknown provenance is flagged)')
+ .option('--include-duplicates', 'Include records marked as duplicates by recall dedup (hidden by default)')
.action((query, options) => {
runSearch(query, {
project: options.project,
table: options.table,
biasType: options.biasType,
limit: parseInt(options.limit, 10),
- showProvenance: options.showProvenance
+ showProvenance: options.showProvenance,
+ includeDuplicates: options.includeDuplicates
});
closeDb();
});
@@ -658,6 +662,30 @@ program
closeDb();
});
+// recall dedup — non-destructive duplicate detection (issue #45)
+// Dry-run by default; --execute marks duplicates in dedup_lineage (records
+// stay intact, hidden from search); --delete is the destructive opt-in.
+program
+ .command('dedup')
+ .description('Detect and mark duplicate memory records (dry-run by default; non-destructive)')
+ .option('--execute', 'Apply the plan: mark duplicates (default is dry-run)')
+ .option('--delete', "Destructive opt-in: hard-delete duplicates instead of marking (requires --execute; run 'recall export --backup' first)")
+ .option('-t, --table ', 'Target table: messages, decisions, learnings, breadcrumbs, loa_entries, all', 'all')
+ .option('-p, --project ', 'Scope to one project')
+ .option('--threshold ', `Semantic similarity threshold (0-1)`, String(DEFAULT_SEMANTIC_THRESHOLD))
+ .option('--no-semantic', 'Skip the semantic (embeddings) pass')
+ .action((options) => {
+ runDedup({
+ execute: options.execute,
+ delete: options.delete,
+ table: options.table,
+ project: options.project,
+ threshold: parseFloat(options.threshold),
+ semantic: options.semantic
+ });
+ closeDb();
+ });
+
// Default command: recall → hybrid search (Phase 3: best of both worlds)
program
.arguments('[query]')
@@ -668,7 +696,7 @@ program
.option('-k, --keyword', 'Use keyword search only (FTS5)')
.option('-v, --vector', 'Use vector search only (semantic)')
.action(async (query, options) => {
- if (query && !['init', 'add', 'search', 'recent', 'show', 'stats', 'import', 'import-conversations', 'loa', 'telos', 'docs', 'dump', 'embed', 'semantic', 'hybrid', 'doctor', 'importance', 'provenance', 'pin', 'unpin', 'decision', 'prune', 'cluster', 'import-legacy', 'benchmark', 'onboard', 'migrate', 'path', 'export'].includes(query)) {
+ if (query && !['init', 'add', 'search', 'recent', 'show', 'stats', 'import', 'import-conversations', 'loa', 'telos', 'docs', 'dump', 'embed', 'semantic', 'hybrid', 'doctor', 'importance', 'provenance', 'pin', 'unpin', 'decision', 'prune', 'cluster', 'import-legacy', 'benchmark', 'onboard', 'migrate', 'path', 'export', 'dedup'].includes(query)) {
if (options.keyword) {
// FTS5 only
runSearch(query, {
diff --git a/src/lib/dedup.ts b/src/lib/dedup.ts
new file mode 100644
index 0000000..fbfce6e
--- /dev/null
+++ b/src/lib/dedup.ts
@@ -0,0 +1,629 @@
+// recall dedup — core logic (issue #45).
+//
+// Non-destructive dedup with provenance-aware survivor selection. The decision
+// logic (normalization, survivor ordering, grouping, semantic pairing) is kept
+// as pure exported functions over plain data so it stays unit-testable
+// (issue #44 phase 2 adds property tests over these functions).
+//
+// Safety model:
+// - Dry-run by default; mutations require an explicit execute step.
+// - Default action marks duplicates in dedup_lineage — records stay intact.
+// Hard deletion is a separate destructive opt-in.
+// - Dedup acts within a table (and within a project) only. Cross-table
+// duplicate candidates are report-only.
+// - Survivor priority: user_authored > verbatim > extracted > derived >
+// unknown (PROVENANCE_VALUES in types/index.ts is the single source of
+// truth). Ties break by richness (normalized length), importance, recency,
+// then lowest id for determinism.
+// - Semantic detection compares stored embeddings pairwise — no embedding
+// service call is needed. Pairs are never chained transitively: every
+// marked duplicate has a direct similarity >= threshold to its survivor.
+//
+// Bind-count note (see src/lib/chunk.ts): scans use keyset pagination
+// (`WHERE id > ? LIMIT ?`, fixed binds). Destructive deletion builds
+// input-scaled `IN (?,...)` lists and therefore goes through chunked().
+
+import { createHash } from 'crypto';
+import { Database } from 'bun:sqlite';
+import { chunked, SQLITE_SAFE_CHUNK_SIZE } from './chunk.js';
+import { blobToEmbedding, cosineSimilarity } from './embeddings.js';
+import {
+ PROVENANCE_TABLES,
+ PROVENANCE_VALUES,
+ type Provenance,
+ type ProvenanceTable,
+} from '../types/index.js';
+
+/** Tables dedup operates on — the provenance-bearing memory tables. */
+export const DEDUP_TABLES = PROVENANCE_TABLES;
+
+/**
+ * Conservative default for semantic matching. At 0.95 cosine similarity on
+ * nomic-embed-text vectors, two records are near-verbatim restatements;
+ * anything below stays untouched (never auto-merge below the threshold).
+ */
+export const DEFAULT_SEMANTIC_THRESHOLD = 0.95;
+
+/**
+ * Records whose normalized text is shorter than this are never dedup
+ * candidates. Short acknowledgments ("ok", "thanks", "yes") repeat
+ * legitimately across sessions — mass-marking them would be noise.
+ */
+export const MIN_DEDUP_TEXT_LENGTH = 20;
+
+export type DedupReason = 'exact' | 'semantic';
+export type LineageStatus = 'marked' | 'deleted' | 'reverted';
+
+export interface DedupCandidate {
+ table: ProvenanceTable;
+ id: number;
+ project: string | null;
+ provenance: Provenance | null;
+ importance: number;
+ created_at: string;
+ /** sha256 of the normalized dedup text — exact-match grouping key. */
+ key: string;
+ /** Length of the normalized text — the richness tie-breaker. */
+ textLength: number;
+}
+
+export interface LineageEntry {
+ survivor_table: ProvenanceTable;
+ survivor_id: number;
+ duplicate_table: ProvenanceTable;
+ duplicate_id: number;
+ reason: DedupReason;
+ similarity: number | null;
+ /** JSON audit detail (project, survivor/duplicate provenance). */
+ detail: string;
+}
+
+// ---------------------------------------------------------------------------
+// Pure functions — text normalization and survivor selection
+// ---------------------------------------------------------------------------
+
+/**
+ * Canonical text normalization for duplicate detection: lowercase, strip
+ * quote characters, collapse whitespace. Same convention the extraction
+ * hooks use for their log-level dedup.
+ */
+export function normalizeText(s: string): string {
+ return s.toLowerCase().replace(/['"]/g, '').replace(/\s+/g, ' ').trim();
+}
+
+/**
+ * Survivor priority rank — lower wins. Order comes from PROVENANCE_VALUES
+ * (user_authored > verbatim > extracted > derived); unknown (NULL) ranks
+ * last. Note PROVENANCE_VALUES is already declared in survivor order.
+ */
+export function provenanceRank(p: Provenance | null | undefined): number {
+ if (!p) return PROVENANCE_VALUES.length;
+ const idx = PROVENANCE_VALUES.indexOf(p);
+ return idx === -1 ? PROVENANCE_VALUES.length : idx;
+}
+
+/**
+ * Comparator placing the survivor first: provenance rank, then richness
+ * (longer normalized text), importance, recency (newer created_at), and
+ * finally lowest id so ordering is total and deterministic.
+ */
+export function compareForSurvivor(a: DedupCandidate, b: DedupCandidate): number {
+ const rank = provenanceRank(a.provenance) - provenanceRank(b.provenance);
+ if (rank !== 0) return rank;
+ if (a.textLength !== b.textLength) return b.textLength - a.textLength;
+ const importance = (b.importance ?? 5) - (a.importance ?? 5);
+ if (importance !== 0) return importance;
+ // ISO and SQLite CURRENT_TIMESTAMP formats both sort lexicographically.
+ if (a.created_at !== b.created_at) return a.created_at > b.created_at ? -1 : 1;
+ return a.id - b.id;
+}
+
+export function selectSurvivor(group: DedupCandidate[]): {
+ survivor: DedupCandidate;
+ duplicates: DedupCandidate[];
+} {
+ if (group.length === 0) throw new Error('selectSurvivor: empty group');
+ const sorted = [...group].sort(compareForSurvivor);
+ return { survivor: sorted[0], duplicates: sorted.slice(1) };
+}
+
+/**
+ * Grouping key: same table, same project, same normalized text. Fields are
+ * joined with the NUL escape `\x00` — a separator that cannot occur in any
+ * field, so boundaries can never be forged by content.
+ */
+function groupKey(c: DedupCandidate): string {
+ return `${c.table}\x00${c.project ?? ''}\x00${c.key}`;
+}
+
+/**
+ * Exact (normalized-text) duplicate groups within a table and project.
+ * Returns only groups with two or more members.
+ */
+export function findExactGroups(candidates: DedupCandidate[]): DedupCandidate[][] {
+ const byKey = new Map();
+ for (const c of candidates) {
+ const k = groupKey(c);
+ const group = byKey.get(k);
+ if (group) group.push(c);
+ else byKey.set(k, [c]);
+ }
+ return [...byKey.values()].filter(g => g.length >= 2);
+}
+
+export interface CrossTableMatch {
+ project: string | null;
+ members: Array<{ table: ProvenanceTable; id: number }>;
+}
+
+/**
+ * Report-only: normalized-text matches spanning two or more tables within
+ * the same project. Dedup never acts across tables.
+ */
+export function findCrossTableMatches(candidates: DedupCandidate[]): CrossTableMatch[] {
+ const byKey = new Map();
+ for (const c of candidates) {
+ const k = `${c.project ?? ''}\x00${c.key}`;
+ const group = byKey.get(k);
+ if (group) group.push(c);
+ else byKey.set(k, [c]);
+ }
+ const matches: CrossTableMatch[] = [];
+ for (const group of byKey.values()) {
+ const tables = new Set(group.map(c => c.table));
+ if (tables.size < 2) continue;
+ matches.push({
+ project: group[0].project,
+ members: group.map(c => ({ table: c.table, id: c.id })),
+ });
+ }
+ return matches;
+}
+
+export interface EmbeddedCandidate {
+ candidate: DedupCandidate;
+ embedding: number[];
+}
+
+export interface SemanticPair {
+ a: DedupCandidate;
+ b: DedupCandidate;
+ similarity: number;
+ sameTable: boolean;
+}
+
+/**
+ * Pairwise cosine similarity over stored embeddings. Returns every pair at
+ * or above the threshold, split into actionable (same table + project) and
+ * cross-table report-only pairs. Pairs within one table but across projects
+ * are ignored — dedup scope is per-project, matching the exact pass.
+ *
+ * O(n^2) over embedded records — acceptable at personal-memory scale and the
+ * same brute-force shape the cluster command and hybrid search already use.
+ */
+export function findSemanticPairs(
+ embedded: EmbeddedCandidate[],
+ threshold: number
+): SemanticPair[] {
+ const pairs: SemanticPair[] = [];
+ for (let i = 0; i < embedded.length; i++) {
+ for (let j = i + 1; j < embedded.length; j++) {
+ const x = embedded[i];
+ const y = embedded[j];
+ const sameTable = x.candidate.table === y.candidate.table;
+ if (sameTable && x.candidate.project !== y.candidate.project) continue;
+ // Exact duplicates are the exact pass's job; identical keys here would
+ // double-report the same records under a second reason.
+ if (x.candidate.key === y.candidate.key) continue;
+ const similarity = cosineSimilarity(x.embedding, y.embedding);
+ if (similarity >= threshold) {
+ pairs.push({ a: x.candidate, b: y.candidate, similarity, sameTable });
+ }
+ }
+ }
+ // Deterministic processing order: strongest pairs first.
+ pairs.sort((p, q) =>
+ q.similarity - p.similarity ||
+ p.a.table.localeCompare(q.a.table) ||
+ p.a.id - q.a.id ||
+ p.b.id - q.b.id
+ );
+ return pairs;
+}
+
+// ---------------------------------------------------------------------------
+// SQL fragments shared with the search paths
+// ---------------------------------------------------------------------------
+
+/**
+ * SQL fragment excluding records marked as duplicates. `tableExpr` and
+ * `idExpr` are SQL expressions (a quoted literal like `'messages'` or a
+ * column reference like `e.source_table`) — never user input.
+ */
+export function notMarkedDuplicateSql(tableExpr: string, idExpr: string): string {
+ return `NOT EXISTS (
+ SELECT 1 FROM dedup_lineage dl
+ WHERE dl.duplicate_table = ${tableExpr}
+ AND dl.duplicate_id = ${idExpr}
+ AND dl.status = 'marked'
+ )`;
+}
+
+// ---------------------------------------------------------------------------
+// Database plumbing — scans, lineage, deletion
+// ---------------------------------------------------------------------------
+
+interface TableScanConfig {
+ textColumns: string[];
+ createdAtColumn: string;
+ extraWhere?: string;
+}
+
+// Which columns constitute "the text" of a record for duplicate detection.
+// Decisions are scanned at status='active' only: supersede/revert already
+// manage lifecycle copies, and a superseded row must never win survivorship
+// over (and thereby hide) an active decision.
+const TABLE_SCAN_CONFIG: Record = {
+ messages: { textColumns: ['content'], createdAtColumn: 'timestamp' },
+ decisions: { textColumns: ['decision'], createdAtColumn: 'created_at', extraWhere: "status = 'active'" },
+ learnings: { textColumns: ['problem', 'solution'], createdAtColumn: 'created_at' },
+ breadcrumbs: { textColumns: ['content'], createdAtColumn: 'created_at' },
+ loa_entries: { textColumns: ['title', 'fabric_extract'], createdAtColumn: 'created_at' },
+};
+
+export interface ScanResult {
+ candidates: DedupCandidate[];
+ scanned: number;
+ tooShort: number;
+}
+
+/**
+ * Read one table's dedup candidates in bounded batches via keyset pagination
+ * (fixed binds per statement regardless of table size). Rows below
+ * MIN_DEDUP_TEXT_LENGTH are counted but excluded.
+ */
+export function scanCandidates(
+ db: Database,
+ table: ProvenanceTable,
+ project?: string,
+ batchSize: number = SQLITE_SAFE_CHUNK_SIZE
+): ScanResult {
+ const config = TABLE_SCAN_CONFIG[table];
+ const where = [
+ 'id > ?',
+ ...(config.extraWhere ? [config.extraWhere] : []),
+ ...(project !== undefined ? ['project = ?'] : []),
+ ].join(' AND ');
+ const stmt = db.prepare(`
+ SELECT id, project, provenance, importance,
+ ${config.createdAtColumn} AS created_at,
+ ${config.textColumns.join(', ')}
+ FROM ${table}
+ WHERE ${where}
+ ORDER BY id
+ LIMIT ?
+ `);
+
+ const candidates: DedupCandidate[] = [];
+ let scanned = 0;
+ let tooShort = 0;
+ let lastId = 0;
+ for (;;) {
+ const params = project !== undefined ? [lastId, project, batchSize] : [lastId, batchSize];
+ const batch = stmt.all(...params) as Array>;
+ if (batch.length === 0) break;
+ for (const row of batch) {
+ scanned++;
+ const text = config.textColumns
+ .map(c => row[c])
+ .filter(v => v !== null && v !== undefined && String(v).length > 0)
+ .join('\n');
+ const normalized = normalizeText(text);
+ if (normalized.length < MIN_DEDUP_TEXT_LENGTH) {
+ tooShort++;
+ continue;
+ }
+ candidates.push({
+ table,
+ id: row.id as number,
+ project: (row.project as string | null) ?? null,
+ provenance: (row.provenance as Provenance | null) ?? null,
+ importance: (row.importance as number | null) ?? 5,
+ created_at: String(row.created_at),
+ key: createHash('sha256').update(normalized).digest('hex'),
+ textLength: normalized.length,
+ });
+ }
+ lastId = batch[batch.length - 1].id as number;
+ }
+ return { candidates, scanned, tooShort };
+}
+
+/** (table, id) pairs currently marked as duplicates — excluded from scans. */
+export function loadMarkedDuplicates(db: Database): Set {
+ const rows = db.prepare(
+ `SELECT duplicate_table, duplicate_id FROM dedup_lineage WHERE status = 'marked'`
+ ).all() as Array<{ duplicate_table: string; duplicate_id: number }>;
+ return new Set(rows.map(r => `${r.duplicate_table}:${r.duplicate_id}`));
+}
+
+/** Stored embeddings for one table, keyed by source row id. */
+export function loadEmbeddings(db: Database, table: ProvenanceTable): Map {
+ const rows = db.prepare(
+ `SELECT source_id, embedding FROM embeddings WHERE source_table = ?`
+ ).all(table) as Array<{ source_id: number; embedding: Buffer }>;
+ const map = new Map();
+ for (const row of rows) {
+ map.set(row.source_id, blobToEmbedding(row.embedding));
+ }
+ return map;
+}
+
+/**
+ * Ids in `table` that other rows reference via foreign keys (loa_entries
+ * message ranges, loa parent links). With foreign_keys=ON these cannot be
+ * hard-deleted; destructive mode downgrades them to 'marked' instead of
+ * failing the whole transaction.
+ */
+export function fkProtectedIds(db: Database, table: ProvenanceTable): Set {
+ const ids = new Set();
+ if (table === 'messages') {
+ const rows = db.prepare(
+ `SELECT message_range_start AS s, message_range_end AS e FROM loa_entries
+ WHERE message_range_start IS NOT NULL OR message_range_end IS NOT NULL`
+ ).all() as Array<{ s: number | null; e: number | null }>;
+ for (const row of rows) {
+ if (row.s !== null) ids.add(row.s);
+ if (row.e !== null) ids.add(row.e);
+ }
+ } else if (table === 'loa_entries') {
+ const rows = db.prepare(
+ `SELECT DISTINCT parent_loa_id AS p FROM loa_entries WHERE parent_loa_id IS NOT NULL`
+ ).all() as Array<{ p: number }>;
+ for (const row of rows) ids.add(row.p);
+ }
+ return ids;
+}
+
+// ---------------------------------------------------------------------------
+// Plan and apply
+// ---------------------------------------------------------------------------
+
+export interface DedupTableReport {
+ table: ProvenanceTable;
+ scanned: number;
+ tooShort: number;
+ alreadyMarked: number;
+ exactGroups: number;
+ semanticPairs: number;
+ planned: LineageEntry[];
+}
+
+export interface DedupPlan {
+ tables: DedupTableReport[];
+ threshold: number;
+ /** Reason the semantic pass did not run, or null if it ran. */
+ semanticSkipped: string | null;
+ crossTable: {
+ textMatches: CrossTableMatch[];
+ semanticPairs: number;
+ };
+}
+
+export interface PlanDedupOptions {
+ tables?: ProvenanceTable[];
+ project?: string;
+ threshold?: number;
+ semantic?: boolean;
+}
+
+function lineageDetail(survivor: DedupCandidate, duplicate: DedupCandidate): string {
+ return JSON.stringify({
+ project: duplicate.project,
+ survivor_provenance: survivor.provenance ?? 'unknown',
+ duplicate_provenance: duplicate.provenance ?? 'unknown',
+ });
+}
+
+function toLineage(
+ survivor: DedupCandidate,
+ duplicate: DedupCandidate,
+ reason: DedupReason,
+ similarity: number | null
+): LineageEntry {
+ return {
+ survivor_table: survivor.table,
+ survivor_id: survivor.id,
+ duplicate_table: duplicate.table,
+ duplicate_id: duplicate.id,
+ reason,
+ similarity,
+ detail: lineageDetail(survivor, duplicate),
+ };
+}
+
+/**
+ * Compute the full dedup plan without writing anything. The same plan is
+ * what a dry run reports and what an execute run applies.
+ */
+export function planDedup(db: Database, options: PlanDedupOptions = {}): DedupPlan {
+ const tables = options.tables ?? [...DEDUP_TABLES];
+ const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD;
+ const semantic = options.semantic ?? true;
+ const marked = loadMarkedDuplicates(db);
+
+ const reports = new Map();
+ const eligible = new Map();
+
+ // Exact pass — within table + project.
+ const plannedDuplicateKeys = new Set();
+ const plannedSurvivorKeys = new Set();
+ for (const table of tables) {
+ const scan = scanCandidates(db, table, options.project);
+ const fresh = scan.candidates.filter(c => !marked.has(`${c.table}:${c.id}`));
+ const report: DedupTableReport = {
+ table,
+ scanned: scan.scanned,
+ tooShort: scan.tooShort,
+ alreadyMarked: scan.candidates.length - fresh.length,
+ exactGroups: 0,
+ semanticPairs: 0,
+ planned: [],
+ };
+ for (const group of findExactGroups(fresh)) {
+ report.exactGroups++;
+ const { survivor, duplicates } = selectSurvivor(group);
+ plannedSurvivorKeys.add(`${survivor.table}:${survivor.id}`);
+ for (const dup of duplicates) {
+ report.planned.push(toLineage(survivor, dup, 'exact', 1.0));
+ plannedDuplicateKeys.add(`${dup.table}:${dup.id}`);
+ }
+ }
+ reports.set(table, report);
+ eligible.set(table, fresh);
+ }
+
+ // Records still standing after the exact pass (survivors + unmatched).
+ const remaining = [...eligible.values()].flat()
+ .filter(c => !plannedDuplicateKeys.has(`${c.table}:${c.id}`));
+
+ // Cross-table normalized-text matches — report-only.
+ const crossTableText = findCrossTableMatches(remaining);
+
+ // Semantic pass — stored embeddings only; no embedding service required.
+ let semanticSkipped: string | null = null;
+ let crossTableSemantic = 0;
+ if (!semantic) {
+ semanticSkipped = 'disabled (--no-semantic)';
+ } else {
+ const embedded: EmbeddedCandidate[] = [];
+ for (const table of tables) {
+ const embeddings = loadEmbeddings(db, table);
+ if (embeddings.size === 0) continue;
+ for (const candidate of remaining) {
+ if (candidate.table !== table) continue;
+ const embedding = embeddings.get(candidate.id);
+ if (embedding) embedded.push({ candidate, embedding });
+ }
+ }
+ if (embedded.length === 0) {
+ semanticSkipped =
+ 'no stored embeddings for the scanned records (run `recall embed backfill` to enable semantic detection)';
+ } else {
+ const pairs = findSemanticPairs(embedded, threshold);
+ for (const pair of pairs) {
+ if (!pair.sameTable) {
+ crossTableSemantic++;
+ continue;
+ }
+ const aKey = `${pair.a.table}:${pair.a.id}`;
+ const bKey = `${pair.b.table}:${pair.b.id}`;
+ // Greedy, strongest-first: a record planned as a duplicate can
+ // neither survive nor be re-marked, and a record planned as a
+ // survivor (in either pass) can never be re-marked by a weaker
+ // pair — lineage stays one hop deep, no transitive chaining.
+ if (plannedDuplicateKeys.has(aKey) || plannedDuplicateKeys.has(bKey)) continue;
+ if (plannedSurvivorKeys.has(aKey) || plannedSurvivorKeys.has(bKey)) continue;
+ const { survivor, duplicates } = selectSurvivor([pair.a, pair.b]);
+ const report = reports.get(pair.a.table)!;
+ report.semanticPairs++;
+ report.planned.push(toLineage(survivor, duplicates[0], 'semantic', pair.similarity));
+ plannedDuplicateKeys.add(`${duplicates[0].table}:${duplicates[0].id}`);
+ plannedSurvivorKeys.add(`${survivor.table}:${survivor.id}`);
+ }
+ }
+ }
+
+ return {
+ tables: tables.map(t => reports.get(t)!),
+ threshold,
+ semanticSkipped,
+ crossTable: { textMatches: crossTableText, semanticPairs: crossTableSemantic },
+ };
+}
+
+export interface ApplyResult {
+ marked: number;
+ deleted: number;
+ /** Duplicates kept as 'marked' in destructive mode because of FK references. */
+ fkProtected: number;
+}
+
+/**
+ * Apply a plan: insert lineage rows ('marked' by default). With
+ * `destructive`, duplicate rows are hard-deleted (with their embeddings) and
+ * lineage status is 'deleted' — except FK-referenced rows, which stay marked.
+ * Runs in one transaction; failures roll back everything.
+ */
+export function applyDedupPlan(
+ db: Database,
+ plan: DedupPlan,
+ options: { destructive?: boolean } = {}
+): ApplyResult {
+ const destructive = options.destructive ?? false;
+ const entries = plan.tables.flatMap(t => t.planned);
+ const result: ApplyResult = { marked: 0, deleted: 0, fkProtected: 0 };
+ if (entries.length === 0) return result;
+
+ const insert = db.prepare(`
+ INSERT INTO dedup_lineage
+ (survivor_table, survivor_id, duplicate_table, duplicate_id, reason, similarity, status, detail)
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+ `);
+
+ // FK lookups are per-table constants within this transaction; cache them so
+ // a large plan doesn't re-query loa_entries per duplicate row.
+ const fkCache = new Map>();
+ const fkProtected = (table: ProvenanceTable): Set => {
+ let cached = fkCache.get(table);
+ if (!cached) {
+ cached = fkProtectedIds(db, table);
+ fkCache.set(table, cached);
+ }
+ return cached;
+ };
+
+ const apply = db.transaction(() => {
+ const toDelete = new Map();
+ for (const entry of entries) {
+ let status: LineageStatus = 'marked';
+ if (destructive) {
+ const protectedIds = fkProtected(entry.duplicate_table);
+ if (protectedIds.has(entry.duplicate_id)) {
+ result.fkProtected++;
+ } else {
+ status = 'deleted';
+ const list = toDelete.get(entry.duplicate_table);
+ if (list) list.push(entry.duplicate_id);
+ else toDelete.set(entry.duplicate_table, [entry.duplicate_id]);
+ }
+ }
+ insert.run(
+ entry.survivor_table,
+ entry.survivor_id,
+ entry.duplicate_table,
+ entry.duplicate_id,
+ entry.reason,
+ entry.similarity,
+ status,
+ entry.detail
+ );
+ if (status === 'marked') result.marked++;
+ else result.deleted++;
+ }
+
+ // Input-scaled IN lists — chunked() per the src/lib/chunk.ts audit note.
+ for (const [table, ids] of toDelete) {
+ for (const chunk of chunked(ids)) {
+ const placeholders = chunk.map(() => '?').join(', ');
+ db.prepare(`DELETE FROM ${table} WHERE id IN (${placeholders})`).run(...chunk);
+ db.prepare(
+ `DELETE FROM embeddings WHERE source_table = ? AND source_id IN (${placeholders})`
+ ).run(table, ...chunk);
+ }
+ }
+ });
+ apply();
+
+ return result;
+}
diff --git a/src/lib/embeddings.ts b/src/lib/embeddings.ts
index 8dc22f4..a46632b 100644
--- a/src/lib/embeddings.ts
+++ b/src/lib/embeddings.ts
@@ -84,12 +84,17 @@ export function embeddingToBlob(embedding: number[]): Buffer {
}
/**
- * Convert SQLite BLOB back to embedding array
+ * Convert SQLite BLOB back to embedding array.
+ * bun:sqlite returns BLOB columns as Uint8Array (not Buffer) — wrap without
+ * copying so readFloatLE is available either way.
*/
-export function blobToEmbedding(blob: Buffer): number[] {
+export function blobToEmbedding(blob: Buffer | Uint8Array): number[] {
+ const buf = Buffer.isBuffer(blob)
+ ? blob
+ : Buffer.from(blob.buffer, blob.byteOffset, blob.byteLength);
const embedding: number[] = [];
- for (let i = 0; i < blob.length; i += 4) {
- embedding.push(blob.readFloatLE(i));
+ for (let i = 0; i < buf.length; i += 4) {
+ embedding.push(buf.readFloatLE(i));
}
return embedding;
}
diff --git a/src/lib/export.ts b/src/lib/export.ts
index 2a072bb..a97573d 100644
--- a/src/lib/export.ts
+++ b/src/lib/export.ts
@@ -25,7 +25,11 @@ import { SQLITE_SAFE_CHUNK_SIZE } from './chunk.js';
import { getMigrationVersion } from '../db/migrations.js';
import { VERSION } from '../version.js';
-/** Durable memory tables included in app-level (JSON/Markdown/SQL) exports. */
+/**
+ * Durable tables included in app-level (JSON/Markdown/SQL) exports: the
+ * memory tables plus dedup_lineage (issue #45), so duplicate lineage stays
+ * portable and auditable alongside the records it describes.
+ */
export const EXPORT_TABLES = [
'sessions',
'messages',
@@ -33,6 +37,7 @@ export const EXPORT_TABLES = [
'learnings',
'breadcrumbs',
'loa_entries',
+ 'dedup_lineage',
] as const;
export type ExportTable = typeof EXPORT_TABLES[number];
diff --git a/src/lib/memory.ts b/src/lib/memory.ts
index eb89ce3..4ad246c 100644
--- a/src/lib/memory.ts
+++ b/src/lib/memory.ts
@@ -2,6 +2,7 @@
import { getDb, getDbPath } from '../db/connection.js';
import { existsSync, statSync } from 'fs';
+import { notMarkedDuplicateSql } from './dedup.js';
import type { Session, Message, Decision, Learning, Breadcrumb, LoaEntry, Stats, SearchResult, Provenance } from '../types/index.js';
// ============ Sessions ============
@@ -271,6 +272,15 @@ export interface MemorySearchOptions {
table?: string;
limit?: number;
biasType?: SearchTable;
+ /** Include records marked as duplicates by `recall dedup` (hidden by default). */
+ includeDuplicates?: boolean;
+}
+
+// Marked duplicates (issue #45) are hidden from search by default — the
+// lineage row in dedup_lineage keeps them recoverable and auditable.
+function duplicateFilter(options: MemorySearchOptions | undefined, physicalTable: string, idExpr: string): string {
+ if (options?.includeDuplicates) return '';
+ return `AND ${notMarkedDuplicateSql(`'${physicalTable}'`, idExpr)}`;
}
const TYPE_BIAS_RANK_MULTIPLIER = 4;
@@ -313,6 +323,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu
FROM messages_fts f
JOIN messages m ON m.id = f.rowid
WHERE messages_fts MATCH ?
+ ${duplicateFilter(options, 'messages', 'm.id')}
${options?.project ? 'AND m.project = ?' : ''}
ORDER BY f.rank
LIMIT ?
@@ -325,6 +336,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu
JOIN decisions d ON d.id = f.rowid
WHERE decisions_fts MATCH ?
AND d.status = 'active'
+ ${duplicateFilter(options, 'decisions', 'd.id')}
${options?.project ? 'AND d.project = ?' : ''}
ORDER BY CASE d.confidence WHEN 'high' THEN 0 WHEN 'medium' THEN 1 WHEN 'low' THEN 2 ELSE 1 END, f.rank
LIMIT ?
@@ -336,6 +348,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu
FROM learnings_fts f
JOIN learnings l ON l.id = f.rowid
WHERE learnings_fts MATCH ?
+ ${duplicateFilter(options, 'learnings', 'l.id')}
${options?.project ? 'AND l.project = ?' : ''}
ORDER BY f.rank
LIMIT ?
@@ -347,6 +360,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu
FROM breadcrumbs_fts f
JOIN breadcrumbs b ON b.id = f.rowid
WHERE breadcrumbs_fts MATCH ?
+ ${duplicateFilter(options, 'breadcrumbs', 'b.id')}
${options?.project ? 'AND b.project = ?' : ''}
ORDER BY f.rank
LIMIT ?
@@ -358,6 +372,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu
FROM loa_fts f
JOIN loa_entries l ON l.id = f.rowid
WHERE loa_fts MATCH ?
+ ${duplicateFilter(options, 'loa_entries', 'l.id')}
${options?.project ? 'AND l.project = ?' : ''}
ORDER BY f.rank
LIMIT ?
diff --git a/src/mcp-server.ts b/src/mcp-server.ts
index 958d001..242085f 100644
--- a/src/mcp-server.ts
+++ b/src/mcp-server.ts
@@ -66,6 +66,7 @@ import {
reciprocalRankFusion,
checkEmbeddingService,
} from "./lib/embeddings.js";
+import { notMarkedDuplicateSql } from "./lib/dedup.js";
import type { Provenance } from "./types/index.js";
import { existsSync } from "fs";
@@ -118,9 +119,12 @@ async function hybridSearch(
const queryResult = await embed(query);
const queryEmbedding = queryResult.embedding;
+ // Marked duplicates (recall dedup, issue #45) keep their embeddings
+ // but are hidden from the vector path, matching the FTS5 default.
const embeddings = db
.prepare(`
SELECT source_table, source_id, embedding FROM embeddings
+ WHERE ${notMarkedDuplicateSql("source_table", "source_id")}
`)
.all() as Array<{
source_table: string;
diff --git a/tests/commands/dedup.test.ts b/tests/commands/dedup.test.ts
new file mode 100644
index 0000000..5c309b0
--- /dev/null
+++ b/tests/commands/dedup.test.ts
@@ -0,0 +1,345 @@
+// recall dedup — issue #45 acceptance criteria.
+//
+// Behavior under test:
+// - dry-run by default: reports the plan, writes nothing
+// - --execute marks duplicates in dedup_lineage; records stay intact
+// - exact/normalized duplicate detection within table + project
+// - survivor priority user_authored > verbatim > extracted > derived > unknown
+// - semantic pass uses stored embeddings only; skipped (and reported) when
+// none exist; threshold respected, below-threshold never merged
+// - idempotence: a second execute finds nothing new
+// - cross-table candidates are report-only
+// - search hides marked duplicates by default; --include-duplicates shows them
+// - lineage rows persist full audit detail
+// - --delete (destructive opt-in) removes rows + embeddings; FK-referenced
+// duplicates are kept as marked instead of failing the transaction
+
+import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
+import { setupTestDb, teardownTestDb } from '../helpers/setup';
+import { runDedup } from '../../src/commands/dedup';
+import { embeddingToBlob } from '../../src/lib/embeddings';
+import { getDb } from '../../src/db/connection';
+import { search } from '../../src/lib/memory';
+import {
+ createSession,
+ addMessage,
+ addDecision,
+ addBreadcrumb,
+ createLoaEntry,
+ supersedeDecision,
+} from '../../src/lib/memory';
+
+const originalLog = console.log;
+const originalError = console.error;
+const originalExitCode = process.exitCode;
+
+beforeEach(() => {
+ setupTestDb();
+ console.log = () => {};
+ console.error = () => {};
+});
+
+afterEach(() => {
+ console.log = originalLog;
+ console.error = originalError;
+ process.exitCode = originalExitCode;
+ teardownTestDb();
+});
+
+// Long enough to clear MIN_DEDUP_TEXT_LENGTH after normalization.
+const CRUMB = 'Always use bun for every script in this repository, never npm.';
+const OTHER = 'A completely different breadcrumb about the release process.';
+
+function lineageRows(): Array> {
+ return getDb().prepare('SELECT * FROM dedup_lineage ORDER BY id').all() as Array>;
+}
+
+function insertEmbedding(table: string, id: number, vector: number[]): void {
+ getDb().prepare(
+ `INSERT OR REPLACE INTO embeddings (source_table, source_id, model, dimensions, embedding)
+ VALUES (?, ?, 'test', ?, ?)`
+ ).run(table, id, vector.length, embeddingToBlob(vector));
+}
+
+describe('dry-run vs execute', () => {
+ test('dry-run reports the plan and writes nothing', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: OTHER, importance: 5 });
+
+ const result = runDedup({})!;
+ const crumbs = result.plan.tables.find(t => t.table === 'breadcrumbs')!;
+ expect(crumbs.exactGroups).toBe(1);
+ expect(crumbs.planned.length).toBe(1);
+ expect(result.applied).toBeNull();
+ expect(lineageRows().length).toBe(0);
+ expect(search('breadcrumb OR bun').length).toBe(3);
+ });
+
+ test('--execute marks duplicates non-destructively and search hides them', () => {
+ const id1 = addBreadcrumb({ content: CRUMB, importance: 5 });
+ const id2 = addBreadcrumb({ content: CRUMB, importance: 5 });
+
+ const result = runDedup({ execute: true })!;
+ expect(result.applied?.marked).toBe(1);
+
+ // Non-destructive: both records still exist.
+ const count = getDb().prepare('SELECT COUNT(*) AS c FROM breadcrumbs').get() as { c: number };
+ expect(count.c).toBe(2);
+
+ // Search hides the marked duplicate by default...
+ const hidden = search('bun');
+ expect(hidden.length).toBe(1);
+ // ...and shows it again with the explicit include option.
+ const shown = search('bun', { includeDuplicates: true });
+ expect(shown.length).toBe(2);
+ expect(shown.map(r => r.id).sort()).toEqual([id1, id2].sort());
+ });
+
+ test('--delete without --execute is rejected', () => {
+ expect(runDedup({ delete: true })).toBeUndefined();
+ expect(process.exitCode).toBe(1);
+ });
+});
+
+describe('exact detection and survivor selection', () => {
+ test('normalized variants dedupe; provenance picks the survivor', () => {
+ const extracted = addBreadcrumb({ content: CRUMB, importance: 9, provenance: 'extracted' });
+ const authored = addBreadcrumb({ content: CRUMB.toUpperCase(), importance: 1, provenance: 'user_authored' });
+ const legacy = addBreadcrumb({ content: ` ${CRUMB} `, importance: 9 });
+
+ runDedup({ execute: true });
+
+ const rows = lineageRows();
+ expect(rows.length).toBe(2);
+ for (const row of rows) {
+ expect(row.survivor_id).toBe(authored);
+ expect(row.survivor_table).toBe('breadcrumbs');
+ expect(row.reason).toBe('exact');
+ expect(row.status).toBe('marked');
+ }
+ expect(rows.map(r => r.duplicate_id).sort()).toEqual([extracted, legacy].sort());
+ });
+
+ test('identical text in different projects is not deduped', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5, project: 'alpha' });
+ addBreadcrumb({ content: CRUMB, importance: 5, project: 'beta' });
+ const result = runDedup({ execute: true })!;
+ expect(result.applied?.marked).toBe(0);
+ });
+
+ test('short records are never candidates', () => {
+ addBreadcrumb({ content: 'ok', importance: 5 });
+ addBreadcrumb({ content: 'ok', importance: 5 });
+ const result = runDedup({})!;
+ const crumbs = result.plan.tables.find(t => t.table === 'breadcrumbs')!;
+ expect(crumbs.tooShort).toBe(2);
+ expect(crumbs.planned.length).toBe(0);
+ });
+
+ test('only active decisions participate', () => {
+ const text = 'Adopt SQLite WAL mode for every database connection in Recall.';
+ const stale = addDecision({ decision: text, status: 'active' });
+ supersedeDecision(stale);
+ addDecision({ decision: text, status: 'active' });
+
+ const result = runDedup({ execute: true })!;
+ expect(result.applied?.marked).toBe(0);
+ });
+
+ test('--table scopes the run', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' });
+ addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:01Z', role: 'user', content: CRUMB });
+ addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:02Z', role: 'user', content: CRUMB });
+
+ const result = runDedup({ execute: true, table: 'messages' })!;
+ expect(result.plan.tables.length).toBe(1);
+ expect(result.applied?.marked).toBe(1);
+ expect(lineageRows().every(r => r.duplicate_table === 'messages')).toBe(true);
+ });
+
+ test('invalid --table and --threshold are rejected', () => {
+ expect(runDedup({ table: 'sessions' })).toBeUndefined();
+ expect(process.exitCode).toBe(1);
+ process.exitCode = 0;
+ expect(runDedup({ threshold: 1.5 })).toBeUndefined();
+ expect(process.exitCode).toBe(1);
+ });
+});
+
+describe('semantic pass', () => {
+ test('skipped with a clear reason when no embeddings exist', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: OTHER, importance: 5 });
+ const result = runDedup({})!;
+ expect(result.plan.semanticSkipped).toContain('embed');
+ });
+
+ test('skipped when disabled via --no-semantic', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ const result = runDedup({ semantic: false })!;
+ expect(result.plan.semanticSkipped).toContain('--no-semantic');
+ });
+
+ test('marks pairs at the threshold, never below it', () => {
+ const near = Math.sqrt(1 - 0.97 * 0.97);
+ const a = addBreadcrumb({ content: CRUMB, importance: 5 });
+ const b = addBreadcrumb({ content: OTHER, importance: 5 });
+ insertEmbedding('breadcrumbs', a, [1, 0, 0]);
+ insertEmbedding('breadcrumbs', b, [0.97, near, 0]);
+
+ // Above the default 0.95 threshold → marked, with similarity recorded.
+ const marked = runDedup({ execute: true })!;
+ expect(marked.applied?.marked).toBe(1);
+ const row = lineageRows()[0];
+ expect(row.reason).toBe('semantic');
+ expect(row.similarity as number).toBeCloseTo(0.97, 3);
+
+ // Survivor is the richer record (longer normalized text wins the tie).
+ const longer = CRUMB.length >= OTHER.length ? a : b;
+ expect(row.survivor_id).toBe(longer);
+ });
+
+ test('a stricter threshold leaves the same pair untouched', () => {
+ const near = Math.sqrt(1 - 0.97 * 0.97);
+ const a = addBreadcrumb({ content: CRUMB, importance: 5 });
+ const b = addBreadcrumb({ content: OTHER, importance: 5 });
+ insertEmbedding('breadcrumbs', a, [1, 0, 0]);
+ insertEmbedding('breadcrumbs', b, [0.97, near, 0]);
+
+ const result = runDedup({ execute: true, threshold: 0.99 })!;
+ expect(result.applied?.marked).toBe(0);
+ expect(lineageRows().length).toBe(0);
+ });
+
+ test('a planned survivor is never re-marked by a weaker pair (no transitive chaining)', () => {
+ // PR #60 review repro: cos(A,B)=0.99, cos(B,C)=0.96, cos(A,C)≈0.91 with
+ // the default 0.95 threshold. Ascending text length pins the survivor
+ // orientation: B survives the strongest pair, so without the survivor
+ // guard the weaker B/C pair re-marks B — leaving A's only visible
+ // neighbor at 0.91, below the threshold.
+ const a = addBreadcrumb({ content: 'Review queue triage happens before standup.', importance: 5 });
+ const b = addBreadcrumb({ content: 'Review queue triage always happens before the standup.', importance: 5 });
+ const c = addBreadcrumb({ content: 'Review queue triage must always happen before the morning standup.', importance: 5 });
+ insertEmbedding('breadcrumbs', a, [0.99, Math.sqrt(1 - 0.99 * 0.99), 0]);
+ insertEmbedding('breadcrumbs', b, [1, 0, 0]);
+ insertEmbedding('breadcrumbs', c, [0.96, -Math.sqrt(1 - 0.96 * 0.96), 0]);
+
+ const result = runDedup({ execute: true })!;
+
+ // Only the strongest pair is marked; B/C is skipped because B already
+ // survives A.
+ expect(result.applied?.marked).toBe(1);
+ const rows = lineageRows();
+ expect(rows.length).toBe(1);
+ expect(rows[0].duplicate_id).toBe(a);
+ expect(rows[0].survivor_id).toBe(b);
+ expect(rows[0].similarity as number).toBeCloseTo(0.99, 3);
+
+ // One-hop lineage invariant: no survivor is itself marked as a duplicate.
+ const duplicates = new Set(rows.map(r => `${r.duplicate_table}:${r.duplicate_id}`));
+ for (const row of rows) {
+ expect(duplicates.has(`${row.survivor_table}:${row.survivor_id}`)).toBe(false);
+ }
+ });
+});
+
+describe('idempotence', () => {
+ test('a second execute finds nothing new', () => {
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+
+ const first = runDedup({ execute: true })!;
+ expect(first.applied?.marked).toBe(2);
+ const after = lineageRows().length;
+
+ const second = runDedup({ execute: true })!;
+ expect(second.applied?.marked).toBe(0);
+ const crumbs = second.plan.tables.find(t => t.table === 'breadcrumbs')!;
+ expect(crumbs.alreadyMarked).toBe(2);
+ expect(lineageRows().length).toBe(after);
+ });
+});
+
+describe('cross-table candidates are report-only', () => {
+ test('reported, never marked', () => {
+ createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' });
+ addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:01Z', role: 'user', content: CRUMB });
+ addBreadcrumb({ content: CRUMB, importance: 5 });
+
+ const result = runDedup({ execute: true })!;
+ expect(result.plan.crossTable.textMatches.length).toBe(1);
+ expect(result.applied?.marked).toBe(0);
+ expect(lineageRows().length).toBe(0);
+ // Both records remain searchable.
+ expect(search('bun').length).toBe(2);
+ });
+});
+
+describe('lineage persistence', () => {
+ test('rows carry full audit detail', () => {
+ const survivor = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'user_authored', project: 'demo' });
+ const dup = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'extracted', project: 'demo' });
+
+ runDedup({ execute: true });
+
+ const row = lineageRows()[0];
+ expect(row.survivor_table).toBe('breadcrumbs');
+ expect(row.survivor_id).toBe(survivor);
+ expect(row.duplicate_table).toBe('breadcrumbs');
+ expect(row.duplicate_id).toBe(dup);
+ expect(row.reason).toBe('exact');
+ expect(row.similarity).toBe(1);
+ expect(row.status).toBe('marked');
+ expect(typeof row.created_at).toBe('string');
+ const detail = JSON.parse(row.detail as string);
+ expect(detail.survivor_provenance).toBe('user_authored');
+ expect(detail.duplicate_provenance).toBe('extracted');
+ expect(detail.project).toBe('demo');
+ });
+});
+
+describe('destructive opt-in (--execute --delete)', () => {
+ test('deletes duplicates and their embeddings; lineage records the deletion', () => {
+ const survivor = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'user_authored' });
+ const dup = addBreadcrumb({ content: CRUMB, importance: 5 });
+ insertEmbedding('breadcrumbs', dup, [1, 0, 0]);
+
+ const result = runDedup({ execute: true, delete: true })!;
+ expect(result.applied?.deleted).toBe(1);
+
+ const db = getDb();
+ const remaining = db.prepare('SELECT id FROM breadcrumbs').all() as Array<{ id: number }>;
+ expect(remaining.map(r => r.id)).toEqual([survivor]);
+ const embeddings = db.prepare('SELECT COUNT(*) AS c FROM embeddings').get() as { c: number };
+ expect(embeddings.c).toBe(0);
+ expect(lineageRows()[0].status).toBe('deleted');
+ });
+
+ test('FK-referenced duplicates are kept as marked instead of failing', () => {
+ createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' });
+ const survivor = addMessage({ session_id: 's1', timestamp: '2026-01-02T00:00:00Z', role: 'user', content: CRUMB });
+ const referenced = addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:00Z', role: 'user', content: CRUMB });
+ // The older message loses survivorship (recency tie-break) but is pinned
+ // by an LoA message range — hard-deleting it would violate the FK.
+ createLoaEntry({
+ title: 'entry', fabric_extract: 'body',
+ message_range_start: referenced, message_range_end: referenced,
+ });
+
+ const result = runDedup({ execute: true, delete: true })!;
+ expect(result.applied?.deleted).toBe(0);
+ expect(result.applied?.fkProtected).toBe(1);
+
+ const row = lineageRows()[0];
+ expect(row.duplicate_id).toBe(referenced);
+ expect(row.survivor_id).toBe(survivor);
+ expect(row.status).toBe('marked');
+ // The record still exists, just hidden.
+ const msg = getDb().prepare('SELECT id FROM messages WHERE id = ?').get(referenced);
+ expect(msg).toBeDefined();
+ });
+});
diff --git a/tests/commands/export.test.ts b/tests/commands/export.test.ts
index b7e0274..799c1f7 100644
--- a/tests/commands/export.test.ts
+++ b/tests/commands/export.test.ts
@@ -134,6 +134,7 @@ describe('manifest', () => {
learnings: 1,
breadcrumbs: 1,
loa_entries: 1,
+ dedup_lineage: 0,
});
expect(manifest.provenance_counts.messages).toEqual({ unknown: 1, verbatim: 1 });
expect(manifest.provenance_counts.decisions).toEqual({ unknown: 1, user_authored: 1 });
@@ -199,6 +200,7 @@ describe('SQL dump', () => {
learnings: 1,
breadcrumbs: 1,
loa_entries: 1,
+ dedup_lineage: 0,
});
// Legacy NULL restores as NULL; known values survive verbatim
diff --git a/tests/db/migrations.test.ts b/tests/db/migrations.test.ts
index 8a52d55..62added 100644
--- a/tests/db/migrations.test.ts
+++ b/tests/db/migrations.test.ts
@@ -188,7 +188,8 @@ describe('MIGRATIONS array', () => {
test('has expected number of migrations', () => {
// 7 → 8: importance column on messages/decisions/learnings/loa_entries (Sprint #4)
// 8 → 9: provenance column on all five memory tables (issue #42)
- expect(MIGRATIONS.length).toBe(9);
+ // 9 → 10: dedup_lineage table (issue #45)
+ expect(MIGRATIONS.length).toBe(10);
});
test('all entries are functions', () => {
diff --git a/tests/lib/dedup.test.ts b/tests/lib/dedup.test.ts
new file mode 100644
index 0000000..e6ff9dd
--- /dev/null
+++ b/tests/lib/dedup.test.ts
@@ -0,0 +1,200 @@
+// recall dedup — pure logic (issue #45).
+//
+// Unit tests over the pure exported functions: normalization, survivor
+// selection (provenance > richness > importance > recency > id), exact
+// grouping, cross-table matching, and semantic pairing. Property-based
+// suites over these functions are issue #44 phase 2 — not here.
+
+import { describe, test, expect } from 'bun:test';
+import {
+ compareForSurvivor,
+ DEFAULT_SEMANTIC_THRESHOLD,
+ findCrossTableMatches,
+ findExactGroups,
+ findSemanticPairs,
+ normalizeText,
+ provenanceRank,
+ selectSurvivor,
+ type DedupCandidate,
+} from '../../src/lib/dedup';
+import { PROVENANCE_VALUES } from '../../src/types/index';
+
+function candidate(overrides: Partial = {}): DedupCandidate {
+ return {
+ table: 'breadcrumbs',
+ id: 1,
+ project: null,
+ provenance: null,
+ importance: 5,
+ created_at: '2026-01-01 00:00:00',
+ key: 'k',
+ textLength: 50,
+ ...overrides,
+ };
+}
+
+describe('normalizeText', () => {
+ test('lowercases, strips quotes, collapses whitespace', () => {
+ expect(normalizeText(` Use "Bun" for\n'everything' `)).toBe('use bun for everything');
+ });
+
+ test('identical meaning with different casing/spacing normalizes equal', () => {
+ expect(normalizeText('Ship THE feature')).toBe(normalizeText("ship the 'feature'"));
+ });
+});
+
+describe('provenanceRank', () => {
+ test('orders user_authored > verbatim > extracted > derived > unknown', () => {
+ const ranks = [...PROVENANCE_VALUES, null].map(p => provenanceRank(p));
+ expect(ranks).toEqual([...ranks].sort((a, b) => a - b));
+ expect(provenanceRank('user_authored')).toBeLessThan(provenanceRank('verbatim'));
+ expect(provenanceRank('verbatim')).toBeLessThan(provenanceRank('extracted'));
+ expect(provenanceRank('extracted')).toBeLessThan(provenanceRank('derived'));
+ expect(provenanceRank('derived')).toBeLessThan(provenanceRank(null));
+ });
+});
+
+describe('selectSurvivor', () => {
+ test('provenance dominates richness, importance, and recency', () => {
+ const weakButAuthored = candidate({ id: 1, provenance: 'user_authored', textLength: 10, importance: 1, created_at: '2020-01-01 00:00:00' });
+ const richButExtracted = candidate({ id: 2, provenance: 'extracted', textLength: 9999, importance: 10, created_at: '2026-01-01 00:00:00' });
+ const { survivor, duplicates } = selectSurvivor([richButExtracted, weakButAuthored]);
+ expect(survivor.id).toBe(1);
+ expect(duplicates.map(d => d.id)).toEqual([2]);
+ });
+
+ test('richness breaks provenance ties', () => {
+ const short = candidate({ id: 1, textLength: 10 });
+ const long = candidate({ id: 2, textLength: 200 });
+ expect(selectSurvivor([short, long]).survivor.id).toBe(2);
+ });
+
+ test('importance breaks richness ties', () => {
+ const low = candidate({ id: 1, importance: 3 });
+ const high = candidate({ id: 2, importance: 8 });
+ expect(selectSurvivor([low, high]).survivor.id).toBe(2);
+ });
+
+ test('recency breaks importance ties', () => {
+ const older = candidate({ id: 1, created_at: '2025-01-01 00:00:00' });
+ const newer = candidate({ id: 2, created_at: '2026-01-01 00:00:00' });
+ expect(selectSurvivor([older, newer]).survivor.id).toBe(2);
+ });
+
+ test('lowest id is the final deterministic tie-break', () => {
+ const a = candidate({ id: 7 });
+ const b = candidate({ id: 3 });
+ expect(selectSurvivor([a, b]).survivor.id).toBe(3);
+ // Total order: comparator never reports equality for distinct ids.
+ expect(compareForSurvivor(a, b)).toBeGreaterThan(0);
+ });
+
+ test('throws on an empty group', () => {
+ expect(() => selectSurvivor([])).toThrow();
+ });
+});
+
+describe('findExactGroups', () => {
+ test('groups same table + project + key; singletons excluded', () => {
+ const groups = findExactGroups([
+ candidate({ id: 1, key: 'a' }),
+ candidate({ id: 2, key: 'a' }),
+ candidate({ id: 3, key: 'b' }),
+ ]);
+ expect(groups.length).toBe(1);
+ expect(groups[0].map(c => c.id).sort()).toEqual([1, 2]);
+ });
+
+ test('identical text in different projects is not grouped', () => {
+ const groups = findExactGroups([
+ candidate({ id: 1, key: 'a', project: 'alpha' }),
+ candidate({ id: 2, key: 'a', project: 'beta' }),
+ ]);
+ expect(groups.length).toBe(0);
+ });
+});
+
+describe('findCrossTableMatches', () => {
+ test('reports same text across tables, ignores single-table matches', () => {
+ const matches = findCrossTableMatches([
+ candidate({ id: 1, key: 'a', table: 'breadcrumbs' }),
+ candidate({ id: 2, key: 'a', table: 'messages' }),
+ candidate({ id: 3, key: 'b', table: 'breadcrumbs' }),
+ candidate({ id: 4, key: 'b', table: 'breadcrumbs' }),
+ ]);
+ expect(matches.length).toBe(1);
+ expect(matches[0].members.map(m => m.table).sort()).toEqual(['breadcrumbs', 'messages']);
+ });
+});
+
+describe('findSemanticPairs', () => {
+ const unit = (x: number, y: number, z: number) => [x, y, z];
+
+ test('pairs at/above threshold; orthogonal vectors never pair', () => {
+ const near = Math.sqrt(1 - 0.97 * 0.97);
+ const pairs = findSemanticPairs(
+ [
+ { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.97, near, 0) },
+ { candidate: candidate({ id: 3, key: 'c' }), embedding: unit(0, 0, 1) },
+ ],
+ DEFAULT_SEMANTIC_THRESHOLD
+ );
+ expect(pairs.length).toBe(1);
+ expect([pairs[0].a.id, pairs[0].b.id].sort()).toEqual([1, 2]);
+ expect(pairs[0].similarity).toBeCloseTo(0.97, 5);
+ });
+
+ test('below-threshold pairs are never produced', () => {
+ const near = Math.sqrt(1 - 0.97 * 0.97);
+ const pairs = findSemanticPairs(
+ [
+ { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.97, near, 0) },
+ ],
+ 0.99
+ );
+ expect(pairs.length).toBe(0);
+ });
+
+ test('same-table pairs across projects are ignored; cross-table pairs are flagged', () => {
+ const pairs = findSemanticPairs(
+ [
+ { candidate: candidate({ id: 1, key: 'a', project: 'alpha' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 2, key: 'b', project: 'beta' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 3, key: 'c', table: 'messages', project: 'alpha' }), embedding: unit(1, 0, 0) },
+ ],
+ 0.95
+ );
+ // 1↔2: same table, different projects — ignored.
+ // 1↔3 and 2↔3: cross-table — report-only pairs.
+ expect(pairs.every(p => !p.sameTable)).toBe(true);
+ expect(pairs.length).toBe(2);
+ });
+
+ test('identical normalized keys are left to the exact pass', () => {
+ const pairs = findSemanticPairs(
+ [
+ { candidate: candidate({ id: 1, key: 'same' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 2, key: 'same' }), embedding: unit(1, 0, 0) },
+ ],
+ 0.95
+ );
+ expect(pairs.length).toBe(0);
+ });
+
+ test('results are sorted strongest-first', () => {
+ const mid = Math.sqrt(1 - 0.96 * 0.96);
+ const near = Math.sqrt(1 - 0.99 * 0.99);
+ const pairs = findSemanticPairs(
+ [
+ { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) },
+ { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.99, near, 0) },
+ { candidate: candidate({ id: 3, key: 'c' }), embedding: unit(0.96, mid, 0) },
+ ],
+ 0.95
+ );
+ const sims = pairs.map(p => p.similarity);
+ expect(sims).toEqual([...sims].sort((a, b) => b - a));
+ });
+});