diff --git a/CHANGELOG.md b/CHANGELOG.md index 4a11335..abf0eda 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,18 @@ while MCP tool names (`memory_search`, `memory_add`, etc.) remain stable. ### Added +- **`recall dedup`** — non-destructive dedup with provenance-aware survivor + selection (#45): dry-run by default, `--execute` marks duplicates in the new + `dedup_lineage` table (schema migration 9→10) without touching the records, + `--delete` is the destructive opt-in (take `recall export --backup` first). + Detection combines normalized-text matching with semantic matching over + stored embeddings (conservative 0.95 default threshold, skip reported when + embeddings are unavailable). Survivor priority is `user_authored > verbatim + > extracted > derived > unknown`, then richness, importance, recency. + Within-table only; cross-table candidates are report-only. Marked + duplicates are hidden from every search path unless + `recall search --include-duplicates` is passed, and lineage rows are + included in `recall export`. - **`recall export`** — portable and disaster-recovery exports (#43): JSON, Markdown, SQL dump, and SQLite (`VACUUM INTO`) formats with a manifest (counts + provenance counts including explicit `unknown`), a stdout/file/ diff --git a/docs/architecture.md b/docs/architecture.md index a7d859e..d84e960 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -70,6 +70,7 @@ both. | telos | Purpose framework entries (optional) | Yes | | documents | Imported standalone markdown documents (optional) | Yes | | embeddings | Vector embeddings for semantic search (768-dim, nomic-embed-text) | N/A | +| dedup_lineage | Duplicate lineage audit trail from `recall dedup` (survivor, duplicate, reason, similarity, status) | No | All FTS5-indexed tables have automatic sync triggers. @@ -88,6 +89,15 @@ rows stay `NULL` (unknown) until classified with `recall provenance backfill`, which only acts on deterministic write-path evidence and never guesses. +The `dedup_lineage` table was added in schema migration 9→10. `recall dedup` +marks duplicate records non-destructively by writing lineage rows here +(survivor table/id, duplicate table/id, reason, similarity, status); marked +duplicates stay in their source tables but are hidden from search unless +`--include-duplicates` is passed. Survivor selection follows provenance order +(`user_authored > verbatim > extracted > derived > unknown`), then richness, +importance, and recency. Dedup acts within a table only; cross-table +candidates are report-only. + ## Tiered RecallStart (v0.7.0+) The `RecallStart` hook injects two tiers at the top of every session: diff --git a/docs/cli-reference.md b/docs/cli-reference.md index c3d38ff..671de42 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -17,6 +17,7 @@ recall search "query" -t decisions # Hard-filter to decisions only recall search "query" --bias-type decisions # Prefer decisions, still show other matching tables recall search "query" -p myproject # Filter by project recall search "query" --show-provenance # Show provenance for every result +recall search "query" --include-duplicates # Include records marked by recall dedup recall semantic "query" # Semantic search (explicit) recall hybrid "query" # Hybrid search (explicit) ``` @@ -46,6 +47,8 @@ FTS5 supports boolean operators and prefix matching: By default, search output stays quiet about [Record Provenance](#record-provenance) when a record carries a known value, and visibly flags records whose provenance is unknown (legacy rows that predate the provenance column). Pass `--show-provenance` to display the provenance of every result. +Records marked as duplicates by [`recall dedup`](#dedup) are hidden from every search path (keyword, semantic, hybrid) by default — the records and their lineage remain in the database. Pass `--include-duplicates` to show them. + --- ## Capture @@ -269,7 +272,7 @@ Formats: - **json / markdown** — app-level export of the durable memory tables (`sessions`, `messages`, `decisions`, `learnings`, `breadcrumbs`, - `loa_entries`). Every row of a provenance-bearing table carries an explicit + `loa_entries`, `dedup_lineage`). Every row of a provenance-bearing table carries an explicit `provenance` field; legacy `NULL` provenance is exported as the literal `unknown` — never omitted, never guessed (see Record Provenance above). Embeddings are excluded. @@ -300,6 +303,46 @@ included. overwrites an existing file (a `-N` suffix is added on collision), and prints the output path. +## Dedup + +Detect and mark duplicate memory records without erasing evidence or lineage. + +```bash +recall dedup # Dry-run report (default — writes nothing) +recall dedup --execute # Mark duplicates (non-destructive) +recall dedup --execute --delete # Destructive opt-in: hard-delete duplicates +recall dedup -t breadcrumbs # Scope to one table +recall dedup -p myproject # Scope to one project +recall dedup --threshold 0.98 # Stricter semantic matching (default 0.95) +recall dedup --no-semantic # Exact/normalized text pass only +``` + +Safety model: + +- **Dry-run by default.** Mutations require `--execute`. +- **Non-destructive by default.** `--execute` writes lineage rows to the + `dedup_lineage` table (survivor, duplicate, reason, similarity, status, + timestamp); the duplicate records themselves stay intact and are merely + hidden from search. `--delete` is the destructive opt-in and requires + `--execute` — run `recall export --backup` first. +- **Within-table only.** Dedup never merges across tables (or across + projects). Cross-table duplicate candidates are report-only. +- **Survivor priority** is `user_authored > verbatim > extracted > derived > + unknown` ([Record Provenance](#record-provenance)); ties break by richness + (longer normalized text), importance, recency, then lowest id. +- **Detection** combines exact/normalized text matching with semantic + matching over stored embeddings (no embedding service call needed). The + semantic pass is skipped — and reported as skipped — when no embeddings + exist; records are never merged below the configured `--threshold` + (conservative default: 0.95 cosine similarity). Records with fewer than 20 + significant characters are never candidates. +- **Lifecycle-aware.** Only `active` decisions participate; superseded and + reverted decisions are managed by the decision lifecycle, not dedup. + +Marked duplicates are hidden from all search paths by default; see +`recall search --include-duplicates`. Lineage is included in +`recall export`, so the audit trail is portable. + ## Admin ```bash diff --git a/src/commands/dedup.ts b/src/commands/dedup.ts new file mode 100644 index 0000000..7c865bf --- /dev/null +++ b/src/commands/dedup.ts @@ -0,0 +1,145 @@ +// recall dedup command (issue #45). +// +// Dry-run by default. --execute marks duplicates (non-destructive: records +// stay intact, hidden from search via dedup_lineage). --delete is the +// destructive opt-in and requires --execute; take a `recall export --backup` +// first. Core logic lives in src/lib/dedup.ts. + +import { getDb } from '../db/connection.js'; +import { + applyDedupPlan, + DEDUP_TABLES, + DEFAULT_SEMANTIC_THRESHOLD, + planDedup, + type ApplyResult, + type DedupPlan, +} from '../lib/dedup.js'; +import type { ProvenanceTable } from '../types/index.js'; + +export interface DedupOptions { + execute?: boolean; + delete?: boolean; + table?: string; + project?: string; + threshold?: number; + semantic?: boolean; +} + +export interface DedupRunResult { + plan: DedupPlan; + applied: ApplyResult | null; +} + +export function runDedup(options: DedupOptions = {}): DedupRunResult | undefined { + const execute = options.execute ?? false; + const destructive = options.delete ?? false; + + if (destructive && !execute) { + console.error('--delete requires --execute. Dry-run never deletes.'); + process.exitCode = 1; + return undefined; + } + + const target = options.table ?? 'all'; + if (target !== 'all' && !(DEDUP_TABLES as readonly string[]).includes(target)) { + console.error( + `Invalid --table "${target}". Valid tables: ${DEDUP_TABLES.join(', ')}, all.` + ); + process.exitCode = 1; + return undefined; + } + + const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD; + if (!Number.isFinite(threshold) || threshold <= 0 || threshold > 1) { + console.error(`Invalid --threshold "${options.threshold}". Expected a number in (0, 1].`); + process.exitCode = 1; + return undefined; + } + + const db = getDb(); + const plan = planDedup(db, { + tables: target === 'all' ? undefined : [target as ProvenanceTable], + project: options.project, + threshold, + semantic: options.semantic, + }); + + const mode = !execute + ? '[DRY RUN — no changes written]' + : destructive + ? '[EXECUTE + DELETE — destructive: duplicates will be removed]' + : '[EXECUTE — marking duplicates, non-destructive]'; + console.log(`${mode}\n`); + + if (destructive) { + console.log("Recommended: run 'recall export --backup' before destructive dedup.\n"); + } + + const verb = execute ? (destructive ? 'delete' : 'mark') : 'would mark'; + let totalPlanned = 0; + for (const report of plan.tables) { + totalPlanned += report.planned.length; + const unchanged = report.scanned - report.planned.length; + const skipped: string[] = []; + if (report.alreadyMarked > 0) skipped.push(`${report.alreadyMarked} already marked`); + if (report.tooShort > 0) skipped.push(`${report.tooShort} too short`); + const skippedNote = skipped.length > 0 ? ` (${skipped.join(', ')})` : ''; + console.log( + `${report.table}: scanned ${report.scanned}, exact groups ${report.exactGroups}, ` + + `semantic pairs ${report.semanticPairs}, ${verb} ${report.planned.length}, ` + + `unchanged ${unchanged}${skippedNote}` + ); + for (const entry of report.planned.slice(0, 3)) { + const sim = entry.similarity !== null ? ` @ ${entry.similarity.toFixed(3)}` : ''; + console.log( + ` #${entry.duplicate_id} → survivor #${entry.survivor_id} [${entry.reason}${sim}]` + ); + } + if (report.planned.length > 3) { + console.log(` ...and ${report.planned.length - 3} more`); + } + } + console.log(''); + + if (plan.semanticSkipped) { + console.log(`Semantic pass: skipped — ${plan.semanticSkipped}`); + } else { + console.log(`Semantic pass: threshold ${plan.threshold}`); + } + + const crossText = plan.crossTable.textMatches; + console.log( + `Cross-table (report-only, never acted on): ${crossText.length} text match group(s), ` + + `${plan.crossTable.semanticPairs} semantic pair(s)` + ); + for (const match of crossText.slice(0, 5)) { + const members = match.members.map(m => `${m.table}#${m.id}`).join(' ↔ '); + const projectTag = match.project ? ` [${match.project}]` : ''; + console.log(` ${members}${projectTag}`); + } + if (crossText.length > 5) { + console.log(` ...and ${crossText.length - 5} more`); + } + console.log(''); + + if (!execute) { + if (totalPlanned > 0) { + console.log('Re-run with --execute to mark duplicates (non-destructive).'); + console.log("Marked duplicates are hidden from search; use 'recall search --include-duplicates' to see them."); + } else { + console.log('No duplicates found.'); + } + return { plan, applied: null }; + } + + const applied = applyDedupPlan(db, plan, { destructive }); + if (destructive) { + const fkNote = applied.fkProtected > 0 + ? ` (${applied.fkProtected} kept as marked — referenced by LoA lineage)` + : ''; + console.log(`Deleted ${applied.deleted} duplicate(s)${fkNote}.`); + } else { + console.log(`Marked ${applied.marked} duplicate(s). Records remain intact and recoverable.`); + } + return { plan, applied }; +} diff --git a/src/commands/embed.ts b/src/commands/embed.ts index c9ae4a0..66bd623 100644 --- a/src/commands/embed.ts +++ b/src/commands/embed.ts @@ -2,8 +2,17 @@ import { getDb } from '../db/connection.js'; import { embed, embeddingToBlob, blobToEmbedding, cosineSimilarity, checkEmbeddingService, reciprocalRankFusion, EMBEDDING_MODEL } from '../lib/embeddings.js'; +import { notMarkedDuplicateSql } from '../lib/dedup.js'; import { search as ftsSearch } from '../lib/memory.js'; +// Marked duplicates (recall dedup, issue #45) keep their embeddings but are +// hidden from the semantic search paths, matching the FTS5 default. +function embeddingsWhere(table?: string): string { + const conditions = [notMarkedDuplicateSql('source_table', 'source_id')]; + if (table) conditions.push(`source_table = '${table}'`); + return `WHERE ${conditions.join(' AND ')}`; +} + interface EmbedOptions { table?: 'loa' | 'decisions' | 'messages' | 'learnings'; limit?: number; @@ -164,11 +173,10 @@ export async function runSemanticSearch(query: string, options: { table?: string const queryEmbedding = queryResult.embedding; // Get all embeddings (for now, brute force - will optimize later) - const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : ''; const embeddings = db.prepare(` SELECT id, source_table, source_id, embedding FROM embeddings - ${tableFilter} + ${embeddingsWhere(options.table)} `).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>; if (embeddings.length === 0) { @@ -304,11 +312,10 @@ export async function runHybridSearch(query: string, options: { table?: string; const queryEmbedding = queryResult.embedding; // Get embeddings from database - const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : ''; const embeddings = db.prepare(` SELECT id, source_table, source_id, embedding FROM embeddings - ${tableFilter} + ${embeddingsWhere(options.table)} `).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>; // Calculate similarities diff --git a/src/commands/search.ts b/src/commands/search.ts index 2c13579..9c782a7 100644 --- a/src/commands/search.ts +++ b/src/commands/search.ts @@ -8,6 +8,7 @@ interface SearchOptions { biasType?: string; limit?: number; showProvenance?: boolean; + includeDuplicates?: boolean; } export function runSearch(query: string, options: SearchOptions): void { @@ -23,7 +24,8 @@ export function runSearch(query: string, options: SearchOptions): void { project: options.project, table: options.table, biasType: options.biasType as SearchTable | undefined, - limit: options.limit || 20 + limit: options.limit || 20, + includeDuplicates: options.includeDuplicates }); if (results.length === 0) { diff --git a/src/db/migrations.ts b/src/db/migrations.ts index 415e1a4..e5c4cbe 100644 --- a/src/db/migrations.ts +++ b/src/db/migrations.ts @@ -198,6 +198,12 @@ export const MIGRATIONS: Migration[] = [ } } }, + + // Migration 9 → 10: Dedup lineage table (issue #45). + // No-op — dedup_lineage and its indexes are brand new, handled by the + // CREATE TABLE IF NOT EXISTS DDL that runs before migrations (same + // precedent as migration 3 → 4 for the extraction tables). + (_db) => {}, ]; // --------------------------------------------------------------------------- diff --git a/src/db/schema.ts b/src/db/schema.ts index 7f60912..269e0b9 100644 --- a/src/db/schema.ts +++ b/src/db/schema.ts @@ -181,6 +181,24 @@ CREATE TABLE IF NOT EXISTS procedures ( times_observed INTEGER DEFAULT 2, confidence TEXT DEFAULT 'medium' CHECK (confidence IN ('high', 'medium', 'low')) ); + +-- Dedup lineage (issue #45): persistent audit trail of duplicate marking. +-- Non-destructive by default — a 'marked' row hides the duplicate from search +-- while the underlying record stays intact. 'deleted' records a destructive +-- opt-in removal. 'reverted' is reserved vocabulary for a future unmark path +-- (CHECK constraints cannot be widened later without a table rebuild). +CREATE TABLE IF NOT EXISTS dedup_lineage ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP, + survivor_table TEXT NOT NULL, + survivor_id INTEGER NOT NULL, + duplicate_table TEXT NOT NULL, + duplicate_id INTEGER NOT NULL, + reason TEXT NOT NULL CHECK (reason IN ('exact', 'semantic')), + similarity REAL, + status TEXT NOT NULL DEFAULT 'marked' CHECK (status IN ('marked', 'deleted', 'reverted')), + detail TEXT +); `; export const CREATE_INDEXES = ` @@ -231,6 +249,14 @@ CREATE INDEX IF NOT EXISTS idx_documents_created ON documents(created_at); -- Extraction session indexes CREATE INDEX IF NOT EXISTS idx_extraction_sessions_ts ON extraction_sessions(timestamp DESC); + +-- Dedup lineage indexes: the partial unique index guarantees a record can be +-- an actively marked duplicate at most once (idempotence); the survivor index +-- supports lineage audits. +CREATE UNIQUE INDEX IF NOT EXISTS idx_dedup_lineage_duplicate + ON dedup_lineage(duplicate_table, duplicate_id) WHERE status = 'marked'; +CREATE INDEX IF NOT EXISTS idx_dedup_lineage_survivor + ON dedup_lineage(survivor_table, survivor_id); `; export const CREATE_FTS = ` diff --git a/src/index.ts b/src/index.ts index 4040e17..a2e2aaf 100644 --- a/src/index.ts +++ b/src/index.ts @@ -30,6 +30,8 @@ import { runOnboard } from './commands/onboard.js'; import { runMigrate } from './commands/migrate.js'; import { runPath } from './commands/path.js'; import { runExport } from './commands/export.js'; +import { runDedup } from './commands/dedup.js'; +import { DEFAULT_SEMANTIC_THRESHOLD } from './lib/dedup.js'; import { closeDb } from './db/connection.js'; const program = new Command(); @@ -180,13 +182,15 @@ program .option('--bias-type ', 'Softly boost one table without filtering others (messages, loa, decisions, learnings, breadcrumbs)') .option('-l, --limit ', 'Max results', '20') .option('--show-provenance', 'Show provenance for every result (default: only unknown provenance is flagged)') + .option('--include-duplicates', 'Include records marked as duplicates by recall dedup (hidden by default)') .action((query, options) => { runSearch(query, { project: options.project, table: options.table, biasType: options.biasType, limit: parseInt(options.limit, 10), - showProvenance: options.showProvenance + showProvenance: options.showProvenance, + includeDuplicates: options.includeDuplicates }); closeDb(); }); @@ -658,6 +662,30 @@ program closeDb(); }); +// recall dedup — non-destructive duplicate detection (issue #45) +// Dry-run by default; --execute marks duplicates in dedup_lineage (records +// stay intact, hidden from search); --delete is the destructive opt-in. +program + .command('dedup') + .description('Detect and mark duplicate memory records (dry-run by default; non-destructive)') + .option('--execute', 'Apply the plan: mark duplicates (default is dry-run)') + .option('--delete', "Destructive opt-in: hard-delete duplicates instead of marking (requires --execute; run 'recall export --backup' first)") + .option('-t, --table
', 'Target table: messages, decisions, learnings, breadcrumbs, loa_entries, all', 'all') + .option('-p, --project ', 'Scope to one project') + .option('--threshold ', `Semantic similarity threshold (0-1)`, String(DEFAULT_SEMANTIC_THRESHOLD)) + .option('--no-semantic', 'Skip the semantic (embeddings) pass') + .action((options) => { + runDedup({ + execute: options.execute, + delete: options.delete, + table: options.table, + project: options.project, + threshold: parseFloat(options.threshold), + semantic: options.semantic + }); + closeDb(); + }); + // Default command: recall → hybrid search (Phase 3: best of both worlds) program .arguments('[query]') @@ -668,7 +696,7 @@ program .option('-k, --keyword', 'Use keyword search only (FTS5)') .option('-v, --vector', 'Use vector search only (semantic)') .action(async (query, options) => { - if (query && !['init', 'add', 'search', 'recent', 'show', 'stats', 'import', 'import-conversations', 'loa', 'telos', 'docs', 'dump', 'embed', 'semantic', 'hybrid', 'doctor', 'importance', 'provenance', 'pin', 'unpin', 'decision', 'prune', 'cluster', 'import-legacy', 'benchmark', 'onboard', 'migrate', 'path', 'export'].includes(query)) { + if (query && !['init', 'add', 'search', 'recent', 'show', 'stats', 'import', 'import-conversations', 'loa', 'telos', 'docs', 'dump', 'embed', 'semantic', 'hybrid', 'doctor', 'importance', 'provenance', 'pin', 'unpin', 'decision', 'prune', 'cluster', 'import-legacy', 'benchmark', 'onboard', 'migrate', 'path', 'export', 'dedup'].includes(query)) { if (options.keyword) { // FTS5 only runSearch(query, { diff --git a/src/lib/dedup.ts b/src/lib/dedup.ts new file mode 100644 index 0000000..fbfce6e --- /dev/null +++ b/src/lib/dedup.ts @@ -0,0 +1,629 @@ +// recall dedup — core logic (issue #45). +// +// Non-destructive dedup with provenance-aware survivor selection. The decision +// logic (normalization, survivor ordering, grouping, semantic pairing) is kept +// as pure exported functions over plain data so it stays unit-testable +// (issue #44 phase 2 adds property tests over these functions). +// +// Safety model: +// - Dry-run by default; mutations require an explicit execute step. +// - Default action marks duplicates in dedup_lineage — records stay intact. +// Hard deletion is a separate destructive opt-in. +// - Dedup acts within a table (and within a project) only. Cross-table +// duplicate candidates are report-only. +// - Survivor priority: user_authored > verbatim > extracted > derived > +// unknown (PROVENANCE_VALUES in types/index.ts is the single source of +// truth). Ties break by richness (normalized length), importance, recency, +// then lowest id for determinism. +// - Semantic detection compares stored embeddings pairwise — no embedding +// service call is needed. Pairs are never chained transitively: every +// marked duplicate has a direct similarity >= threshold to its survivor. +// +// Bind-count note (see src/lib/chunk.ts): scans use keyset pagination +// (`WHERE id > ? LIMIT ?`, fixed binds). Destructive deletion builds +// input-scaled `IN (?,...)` lists and therefore goes through chunked(). + +import { createHash } from 'crypto'; +import { Database } from 'bun:sqlite'; +import { chunked, SQLITE_SAFE_CHUNK_SIZE } from './chunk.js'; +import { blobToEmbedding, cosineSimilarity } from './embeddings.js'; +import { + PROVENANCE_TABLES, + PROVENANCE_VALUES, + type Provenance, + type ProvenanceTable, +} from '../types/index.js'; + +/** Tables dedup operates on — the provenance-bearing memory tables. */ +export const DEDUP_TABLES = PROVENANCE_TABLES; + +/** + * Conservative default for semantic matching. At 0.95 cosine similarity on + * nomic-embed-text vectors, two records are near-verbatim restatements; + * anything below stays untouched (never auto-merge below the threshold). + */ +export const DEFAULT_SEMANTIC_THRESHOLD = 0.95; + +/** + * Records whose normalized text is shorter than this are never dedup + * candidates. Short acknowledgments ("ok", "thanks", "yes") repeat + * legitimately across sessions — mass-marking them would be noise. + */ +export const MIN_DEDUP_TEXT_LENGTH = 20; + +export type DedupReason = 'exact' | 'semantic'; +export type LineageStatus = 'marked' | 'deleted' | 'reverted'; + +export interface DedupCandidate { + table: ProvenanceTable; + id: number; + project: string | null; + provenance: Provenance | null; + importance: number; + created_at: string; + /** sha256 of the normalized dedup text — exact-match grouping key. */ + key: string; + /** Length of the normalized text — the richness tie-breaker. */ + textLength: number; +} + +export interface LineageEntry { + survivor_table: ProvenanceTable; + survivor_id: number; + duplicate_table: ProvenanceTable; + duplicate_id: number; + reason: DedupReason; + similarity: number | null; + /** JSON audit detail (project, survivor/duplicate provenance). */ + detail: string; +} + +// --------------------------------------------------------------------------- +// Pure functions — text normalization and survivor selection +// --------------------------------------------------------------------------- + +/** + * Canonical text normalization for duplicate detection: lowercase, strip + * quote characters, collapse whitespace. Same convention the extraction + * hooks use for their log-level dedup. + */ +export function normalizeText(s: string): string { + return s.toLowerCase().replace(/['"]/g, '').replace(/\s+/g, ' ').trim(); +} + +/** + * Survivor priority rank — lower wins. Order comes from PROVENANCE_VALUES + * (user_authored > verbatim > extracted > derived); unknown (NULL) ranks + * last. Note PROVENANCE_VALUES is already declared in survivor order. + */ +export function provenanceRank(p: Provenance | null | undefined): number { + if (!p) return PROVENANCE_VALUES.length; + const idx = PROVENANCE_VALUES.indexOf(p); + return idx === -1 ? PROVENANCE_VALUES.length : idx; +} + +/** + * Comparator placing the survivor first: provenance rank, then richness + * (longer normalized text), importance, recency (newer created_at), and + * finally lowest id so ordering is total and deterministic. + */ +export function compareForSurvivor(a: DedupCandidate, b: DedupCandidate): number { + const rank = provenanceRank(a.provenance) - provenanceRank(b.provenance); + if (rank !== 0) return rank; + if (a.textLength !== b.textLength) return b.textLength - a.textLength; + const importance = (b.importance ?? 5) - (a.importance ?? 5); + if (importance !== 0) return importance; + // ISO and SQLite CURRENT_TIMESTAMP formats both sort lexicographically. + if (a.created_at !== b.created_at) return a.created_at > b.created_at ? -1 : 1; + return a.id - b.id; +} + +export function selectSurvivor(group: DedupCandidate[]): { + survivor: DedupCandidate; + duplicates: DedupCandidate[]; +} { + if (group.length === 0) throw new Error('selectSurvivor: empty group'); + const sorted = [...group].sort(compareForSurvivor); + return { survivor: sorted[0], duplicates: sorted.slice(1) }; +} + +/** + * Grouping key: same table, same project, same normalized text. Fields are + * joined with the NUL escape `\x00` — a separator that cannot occur in any + * field, so boundaries can never be forged by content. + */ +function groupKey(c: DedupCandidate): string { + return `${c.table}\x00${c.project ?? ''}\x00${c.key}`; +} + +/** + * Exact (normalized-text) duplicate groups within a table and project. + * Returns only groups with two or more members. + */ +export function findExactGroups(candidates: DedupCandidate[]): DedupCandidate[][] { + const byKey = new Map(); + for (const c of candidates) { + const k = groupKey(c); + const group = byKey.get(k); + if (group) group.push(c); + else byKey.set(k, [c]); + } + return [...byKey.values()].filter(g => g.length >= 2); +} + +export interface CrossTableMatch { + project: string | null; + members: Array<{ table: ProvenanceTable; id: number }>; +} + +/** + * Report-only: normalized-text matches spanning two or more tables within + * the same project. Dedup never acts across tables. + */ +export function findCrossTableMatches(candidates: DedupCandidate[]): CrossTableMatch[] { + const byKey = new Map(); + for (const c of candidates) { + const k = `${c.project ?? ''}\x00${c.key}`; + const group = byKey.get(k); + if (group) group.push(c); + else byKey.set(k, [c]); + } + const matches: CrossTableMatch[] = []; + for (const group of byKey.values()) { + const tables = new Set(group.map(c => c.table)); + if (tables.size < 2) continue; + matches.push({ + project: group[0].project, + members: group.map(c => ({ table: c.table, id: c.id })), + }); + } + return matches; +} + +export interface EmbeddedCandidate { + candidate: DedupCandidate; + embedding: number[]; +} + +export interface SemanticPair { + a: DedupCandidate; + b: DedupCandidate; + similarity: number; + sameTable: boolean; +} + +/** + * Pairwise cosine similarity over stored embeddings. Returns every pair at + * or above the threshold, split into actionable (same table + project) and + * cross-table report-only pairs. Pairs within one table but across projects + * are ignored — dedup scope is per-project, matching the exact pass. + * + * O(n^2) over embedded records — acceptable at personal-memory scale and the + * same brute-force shape the cluster command and hybrid search already use. + */ +export function findSemanticPairs( + embedded: EmbeddedCandidate[], + threshold: number +): SemanticPair[] { + const pairs: SemanticPair[] = []; + for (let i = 0; i < embedded.length; i++) { + for (let j = i + 1; j < embedded.length; j++) { + const x = embedded[i]; + const y = embedded[j]; + const sameTable = x.candidate.table === y.candidate.table; + if (sameTable && x.candidate.project !== y.candidate.project) continue; + // Exact duplicates are the exact pass's job; identical keys here would + // double-report the same records under a second reason. + if (x.candidate.key === y.candidate.key) continue; + const similarity = cosineSimilarity(x.embedding, y.embedding); + if (similarity >= threshold) { + pairs.push({ a: x.candidate, b: y.candidate, similarity, sameTable }); + } + } + } + // Deterministic processing order: strongest pairs first. + pairs.sort((p, q) => + q.similarity - p.similarity || + p.a.table.localeCompare(q.a.table) || + p.a.id - q.a.id || + p.b.id - q.b.id + ); + return pairs; +} + +// --------------------------------------------------------------------------- +// SQL fragments shared with the search paths +// --------------------------------------------------------------------------- + +/** + * SQL fragment excluding records marked as duplicates. `tableExpr` and + * `idExpr` are SQL expressions (a quoted literal like `'messages'` or a + * column reference like `e.source_table`) — never user input. + */ +export function notMarkedDuplicateSql(tableExpr: string, idExpr: string): string { + return `NOT EXISTS ( + SELECT 1 FROM dedup_lineage dl + WHERE dl.duplicate_table = ${tableExpr} + AND dl.duplicate_id = ${idExpr} + AND dl.status = 'marked' + )`; +} + +// --------------------------------------------------------------------------- +// Database plumbing — scans, lineage, deletion +// --------------------------------------------------------------------------- + +interface TableScanConfig { + textColumns: string[]; + createdAtColumn: string; + extraWhere?: string; +} + +// Which columns constitute "the text" of a record for duplicate detection. +// Decisions are scanned at status='active' only: supersede/revert already +// manage lifecycle copies, and a superseded row must never win survivorship +// over (and thereby hide) an active decision. +const TABLE_SCAN_CONFIG: Record = { + messages: { textColumns: ['content'], createdAtColumn: 'timestamp' }, + decisions: { textColumns: ['decision'], createdAtColumn: 'created_at', extraWhere: "status = 'active'" }, + learnings: { textColumns: ['problem', 'solution'], createdAtColumn: 'created_at' }, + breadcrumbs: { textColumns: ['content'], createdAtColumn: 'created_at' }, + loa_entries: { textColumns: ['title', 'fabric_extract'], createdAtColumn: 'created_at' }, +}; + +export interface ScanResult { + candidates: DedupCandidate[]; + scanned: number; + tooShort: number; +} + +/** + * Read one table's dedup candidates in bounded batches via keyset pagination + * (fixed binds per statement regardless of table size). Rows below + * MIN_DEDUP_TEXT_LENGTH are counted but excluded. + */ +export function scanCandidates( + db: Database, + table: ProvenanceTable, + project?: string, + batchSize: number = SQLITE_SAFE_CHUNK_SIZE +): ScanResult { + const config = TABLE_SCAN_CONFIG[table]; + const where = [ + 'id > ?', + ...(config.extraWhere ? [config.extraWhere] : []), + ...(project !== undefined ? ['project = ?'] : []), + ].join(' AND '); + const stmt = db.prepare(` + SELECT id, project, provenance, importance, + ${config.createdAtColumn} AS created_at, + ${config.textColumns.join(', ')} + FROM ${table} + WHERE ${where} + ORDER BY id + LIMIT ? + `); + + const candidates: DedupCandidate[] = []; + let scanned = 0; + let tooShort = 0; + let lastId = 0; + for (;;) { + const params = project !== undefined ? [lastId, project, batchSize] : [lastId, batchSize]; + const batch = stmt.all(...params) as Array>; + if (batch.length === 0) break; + for (const row of batch) { + scanned++; + const text = config.textColumns + .map(c => row[c]) + .filter(v => v !== null && v !== undefined && String(v).length > 0) + .join('\n'); + const normalized = normalizeText(text); + if (normalized.length < MIN_DEDUP_TEXT_LENGTH) { + tooShort++; + continue; + } + candidates.push({ + table, + id: row.id as number, + project: (row.project as string | null) ?? null, + provenance: (row.provenance as Provenance | null) ?? null, + importance: (row.importance as number | null) ?? 5, + created_at: String(row.created_at), + key: createHash('sha256').update(normalized).digest('hex'), + textLength: normalized.length, + }); + } + lastId = batch[batch.length - 1].id as number; + } + return { candidates, scanned, tooShort }; +} + +/** (table, id) pairs currently marked as duplicates — excluded from scans. */ +export function loadMarkedDuplicates(db: Database): Set { + const rows = db.prepare( + `SELECT duplicate_table, duplicate_id FROM dedup_lineage WHERE status = 'marked'` + ).all() as Array<{ duplicate_table: string; duplicate_id: number }>; + return new Set(rows.map(r => `${r.duplicate_table}:${r.duplicate_id}`)); +} + +/** Stored embeddings for one table, keyed by source row id. */ +export function loadEmbeddings(db: Database, table: ProvenanceTable): Map { + const rows = db.prepare( + `SELECT source_id, embedding FROM embeddings WHERE source_table = ?` + ).all(table) as Array<{ source_id: number; embedding: Buffer }>; + const map = new Map(); + for (const row of rows) { + map.set(row.source_id, blobToEmbedding(row.embedding)); + } + return map; +} + +/** + * Ids in `table` that other rows reference via foreign keys (loa_entries + * message ranges, loa parent links). With foreign_keys=ON these cannot be + * hard-deleted; destructive mode downgrades them to 'marked' instead of + * failing the whole transaction. + */ +export function fkProtectedIds(db: Database, table: ProvenanceTable): Set { + const ids = new Set(); + if (table === 'messages') { + const rows = db.prepare( + `SELECT message_range_start AS s, message_range_end AS e FROM loa_entries + WHERE message_range_start IS NOT NULL OR message_range_end IS NOT NULL` + ).all() as Array<{ s: number | null; e: number | null }>; + for (const row of rows) { + if (row.s !== null) ids.add(row.s); + if (row.e !== null) ids.add(row.e); + } + } else if (table === 'loa_entries') { + const rows = db.prepare( + `SELECT DISTINCT parent_loa_id AS p FROM loa_entries WHERE parent_loa_id IS NOT NULL` + ).all() as Array<{ p: number }>; + for (const row of rows) ids.add(row.p); + } + return ids; +} + +// --------------------------------------------------------------------------- +// Plan and apply +// --------------------------------------------------------------------------- + +export interface DedupTableReport { + table: ProvenanceTable; + scanned: number; + tooShort: number; + alreadyMarked: number; + exactGroups: number; + semanticPairs: number; + planned: LineageEntry[]; +} + +export interface DedupPlan { + tables: DedupTableReport[]; + threshold: number; + /** Reason the semantic pass did not run, or null if it ran. */ + semanticSkipped: string | null; + crossTable: { + textMatches: CrossTableMatch[]; + semanticPairs: number; + }; +} + +export interface PlanDedupOptions { + tables?: ProvenanceTable[]; + project?: string; + threshold?: number; + semantic?: boolean; +} + +function lineageDetail(survivor: DedupCandidate, duplicate: DedupCandidate): string { + return JSON.stringify({ + project: duplicate.project, + survivor_provenance: survivor.provenance ?? 'unknown', + duplicate_provenance: duplicate.provenance ?? 'unknown', + }); +} + +function toLineage( + survivor: DedupCandidate, + duplicate: DedupCandidate, + reason: DedupReason, + similarity: number | null +): LineageEntry { + return { + survivor_table: survivor.table, + survivor_id: survivor.id, + duplicate_table: duplicate.table, + duplicate_id: duplicate.id, + reason, + similarity, + detail: lineageDetail(survivor, duplicate), + }; +} + +/** + * Compute the full dedup plan without writing anything. The same plan is + * what a dry run reports and what an execute run applies. + */ +export function planDedup(db: Database, options: PlanDedupOptions = {}): DedupPlan { + const tables = options.tables ?? [...DEDUP_TABLES]; + const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD; + const semantic = options.semantic ?? true; + const marked = loadMarkedDuplicates(db); + + const reports = new Map(); + const eligible = new Map(); + + // Exact pass — within table + project. + const plannedDuplicateKeys = new Set(); + const plannedSurvivorKeys = new Set(); + for (const table of tables) { + const scan = scanCandidates(db, table, options.project); + const fresh = scan.candidates.filter(c => !marked.has(`${c.table}:${c.id}`)); + const report: DedupTableReport = { + table, + scanned: scan.scanned, + tooShort: scan.tooShort, + alreadyMarked: scan.candidates.length - fresh.length, + exactGroups: 0, + semanticPairs: 0, + planned: [], + }; + for (const group of findExactGroups(fresh)) { + report.exactGroups++; + const { survivor, duplicates } = selectSurvivor(group); + plannedSurvivorKeys.add(`${survivor.table}:${survivor.id}`); + for (const dup of duplicates) { + report.planned.push(toLineage(survivor, dup, 'exact', 1.0)); + plannedDuplicateKeys.add(`${dup.table}:${dup.id}`); + } + } + reports.set(table, report); + eligible.set(table, fresh); + } + + // Records still standing after the exact pass (survivors + unmatched). + const remaining = [...eligible.values()].flat() + .filter(c => !plannedDuplicateKeys.has(`${c.table}:${c.id}`)); + + // Cross-table normalized-text matches — report-only. + const crossTableText = findCrossTableMatches(remaining); + + // Semantic pass — stored embeddings only; no embedding service required. + let semanticSkipped: string | null = null; + let crossTableSemantic = 0; + if (!semantic) { + semanticSkipped = 'disabled (--no-semantic)'; + } else { + const embedded: EmbeddedCandidate[] = []; + for (const table of tables) { + const embeddings = loadEmbeddings(db, table); + if (embeddings.size === 0) continue; + for (const candidate of remaining) { + if (candidate.table !== table) continue; + const embedding = embeddings.get(candidate.id); + if (embedding) embedded.push({ candidate, embedding }); + } + } + if (embedded.length === 0) { + semanticSkipped = + 'no stored embeddings for the scanned records (run `recall embed backfill` to enable semantic detection)'; + } else { + const pairs = findSemanticPairs(embedded, threshold); + for (const pair of pairs) { + if (!pair.sameTable) { + crossTableSemantic++; + continue; + } + const aKey = `${pair.a.table}:${pair.a.id}`; + const bKey = `${pair.b.table}:${pair.b.id}`; + // Greedy, strongest-first: a record planned as a duplicate can + // neither survive nor be re-marked, and a record planned as a + // survivor (in either pass) can never be re-marked by a weaker + // pair — lineage stays one hop deep, no transitive chaining. + if (plannedDuplicateKeys.has(aKey) || plannedDuplicateKeys.has(bKey)) continue; + if (plannedSurvivorKeys.has(aKey) || plannedSurvivorKeys.has(bKey)) continue; + const { survivor, duplicates } = selectSurvivor([pair.a, pair.b]); + const report = reports.get(pair.a.table)!; + report.semanticPairs++; + report.planned.push(toLineage(survivor, duplicates[0], 'semantic', pair.similarity)); + plannedDuplicateKeys.add(`${duplicates[0].table}:${duplicates[0].id}`); + plannedSurvivorKeys.add(`${survivor.table}:${survivor.id}`); + } + } + } + + return { + tables: tables.map(t => reports.get(t)!), + threshold, + semanticSkipped, + crossTable: { textMatches: crossTableText, semanticPairs: crossTableSemantic }, + }; +} + +export interface ApplyResult { + marked: number; + deleted: number; + /** Duplicates kept as 'marked' in destructive mode because of FK references. */ + fkProtected: number; +} + +/** + * Apply a plan: insert lineage rows ('marked' by default). With + * `destructive`, duplicate rows are hard-deleted (with their embeddings) and + * lineage status is 'deleted' — except FK-referenced rows, which stay marked. + * Runs in one transaction; failures roll back everything. + */ +export function applyDedupPlan( + db: Database, + plan: DedupPlan, + options: { destructive?: boolean } = {} +): ApplyResult { + const destructive = options.destructive ?? false; + const entries = plan.tables.flatMap(t => t.planned); + const result: ApplyResult = { marked: 0, deleted: 0, fkProtected: 0 }; + if (entries.length === 0) return result; + + const insert = db.prepare(` + INSERT INTO dedup_lineage + (survivor_table, survivor_id, duplicate_table, duplicate_id, reason, similarity, status, detail) + VALUES (?, ?, ?, ?, ?, ?, ?, ?) + `); + + // FK lookups are per-table constants within this transaction; cache them so + // a large plan doesn't re-query loa_entries per duplicate row. + const fkCache = new Map>(); + const fkProtected = (table: ProvenanceTable): Set => { + let cached = fkCache.get(table); + if (!cached) { + cached = fkProtectedIds(db, table); + fkCache.set(table, cached); + } + return cached; + }; + + const apply = db.transaction(() => { + const toDelete = new Map(); + for (const entry of entries) { + let status: LineageStatus = 'marked'; + if (destructive) { + const protectedIds = fkProtected(entry.duplicate_table); + if (protectedIds.has(entry.duplicate_id)) { + result.fkProtected++; + } else { + status = 'deleted'; + const list = toDelete.get(entry.duplicate_table); + if (list) list.push(entry.duplicate_id); + else toDelete.set(entry.duplicate_table, [entry.duplicate_id]); + } + } + insert.run( + entry.survivor_table, + entry.survivor_id, + entry.duplicate_table, + entry.duplicate_id, + entry.reason, + entry.similarity, + status, + entry.detail + ); + if (status === 'marked') result.marked++; + else result.deleted++; + } + + // Input-scaled IN lists — chunked() per the src/lib/chunk.ts audit note. + for (const [table, ids] of toDelete) { + for (const chunk of chunked(ids)) { + const placeholders = chunk.map(() => '?').join(', '); + db.prepare(`DELETE FROM ${table} WHERE id IN (${placeholders})`).run(...chunk); + db.prepare( + `DELETE FROM embeddings WHERE source_table = ? AND source_id IN (${placeholders})` + ).run(table, ...chunk); + } + } + }); + apply(); + + return result; +} diff --git a/src/lib/embeddings.ts b/src/lib/embeddings.ts index 8dc22f4..a46632b 100644 --- a/src/lib/embeddings.ts +++ b/src/lib/embeddings.ts @@ -84,12 +84,17 @@ export function embeddingToBlob(embedding: number[]): Buffer { } /** - * Convert SQLite BLOB back to embedding array + * Convert SQLite BLOB back to embedding array. + * bun:sqlite returns BLOB columns as Uint8Array (not Buffer) — wrap without + * copying so readFloatLE is available either way. */ -export function blobToEmbedding(blob: Buffer): number[] { +export function blobToEmbedding(blob: Buffer | Uint8Array): number[] { + const buf = Buffer.isBuffer(blob) + ? blob + : Buffer.from(blob.buffer, blob.byteOffset, blob.byteLength); const embedding: number[] = []; - for (let i = 0; i < blob.length; i += 4) { - embedding.push(blob.readFloatLE(i)); + for (let i = 0; i < buf.length; i += 4) { + embedding.push(buf.readFloatLE(i)); } return embedding; } diff --git a/src/lib/export.ts b/src/lib/export.ts index 2a072bb..a97573d 100644 --- a/src/lib/export.ts +++ b/src/lib/export.ts @@ -25,7 +25,11 @@ import { SQLITE_SAFE_CHUNK_SIZE } from './chunk.js'; import { getMigrationVersion } from '../db/migrations.js'; import { VERSION } from '../version.js'; -/** Durable memory tables included in app-level (JSON/Markdown/SQL) exports. */ +/** + * Durable tables included in app-level (JSON/Markdown/SQL) exports: the + * memory tables plus dedup_lineage (issue #45), so duplicate lineage stays + * portable and auditable alongside the records it describes. + */ export const EXPORT_TABLES = [ 'sessions', 'messages', @@ -33,6 +37,7 @@ export const EXPORT_TABLES = [ 'learnings', 'breadcrumbs', 'loa_entries', + 'dedup_lineage', ] as const; export type ExportTable = typeof EXPORT_TABLES[number]; diff --git a/src/lib/memory.ts b/src/lib/memory.ts index eb89ce3..4ad246c 100644 --- a/src/lib/memory.ts +++ b/src/lib/memory.ts @@ -2,6 +2,7 @@ import { getDb, getDbPath } from '../db/connection.js'; import { existsSync, statSync } from 'fs'; +import { notMarkedDuplicateSql } from './dedup.js'; import type { Session, Message, Decision, Learning, Breadcrumb, LoaEntry, Stats, SearchResult, Provenance } from '../types/index.js'; // ============ Sessions ============ @@ -271,6 +272,15 @@ export interface MemorySearchOptions { table?: string; limit?: number; biasType?: SearchTable; + /** Include records marked as duplicates by `recall dedup` (hidden by default). */ + includeDuplicates?: boolean; +} + +// Marked duplicates (issue #45) are hidden from search by default — the +// lineage row in dedup_lineage keeps them recoverable and auditable. +function duplicateFilter(options: MemorySearchOptions | undefined, physicalTable: string, idExpr: string): string { + if (options?.includeDuplicates) return ''; + return `AND ${notMarkedDuplicateSql(`'${physicalTable}'`, idExpr)}`; } const TYPE_BIAS_RANK_MULTIPLIER = 4; @@ -313,6 +323,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu FROM messages_fts f JOIN messages m ON m.id = f.rowid WHERE messages_fts MATCH ? + ${duplicateFilter(options, 'messages', 'm.id')} ${options?.project ? 'AND m.project = ?' : ''} ORDER BY f.rank LIMIT ? @@ -325,6 +336,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu JOIN decisions d ON d.id = f.rowid WHERE decisions_fts MATCH ? AND d.status = 'active' + ${duplicateFilter(options, 'decisions', 'd.id')} ${options?.project ? 'AND d.project = ?' : ''} ORDER BY CASE d.confidence WHEN 'high' THEN 0 WHEN 'medium' THEN 1 WHEN 'low' THEN 2 ELSE 1 END, f.rank LIMIT ? @@ -336,6 +348,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu FROM learnings_fts f JOIN learnings l ON l.id = f.rowid WHERE learnings_fts MATCH ? + ${duplicateFilter(options, 'learnings', 'l.id')} ${options?.project ? 'AND l.project = ?' : ''} ORDER BY f.rank LIMIT ? @@ -347,6 +360,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu FROM breadcrumbs_fts f JOIN breadcrumbs b ON b.id = f.rowid WHERE breadcrumbs_fts MATCH ? + ${duplicateFilter(options, 'breadcrumbs', 'b.id')} ${options?.project ? 'AND b.project = ?' : ''} ORDER BY f.rank LIMIT ? @@ -358,6 +372,7 @@ export function search(query: string, options?: MemorySearchOptions): SearchResu FROM loa_fts f JOIN loa_entries l ON l.id = f.rowid WHERE loa_fts MATCH ? + ${duplicateFilter(options, 'loa_entries', 'l.id')} ${options?.project ? 'AND l.project = ?' : ''} ORDER BY f.rank LIMIT ? diff --git a/src/mcp-server.ts b/src/mcp-server.ts index 958d001..242085f 100644 --- a/src/mcp-server.ts +++ b/src/mcp-server.ts @@ -66,6 +66,7 @@ import { reciprocalRankFusion, checkEmbeddingService, } from "./lib/embeddings.js"; +import { notMarkedDuplicateSql } from "./lib/dedup.js"; import type { Provenance } from "./types/index.js"; import { existsSync } from "fs"; @@ -118,9 +119,12 @@ async function hybridSearch( const queryResult = await embed(query); const queryEmbedding = queryResult.embedding; + // Marked duplicates (recall dedup, issue #45) keep their embeddings + // but are hidden from the vector path, matching the FTS5 default. const embeddings = db .prepare(` SELECT source_table, source_id, embedding FROM embeddings + WHERE ${notMarkedDuplicateSql("source_table", "source_id")} `) .all() as Array<{ source_table: string; diff --git a/tests/commands/dedup.test.ts b/tests/commands/dedup.test.ts new file mode 100644 index 0000000..5c309b0 --- /dev/null +++ b/tests/commands/dedup.test.ts @@ -0,0 +1,345 @@ +// recall dedup — issue #45 acceptance criteria. +// +// Behavior under test: +// - dry-run by default: reports the plan, writes nothing +// - --execute marks duplicates in dedup_lineage; records stay intact +// - exact/normalized duplicate detection within table + project +// - survivor priority user_authored > verbatim > extracted > derived > unknown +// - semantic pass uses stored embeddings only; skipped (and reported) when +// none exist; threshold respected, below-threshold never merged +// - idempotence: a second execute finds nothing new +// - cross-table candidates are report-only +// - search hides marked duplicates by default; --include-duplicates shows them +// - lineage rows persist full audit detail +// - --delete (destructive opt-in) removes rows + embeddings; FK-referenced +// duplicates are kept as marked instead of failing the transaction + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { setupTestDb, teardownTestDb } from '../helpers/setup'; +import { runDedup } from '../../src/commands/dedup'; +import { embeddingToBlob } from '../../src/lib/embeddings'; +import { getDb } from '../../src/db/connection'; +import { search } from '../../src/lib/memory'; +import { + createSession, + addMessage, + addDecision, + addBreadcrumb, + createLoaEntry, + supersedeDecision, +} from '../../src/lib/memory'; + +const originalLog = console.log; +const originalError = console.error; +const originalExitCode = process.exitCode; + +beforeEach(() => { + setupTestDb(); + console.log = () => {}; + console.error = () => {}; +}); + +afterEach(() => { + console.log = originalLog; + console.error = originalError; + process.exitCode = originalExitCode; + teardownTestDb(); +}); + +// Long enough to clear MIN_DEDUP_TEXT_LENGTH after normalization. +const CRUMB = 'Always use bun for every script in this repository, never npm.'; +const OTHER = 'A completely different breadcrumb about the release process.'; + +function lineageRows(): Array> { + return getDb().prepare('SELECT * FROM dedup_lineage ORDER BY id').all() as Array>; +} + +function insertEmbedding(table: string, id: number, vector: number[]): void { + getDb().prepare( + `INSERT OR REPLACE INTO embeddings (source_table, source_id, model, dimensions, embedding) + VALUES (?, ?, 'test', ?, ?)` + ).run(table, id, vector.length, embeddingToBlob(vector)); +} + +describe('dry-run vs execute', () => { + test('dry-run reports the plan and writes nothing', () => { + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: OTHER, importance: 5 }); + + const result = runDedup({})!; + const crumbs = result.plan.tables.find(t => t.table === 'breadcrumbs')!; + expect(crumbs.exactGroups).toBe(1); + expect(crumbs.planned.length).toBe(1); + expect(result.applied).toBeNull(); + expect(lineageRows().length).toBe(0); + expect(search('breadcrumb OR bun').length).toBe(3); + }); + + test('--execute marks duplicates non-destructively and search hides them', () => { + const id1 = addBreadcrumb({ content: CRUMB, importance: 5 }); + const id2 = addBreadcrumb({ content: CRUMB, importance: 5 }); + + const result = runDedup({ execute: true })!; + expect(result.applied?.marked).toBe(1); + + // Non-destructive: both records still exist. + const count = getDb().prepare('SELECT COUNT(*) AS c FROM breadcrumbs').get() as { c: number }; + expect(count.c).toBe(2); + + // Search hides the marked duplicate by default... + const hidden = search('bun'); + expect(hidden.length).toBe(1); + // ...and shows it again with the explicit include option. + const shown = search('bun', { includeDuplicates: true }); + expect(shown.length).toBe(2); + expect(shown.map(r => r.id).sort()).toEqual([id1, id2].sort()); + }); + + test('--delete without --execute is rejected', () => { + expect(runDedup({ delete: true })).toBeUndefined(); + expect(process.exitCode).toBe(1); + }); +}); + +describe('exact detection and survivor selection', () => { + test('normalized variants dedupe; provenance picks the survivor', () => { + const extracted = addBreadcrumb({ content: CRUMB, importance: 9, provenance: 'extracted' }); + const authored = addBreadcrumb({ content: CRUMB.toUpperCase(), importance: 1, provenance: 'user_authored' }); + const legacy = addBreadcrumb({ content: ` ${CRUMB} `, importance: 9 }); + + runDedup({ execute: true }); + + const rows = lineageRows(); + expect(rows.length).toBe(2); + for (const row of rows) { + expect(row.survivor_id).toBe(authored); + expect(row.survivor_table).toBe('breadcrumbs'); + expect(row.reason).toBe('exact'); + expect(row.status).toBe('marked'); + } + expect(rows.map(r => r.duplicate_id).sort()).toEqual([extracted, legacy].sort()); + }); + + test('identical text in different projects is not deduped', () => { + addBreadcrumb({ content: CRUMB, importance: 5, project: 'alpha' }); + addBreadcrumb({ content: CRUMB, importance: 5, project: 'beta' }); + const result = runDedup({ execute: true })!; + expect(result.applied?.marked).toBe(0); + }); + + test('short records are never candidates', () => { + addBreadcrumb({ content: 'ok', importance: 5 }); + addBreadcrumb({ content: 'ok', importance: 5 }); + const result = runDedup({})!; + const crumbs = result.plan.tables.find(t => t.table === 'breadcrumbs')!; + expect(crumbs.tooShort).toBe(2); + expect(crumbs.planned.length).toBe(0); + }); + + test('only active decisions participate', () => { + const text = 'Adopt SQLite WAL mode for every database connection in Recall.'; + const stale = addDecision({ decision: text, status: 'active' }); + supersedeDecision(stale); + addDecision({ decision: text, status: 'active' }); + + const result = runDedup({ execute: true })!; + expect(result.applied?.marked).toBe(0); + }); + + test('--table scopes the run', () => { + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: CRUMB, importance: 5 }); + createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' }); + addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:01Z', role: 'user', content: CRUMB }); + addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:02Z', role: 'user', content: CRUMB }); + + const result = runDedup({ execute: true, table: 'messages' })!; + expect(result.plan.tables.length).toBe(1); + expect(result.applied?.marked).toBe(1); + expect(lineageRows().every(r => r.duplicate_table === 'messages')).toBe(true); + }); + + test('invalid --table and --threshold are rejected', () => { + expect(runDedup({ table: 'sessions' })).toBeUndefined(); + expect(process.exitCode).toBe(1); + process.exitCode = 0; + expect(runDedup({ threshold: 1.5 })).toBeUndefined(); + expect(process.exitCode).toBe(1); + }); +}); + +describe('semantic pass', () => { + test('skipped with a clear reason when no embeddings exist', () => { + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: OTHER, importance: 5 }); + const result = runDedup({})!; + expect(result.plan.semanticSkipped).toContain('embed'); + }); + + test('skipped when disabled via --no-semantic', () => { + addBreadcrumb({ content: CRUMB, importance: 5 }); + const result = runDedup({ semantic: false })!; + expect(result.plan.semanticSkipped).toContain('--no-semantic'); + }); + + test('marks pairs at the threshold, never below it', () => { + const near = Math.sqrt(1 - 0.97 * 0.97); + const a = addBreadcrumb({ content: CRUMB, importance: 5 }); + const b = addBreadcrumb({ content: OTHER, importance: 5 }); + insertEmbedding('breadcrumbs', a, [1, 0, 0]); + insertEmbedding('breadcrumbs', b, [0.97, near, 0]); + + // Above the default 0.95 threshold → marked, with similarity recorded. + const marked = runDedup({ execute: true })!; + expect(marked.applied?.marked).toBe(1); + const row = lineageRows()[0]; + expect(row.reason).toBe('semantic'); + expect(row.similarity as number).toBeCloseTo(0.97, 3); + + // Survivor is the richer record (longer normalized text wins the tie). + const longer = CRUMB.length >= OTHER.length ? a : b; + expect(row.survivor_id).toBe(longer); + }); + + test('a stricter threshold leaves the same pair untouched', () => { + const near = Math.sqrt(1 - 0.97 * 0.97); + const a = addBreadcrumb({ content: CRUMB, importance: 5 }); + const b = addBreadcrumb({ content: OTHER, importance: 5 }); + insertEmbedding('breadcrumbs', a, [1, 0, 0]); + insertEmbedding('breadcrumbs', b, [0.97, near, 0]); + + const result = runDedup({ execute: true, threshold: 0.99 })!; + expect(result.applied?.marked).toBe(0); + expect(lineageRows().length).toBe(0); + }); + + test('a planned survivor is never re-marked by a weaker pair (no transitive chaining)', () => { + // PR #60 review repro: cos(A,B)=0.99, cos(B,C)=0.96, cos(A,C)≈0.91 with + // the default 0.95 threshold. Ascending text length pins the survivor + // orientation: B survives the strongest pair, so without the survivor + // guard the weaker B/C pair re-marks B — leaving A's only visible + // neighbor at 0.91, below the threshold. + const a = addBreadcrumb({ content: 'Review queue triage happens before standup.', importance: 5 }); + const b = addBreadcrumb({ content: 'Review queue triage always happens before the standup.', importance: 5 }); + const c = addBreadcrumb({ content: 'Review queue triage must always happen before the morning standup.', importance: 5 }); + insertEmbedding('breadcrumbs', a, [0.99, Math.sqrt(1 - 0.99 * 0.99), 0]); + insertEmbedding('breadcrumbs', b, [1, 0, 0]); + insertEmbedding('breadcrumbs', c, [0.96, -Math.sqrt(1 - 0.96 * 0.96), 0]); + + const result = runDedup({ execute: true })!; + + // Only the strongest pair is marked; B/C is skipped because B already + // survives A. + expect(result.applied?.marked).toBe(1); + const rows = lineageRows(); + expect(rows.length).toBe(1); + expect(rows[0].duplicate_id).toBe(a); + expect(rows[0].survivor_id).toBe(b); + expect(rows[0].similarity as number).toBeCloseTo(0.99, 3); + + // One-hop lineage invariant: no survivor is itself marked as a duplicate. + const duplicates = new Set(rows.map(r => `${r.duplicate_table}:${r.duplicate_id}`)); + for (const row of rows) { + expect(duplicates.has(`${row.survivor_table}:${row.survivor_id}`)).toBe(false); + } + }); +}); + +describe('idempotence', () => { + test('a second execute finds nothing new', () => { + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: CRUMB, importance: 5 }); + addBreadcrumb({ content: CRUMB, importance: 5 }); + + const first = runDedup({ execute: true })!; + expect(first.applied?.marked).toBe(2); + const after = lineageRows().length; + + const second = runDedup({ execute: true })!; + expect(second.applied?.marked).toBe(0); + const crumbs = second.plan.tables.find(t => t.table === 'breadcrumbs')!; + expect(crumbs.alreadyMarked).toBe(2); + expect(lineageRows().length).toBe(after); + }); +}); + +describe('cross-table candidates are report-only', () => { + test('reported, never marked', () => { + createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' }); + addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:01Z', role: 'user', content: CRUMB }); + addBreadcrumb({ content: CRUMB, importance: 5 }); + + const result = runDedup({ execute: true })!; + expect(result.plan.crossTable.textMatches.length).toBe(1); + expect(result.applied?.marked).toBe(0); + expect(lineageRows().length).toBe(0); + // Both records remain searchable. + expect(search('bun').length).toBe(2); + }); +}); + +describe('lineage persistence', () => { + test('rows carry full audit detail', () => { + const survivor = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'user_authored', project: 'demo' }); + const dup = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'extracted', project: 'demo' }); + + runDedup({ execute: true }); + + const row = lineageRows()[0]; + expect(row.survivor_table).toBe('breadcrumbs'); + expect(row.survivor_id).toBe(survivor); + expect(row.duplicate_table).toBe('breadcrumbs'); + expect(row.duplicate_id).toBe(dup); + expect(row.reason).toBe('exact'); + expect(row.similarity).toBe(1); + expect(row.status).toBe('marked'); + expect(typeof row.created_at).toBe('string'); + const detail = JSON.parse(row.detail as string); + expect(detail.survivor_provenance).toBe('user_authored'); + expect(detail.duplicate_provenance).toBe('extracted'); + expect(detail.project).toBe('demo'); + }); +}); + +describe('destructive opt-in (--execute --delete)', () => { + test('deletes duplicates and their embeddings; lineage records the deletion', () => { + const survivor = addBreadcrumb({ content: CRUMB, importance: 5, provenance: 'user_authored' }); + const dup = addBreadcrumb({ content: CRUMB, importance: 5 }); + insertEmbedding('breadcrumbs', dup, [1, 0, 0]); + + const result = runDedup({ execute: true, delete: true })!; + expect(result.applied?.deleted).toBe(1); + + const db = getDb(); + const remaining = db.prepare('SELECT id FROM breadcrumbs').all() as Array<{ id: number }>; + expect(remaining.map(r => r.id)).toEqual([survivor]); + const embeddings = db.prepare('SELECT COUNT(*) AS c FROM embeddings').get() as { c: number }; + expect(embeddings.c).toBe(0); + expect(lineageRows()[0].status).toBe('deleted'); + }); + + test('FK-referenced duplicates are kept as marked instead of failing', () => { + createSession({ session_id: 's1', started_at: '2026-01-01T00:00:00Z' }); + const survivor = addMessage({ session_id: 's1', timestamp: '2026-01-02T00:00:00Z', role: 'user', content: CRUMB }); + const referenced = addMessage({ session_id: 's1', timestamp: '2026-01-01T00:00:00Z', role: 'user', content: CRUMB }); + // The older message loses survivorship (recency tie-break) but is pinned + // by an LoA message range — hard-deleting it would violate the FK. + createLoaEntry({ + title: 'entry', fabric_extract: 'body', + message_range_start: referenced, message_range_end: referenced, + }); + + const result = runDedup({ execute: true, delete: true })!; + expect(result.applied?.deleted).toBe(0); + expect(result.applied?.fkProtected).toBe(1); + + const row = lineageRows()[0]; + expect(row.duplicate_id).toBe(referenced); + expect(row.survivor_id).toBe(survivor); + expect(row.status).toBe('marked'); + // The record still exists, just hidden. + const msg = getDb().prepare('SELECT id FROM messages WHERE id = ?').get(referenced); + expect(msg).toBeDefined(); + }); +}); diff --git a/tests/commands/export.test.ts b/tests/commands/export.test.ts index b7e0274..799c1f7 100644 --- a/tests/commands/export.test.ts +++ b/tests/commands/export.test.ts @@ -134,6 +134,7 @@ describe('manifest', () => { learnings: 1, breadcrumbs: 1, loa_entries: 1, + dedup_lineage: 0, }); expect(manifest.provenance_counts.messages).toEqual({ unknown: 1, verbatim: 1 }); expect(manifest.provenance_counts.decisions).toEqual({ unknown: 1, user_authored: 1 }); @@ -199,6 +200,7 @@ describe('SQL dump', () => { learnings: 1, breadcrumbs: 1, loa_entries: 1, + dedup_lineage: 0, }); // Legacy NULL restores as NULL; known values survive verbatim diff --git a/tests/db/migrations.test.ts b/tests/db/migrations.test.ts index 8a52d55..62added 100644 --- a/tests/db/migrations.test.ts +++ b/tests/db/migrations.test.ts @@ -188,7 +188,8 @@ describe('MIGRATIONS array', () => { test('has expected number of migrations', () => { // 7 → 8: importance column on messages/decisions/learnings/loa_entries (Sprint #4) // 8 → 9: provenance column on all five memory tables (issue #42) - expect(MIGRATIONS.length).toBe(9); + // 9 → 10: dedup_lineage table (issue #45) + expect(MIGRATIONS.length).toBe(10); }); test('all entries are functions', () => { diff --git a/tests/lib/dedup.test.ts b/tests/lib/dedup.test.ts new file mode 100644 index 0000000..e6ff9dd --- /dev/null +++ b/tests/lib/dedup.test.ts @@ -0,0 +1,200 @@ +// recall dedup — pure logic (issue #45). +// +// Unit tests over the pure exported functions: normalization, survivor +// selection (provenance > richness > importance > recency > id), exact +// grouping, cross-table matching, and semantic pairing. Property-based +// suites over these functions are issue #44 phase 2 — not here. + +import { describe, test, expect } from 'bun:test'; +import { + compareForSurvivor, + DEFAULT_SEMANTIC_THRESHOLD, + findCrossTableMatches, + findExactGroups, + findSemanticPairs, + normalizeText, + provenanceRank, + selectSurvivor, + type DedupCandidate, +} from '../../src/lib/dedup'; +import { PROVENANCE_VALUES } from '../../src/types/index'; + +function candidate(overrides: Partial = {}): DedupCandidate { + return { + table: 'breadcrumbs', + id: 1, + project: null, + provenance: null, + importance: 5, + created_at: '2026-01-01 00:00:00', + key: 'k', + textLength: 50, + ...overrides, + }; +} + +describe('normalizeText', () => { + test('lowercases, strips quotes, collapses whitespace', () => { + expect(normalizeText(` Use "Bun" for\n'everything' `)).toBe('use bun for everything'); + }); + + test('identical meaning with different casing/spacing normalizes equal', () => { + expect(normalizeText('Ship THE feature')).toBe(normalizeText("ship the 'feature'")); + }); +}); + +describe('provenanceRank', () => { + test('orders user_authored > verbatim > extracted > derived > unknown', () => { + const ranks = [...PROVENANCE_VALUES, null].map(p => provenanceRank(p)); + expect(ranks).toEqual([...ranks].sort((a, b) => a - b)); + expect(provenanceRank('user_authored')).toBeLessThan(provenanceRank('verbatim')); + expect(provenanceRank('verbatim')).toBeLessThan(provenanceRank('extracted')); + expect(provenanceRank('extracted')).toBeLessThan(provenanceRank('derived')); + expect(provenanceRank('derived')).toBeLessThan(provenanceRank(null)); + }); +}); + +describe('selectSurvivor', () => { + test('provenance dominates richness, importance, and recency', () => { + const weakButAuthored = candidate({ id: 1, provenance: 'user_authored', textLength: 10, importance: 1, created_at: '2020-01-01 00:00:00' }); + const richButExtracted = candidate({ id: 2, provenance: 'extracted', textLength: 9999, importance: 10, created_at: '2026-01-01 00:00:00' }); + const { survivor, duplicates } = selectSurvivor([richButExtracted, weakButAuthored]); + expect(survivor.id).toBe(1); + expect(duplicates.map(d => d.id)).toEqual([2]); + }); + + test('richness breaks provenance ties', () => { + const short = candidate({ id: 1, textLength: 10 }); + const long = candidate({ id: 2, textLength: 200 }); + expect(selectSurvivor([short, long]).survivor.id).toBe(2); + }); + + test('importance breaks richness ties', () => { + const low = candidate({ id: 1, importance: 3 }); + const high = candidate({ id: 2, importance: 8 }); + expect(selectSurvivor([low, high]).survivor.id).toBe(2); + }); + + test('recency breaks importance ties', () => { + const older = candidate({ id: 1, created_at: '2025-01-01 00:00:00' }); + const newer = candidate({ id: 2, created_at: '2026-01-01 00:00:00' }); + expect(selectSurvivor([older, newer]).survivor.id).toBe(2); + }); + + test('lowest id is the final deterministic tie-break', () => { + const a = candidate({ id: 7 }); + const b = candidate({ id: 3 }); + expect(selectSurvivor([a, b]).survivor.id).toBe(3); + // Total order: comparator never reports equality for distinct ids. + expect(compareForSurvivor(a, b)).toBeGreaterThan(0); + }); + + test('throws on an empty group', () => { + expect(() => selectSurvivor([])).toThrow(); + }); +}); + +describe('findExactGroups', () => { + test('groups same table + project + key; singletons excluded', () => { + const groups = findExactGroups([ + candidate({ id: 1, key: 'a' }), + candidate({ id: 2, key: 'a' }), + candidate({ id: 3, key: 'b' }), + ]); + expect(groups.length).toBe(1); + expect(groups[0].map(c => c.id).sort()).toEqual([1, 2]); + }); + + test('identical text in different projects is not grouped', () => { + const groups = findExactGroups([ + candidate({ id: 1, key: 'a', project: 'alpha' }), + candidate({ id: 2, key: 'a', project: 'beta' }), + ]); + expect(groups.length).toBe(0); + }); +}); + +describe('findCrossTableMatches', () => { + test('reports same text across tables, ignores single-table matches', () => { + const matches = findCrossTableMatches([ + candidate({ id: 1, key: 'a', table: 'breadcrumbs' }), + candidate({ id: 2, key: 'a', table: 'messages' }), + candidate({ id: 3, key: 'b', table: 'breadcrumbs' }), + candidate({ id: 4, key: 'b', table: 'breadcrumbs' }), + ]); + expect(matches.length).toBe(1); + expect(matches[0].members.map(m => m.table).sort()).toEqual(['breadcrumbs', 'messages']); + }); +}); + +describe('findSemanticPairs', () => { + const unit = (x: number, y: number, z: number) => [x, y, z]; + + test('pairs at/above threshold; orthogonal vectors never pair', () => { + const near = Math.sqrt(1 - 0.97 * 0.97); + const pairs = findSemanticPairs( + [ + { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.97, near, 0) }, + { candidate: candidate({ id: 3, key: 'c' }), embedding: unit(0, 0, 1) }, + ], + DEFAULT_SEMANTIC_THRESHOLD + ); + expect(pairs.length).toBe(1); + expect([pairs[0].a.id, pairs[0].b.id].sort()).toEqual([1, 2]); + expect(pairs[0].similarity).toBeCloseTo(0.97, 5); + }); + + test('below-threshold pairs are never produced', () => { + const near = Math.sqrt(1 - 0.97 * 0.97); + const pairs = findSemanticPairs( + [ + { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.97, near, 0) }, + ], + 0.99 + ); + expect(pairs.length).toBe(0); + }); + + test('same-table pairs across projects are ignored; cross-table pairs are flagged', () => { + const pairs = findSemanticPairs( + [ + { candidate: candidate({ id: 1, key: 'a', project: 'alpha' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 2, key: 'b', project: 'beta' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 3, key: 'c', table: 'messages', project: 'alpha' }), embedding: unit(1, 0, 0) }, + ], + 0.95 + ); + // 1↔2: same table, different projects — ignored. + // 1↔3 and 2↔3: cross-table — report-only pairs. + expect(pairs.every(p => !p.sameTable)).toBe(true); + expect(pairs.length).toBe(2); + }); + + test('identical normalized keys are left to the exact pass', () => { + const pairs = findSemanticPairs( + [ + { candidate: candidate({ id: 1, key: 'same' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 2, key: 'same' }), embedding: unit(1, 0, 0) }, + ], + 0.95 + ); + expect(pairs.length).toBe(0); + }); + + test('results are sorted strongest-first', () => { + const mid = Math.sqrt(1 - 0.96 * 0.96); + const near = Math.sqrt(1 - 0.99 * 0.99); + const pairs = findSemanticPairs( + [ + { candidate: candidate({ id: 1, key: 'a' }), embedding: unit(1, 0, 0) }, + { candidate: candidate({ id: 2, key: 'b' }), embedding: unit(0.99, near, 0) }, + { candidate: candidate({ id: 3, key: 'c' }), embedding: unit(0.96, mid, 0) }, + ], + 0.95 + ); + const sims = pairs.map(p => p.similarity); + expect(sims).toEqual([...sims].sort((a, b) => b - a)); + }); +});