Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,18 @@ while MCP tool names (`memory_search`, `memory_add`, etc.) remain stable.

### Added

- **`recall dedup`** — non-destructive dedup with provenance-aware survivor
selection (#45): dry-run by default, `--execute` marks duplicates in the new
`dedup_lineage` table (schema migration 9→10) without touching the records,
`--delete` is the destructive opt-in (take `recall export --backup` first).
Detection combines normalized-text matching with semantic matching over
stored embeddings (conservative 0.95 default threshold, skip reported when
embeddings are unavailable). Survivor priority is `user_authored > verbatim
> extracted > derived > unknown`, then richness, importance, recency.
Within-table only; cross-table candidates are report-only. Marked
duplicates are hidden from every search path unless
`recall search --include-duplicates` is passed, and lineage rows are
included in `recall export`.
- **`recall export`** — portable and disaster-recovery exports (#43): JSON,
Markdown, SQL dump, and SQLite (`VACUUM INTO`) formats with a manifest
(counts + provenance counts including explicit `unknown`), a stdout/file/
Expand Down
10 changes: 10 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ both.
| telos | Purpose framework entries (optional) | Yes |
| documents | Imported standalone markdown documents (optional) | Yes |
| embeddings | Vector embeddings for semantic search (768-dim, nomic-embed-text) | N/A |
| dedup_lineage | Duplicate lineage audit trail from `recall dedup` (survivor, duplicate, reason, similarity, status) | No |

All FTS5-indexed tables have automatic sync triggers.

Expand All @@ -88,6 +89,15 @@ rows stay `NULL` (unknown) until classified with
`recall provenance backfill`, which only acts on deterministic write-path
evidence and never guesses.

The `dedup_lineage` table was added in schema migration 9→10. `recall dedup`
marks duplicate records non-destructively by writing lineage rows here
(survivor table/id, duplicate table/id, reason, similarity, status); marked
duplicates stay in their source tables but are hidden from search unless
`--include-duplicates` is passed. Survivor selection follows provenance order
(`user_authored > verbatim > extracted > derived > unknown`), then richness,
importance, and recency. Dedup acts within a table only; cross-table
candidates are report-only.

## Tiered RecallStart (v0.7.0+)

The `RecallStart` hook injects two tiers at the top of every session:
Expand Down
45 changes: 44 additions & 1 deletion docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ recall search "query" -t decisions # Hard-filter to decisions only
recall search "query" --bias-type decisions # Prefer decisions, still show other matching tables
recall search "query" -p myproject # Filter by project
recall search "query" --show-provenance # Show provenance for every result
recall search "query" --include-duplicates # Include records marked by recall dedup
recall semantic "query" # Semantic search (explicit)
recall hybrid "query" # Hybrid search (explicit)
```
Expand Down Expand Up @@ -46,6 +47,8 @@ FTS5 supports boolean operators and prefix matching:

By default, search output stays quiet about [Record Provenance](#record-provenance) when a record carries a known value, and visibly flags records whose provenance is unknown (legacy rows that predate the provenance column). Pass `--show-provenance` to display the provenance of every result.

Records marked as duplicates by [`recall dedup`](#dedup) are hidden from every search path (keyword, semantic, hybrid) by default — the records and their lineage remain in the database. Pass `--include-duplicates` to show them.

---

## Capture
Expand Down Expand Up @@ -269,7 +272,7 @@ Formats:

- **json / markdown** — app-level export of the durable memory tables
(`sessions`, `messages`, `decisions`, `learnings`, `breadcrumbs`,
`loa_entries`). Every row of a provenance-bearing table carries an explicit
`loa_entries`, `dedup_lineage`). Every row of a provenance-bearing table carries an explicit
`provenance` field; legacy `NULL` provenance is exported as the literal
`unknown` — never omitted, never guessed (see Record Provenance above).
Embeddings are excluded.
Expand Down Expand Up @@ -300,6 +303,46 @@ included.
overwrites an existing file (a `-N` suffix is added on collision), and prints
the output path.

## Dedup

Detect and mark duplicate memory records without erasing evidence or lineage.

```bash
recall dedup # Dry-run report (default — writes nothing)
recall dedup --execute # Mark duplicates (non-destructive)
recall dedup --execute --delete # Destructive opt-in: hard-delete duplicates
recall dedup -t breadcrumbs # Scope to one table
recall dedup -p myproject # Scope to one project
recall dedup --threshold 0.98 # Stricter semantic matching (default 0.95)
recall dedup --no-semantic # Exact/normalized text pass only
```

Safety model:

- **Dry-run by default.** Mutations require `--execute`.
- **Non-destructive by default.** `--execute` writes lineage rows to the
`dedup_lineage` table (survivor, duplicate, reason, similarity, status,
timestamp); the duplicate records themselves stay intact and are merely
hidden from search. `--delete` is the destructive opt-in and requires
`--execute` — run `recall export --backup` first.
- **Within-table only.** Dedup never merges across tables (or across
projects). Cross-table duplicate candidates are report-only.
- **Survivor priority** is `user_authored > verbatim > extracted > derived >
unknown` ([Record Provenance](#record-provenance)); ties break by richness
(longer normalized text), importance, recency, then lowest id.
- **Detection** combines exact/normalized text matching with semantic
matching over stored embeddings (no embedding service call needed). The
semantic pass is skipped — and reported as skipped — when no embeddings
exist; records are never merged below the configured `--threshold`
(conservative default: 0.95 cosine similarity). Records with fewer than 20
significant characters are never candidates.
- **Lifecycle-aware.** Only `active` decisions participate; superseded and
reverted decisions are managed by the decision lifecycle, not dedup.

Marked duplicates are hidden from all search paths by default; see
`recall search --include-duplicates`. Lineage is included in
`recall export`, so the audit trail is portable.

## Admin

```bash
Expand Down
145 changes: 145 additions & 0 deletions src/commands/dedup.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
// recall dedup command (issue #45).
//
// Dry-run by default. --execute marks duplicates (non-destructive: records
// stay intact, hidden from search via dedup_lineage). --delete is the
// destructive opt-in and requires --execute; take a `recall export --backup`
// first. Core logic lives in src/lib/dedup.ts.

import { getDb } from '../db/connection.js';
import {
applyDedupPlan,
DEDUP_TABLES,
DEFAULT_SEMANTIC_THRESHOLD,
planDedup,
type ApplyResult,
type DedupPlan,
} from '../lib/dedup.js';
import type { ProvenanceTable } from '../types/index.js';

export interface DedupOptions {
execute?: boolean;
delete?: boolean;
table?: string;
project?: string;
threshold?: number;
semantic?: boolean;
}

export interface DedupRunResult {
plan: DedupPlan;
applied: ApplyResult | null;
}

export function runDedup(options: DedupOptions = {}): DedupRunResult | undefined {
const execute = options.execute ?? false;
const destructive = options.delete ?? false;

if (destructive && !execute) {
console.error('--delete requires --execute. Dry-run never deletes.');
process.exitCode = 1;
return undefined;
}

const target = options.table ?? 'all';
if (target !== 'all' && !(DEDUP_TABLES as readonly string[]).includes(target)) {
console.error(
`Invalid --table "${target}". Valid tables: ${DEDUP_TABLES.join(', ')}, all.`
);
process.exitCode = 1;
return undefined;
}

const threshold = options.threshold ?? DEFAULT_SEMANTIC_THRESHOLD;
if (!Number.isFinite(threshold) || threshold <= 0 || threshold > 1) {
console.error(`Invalid --threshold "${options.threshold}". Expected a number in (0, 1].`);
process.exitCode = 1;
return undefined;
}

const db = getDb();
const plan = planDedup(db, {
tables: target === 'all' ? undefined : [target as ProvenanceTable],
project: options.project,
threshold,
semantic: options.semantic,
});

const mode = !execute
? '[DRY RUN — no changes written]'
: destructive
? '[EXECUTE + DELETE — destructive: duplicates will be removed]'
: '[EXECUTE — marking duplicates, non-destructive]';
console.log(`${mode}\n`);

if (destructive) {
console.log("Recommended: run 'recall export --backup' before destructive dedup.\n");
}

const verb = execute ? (destructive ? 'delete' : 'mark') : 'would mark';
let totalPlanned = 0;
for (const report of plan.tables) {
totalPlanned += report.planned.length;
const unchanged = report.scanned - report.planned.length;
const skipped: string[] = [];
if (report.alreadyMarked > 0) skipped.push(`${report.alreadyMarked} already marked`);
if (report.tooShort > 0) skipped.push(`${report.tooShort} too short`);
const skippedNote = skipped.length > 0 ? ` (${skipped.join(', ')})` : '';
console.log(
`${report.table}: scanned ${report.scanned}, exact groups ${report.exactGroups}, ` +
`semantic pairs ${report.semanticPairs}, ${verb} ${report.planned.length}, ` +
`unchanged ${unchanged}${skippedNote}`
);
for (const entry of report.planned.slice(0, 3)) {
const sim = entry.similarity !== null ? ` @ ${entry.similarity.toFixed(3)}` : '';
console.log(
` #${entry.duplicate_id} → survivor #${entry.survivor_id} [${entry.reason}${sim}]`
);
}
if (report.planned.length > 3) {
console.log(` ...and ${report.planned.length - 3} more`);
}
}
console.log('');

if (plan.semanticSkipped) {
console.log(`Semantic pass: skipped — ${plan.semanticSkipped}`);
} else {
console.log(`Semantic pass: threshold ${plan.threshold}`);
}

const crossText = plan.crossTable.textMatches;
console.log(
`Cross-table (report-only, never acted on): ${crossText.length} text match group(s), ` +
`${plan.crossTable.semanticPairs} semantic pair(s)`
);
for (const match of crossText.slice(0, 5)) {
const members = match.members.map(m => `${m.table}#${m.id}`).join(' ↔ ');
const projectTag = match.project ? ` [${match.project}]` : '';
console.log(` ${members}${projectTag}`);
}
if (crossText.length > 5) {
console.log(` ...and ${crossText.length - 5} more`);
}
console.log('');

if (!execute) {
if (totalPlanned > 0) {
console.log('Re-run with --execute to mark duplicates (non-destructive).');
console.log("Marked duplicates are hidden from search; use 'recall search --include-duplicates' to see them.");
} else {
console.log('No duplicates found.');
}
return { plan, applied: null };
}

const applied = applyDedupPlan(db, plan, { destructive });
if (destructive) {
const fkNote = applied.fkProtected > 0
? ` (${applied.fkProtected} kept as marked — referenced by LoA lineage)`
: '';
console.log(`Deleted ${applied.deleted} duplicate(s)${fkNote}.`);
} else {
console.log(`Marked ${applied.marked} duplicate(s). Records remain intact and recoverable.`);
}
return { plan, applied };
}
15 changes: 11 additions & 4 deletions src/commands/embed.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,17 @@

import { getDb } from '../db/connection.js';
import { embed, embeddingToBlob, blobToEmbedding, cosineSimilarity, checkEmbeddingService, reciprocalRankFusion, EMBEDDING_MODEL } from '../lib/embeddings.js';
import { notMarkedDuplicateSql } from '../lib/dedup.js';
import { search as ftsSearch } from '../lib/memory.js';

// Marked duplicates (recall dedup, issue #45) keep their embeddings but are
// hidden from the semantic search paths, matching the FTS5 default.
function embeddingsWhere(table?: string): string {
const conditions = [notMarkedDuplicateSql('source_table', 'source_id')];
if (table) conditions.push(`source_table = '${table}'`);
return `WHERE ${conditions.join(' AND ')}`;
}

interface EmbedOptions {
table?: 'loa' | 'decisions' | 'messages' | 'learnings';
limit?: number;
Expand Down Expand Up @@ -164,11 +173,10 @@ export async function runSemanticSearch(query: string, options: { table?: string
const queryEmbedding = queryResult.embedding;

// Get all embeddings (for now, brute force - will optimize later)
const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
const embeddings = db.prepare(`
SELECT id, source_table, source_id, embedding
FROM embeddings
${tableFilter}
${embeddingsWhere(options.table)}
`).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;

if (embeddings.length === 0) {
Expand Down Expand Up @@ -304,11 +312,10 @@ export async function runHybridSearch(query: string, options: { table?: string;
const queryEmbedding = queryResult.embedding;

// Get embeddings from database
const tableFilter = options.table ? `WHERE source_table = '${options.table}'` : '';
const embeddings = db.prepare(`
SELECT id, source_table, source_id, embedding
FROM embeddings
${tableFilter}
${embeddingsWhere(options.table)}
`).all() as Array<{ id: number; source_table: string; source_id: number; embedding: Buffer }>;

// Calculate similarities
Expand Down
4 changes: 3 additions & 1 deletion src/commands/search.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ interface SearchOptions {
biasType?: string;
limit?: number;
showProvenance?: boolean;
includeDuplicates?: boolean;
}

export function runSearch(query: string, options: SearchOptions): void {
Expand All @@ -23,7 +24,8 @@ export function runSearch(query: string, options: SearchOptions): void {
project: options.project,
table: options.table,
biasType: options.biasType as SearchTable | undefined,
limit: options.limit || 20
limit: options.limit || 20,
includeDuplicates: options.includeDuplicates
});

if (results.length === 0) {
Expand Down
6 changes: 6 additions & 0 deletions src/db/migrations.ts
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,12 @@ export const MIGRATIONS: Migration[] = [
}
}
},

// Migration 9 → 10: Dedup lineage table (issue #45).
// No-op — dedup_lineage and its indexes are brand new, handled by the
// CREATE TABLE IF NOT EXISTS DDL that runs before migrations (same
// precedent as migration 3 → 4 for the extraction tables).
(_db) => {},
];

// ---------------------------------------------------------------------------
Expand Down
26 changes: 26 additions & 0 deletions src/db/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,24 @@ CREATE TABLE IF NOT EXISTS procedures (
times_observed INTEGER DEFAULT 2,
confidence TEXT DEFAULT 'medium' CHECK (confidence IN ('high', 'medium', 'low'))
);

-- Dedup lineage (issue #45): persistent audit trail of duplicate marking.
-- Non-destructive by default — a 'marked' row hides the duplicate from search
-- while the underlying record stays intact. 'deleted' records a destructive
-- opt-in removal. 'reverted' is reserved vocabulary for a future unmark path
-- (CHECK constraints cannot be widened later without a table rebuild).
CREATE TABLE IF NOT EXISTS dedup_lineage (
id INTEGER PRIMARY KEY AUTOINCREMENT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
survivor_table TEXT NOT NULL,
survivor_id INTEGER NOT NULL,
duplicate_table TEXT NOT NULL,
duplicate_id INTEGER NOT NULL,
reason TEXT NOT NULL CHECK (reason IN ('exact', 'semantic')),
similarity REAL,
status TEXT NOT NULL DEFAULT 'marked' CHECK (status IN ('marked', 'deleted', 'reverted')),
detail TEXT
);
`;

export const CREATE_INDEXES = `
Expand Down Expand Up @@ -231,6 +249,14 @@ CREATE INDEX IF NOT EXISTS idx_documents_created ON documents(created_at);

-- Extraction session indexes
CREATE INDEX IF NOT EXISTS idx_extraction_sessions_ts ON extraction_sessions(timestamp DESC);

-- Dedup lineage indexes: the partial unique index guarantees a record can be
-- an actively marked duplicate at most once (idempotence); the survivor index
-- supports lineage audits.
CREATE UNIQUE INDEX IF NOT EXISTS idx_dedup_lineage_duplicate
ON dedup_lineage(duplicate_table, duplicate_id) WHERE status = 'marked';
CREATE INDEX IF NOT EXISTS idx_dedup_lineage_survivor
ON dedup_lineage(survivor_table, survivor_id);
`;

export const CREATE_FTS = `
Expand Down
Loading
Loading