dedup: cross-run re-marking of a prior survivor breaks the visible-survivor safety property (transitive chains across runs)

## Context

PR #60's review established a binding safety property for dedup ([Blocking 1](https://github.com/edheltzel/Recall/pull/60#issuecomment-4676730890)): every hidden (marked) record must have a **visible** survivor with direct similarity ≥ threshold — and in destructive mode, no record may be hard-deleted when its similarity to the only surviving record is below the configured threshold. The fix (`plannedSurvivorKeys`, `src/lib/dedup.ts:526`) enforces this **within a single plan**: a planned survivor can never be re-marked by a weaker pair in the same run, so lineage is one hop deep per run.

## Gap

The guard's state is per-run; nothing persists survivorship across runs. `planDedup` excludes previously **marked duplicates** from scans (`loadMarkedDuplicates`), but prior **survivors** re-enter every later run as ordinary candidates:

1. Run 1: B marked duplicate of A (`sim(B,A) ≥ t`). A stays visible — invariant holds.
2. Run 2: A (unmarked) pairs with new record C at `sim(A,C) ≥ t`; the comparator picks C → A marked duplicate of C.
3. Now B's recorded survivor A is itself hidden. B's only visible representative is C — and `sim(B,C)` was never measured and can be `< t` (two ~18° hops at threshold 0.95 compose to ~0.81).

- **Marking mode:** B is hidden from search while its only visible neighbor is below threshold — the user-facing property from PR #60's Blocking 1, violated across runs.
- **Destructive mode (worse):** run 1 `--execute --delete` removes B; run 2 removes A; only C survives. B was permanently deleted with no surviving record at ≥ threshold — the precise harm Blocking 1 cited. (Destructive mode is double-opt-in, which mitigates.)

The per-row invariant in the `dedup.ts` header ("every marked duplicate has a direct similarity ≥ threshold to its survivor") still holds — each lineage row was measured directly at mark time. It is the **visible-survivor** property that decays: `dedup_lineage` can accumulate marked→marked survivor chains.

Exact-pass chains are benign (exact duplicates share normalized text, so `sim(B,C) ≈ sim(A,C) ≥ t`). The hazard is semantic→semantic across runs.

## Where this surfaced

PR #61's property suite documents this honestly: its one-hop property is scoped *within a plan*, and its repeated-runs property asserts auditability (unique active marks, record preservation, no re-marking of marked records) rather than cross-run one-hop, because the implementation permits the scenario above (`tests/lib/dedup.property.test.ts:13-18`). The PR #61 review ruled that the scope note accurately describes implementation behavior, but the cross-run behavior itself is not consistent with the safety property PR #60's blocker was grounded in — hence this issue.

## Options

a. **Sticky survivors** — load survivor keys from `dedup_lineage` (mirror of `loadMarkedDuplicates`) and exclude prior survivors from being re-marked in later runs. Simple; conservative in the safe direction. Tradeoff: survivorship becomes order-dependent across runs (an early survivor can never later be consolidated under a genuinely better record).
b. **Chain compression at apply** — when marking A→C, re-point existing `B→A` lineage rows to `B→C`. Automatically safe for exact rows (same normalized text); for semantic rows requires measuring `sim(B,C)` and skipping (or keeping A visible) when below threshold.
c. **Document as accepted limitation** — at minimum state it in `docs/cli-reference.md`'s safety model, and have destructive mode refuse (or warn) when a planned duplicate is itself a recorded survivor in `dedup_lineage`.

Recommend (a) or (b), plus a cross-run property test pinning whichever invariant is chosen — PR #61's suite is the natural home (its scope note already marks the spot).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dedup: cross-run re-marking of a prior survivor breaks the visible-survivor safety property (transitive chains across runs) #63

Context

Gap

Where this surfaced

Options

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

dedup: cross-run re-marking of a prior survivor breaks the visible-survivor safety property (transitive chains across runs) #63

Description

Context

Gap

Where this surfaced

Options

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions