Skip to content

dedup: cross-run re-marking of a prior survivor breaks the visible-survivor safety property (transitive chains across runs) #63

@edheltzel

Description

@edheltzel

Context

PR #60's review established a binding safety property for dedup (Blocking 1): every hidden (marked) record must have a visible survivor with direct similarity ≥ threshold — and in destructive mode, no record may be hard-deleted when its similarity to the only surviving record is below the configured threshold. The fix (plannedSurvivorKeys, src/lib/dedup.ts:526) enforces this within a single plan: a planned survivor can never be re-marked by a weaker pair in the same run, so lineage is one hop deep per run.

Gap

The guard's state is per-run; nothing persists survivorship across runs. planDedup excludes previously marked duplicates from scans (loadMarkedDuplicates), but prior survivors re-enter every later run as ordinary candidates:

  1. Run 1: B marked duplicate of A (sim(B,A) ≥ t). A stays visible — invariant holds.
  2. Run 2: A (unmarked) pairs with new record C at sim(A,C) ≥ t; the comparator picks C → A marked duplicate of C.
  3. Now B's recorded survivor A is itself hidden. B's only visible representative is C — and sim(B,C) was never measured and can be < t (two ~18° hops at threshold 0.95 compose to ~0.81).
  • Marking mode: B is hidden from search while its only visible neighbor is below threshold — the user-facing property from PR feat: add recall dedup — non-destructive dedup with provenance-aware survivor selection #60's Blocking 1, violated across runs.
  • Destructive mode (worse): run 1 --execute --delete removes B; run 2 removes A; only C survives. B was permanently deleted with no surviving record at ≥ threshold — the precise harm Blocking 1 cited. (Destructive mode is double-opt-in, which mitigates.)

The per-row invariant in the dedup.ts header ("every marked duplicate has a direct similarity ≥ threshold to its survivor") still holds — each lineage row was measured directly at mark time. It is the visible-survivor property that decays: dedup_lineage can accumulate marked→marked survivor chains.

Exact-pass chains are benign (exact duplicates share normalized text, so sim(B,C) ≈ sim(A,C) ≥ t). The hazard is semantic→semantic across runs.

Where this surfaced

PR #61's property suite documents this honestly: its one-hop property is scoped within a plan, and its repeated-runs property asserts auditability (unique active marks, record preservation, no re-marking of marked records) rather than cross-run one-hop, because the implementation permits the scenario above (tests/lib/dedup.property.test.ts:13-18). The PR #61 review ruled that the scope note accurately describes implementation behavior, but the cross-run behavior itself is not consistent with the safety property PR #60's blocker was grounded in — hence this issue.

Options

a. Sticky survivors — load survivor keys from dedup_lineage (mirror of loadMarkedDuplicates) and exclude prior survivors from being re-marked in later runs. Simple; conservative in the safe direction. Tradeoff: survivorship becomes order-dependent across runs (an early survivor can never later be consolidated under a genuinely better record).
b. Chain compression at apply — when marking A→C, re-point existing B→A lineage rows to B→C. Automatically safe for exact rows (same normalized text); for semantic rows requires measuring sim(B,C) and skipping (or keeping A visible) when below threshold.
c. Document as accepted limitation — at minimum state it in docs/cli-reference.md's safety model, and have destructive mode refuse (or warn) when a planned duplicate is itself a recorded survivor in dedup_lineage.

Recommend (a) or (b), plus a cross-run property test pinning whichever invariant is chosen — PR #61's suite is the natural home (its scope note already marks the spot).

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs:humanRequires human implementationtype:bugSomething isn't working

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions