You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #60's review established a binding safety property for dedup (Blocking 1): every hidden (marked) record must have a visible survivor with direct similarity ≥ threshold — and in destructive mode, no record may be hard-deleted when its similarity to the only surviving record is below the configured threshold. The fix (plannedSurvivorKeys, src/lib/dedup.ts:526) enforces this within a single plan: a planned survivor can never be re-marked by a weaker pair in the same run, so lineage is one hop deep per run.
Gap
The guard's state is per-run; nothing persists survivorship across runs. planDedup excludes previously marked duplicates from scans (loadMarkedDuplicates), but prior survivors re-enter every later run as ordinary candidates:
Run 1: B marked duplicate of A (sim(B,A) ≥ t). A stays visible — invariant holds.
Run 2: A (unmarked) pairs with new record C at sim(A,C) ≥ t; the comparator picks C → A marked duplicate of C.
Now B's recorded survivor A is itself hidden. B's only visible representative is C — and sim(B,C) was never measured and can be < t (two ~18° hops at threshold 0.95 compose to ~0.81).
Destructive mode (worse): run 1 --execute --delete removes B; run 2 removes A; only C survives. B was permanently deleted with no surviving record at ≥ threshold — the precise harm Blocking 1 cited. (Destructive mode is double-opt-in, which mitigates.)
The per-row invariant in the dedup.ts header ("every marked duplicate has a direct similarity ≥ threshold to its survivor") still holds — each lineage row was measured directly at mark time. It is the visible-survivor property that decays: dedup_lineage can accumulate marked→marked survivor chains.
Exact-pass chains are benign (exact duplicates share normalized text, so sim(B,C) ≈ sim(A,C) ≥ t). The hazard is semantic→semantic across runs.
Where this surfaced
PR #61's property suite documents this honestly: its one-hop property is scoped within a plan, and its repeated-runs property asserts auditability (unique active marks, record preservation, no re-marking of marked records) rather than cross-run one-hop, because the implementation permits the scenario above (tests/lib/dedup.property.test.ts:13-18). The PR #61 review ruled that the scope note accurately describes implementation behavior, but the cross-run behavior itself is not consistent with the safety property PR #60's blocker was grounded in — hence this issue.
Options
a. Sticky survivors — load survivor keys from dedup_lineage (mirror of loadMarkedDuplicates) and exclude prior survivors from being re-marked in later runs. Simple; conservative in the safe direction. Tradeoff: survivorship becomes order-dependent across runs (an early survivor can never later be consolidated under a genuinely better record).
b. Chain compression at apply — when marking A→C, re-point existing B→A lineage rows to B→C. Automatically safe for exact rows (same normalized text); for semantic rows requires measuring sim(B,C) and skipping (or keeping A visible) when below threshold.
c. Document as accepted limitation — at minimum state it in docs/cli-reference.md's safety model, and have destructive mode refuse (or warn) when a planned duplicate is itself a recorded survivor in dedup_lineage.
Recommend (a) or (b), plus a cross-run property test pinning whichever invariant is chosen — PR #61's suite is the natural home (its scope note already marks the spot).
Context
PR #60's review established a binding safety property for dedup (Blocking 1): every hidden (marked) record must have a visible survivor with direct similarity ≥ threshold — and in destructive mode, no record may be hard-deleted when its similarity to the only surviving record is below the configured threshold. The fix (
plannedSurvivorKeys,src/lib/dedup.ts:526) enforces this within a single plan: a planned survivor can never be re-marked by a weaker pair in the same run, so lineage is one hop deep per run.Gap
The guard's state is per-run; nothing persists survivorship across runs.
planDedupexcludes previously marked duplicates from scans (loadMarkedDuplicates), but prior survivors re-enter every later run as ordinary candidates:sim(B,A) ≥ t). A stays visible — invariant holds.sim(A,C) ≥ t; the comparator picks C → A marked duplicate of C.sim(B,C)was never measured and can be< t(two ~18° hops at threshold 0.95 compose to ~0.81).--execute --deleteremoves B; run 2 removes A; only C survives. B was permanently deleted with no surviving record at ≥ threshold — the precise harm Blocking 1 cited. (Destructive mode is double-opt-in, which mitigates.)The per-row invariant in the
dedup.tsheader ("every marked duplicate has a direct similarity ≥ threshold to its survivor") still holds — each lineage row was measured directly at mark time. It is the visible-survivor property that decays:dedup_lineagecan accumulate marked→marked survivor chains.Exact-pass chains are benign (exact duplicates share normalized text, so
sim(B,C) ≈ sim(A,C) ≥ t). The hazard is semantic→semantic across runs.Where this surfaced
PR #61's property suite documents this honestly: its one-hop property is scoped within a plan, and its repeated-runs property asserts auditability (unique active marks, record preservation, no re-marking of marked records) rather than cross-run one-hop, because the implementation permits the scenario above (
tests/lib/dedup.property.test.ts:13-18). The PR #61 review ruled that the scope note accurately describes implementation behavior, but the cross-run behavior itself is not consistent with the safety property PR #60's blocker was grounded in — hence this issue.Options
a. Sticky survivors — load survivor keys from
dedup_lineage(mirror ofloadMarkedDuplicates) and exclude prior survivors from being re-marked in later runs. Simple; conservative in the safe direction. Tradeoff: survivorship becomes order-dependent across runs (an early survivor can never later be consolidated under a genuinely better record).b. Chain compression at apply — when marking A→C, re-point existing
B→Alineage rows toB→C. Automatically safe for exact rows (same normalized text); for semantic rows requires measuringsim(B,C)and skipping (or keeping A visible) when below threshold.c. Document as accepted limitation — at minimum state it in
docs/cli-reference.md's safety model, and have destructive mode refuse (or warn) when a planned duplicate is itself a recorded survivor indedup_lineage.Recommend (a) or (b), plus a cross-run property test pinning whichever invariant is chosen — PR #61's suite is the natural home (its scope note already marks the spot).