Skip to content

fix(recovery): converge roll-forward on concurrent manifest advance#296

Open
ragnorc wants to merge 3 commits into
mainfrom
correctness-by-design-fix
Open

fix(recovery): converge roll-forward on concurrent manifest advance#296
ragnorc wants to merge 3 commits into
mainfrom
correctness-by-design-fix

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

What & why

The open-time recovery sweep's roll-forward published a manifest CAS at the sidecar's pinned expected_version; under a concurrent writer that advanced the manifest past it during the classify→publish window, the CAS failed with ExpectedVersionMismatch and aborted the whole Omnigraph::open (iss-schema-apply-reopen-recovery-race). A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15), so on that conflict the sweep now re-reads and either records the already-achieved roll-forward (converge) or defers to the next pass — never failing the open. The PR also hardens the failpoint test harness (one ScopedFailPoint with with_callback, #[serial] on every failpoint test, a deterministic Rendezvous replacing fixed sleeps, and a compile-checked failpoints::names catalog enforced by a source-walk guard) — which is what makes the deterministic red→green regression possible — and gates a test-only TableStore::append_batch behind #[cfg(test)].

Backing issue / RFC

  • Fixes an accepted issue: tracked internally as iss-schema-apply-reopen-recovery-race (maintainer internal process per the template note)

Checklist

  • Change is focused (one logical change) — the recovery fix; the harness hardening and the cfg(test) gate are the enabling/supporting refactors, each in its own commit
  • Tests added/updated for behavior changes — a deterministic red→green failpoint regression (open_sweep_roll_forward_converges_when_manifest_advances_concurrently)
  • Public docs updated if user-facing surface changed — N/A (no user-facing surface change); dev docs updated (invariants.md known-gap, testing.md)
  • Reviewed against docs/dev/invariants.md — no Hard Invariant weakened; the fix instantiates invariants 7 & 15 (idempotent physical reconciliation; fail loudly only on a genuine logical conflict)

Notes for reviewers

  • rule-12 red→green is split across two commits: check out 01d726c0 to reproduce ExpectedVersionMismatch { expected: 1, actual: 2 }; 496d9bf7 turns it green.
  • The convergence proxy (current_version >= observed lance_head) is sound under the heal-first invariant — every manifest-advancing writer heals pending sidecars first, or optimize refuses on an unrecovered graph — documented at sidecar_intent_satisfied.
  • Scope: closes only the roll-forward face. The roll-back twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly.
  • Follow-up considered, not done: have the publisher return the fresh latest_per_table it already computed at conflict detection, so the converge path avoids a re-open / second TOCTOU.

Note

High Risk
Changes core manifest recovery and open-time reconciliation on CAS races; incorrect convergence logic could mask real conflicts or leave stale sidecars.

Overview
Fixes iss-schema-apply-reopen-recovery-race: when open-time recovery roll-forward loses a manifest CAS because another writer advanced the manifest during classify→publish, the sweep no longer fails Omnigraph::open with ExpectedVersionMismatch. It re-reads the manifest and either converges (audit + delete sidecar if the sidecar’s Lance intent is already satisfied) or defers the sidecar for a later pass. A new failpoint recovery.before_roll_forward_publish pins that window for tests.

Supporting changes: compile-checked failpoints::names catalogs (engine + cluster) with a source-walk guard; one shared ScopedFailPoint (including with_callback); #[serial] on failpoint integration tests; Rendezvous for deterministic multi-thread tests instead of sleeps; cluster tests use engine ScopedFailPoint; TableStore::append_batch gated to #[cfg(test)]. Dev docs updated for convergence vs remaining roll-back / cross-process gaps.

Reviewed by Cursor Bugbot for commit 496d9bf. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

Fixes iss-schema-apply-reopen-recovery-race: when the open-time recovery sweep's roll-forward publish loses a manifest CAS to a concurrent writer during the classify→publish window, it now re-reads the live manifest and either converges (records audit + deletes the sidecar idempotently) or defers to the next open, rather than aborting Omnigraph::open with ExpectedVersionMismatch. The correctness argument rests on the heal-first invariant — any writer that advances a table's manifest version must have healed all pending sidecars first, so current_version >= lance_head implies the sidecar's intent was already published.

  • Core fix: converge_or_defer_roll_forward + sidecar_intent_satisfied in recovery.rs; new failpoint RECOVERY_BEFORE_ROLL_FORWARD_PUBLISH pins the TOCTOU window for the deterministic regression test.
  • Harness hardening: compile-checked failpoints::names catalog, ScopedFailPoint::with_callback, #[serial] on all failpoint tests, Rendezvous replacing fixed sleeps, source-walk failpoint_names_guard, cluster tests reuse engine's ScopedFailPoint.
  • TableStore::append_batch gated to #[cfg(test)], enforcing no new engine call sites by construction.

Confidence Score: 4/5

The convergence logic is well-argued and the soundness proof is documented inline; the two findings are contained to audit semantics and a test-hygiene gap in the new guard.

The core converge_or_defer_roll_forward path correctly re-reads the manifest and gates convergence on the heal-first invariant. The audit duplication issue means operators can see two RolledForward entries for one recovery event when two concurrent opens race, but the graph state itself is consistent. The multi-line park_first call bypasses the new names-guard for one site. Both are non-blocking quality issues.

recovery.rs (converge path + audit deduplication) and tests/failpoints.rs (multi-line park_first literal)

Important Files Changed

Filename Overview
crates/omnigraph/src/db/manifest/recovery.rs Core fix: adds converge_or_defer_roll_forward to handle ExpectedVersionMismatch during open-time roll-forward publish; new sidecar_intent_satisfied proxy relies on heal-first invariant; process_sidecar signature widened to Arc; potential duplicate audit record on concurrent-open convergence.
crates/omnigraph/src/failpoints.rs Adds compile-checked names catalog, consolidates ScopedFailPoint (with with_callback) into the engine crate, removes the duplicate from omnigraph-cluster.
crates/omnigraph/tests/failpoints.rs All tests gain #[serial]; Rendezvous replaces sleeps/loops; new regression test added; one park_first call retains a bare string literal that bypasses the names guard.
crates/omnigraph/tests/helpers/failpoint.rs New Rendezvous helper: park-first-arrival callback with bounded 30s spin, async wait_until_reached with 12s timeout, release, and RAII-cleanup via _failpoint drop.
crates/omnigraph/tests/failpoint_names_guard.rs New source-walk guard enforcing names catalog usage; same-line pattern matching misses multi-line call sites like the one in inline_delete_conflict_writes_sidecar_before_rejecting.
crates/omnigraph/src/table_store.rs append_batch gated to #[cfg(test)]; enforces no new engine call sites by construction and silences the dead-code warning.
crates/omnigraph-cluster/src/failpoints.rs Removes duplicate ScopedFailPoint; cluster tests now use omnigraph::failpoints::ScopedFailPoint via the engine's failpoints feature dep.
docs/dev/invariants.md Known Gaps updated: roll-forward convergence closed; roll-back twin still requires cross-process lease.

Fix All in Claude Code

Reviews (1): Last reviewed commit: "fix(recovery): converge roll-forward whe..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

ragnorc added 3 commits June 21, 2026 14:42
…(test)

The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.
…ward CAS race

Hardens the test infrastructure around the process-global `fail` registry, and
adds a deterministic red repro for the open-time recovery sweep's roll-forward
CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next
commit — this commit is intentionally red (rule 12: red→green visible in log).

Harness:
- One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate
  is removed and cluster tests reuse the engine type via `omnigraph/failpoints`.
- `#[serial]` on every failpoint test (the registry is process-global, so shared
  names interfere under parallelism); `serial_test` added to cluster dev-deps.
- `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release)
  replaces fixed-`sleep` cross-thread coordination; the three concurrent tests
  now rendezvous deterministically. The reached flag doubles as a fired-assert.
- Compile-checked `failpoints::names` catalog (engine + cluster); every call
  site references a const, and `failpoint_names_guard.rs` enforces "no string
  literal names" by source-walk, so a typo is a build error not a silent no-fire.

Red repro:
- New `recovery.before_roll_forward_publish` failpoint at the sweep's
  classify -> publish-CAS window (the only injection point there).
- `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two
  concurrent open-sweeps race one pending sidecar; the sweep parked at the
  failpoint loses its publish CAS to the other and fails the open with
  `ExpectedVersionMismatch`. FAILS at this commit by design.
…rrently

The open-time recovery sweep classified a pending sidecar as RolledPastExpected,
then published a manifest CAS at the sidecar's pinned expected_version. Under a
concurrent writer that advanced the manifest past expected during the
classify -> publish window, the CAS failed with ExpectedVersionMismatch and
`?`-propagated, failing the whole Omnigraph::open.
iss-schema-apply-reopen-recovery-race.

A roll-forward's postcondition is "the manifest reflects the sidecar's committed
Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an
ExpectedVersionMismatch, re-read the live manifest and check whether the
sidecar's intent is already satisfied (every pinned table at a version >= the
one we observed and tried to publish; added tables registered; tombstones gone
— sound under the heal-first invariant, documented at the check). If satisfied,
this is convergence: record the RolledForward audit + delete the sidecar
idempotently. If only partway, defer to the next pass. Either way the open no
longer fails. Other errors still propagate; a genuine logical conflict
resurfaces via the classifier's InvariantViolation.

Turns the red repro from the previous commit green. The roll-BACK twin
(iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and
still needs a cross-process lease — the known-gap is updated accordingly.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.

// then lands deterministically before the delete resumes, so the
// delete's manifest CAS is guaranteed stale — no retry loop, no sleep.
let rv = helpers::failpoint::Rendezvous::park_first(
"mutation.delete_node_pre_primary_delete",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failpoint guard literal violation

Medium Severity

The new Rendezvous::park_first call passes a bare string for mutation.delete_node_pre_primary_delete instead of names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE. The added failpoint_names_use_the_compile_checked_catalog guard treats any park_first(" line in tests/failpoints.rs as a violation, so this commit should fail that test in CI.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.

RecoveryKind::RolledForward,
outcomes,
)
.await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate recovery audit on converge

Low Severity

When a roll-forward publish loses a CAS but the manifest already reflects the sidecar goal, converge_or_defer_roll_forward always appends another RolledForward audit. If another open sweep already rolled forward, recorded audit, and deleted the sidecar, the losing sweep still writes a second audit row for the same operation_id.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 496d9bf757

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.unwrap_or(pin.post_commit_pin),
})
.collect();
record_audit(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check audit idempotency before converging

When two recovery sweeps race the same sidecar, the sweep that loses the manifest CAS now reaches this convergence path after the winner has already published, recorded the RolledForward audit, and deleted the sidecar. This unconditional record_audit appends a second recovery audit/commit for the same operation_id (the new regression test exercises exactly this race but only checks readability), so operators will see duplicate recoveries and helpers that expect one audit row would fail. Re-check whether the sidecar/audit still needs cleanup before recording here.

Useful? React with 👍 / 👎.

Comment on lines +1368 to +1413
async fn converge_or_defer_roll_forward(
root_uri: &str,
storage: &std::sync::Arc<dyn StorageAdapter>,
sidecar: &RecoverySidecar,
states: &[ClassifiedTable],
conflict: OmniError,
) -> Result<bool> {
let fresh = fresh_snapshot_for_sidecar(root_uri, storage, sidecar).await?;
if !sidecar_intent_satisfied(&fresh, sidecar, states) {
warn!(
operation_id = sidecar.operation_id.as_str(),
writer_kind = ?sidecar.writer_kind,
"recovery: roll-forward publish lost a CAS and the manifest has not \
yet reached the sidecar's goal; deferring to the next pass \
(conflict: {conflict})"
);
return Ok(false);
}
warn!(
operation_id = sidecar.operation_id.as_str(),
writer_kind = ?sidecar.writer_kind,
"recovery: roll-forward publish lost a CAS to a concurrent writer that \
already reached the goal; converging (RolledForward audit + delete)"
);
let outcomes: Vec<TableOutcome> = sidecar
.tables
.iter()
.map(|pin| TableOutcome {
table_key: pin.table_key.clone(),
from_version: pin.expected_version,
to_version: fresh
.entry(&pin.table_key)
.map(|e| e.table_version)
.unwrap_or(pin.post_commit_pin),
})
.collect();
record_audit(
root_uri,
sidecar,
fresh.version(),
RecoveryKind::RolledForward,
outcomes,
)
.await?;
delete_sidecar_by_operation_id(root_uri, storage.as_ref(), &sidecar.operation_id).await?;
Ok(true)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate audit row when two concurrent opens both converge the same sidecar

converge_or_defer_roll_forward calls record_audit(…, RecoveryKind::RolledForward, …) unconditionally when sidecar_intent_satisfied is true. But in the concurrent-open scenario covered by the new test, the open that WON the CAS already appended a RolledForward row in the normal roll_forward_allrecord_audit path. The open that lost the CAS and ended up here appends a second row for the same operation_id. The result is two RolledForward entries in _graph_commit_recoveries.lance for one recovery event — visible in omnigraph commit list --filter actor=omnigraph:recovery and any audit-consuming tooling. The test does not assert on the audit record count, so this goes undetected. Consider recording with a distinct RecoveryKind (e.g. ConvergedForward) or skipping the audit entry when the sidecar is already absent, to keep the audit table append-idempotent across concurrent sweeps.

Fix in Claude Code

Comment on lines +960 to 962
let rv = helpers::failpoint::Rendezvous::park_first(
"mutation.delete_node_pre_primary_delete",
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Bare string literal bypasses the failpoint_names_guard. The guard pattern "park_first(\"" checks whether park_first( and " appear on the same line; splitting the call across two lines silently sidesteps it. MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE is already in the names catalog — use it here so the guard actually covers this site.

Suggested change
let rv = helpers::failpoint::Rendezvous::park_first(
"mutation.delete_node_pre_primary_delete",
);
let rv = helpers::failpoint::Rendezvous::park_first(
names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE,
);

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant