fix(recovery): converge roll-forward on concurrent manifest advance#296
fix(recovery): converge roll-forward on concurrent manifest advance#296ragnorc wants to merge 3 commits into
Conversation
…(test)
The inherent append_batch is used only by in-source recovery test setup, but
the non-test lib build (cfg(test) off) cannot see those callers and emitted a
dead_code warning. Gating the method #[cfg(test)] silences the false positive
and enforces its own doc contract ("no new engine call sites") by construction
— engine code physically cannot call a cfg(test) method.
…ward CAS race Hardens the test infrastructure around the process-global `fail` registry, and adds a deterministic red repro for the open-time recovery sweep's roll-forward CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next commit — this commit is intentionally red (rule 12: red→green visible in log). Harness: - One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate is removed and cluster tests reuse the engine type via `omnigraph/failpoints`. - `#[serial]` on every failpoint test (the registry is process-global, so shared names interfere under parallelism); `serial_test` added to cluster dev-deps. - `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release) replaces fixed-`sleep` cross-thread coordination; the three concurrent tests now rendezvous deterministically. The reached flag doubles as a fired-assert. - Compile-checked `failpoints::names` catalog (engine + cluster); every call site references a const, and `failpoint_names_guard.rs` enforces "no string literal names" by source-walk, so a typo is a build error not a silent no-fire. Red repro: - New `recovery.before_roll_forward_publish` failpoint at the sweep's classify -> publish-CAS window (the only injection point there). - `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one pending sidecar; the sweep parked at the failpoint loses its publish CAS to the other and fails the open with `ExpectedVersionMismatch`. FAILS at this commit by design.
…rrently The open-time recovery sweep classified a pending sidecar as RolledPastExpected, then published a manifest CAS at the sidecar's pinned expected_version. Under a concurrent writer that advanced the manifest past expected during the classify -> publish window, the CAS failed with ExpectedVersionMismatch and `?`-propagated, failing the whole Omnigraph::open. iss-schema-apply-reopen-recovery-race. A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an ExpectedVersionMismatch, re-read the live manifest and check whether the sidecar's intent is already satisfied (every pinned table at a version >= the one we observed and tried to publish; added tables registered; tombstones gone — sound under the heal-first invariant, documented at the check). If satisfied, this is convergence: record the RolledForward audit + delete the sidecar idempotently. If only partway, defer to the next pass. Either way the open no longer fails. Other errors still propagate; a genuine logical conflict resurfaces via the classifier's InvariantViolation. Turns the red repro from the previous commit green. The roll-BACK twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.
| // then lands deterministically before the delete resumes, so the | ||
| // delete's manifest CAS is guaranteed stale — no retry loop, no sleep. | ||
| let rv = helpers::failpoint::Rendezvous::park_first( | ||
| "mutation.delete_node_pre_primary_delete", |
There was a problem hiding this comment.
Failpoint guard literal violation
Medium Severity
The new Rendezvous::park_first call passes a bare string for mutation.delete_node_pre_primary_delete instead of names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE. The added failpoint_names_use_the_compile_checked_catalog guard treats any park_first(" line in tests/failpoints.rs as a violation, so this commit should fail that test in CI.
Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.
| RecoveryKind::RolledForward, | ||
| outcomes, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
Duplicate recovery audit on converge
Low Severity
When a roll-forward publish loses a CAS but the manifest already reflects the sidecar goal, converge_or_defer_roll_forward always appends another RolledForward audit. If another open sweep already rolled forward, recorded audit, and deleted the sidecar, the losing sweep still writes a second audit row for the same operation_id.
Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 496d9bf757
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .unwrap_or(pin.post_commit_pin), | ||
| }) | ||
| .collect(); | ||
| record_audit( |
There was a problem hiding this comment.
Check audit idempotency before converging
When two recovery sweeps race the same sidecar, the sweep that loses the manifest CAS now reaches this convergence path after the winner has already published, recorded the RolledForward audit, and deleted the sidecar. This unconditional record_audit appends a second recovery audit/commit for the same operation_id (the new regression test exercises exactly this race but only checks readability), so operators will see duplicate recoveries and helpers that expect one audit row would fail. Re-check whether the sidecar/audit still needs cleanup before recording here.
Useful? React with 👍 / 👎.
| async fn converge_or_defer_roll_forward( | ||
| root_uri: &str, | ||
| storage: &std::sync::Arc<dyn StorageAdapter>, | ||
| sidecar: &RecoverySidecar, | ||
| states: &[ClassifiedTable], | ||
| conflict: OmniError, | ||
| ) -> Result<bool> { | ||
| let fresh = fresh_snapshot_for_sidecar(root_uri, storage, sidecar).await?; | ||
| if !sidecar_intent_satisfied(&fresh, sidecar, states) { | ||
| warn!( | ||
| operation_id = sidecar.operation_id.as_str(), | ||
| writer_kind = ?sidecar.writer_kind, | ||
| "recovery: roll-forward publish lost a CAS and the manifest has not \ | ||
| yet reached the sidecar's goal; deferring to the next pass \ | ||
| (conflict: {conflict})" | ||
| ); | ||
| return Ok(false); | ||
| } | ||
| warn!( | ||
| operation_id = sidecar.operation_id.as_str(), | ||
| writer_kind = ?sidecar.writer_kind, | ||
| "recovery: roll-forward publish lost a CAS to a concurrent writer that \ | ||
| already reached the goal; converging (RolledForward audit + delete)" | ||
| ); | ||
| let outcomes: Vec<TableOutcome> = sidecar | ||
| .tables | ||
| .iter() | ||
| .map(|pin| TableOutcome { | ||
| table_key: pin.table_key.clone(), | ||
| from_version: pin.expected_version, | ||
| to_version: fresh | ||
| .entry(&pin.table_key) | ||
| .map(|e| e.table_version) | ||
| .unwrap_or(pin.post_commit_pin), | ||
| }) | ||
| .collect(); | ||
| record_audit( | ||
| root_uri, | ||
| sidecar, | ||
| fresh.version(), | ||
| RecoveryKind::RolledForward, | ||
| outcomes, | ||
| ) | ||
| .await?; | ||
| delete_sidecar_by_operation_id(root_uri, storage.as_ref(), &sidecar.operation_id).await?; | ||
| Ok(true) |
There was a problem hiding this comment.
Duplicate audit row when two concurrent opens both converge the same sidecar
converge_or_defer_roll_forward calls record_audit(…, RecoveryKind::RolledForward, …) unconditionally when sidecar_intent_satisfied is true. But in the concurrent-open scenario covered by the new test, the open that WON the CAS already appended a RolledForward row in the normal roll_forward_all → record_audit path. The open that lost the CAS and ended up here appends a second row for the same operation_id. The result is two RolledForward entries in _graph_commit_recoveries.lance for one recovery event — visible in omnigraph commit list --filter actor=omnigraph:recovery and any audit-consuming tooling. The test does not assert on the audit record count, so this goes undetected. Consider recording with a distinct RecoveryKind (e.g. ConvergedForward) or skipping the audit entry when the sidecar is already absent, to keep the audit table append-idempotent across concurrent sweeps.
| let rv = helpers::failpoint::Rendezvous::park_first( | ||
| "mutation.delete_node_pre_primary_delete", | ||
| ); |
There was a problem hiding this comment.
Bare string literal bypasses the
failpoint_names_guard. The guard pattern "park_first(\"" checks whether park_first( and " appear on the same line; splitting the call across two lines silently sidesteps it. MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE is already in the names catalog — use it here so the guard actually covers this site.
| let rv = helpers::failpoint::Rendezvous::park_first( | |
| "mutation.delete_node_pre_primary_delete", | |
| ); | |
| let rv = helpers::failpoint::Rendezvous::park_first( | |
| names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE, | |
| ); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


What & why
The open-time recovery sweep's roll-forward published a manifest CAS at the sidecar's pinned
expected_version; under a concurrent writer that advanced the manifest past it during the classify→publish window, the CAS failed withExpectedVersionMismatchand aborted the wholeOmnigraph::open(iss-schema-apply-reopen-recovery-race). A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15), so on that conflict the sweep now re-reads and either records the already-achieved roll-forward (converge) or defers to the next pass — never failing the open. The PR also hardens the failpoint test harness (oneScopedFailPointwithwith_callback,#[serial]on every failpoint test, a deterministicRendezvousreplacing fixed sleeps, and a compile-checkedfailpoints::namescatalog enforced by a source-walk guard) — which is what makes the deterministic red→green regression possible — and gates a test-onlyTableStore::append_batchbehind#[cfg(test)].Backing issue / RFC
iss-schema-apply-reopen-recovery-race(maintainer internal process per the template note)Checklist
cfg(test)gate are the enabling/supporting refactors, each in its own commitopen_sweep_roll_forward_converges_when_manifest_advances_concurrently)invariants.mdknown-gap,testing.md)Notes for reviewers
01d726c0to reproduceExpectedVersionMismatch { expected: 1, actual: 2 };496d9bf7turns it green.current_version >= observed lance_head) is sound under the heal-first invariant — every manifest-advancing writer heals pending sidecars first, oroptimizerefuses on an unrecovered graph — documented atsidecar_intent_satisfied.iss-recovery-sweep-live-writer-rollback) is destructive (LanceRestore) and still needs a cross-process lease — the known-gap is updated accordingly.latest_per_tableit already computed at conflict detection, so the converge path avoids a re-open / second TOCTOU.Note
High Risk
Changes core manifest recovery and open-time reconciliation on CAS races; incorrect convergence logic could mask real conflicts or leave stale sidecars.
Overview
Fixes
iss-schema-apply-reopen-recovery-race: when open-time recovery roll-forward loses a manifest CAS because another writer advanced the manifest during classify→publish, the sweep no longer failsOmnigraph::openwithExpectedVersionMismatch. It re-reads the manifest and either converges (audit + delete sidecar if the sidecar’s Lance intent is already satisfied) or defers the sidecar for a later pass. A new failpointrecovery.before_roll_forward_publishpins that window for tests.Supporting changes: compile-checked
failpoints::namescatalogs (engine + cluster) with a source-walk guard; one sharedScopedFailPoint(includingwith_callback);#[serial]on failpoint integration tests;Rendezvousfor deterministic multi-thread tests instead of sleeps; cluster tests use engineScopedFailPoint;TableStore::append_batchgated to#[cfg(test)]. Dev docs updated for convergence vs remaining roll-back / cross-process gaps.Reviewed by Cursor Bugbot for commit 496d9bf. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
Fixes
iss-schema-apply-reopen-recovery-race: when the open-time recovery sweep's roll-forward publish loses a manifest CAS to a concurrent writer during the classify→publish window, it now re-reads the live manifest and either converges (records audit + deletes the sidecar idempotently) or defers to the next open, rather than abortingOmnigraph::openwithExpectedVersionMismatch. The correctness argument rests on the heal-first invariant — any writer that advances a table's manifest version must have healed all pending sidecars first, socurrent_version >= lance_headimplies the sidecar's intent was already published.converge_or_defer_roll_forward+sidecar_intent_satisfiedinrecovery.rs; new failpointRECOVERY_BEFORE_ROLL_FORWARD_PUBLISHpins the TOCTOU window for the deterministic regression test.failpoints::namescatalog,ScopedFailPoint::with_callback,#[serial]on all failpoint tests,Rendezvousreplacing fixed sleeps, source-walkfailpoint_names_guard, cluster tests reuse engine'sScopedFailPoint.TableStore::append_batchgated to#[cfg(test)], enforcing no new engine call sites by construction.Confidence Score: 4/5
The convergence logic is well-argued and the soundness proof is documented inline; the two findings are contained to audit semantics and a test-hygiene gap in the new guard.
The core converge_or_defer_roll_forward path correctly re-reads the manifest and gates convergence on the heal-first invariant. The audit duplication issue means operators can see two RolledForward entries for one recovery event when two concurrent opens race, but the graph state itself is consistent. The multi-line park_first call bypasses the new names-guard for one site. Both are non-blocking quality issues.
recovery.rs (converge path + audit deduplication) and tests/failpoints.rs (multi-line park_first literal)
Important Files Changed
Reviews (1): Last reviewed commit: "fix(recovery): converge roll-forward whe..." | Re-trigger Greptile