fix(recovery): converge roll-forward on concurrent manifest advance by ragnorc · Pull Request #296 · ModernRelay/omnigraph

ragnorc · 2026-06-21T14:19:03Z

What & why

The open-time recovery sweep's roll-forward published a manifest CAS at the sidecar's pinned expected_version; under a concurrent writer that advanced the manifest past it during the classify→publish window, the CAS failed with ExpectedVersionMismatch and aborted the whole Omnigraph::open (iss-schema-apply-reopen-recovery-race). A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15), so on that conflict the sweep now re-reads and either records the already-achieved roll-forward (converge) or defers to the next pass — never failing the open. The PR also hardens the failpoint test harness (one ScopedFailPoint with with_callback, #[serial] on every failpoint test, a deterministic Rendezvous replacing fixed sleeps, and a compile-checked failpoints::names catalog enforced by a source-walk guard) — which is what makes the deterministic red→green regression possible — and gates a test-only TableStore::append_batch behind #[cfg(test)].

Backing issue / RFC

Fixes an accepted issue: tracked internally as iss-schema-apply-reopen-recovery-race (maintainer internal process per the template note)

Checklist

Change is focused (one logical change) — the recovery fix; the harness hardening and the cfg(test) gate are the enabling/supporting refactors, each in its own commit
Tests added/updated for behavior changes — a deterministic red→green failpoint regression (open_sweep_roll_forward_converges_when_manifest_advances_concurrently)
Public docs updated if user-facing surface changed — N/A (no user-facing surface change); dev docs updated (invariants.md known-gap, testing.md)
Reviewed against docs/dev/invariants.md — no Hard Invariant weakened; the fix instantiates invariants 7 & 15 (idempotent physical reconciliation; fail loudly only on a genuine logical conflict)

Notes for reviewers

rule-12 red→green is split across two commits: check out 01d726c0 to reproduce ExpectedVersionMismatch { expected: 1, actual: 2 }; 496d9bf7 turns it green.
The convergence proxy (current_version >= observed lance_head) is sound under the heal-first invariant — every manifest-advancing writer heals pending sidecars first, or optimize refuses on an unrecovered graph — documented at sidecar_intent_satisfied.
Scope: closes only the roll-forward face. The roll-back twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly.
Follow-up considered, not done: have the publisher return the fresh latest_per_table it already computed at conflict detection, so the converge path avoids a re-open / second TOCTOU.

Note

High Risk
Changes core manifest recovery and open-time reconciliation on CAS races; incorrect convergence logic could mask real conflicts or leave stale sidecars.

Overview
Fixes iss-schema-apply-reopen-recovery-race: when open-time recovery roll-forward loses a manifest CAS because another writer advanced the manifest during classify→publish, the sweep no longer fails Omnigraph::open with ExpectedVersionMismatch. It re-reads the manifest and either converges (audit + delete sidecar if the sidecar’s Lance intent is already satisfied) or defers the sidecar for a later pass. A new failpoint recovery.before_roll_forward_publish pins that window for tests.

Supporting changes: compile-checked failpoints::names catalogs (engine + cluster) with a source-walk guard; one shared ScopedFailPoint (including with_callback); #[serial] on failpoint integration tests; Rendezvous for deterministic multi-thread tests instead of sleeps; cluster tests use engine ScopedFailPoint; TableStore::append_batch gated to #[cfg(test)]. Dev docs updated for convergence vs remaining roll-back / cross-process gaps.

^{Reviewed by Cursor Bugbot for commit 496d9bf. Bugbot is set up for automated code reviews on this repo. Configure here.}

Greptile Summary

Fixes iss-schema-apply-reopen-recovery-race: when the open-time recovery sweep's roll-forward publish loses a manifest CAS to a concurrent writer during the classify→publish window, it now re-reads the live manifest and either converges (records audit + deletes the sidecar idempotently) or defers to the next open, rather than aborting Omnigraph::open with ExpectedVersionMismatch. The correctness argument rests on the heal-first invariant — any writer that advances a table's manifest version must have healed all pending sidecars first, so current_version >= lance_head implies the sidecar's intent was already published.

Core fix: converge_or_defer_roll_forward + sidecar_intent_satisfied in recovery.rs; new failpoint RECOVERY_BEFORE_ROLL_FORWARD_PUBLISH pins the TOCTOU window for the deterministic regression test.
Harness hardening: compile-checked failpoints::names catalog, ScopedFailPoint::with_callback, #[serial] on all failpoint tests, Rendezvous replacing fixed sleeps, source-walk failpoint_names_guard, cluster tests reuse engine's ScopedFailPoint.
TableStore::append_batch gated to #[cfg(test)], enforcing no new engine call sites by construction.

Confidence Score: 4/5

The convergence logic is well-argued and the soundness proof is documented inline; the two findings are contained to audit semantics and a test-hygiene gap in the new guard.

The core converge_or_defer_roll_forward path correctly re-reads the manifest and gates convergence on the heal-first invariant. The audit duplication issue means operators can see two RolledForward entries for one recovery event when two concurrent opens race, but the graph state itself is consistent. The multi-line park_first call bypasses the new names-guard for one site. Both are non-blocking quality issues.

recovery.rs (converge path + audit deduplication) and tests/failpoints.rs (multi-line park_first literal)

Important Files Changed

Filename	Overview
crates/omnigraph/src/db/manifest/recovery.rs	Core fix: adds converge_or_defer_roll_forward to handle ExpectedVersionMismatch during open-time roll-forward publish; new sidecar_intent_satisfied proxy relies on heal-first invariant; process_sidecar signature widened to Arc; potential duplicate audit record on concurrent-open convergence.
crates/omnigraph/src/failpoints.rs	Adds compile-checked names catalog, consolidates ScopedFailPoint (with with_callback) into the engine crate, removes the duplicate from omnigraph-cluster.
crates/omnigraph/tests/failpoints.rs	All tests gain #[serial]; Rendezvous replaces sleeps/loops; new regression test added; one park_first call retains a bare string literal that bypasses the names guard.
crates/omnigraph/tests/helpers/failpoint.rs	New Rendezvous helper: park-first-arrival callback with bounded 30s spin, async wait_until_reached with 12s timeout, release, and RAII-cleanup via _failpoint drop.
crates/omnigraph/tests/failpoint_names_guard.rs	New source-walk guard enforcing names catalog usage; same-line pattern matching misses multi-line call sites like the one in inline_delete_conflict_writes_sidecar_before_rejecting.
crates/omnigraph/src/table_store.rs	append_batch gated to #[cfg(test)]; enforces no new engine call sites by construction and silences the dead-code warning.
crates/omnigraph-cluster/src/failpoints.rs	Removes duplicate ScopedFailPoint; cluster tests now use omnigraph::failpoints::ScopedFailPoint via the engine's failpoints feature dep.
docs/dev/invariants.md	Known Gaps updated: roll-forward convergence closed; roll-back twin still requires cross-process lease.

_{Reviews (1): Last reviewed commit: "fix(recovery): converge roll-forward whe..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…(test) The inherent append_batch is used only by in-source recovery test setup, but the non-test lib build (cfg(test) off) cannot see those callers and emitted a dead_code warning. Gating the method #[cfg(test)] silences the false positive and enforces its own doc contract ("no new engine call sites") by construction — engine code physically cannot call a cfg(test) method.

…ward CAS race Hardens the test infrastructure around the process-global `fail` registry, and adds a deterministic red repro for the open-time recovery sweep's roll-forward CAS race (iss-schema-apply-reopen-recovery-race). The fix lands in the next commit — this commit is intentionally red (rule 12: red→green visible in log). Harness: - One `ScopedFailPoint` (engine) gaining `with_callback`; the cluster duplicate is removed and cluster tests reuse the engine type via `omnigraph/failpoints`. - `#[serial]` on every failpoint test (the registry is process-global, so shared names interfere under parallelism); `serial_test` added to cluster dev-deps. - `helpers::failpoint::Rendezvous` (park-first / wait-until-reached / release) replaces fixed-`sleep` cross-thread coordination; the three concurrent tests now rendezvous deterministically. The reached flag doubles as a fired-assert. - Compile-checked `failpoints::names` catalog (engine + cluster); every call site references a const, and `failpoint_names_guard.rs` enforces "no string literal names" by source-walk, so a typo is a build error not a silent no-fire. Red repro: - New `recovery.before_roll_forward_publish` failpoint at the sweep's classify -> publish-CAS window (the only injection point there). - `open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one pending sidecar; the sweep parked at the failpoint loses its publish CAS to the other and fails the open with `ExpectedVersionMismatch`. FAILS at this commit by design.

…rrently The open-time recovery sweep classified a pending sidecar as RolledPastExpected, then published a manifest CAS at the sidecar's pinned expected_version. Under a concurrent writer that advanced the manifest past expected during the classify -> publish window, the CAS failed with ExpectedVersionMismatch and `?`-propagated, failing the whole Omnigraph::open. iss-schema-apply-reopen-recovery-race. A roll-forward's postcondition is "the manifest reflects the sidecar's committed Lance state", not "this sweep won the CAS" (invariants 7 & 15). On an ExpectedVersionMismatch, re-read the live manifest and check whether the sidecar's intent is already satisfied (every pinned table at a version >= the one we observed and tried to publish; added tables registered; tombstones gone — sound under the heal-first invariant, documented at the check). If satisfied, this is convergence: record the RolledForward audit + delete the sidecar idempotently. If only partway, defer to the next pass. Either way the open no longer fails. Other errors still propagate; a genuine logical conflict resurfaces via the classifier's InvariantViolation. Turns the red repro from the previous commit green. The roll-BACK twin (iss-recovery-sweep-live-writer-rollback) is destructive (Lance Restore) and still needs a cross-process lease — the known-gap is updated accordingly.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.}

cursor · 2026-06-21T14:21:35Z

+        // then lands deterministically before the delete resumes, so the
+        // delete's manifest CAS is guaranteed stale — no retry loop, no sleep.
+        let rv = helpers::failpoint::Rendezvous::park_first(
+            "mutation.delete_node_pre_primary_delete",


Failpoint guard literal violation

Medium Severity

The new Rendezvous::park_first call passes a bare string for mutation.delete_node_pre_primary_delete instead of names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE. The added failpoint_names_use_the_compile_checked_catalog guard treats any park_first(" line in tests/failpoints.rs as a violation, so this commit should fail that test in CI.

^{Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.}

cursor · 2026-06-21T14:21:35Z

+        RecoveryKind::RolledForward,
+        outcomes,
+    )
+    .await?;


Duplicate recovery audit on converge

Low Severity

When a roll-forward publish loses a CAS but the manifest already reflects the sidecar goal, converge_or_defer_roll_forward always appends another RolledForward audit. If another open sweep already rolled forward, recorded audit, and deleted the sidecar, the losing sweep still writes a second audit row for the same operation_id.

^{Reviewed by Cursor Bugbot for commit 496d9bf. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 496d9bf757

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-21T14:23:58Z

+                .unwrap_or(pin.post_commit_pin),
+        })
+        .collect();
+    record_audit(


Check audit idempotency before converging

When two recovery sweeps race the same sidecar, the sweep that loses the manifest CAS now reaches this convergence path after the winner has already published, recorded the RolledForward audit, and deleted the sidecar. This unconditional record_audit appends a second recovery audit/commit for the same operation_id (the new regression test exercises exactly this race but only checks readability), so operators will see duplicate recoveries and helpers that expect one audit row would fail. Re-check whether the sidecar/audit still needs cleanup before recording here.

Useful? React with 👍 / 👎.

greptile-apps · 2026-06-21T14:44:42Z

+async fn converge_or_defer_roll_forward(
+    root_uri: &str,
+    storage: &std::sync::Arc<dyn StorageAdapter>,
+    sidecar: &RecoverySidecar,
+    states: &[ClassifiedTable],
+    conflict: OmniError,
+) -> Result<bool> {
+    let fresh = fresh_snapshot_for_sidecar(root_uri, storage, sidecar).await?;
+    if !sidecar_intent_satisfied(&fresh, sidecar, states) {
+        warn!(
+            operation_id = sidecar.operation_id.as_str(),
+            writer_kind = ?sidecar.writer_kind,
+            "recovery: roll-forward publish lost a CAS and the manifest has not \
+             yet reached the sidecar's goal; deferring to the next pass \
+             (conflict: {conflict})"
+        );
+        return Ok(false);
+    }
+    warn!(
+        operation_id = sidecar.operation_id.as_str(),
+        writer_kind = ?sidecar.writer_kind,
+        "recovery: roll-forward publish lost a CAS to a concurrent writer that \
+         already reached the goal; converging (RolledForward audit + delete)"
+    );
+    let outcomes: Vec<TableOutcome> = sidecar
+        .tables
+        .iter()
+        .map(|pin| TableOutcome {
+            table_key: pin.table_key.clone(),
+            from_version: pin.expected_version,
+            to_version: fresh
+                .entry(&pin.table_key)
+                .map(|e| e.table_version)
+                .unwrap_or(pin.post_commit_pin),
+        })
+        .collect();
+    record_audit(
+        root_uri,
+        sidecar,
+        fresh.version(),
+        RecoveryKind::RolledForward,
+        outcomes,
+    )
+    .await?;
+    delete_sidecar_by_operation_id(root_uri, storage.as_ref(), &sidecar.operation_id).await?;
+    Ok(true)


Duplicate audit row when two concurrent opens both converge the same sidecar

converge_or_defer_roll_forward calls record_audit(…, RecoveryKind::RolledForward, …) unconditionally when sidecar_intent_satisfied is true. But in the concurrent-open scenario covered by the new test, the open that WON the CAS already appended a RolledForward row in the normal roll_forward_all → record_audit path. The open that lost the CAS and ended up here appends a second row for the same operation_id. The result is two RolledForward entries in _graph_commit_recoveries.lance for one recovery event — visible in omnigraph commit list --filter actor=omnigraph:recovery and any audit-consuming tooling. The test does not assert on the audit record count, so this goes undetected. Consider recording with a distinct RecoveryKind (e.g. ConvergedForward) or skipping the audit entry when the sidecar is already absent, to keep the audit table append-idempotent across concurrent sweeps.

greptile-apps · 2026-06-21T14:44:43Z

+        let rv = helpers::failpoint::Rendezvous::park_first(
+            "mutation.delete_node_pre_primary_delete",
        );


Bare string literal bypasses the failpoint_names_guard. The guard pattern "park_first(\"" checks whether park_first( and " appear on the same line; splitting the call across two lines silently sidesteps it. MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE is already in the names catalog — use it here so the guard actually covers this site.

Suggested change

let rv = helpers::failpoint::Rendezvous::park_first(

"mutation.delete_node_pre_primary_delete",

);

let rv = helpers::failpoint::Rendezvous::park_first(

names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE,

);

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

ragnorc added 3 commits June 21, 2026 14:42

cursor Bot reviewed Jun 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recovery): converge roll-forward on concurrent manifest advance#296

fix(recovery): converge roll-forward on concurrent manifest advance#296
ragnorc wants to merge 3 commits into
mainfrom
correctness-by-design-fix

ragnorc commented Jun 21, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 21, 2026

Uh oh!

cursor Bot Jun 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Uh oh!

greptile-apps Bot Jun 21, 2026

Uh oh!

greptile-apps Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ragnorc commented Jun 21, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Backing issue / RFC

Checklist

Notes for reviewers

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 21, 2026

Choose a reason for hiding this comment

Failpoint guard literal violation

Uh oh!

cursor Bot Jun 21, 2026

Choose a reason for hiding this comment

Duplicate recovery audit on converge

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ragnorc commented Jun 21, 2026 •

edited by greptile-apps Bot

Loading