Skip to content

fix(engine): optimize survives a cross-process write race on the same table#297

Merged
ragnorc merged 5 commits into
mainfrom
fix-optimize-concurrency-race
Jun 22, 2026
Merged

fix(engine): optimize survives a cross-process write race on the same table#297
ragnorc merged 5 commits into
mainfrom
fix-optimize-concurrency-race

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What & why

A prod-discovered correctness bug (R2, while running #291's optimize): a direct CLI optimize racing a served-server write on the same table fails, with two signatures:

  • Rewrite … preempted by concurrent transaction Update at version N — optimize's Lance compact_files commit losing the OCC race.
  • stale view of 'node:Media': expected manifest table version 14 but current is 15 — optimize's manifest-publish equality CAS losing.

Root cause (validated in code + Lance 7.0.0 source): the data-table optimize path (optimize_one_table) trusted the in-process write queue and skipped a retry — but a CLI optimizer and the server are separate processes with separate WriteQueueManagers, so the queue serializes nothing between them. Lance cannot absorb the conflict either: commit_compaction commits a fixed Rewrite via apply_commit direct, there is no compaction RetryExecutor, and a semantic RetryableCommitConflict escapes commit_transaction's retry loop (io/commit.rs:979). So compaction fundamentally needs reopen + replan — which the internal-table path already did and the data path did not.

This is the maintenance op-class face of the same process-boundary flaw as the logical-class corruption in RFC §6.5. It needs the opposite fix, because compaction commutes (content-preserving): not strict OCC + a lock, but Lance rebase + reopen/replan + a monotonic manifest fast-forward. No writer_epoch.

The fix

Unify both compaction paths on one reopen+replan shape, with a two-level retry matching the two failure points:

  • Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same-fragment overlap means our compaction did not commit — reopen at the new HEAD and re-plan. Lance rebases the common disjoint case (a concurrent insert/update/delete on other fragments) for free, so this fires only on real overlap.
  • Inner loop (Phase C, monotonic publish): the manifest advanced between our compaction and our publish. The compaction is committed at Lance HEAD N, so we must NOT reopen (that would trip the HEAD>manifest drift guard on our own work). Re-read the current manifest version C: if C ≥ N the manifest already includes our compaction (versions are linear) → no-op; else fast-forward to N. Monotonic, never the strict equality CAS that manufactured the conflict.

The Optimize recovery sidecar is written once and reused across attempts (every Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward or safely rolls the compaction back). The in-process queue is kept — now an in-process contention reducer, not the cross-process guard. Shares COMPACTION_RETRY_BUDGET + is_retryable_lance_conflict with the internal path; adds is_retryable_manifest_conflict for the publish loop.

Design note (liability)

The lower-liability shape: the monotonic decision is kept local to optimize.rs and reuses the existing equality-CAS publish unchanged — no new publisher mode, no signature ripple to load/mutate/recovery/repair (they keep strict equality). One engine file changed.

This is the first correct instance of the maintenance-class commit model (RFC §6.6), not a parallel band-aid: when the planned step-5 PublishPlan unification lands, it relocates this (a TableAction::Rewrite carries the same commit model into the single publish authority) rather than reinventing it — the two regression tests are the contract that stays green across that refactor.

Tests (test-first, red→green)

  • optimize_survives_concurrent_insert_advancing_manifest
  • optimize_survives_concurrent_delete_before_compaction

Both reproduce the prod manifest-CAS signature via a two-handle (cross-process-simulated) race + the optimize.post_phase_b_pre_manifest_commit / new optimize.before_compact failpoints. RED in the first commit, GREEN after the fix.

Verification

  • cargo test --workspace --locked — green (65 suites, 0 failed).
  • maintenance / composite_flow / full failpoints (recovery) suites — no regressions.

Out of scope (tracked separately)

  • Persistent node:Media drift_needs_repair (a narrower index-op recovery deferral).
  • Blob compaction (node:Artifact) — upstream Lance blob-v2 blocker.
  • The full writer unification + writer_epoch for the logical class — RFC step 5.

Note

Medium Risk
Changes core optimize manifest publish and Lance retry behavior on a live maintenance path; regression tests and sidecar recovery mitigate, but incorrect drift classification could still strand HEAD/manifest gaps until repair.

Overview
Fixes a prod bug where CLI optimize and a served write in another process could fail on the same table: Lance Rewrite conflicts or a manifest publish equality CAS (expected X but current Y) because the in-process write queue does not serialize across processes.

optimize_one_table now uses a two-level retry: an outer reopen-and-replan loop on retryable Lance conflicts during Phase B (strip, compact, reindex, index build); an inner Phase C loop that monotonic fast-forwards the manifest (no-op when current >= compacted version, else CAS against fresh current) instead of pinning to the pre-compaction version. A head_advanced flag keeps own Phase-B HEAD-ahead-of-manifest drift from being treated as external drift (sidecar retained, publish instead of skipped_for_drift). The Optimize sidecar is written once and reused across reopen attempts. COMPACTION_RETRY_BUDGET is shared with the internal-table path; is_retryable_manifest_conflict and current_manifest_version support the publish loop.

Tests add cross-process-simulated races (concurrent insert/delete before compact), own-HEAD drift on reindex retry, and #[serial(optimize)] on related failpoint tests. RFC-013 §6.6 documents maintenance vs logical op-class commit models.

Reviewed by Cursor Bugbot for commit e9d16a2. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

Fixes a prod correctness bug where optimize (CLI) racing a served writer on the same table could fail with either a Lance Rewrite-vs-Update OCC loss or a manifest equality-CAS mismatch — both caused by the in-process write queue not serialising across processes.

  • optimize_one_table gains a two-level retry: an outer reopen+replan loop for genuine Lance Rewrite overlap conflicts, and an inner Phase-C monotonic fast-forward replacing the strict equality-CAS publish (no-op when manifest ≥ N, else CAS against freshly read current). The recovery sidecar is written once and reused across attempts; head_advanced prevents the drift guard from misclassifying our own prior Phase-B commits as external drift.
  • Constant rename INTERNAL_COMPACTION_RETRY_BUDGET → COMPACTION_RETRY_BUDGET unifies the budget across both compaction paths; adds is_retryable_manifest_conflict and current_manifest_version helpers.
  • Three regression tests cover the concurrent-insert race, concurrent-delete race, and own-HEAD drift misclassification, each reproducing the prod failure signatures as RED before the fix and GREEN after.

Confidence Score: 5/5

Safe to merge; the change is mechanically narrow (one engine file, one test file, one RFC doc), all three new tests reproduce the prod failure signatures with explicit RED→GREEN proof, and the previous two blocking review threads have been fully addressed in this head commit.

The root cause is accurately identified and closed by design: commutative maintenance ops now use a monotonic fast-forward instead of an equality CAS, so a concurrent writer that builds on the compaction causes a convergent no-op rather than a spurious error. The outer reopen+replan loop mirrors the already-correct internal-table path. The head_advanced flag correctly distinguishes our own prior Phase-B commits from external drift, preventing the sidecar from being prematurely deleted on retry. The post-budget convergence check (re-read current before surfacing an error) plugs the Phase-C exhaustion gap from the prior thread. Recovery semantics are preserved: the sidecar is written before any HEAD-advancing op and deleted after successful publish, and its loose-match classifier handles the stale expected_version across retries as documented.

No files require special attention; optimize.rs change is self-contained and the affected code paths are all exercised by the new failpoint tests.

Important Files Changed

Filename Overview
crates/omnigraph/src/db/omnigraph/optimize.rs Core fix: adds outer reopen+replan loop and inner monotonic Phase-C fast-forward to optimize_one_table, replacing the single-shot equality-CAS publish; renames INTERNAL_COMPACTION_RETRY_BUDGET → COMPACTION_RETRY_BUDGET; adds is_retryable_manifest_conflict and current_manifest_version helpers; fixes post-budget exhaustion path to check convergence before surfacing error. Well-reasoned and correct by design.
crates/omnigraph/tests/failpoints.rs Adds three regression tests (concurrent insert, concurrent delete, own-HEAD drift misclassification) plus wait_for_sidecar polling helper; adds #[serial(optimize)] to existing optimize test. Test logic is sound and covers the two key failure signatures from the prod incident.
docs/dev/rfc-013-write-path-latency.md Adds §6.6 documenting the two op-class distinction (maintenance vs logical write), the root cause, the fix, and its relationship to the planned step-5 unification; adds a table row marking the maintenance-class commit model as LANDED. Clear and accurate.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as CLI optimize (separate process)
    participant SRV as Served writer (same table)
    participant Lance as Lance dataset
    participant Manifest as __manifest

    CLI->>CLI: acquire in-process write queue
    CLI->>Lance: "open_dataset_head_for_write (HEAD=V₀)"
    CLI->>Manifest: "fresh_snapshot → expected_version=V₀"
    CLI->>CLI: "Phase A — write sidecar (expected=V₀, head_advanced=false)"
    CLI->>CLI: failpoint: optimize.before_compact (pause)
    SRV->>Lance: "commit insert/delete → HEAD=V₀+1"
    SRV->>Manifest: publish V₀+1
    CLI->>CLI: failpoint released
    CLI->>Lance: "compact_files (on ds@V₀, Lance rebases over V₀+1)"
    Lance-->>CLI: Ok(metrics) — HEAD now V₀+3 (reserve+rewrite)
    CLI->>CLI: "head_advanced=true"
    CLI->>Lance: optimize_indices, build_indices
    CLI->>Manifest: "Phase C: read current=V₀+1"
    Note over CLI,Manifest: current(V₀+1) < state.version(V₀+3) → fast-forward
    CLI->>Manifest: "CAS expected=V₀+1 → publish V₀+3"
    Manifest-->>CLI: Ok
    CLI->>CLI: Phase D — delete sidecar
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as CLI optimize (separate process)
    participant SRV as Served writer (same table)
    participant Lance as Lance dataset
    participant Manifest as __manifest

    CLI->>CLI: acquire in-process write queue
    CLI->>Lance: "open_dataset_head_for_write (HEAD=V₀)"
    CLI->>Manifest: "fresh_snapshot → expected_version=V₀"
    CLI->>CLI: "Phase A — write sidecar (expected=V₀, head_advanced=false)"
    CLI->>CLI: failpoint: optimize.before_compact (pause)
    SRV->>Lance: "commit insert/delete → HEAD=V₀+1"
    SRV->>Manifest: publish V₀+1
    CLI->>CLI: failpoint released
    CLI->>Lance: "compact_files (on ds@V₀, Lance rebases over V₀+1)"
    Lance-->>CLI: Ok(metrics) — HEAD now V₀+3 (reserve+rewrite)
    CLI->>CLI: "head_advanced=true"
    CLI->>Lance: optimize_indices, build_indices
    CLI->>Manifest: "Phase C: read current=V₀+1"
    Note over CLI,Manifest: current(V₀+1) < state.version(V₀+3) → fast-forward
    CLI->>Manifest: "CAS expected=V₀+1 → publish V₀+3"
    Manifest-->>CLI: Ok
    CLI->>CLI: Phase D — delete sidecar
Loading

Reviews (3): Last reviewed commit: "fix(engine): optimize publish converges ..." | Re-trigger Greptile

ragnorc added 3 commits June 22, 2026 10:53
Two regression tests for the prod bug: a direct `optimize` process racing a
served write on the same table fails, because the in-process write queue does
not serialize across processes and the data-table optimize path has no retry.

- optimize_survives_concurrent_insert_advancing_manifest: a concurrent insert
  advances the manifest while optimize is paused between compact and publish;
  optimize's equality-CAS publish then fails "expected X but current Y".
- optimize_survives_concurrent_delete_before_compaction: a concurrent delete
  commits before optimize compacts; Lance rebases the compaction past it
  cleanly, so optimize again fails the publish CAS (the genuine Lance
  Rewrite-vs-Rewrite overlap is rarer and shares the internal path's retry).

Both fail today with ExpectedVersionMismatch. Adds the `optimize.before_compact`
failpoint seam + a wait_for_sidecar helper; serializes the optimize failpoint
tests (shared failpoint name). The fix lands next.
… table

The data-table optimize path trusted the in-process write queue and skipped a
retry, so a CLI `optimize` racing a served write (separate processes = separate
queues) failed: either the Lance Rewrite lost ("preempted by concurrent Update")
or the manifest publish lost the strict equality CAS ("expected X but current Y").

Unify both compaction paths on the internal path's reopen+replan shape, with a
two-level retry that matches the two failure points:

- Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same-
  fragment conflict means our compaction did not commit — reopen at the new HEAD
  and re-plan. Lance rebases the common disjoint case (a concurrent insert/delete
  on other fragments) for free, so this fires only on real overlap.
- Inner loop (Phase C, monotonic publish): the manifest advanced between our
  compaction and our publish. The compaction is already committed at Lance HEAD N,
  so we must NOT reopen (that trips the HEAD>manifest drift guard on our own work).
  Re-read the current manifest version C: if C >= N the manifest already includes
  our compaction (versions are linear) — no-op; else fast-forward to N. Monotonic,
  not the strict equality CAS that manufactured the conflict.

The Phase-A sidecar is written once and reused across reopen attempts (every
Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward
or safely rolls the compaction back). The in-process queue is kept — it is now an
in-process contention reducer, not the cross-process correctness guard. Shares the
COMPACTION_RETRY_BUDGET constant + is_retryable_lance_conflict with the internal
path; adds is_retryable_manifest_conflict for the publish loop. No writer_epoch.

Turns the prior commit's two race tests green.
…rite race

§6.6 records the maintenance vs logical op-class distinction (maintenance commutes
→ Lance rebase + reopen/replan + monotonic manifest fast-forward, no writer_epoch;
logical → strict cross-process OCC + epoch) and the prod optimize-vs-served-write
race that motivated it, now landed. Adds the matching mechanic row to §4.2.
Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 903ede2f95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +524 to +525
Err(e) if attempt < COMPACTION_RETRY_BUDGET && is_retryable_lance_conflict(&e) => {
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the sidecar when retrying after a Phase-B commit

When optimize_indices hits a retryable Lance conflict after compact_files has already succeeded, this continue re-enters the outer loop with Lance HEAD advanced by our own compaction while the manifest is still at the old version. The next iteration then takes the lance_head_version > expected_version drift branch, deletes the recovery sidecar as if the drift were external, and returns skipped_for_drift, leaving an uncovered HEAD>manifest gap that can no longer be recovered automatically. This also applies to upgraded graphs if the auto-cleanup config scrub commits before a later retryable conflict, so the retry path needs to distinguish/publish or roll back its own Phase-B commits instead of looping into the generic drift skip.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same finding as the cursor High at optimize.rs:443 — confirmed and fixed in 56d004e0. The retry now tracks head_advanced; once one of our Phase-B ops (auto_cleanup strip / compact_files) has committed, a reopened HEAD > manifest is treated as our own sidecar-covered work (the monotonic Phase-C publish fast-forwards it), not external drift, so the sidecar is not deleted and skipped_for_drift is not returned. Regression test optimize_retry_does_not_misclassify_own_head_drift injects exactly the "compact commits, reindex conflicts" sequence you describe.

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs Outdated
Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs Outdated
Comment thread crates/omnigraph/tests/failpoints.rs
Review catch on the cross-process optimize fix: the outer retry loop re-ran the
`lance_head > manifest` drift guard every iteration. After a partial Phase-B commit
(the auto_cleanup strip or compaction commits, then a later op hits a retryable
conflict), the reopened attempt saw HEAD ahead of the manifest — from OUR own
sidecar-covered work, not an external writer — and deleted the sidecar + returned
`skipped_for_drift`, stranding uncovered drift that then needs `repair`.

Track `head_advanced` (did one of our Phase-B ops already commit). The drift guard
now fires only when `!head_advanced` (genuine pre-existing external drift); once we
have advanced HEAD, a reopened HEAD>manifest is our work that the monotonic publish
fast-forwards. The no-op early-return likewise publishes prior committed work instead
of dropping it when `head_advanced`.

Regression test `optimize_retry_does_not_misclassify_own_head_drift` injects one
retryable reindex conflict after the compaction commits (new `optimize.inject_
reindex_conflict` seam); red→green verified by negative control (reverting the gate
reproduces `skipped_for_drift: Some(DriftNeedsRepair)`).

Also de-flake `optimize_survives_concurrent_insert_advancing_manifest`: pause at
`before_compact` (not post-compact) so the concurrent insert lands while HEAD==
manifest — otherwise it could race optimize's committed-but-unpublished compaction
and hit the write-path "HEAD ahead of manifest" guard.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 56d004e. Configure here.

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs
Review catch (greptile): the monotonic Phase-C publish loop returned an error on its
final iteration's retryable manifest conflict, even though that conflict can itself
mean a concurrent writer published a version that already includes our (content-
preserving) compaction — i.e. the postcondition ("the manifest reflects our
compaction") is already met. Recovery covered it (no data loss), but the operator
saw a spurious error and had to re-run.

Restructure the loop to re-read `current` on every retryable conflict and, on budget
exhaustion, do a final `current >= state.version` convergence check before surfacing
the error — the §6.6 "postcondition is the state, not winning the CAS" principle.
Factor the repeated current-version read into `current_manifest_version`.
@ragnorc ragnorc merged commit 6d4606a into main Jun 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant