Skip to content

write-path cost gate + opener bypass#288

Merged
ragnorc merged 10 commits into
mainfrom
rfc-013-write-path-latency
Jun 20, 2026
Merged

write-path cost gate + opener bypass#288
ragnorc merged 10 commits into
mainfrom
rfc-013-write-path-latency

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

What & why

Write-path table opens routed through Lance's DatasetBuilder::from_namespace, whose describe_table opened the whole dataset just to return a location and then re-resolved the latest version — an O(commit-history depth) double latest-resolution per table open that missed Lance's O(1) version-hint fast path. On an object store this dominated write/branch latency (~70%, measured; RFC-013 §2.4) and got worse as a graph aged. Reads already bypassed this (PR #268); this completes the same open-by-location migration on the write side.

This PR lands RFC-013 step 1 (the cost gate + shared harness) and step 3a (the opener bypass — the dominant fix).

What's in this PR

perf(engine) — opener bypass + namespace retirement

  • TableStore::open_dataset_head_for_write now delegates to the direct opener (Dataset::open by URI + checkout_branch, routed through the tracked opener so cost tests can count it; a no-op in production). ensure_expected_version still validates head == pinned for strict ops.
  • With both reads (perf(engine): remove the per-query metadata re-derivation tax on warm reads #268) and now writes bypassing it, nothing in production routes through the per-table Lance namespace. The dead open chain (load_table_from_namespace, open_table_head_for_write) is deleted and the StagedTableNamespace contract apparatus is retired to a #[cfg(test)] mod (kept only to validate the LanceNamespace contract). __manifest commit coordination (GraphNamespacePublisher) is a separate component and is unaffected.

test(engine) — cost-budget gate on a shared, path-classifying harness

  • New tests/helpers/cost.rs: a store-agnostic harness (IoCounts/StagedCounts, measure/measure_with_staged, assert_flat/assert_grows, local_graph/s3_graph) — the single home for the IO-counting plumbing. Its data-table wrapper is a PrefixCounter that attributes each read to the opener term (_versions//.manifest) vs the scan term (data//*.lance), so cost tests isolate the opener directly.
  • write_cost.rs (local, every-PR): the internal-table scan term flat in depth (a RED #[ignore]'d LOCK — step 2's red→green target) + green guards (bounded data writes; a per-write read-op ceiling; a keyed insert routes through stage_merge_insert once) + a validator proving the opener/scan split (opener flat, scan grows).
  • write_cost_s3.rs (bucket-gated, RustFS CI): asserts data_opener_reads flat across depth — the proof of the step-3a win on a real object store. Wired into the rustfs_integration job + path filter.
  • warm_read_cost.rs migrated onto the shared harness (removing its duplicate local probes()).

docs — RFC-013 §9 marks step 1 / 3a landed and the namespace retirement; §7.1/§6/§11/§5.5/§10 correct the cross-table write-skew analysis (a prototype proved the scoped fix is a no-op without a shared graph_head contention row → coupled to Phase 7; concurrent-face vs sequential-face split). testing.md documents the harness + the local-vs-S3 backend split.

Measured impact (RFC §2.4 [M]; the S3 gate verifies it in CI)

per-write depth slope depth-80ish edge write
Before O(depth): data + internal ~1618–1720 ops
This PR (step 3a) data term flat; internal still ~+5/depth ~593 ops (2.7×)
After step 2 (+3b) flat ~198 ops (~8.7×), flat
  • Data-table opener: 31 + 12·depth → flat 4. The key shape change: per-write cost no longer grows with the data table's version history.
  • Object-store deployments only (local FS resolves latest with one read_dir either way); the win grows with graph history and with the number of tables a write touches (load/merge/branch amplify it).

Test status

  • cargo test --workspace --locked green locally (writes, branch-merge, fork-on-first-write, overwrite, recovery — the paths the namespace open was suspected to be load-bearing on; the suite refutes it).
  • write_cost_s3 verified green on RustFSdata_opener_reads flat across commit depth on a real object store (a prior run, pre-fix, failed here with opener grew 20 → 50, empirically confirming the fix is load-bearing).
  • PR-gating checks (aws-feature, agents-md, container, Greptile, Cursor Bugbot) all pass. The full Test Workspace + RustFS S3 suites are PR-skipped by deliberate CI design and run post-merge on main.

Follow-on PRs (separate, after this merges)

  • Step 2 — internal-table compaction (__manifest/_graph_commits into all_table_keys), which un-ignores write_cost.rs's internal-table LOCK and delivers the remaining ~3× / full flatness. Must land with the Q8 boundary watermark (Lance version-CAS is cleanup-resurrection-vulnerable, RFC §12 Q8).
  • Step 3b — WriteTxn (capture-once + pinned-version opens + handle reuse + shared Session), retiring the flat-46 schema-read constant.
  • §7.1 write-skew — concurrent face (read-set + shared graph_head, coupled to Phase 7) and sequential face (inbound-RI validation on node removal) — two separate fixes.
  • Step 1c — an always-on prod storage.ops span metric (the test gate is the durable guard).

See docs/dev/rfc-013-write-path-latency.md for the full design and sequencing.


Note

Medium Risk
Changes the hot write open path and removes production namespace opens; mitigated by full workspace tests, S3 cost gate in CI, and retained test-only namespace contract coverage.

Overview
This PR lands RFC-013 step 3a: write-path table opens no longer go through DatasetBuilder::from_namespace (which did O(commit-depth) latest-version resolution). TableStore::open_dataset_head_for_write now delegates to a direct open_dataset_head (Dataset::open by URI + branch checkout), routed through open_dataset_tracked so cost tests can count opens (no-op when probes are unset in production).

With reads already opening by location, production no longer uses the per-table Lance namespace. The dead chain (load_table_from_namespace, open_table_head_for_write) is removed; StagedTableNamespace and related manifest namespace wiring move behind #[cfg(test)] for contract tests only. GraphNamespacePublisher for __manifest is unchanged.

Testing: New tests/helpers/cost.rs centralizes measure / IoCounts / PrefixCounter (opener vs scan reads). write_cost.rs runs on local FS (green guards + a RED #[ignore] internal-table flatness lock for step 2). write_cost_s3.rs asserts data_opener_reads stay flat on S3. warm_read_cost.rs drops duplicate probe wiring. CI path filters and the RustFS job run write_cost_s3.

Docs: RFC-013 §9 marks steps 1/3a landed; testing.md and index.md document the harness and backend split.

Reviewed by Cursor Bugbot for commit c9183b3. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

Completes RFC-013 steps 1 and 3a: the write path's O(depth) DatasetBuilder::from_namespace open is replaced by a direct Dataset::open-by-URI call (O(1)), and the Lance namespace layer is retired to #[cfg(test)]-only contract validation since nothing in production routes through it anymore.

  • table_store.rs: open_dataset_head_for_write now calls open_dataset_head (tracked direct open) instead of the namespace builder; table_key is retained for signature stability but silently discarded.
  • namespace.rs / manifest.rs: load_table_from_namespace and open_table_head_for_write deleted; entire namespace module gated to #[cfg(test)], leaving GraphNamespacePublisher as the only production path through the manifest layer.
  • Test harness: PrefixCounter splits data-table reads into opener vs scan terms; write_cost.rs runs every-PR on local FS; write_cost_s3.rs asserts data_opener_reads flat across depth on a real object store — the step 3a acceptance test.

Confidence Score: 5/5

Safe to merge — the write-path change is a drop-in open strategy swap backed by a passing test suite and an S3 cost gate confirming the fix is load-bearing.

The core change replaces the namespace-builder open with a direct URI open that already existed and was proven on the read side; the namespace retirement is mechanical and consistently gated at the module level. The test harness provides both local every-PR guards and a real-S3 acceptance test.

No files require special attention.

Important Files Changed

Filename Overview
crates/omnigraph/src/table_store.rs Core fix: open_dataset_head_for_write now delegates to open_dataset_head (direct URI, O(1)) instead of the lance-namespace builder. table_key retained as a dead param for signature stability.
crates/omnigraph/src/db/manifest/namespace.rs Dead load_table_from_namespace and open_table_head_for_write functions removed; per-item #[cfg(test)] guards dropped since the whole module is now #[cfg(test)] in manifest.rs. Logically consistent.
crates/omnigraph/tests/helpers/cost.rs New shared cost-test harness: PrefixCounter wrapper splits reads into opener vs scan terms; measure/assert_flat/assert_grows/local_graph/s3_graph give all cost tests a single vocabulary. S3 init/seed failures now panic when bucket is configured.
crates/omnigraph/tests/write_cost.rs New local write cost gate: green guards (ceiling, bounded writes, merge-insert routing) + opener/scan split validator + a #[ignore]'d RED LOCK for the step-2 internal-table flatness target.
crates/omnigraph/tests/write_cost_s3.rs S3 opener gate: asserts data_opener_reads flat across depth [10, 50] — the RFC-013 step 3a acceptance test. Skips when bucket unset; panics on misconfigured bucket.
.github/workflows/ci.yml Adds write_cost_s3.rs and table_store.rs/instrumentation.rs to the RustFS CI path filter and wires in the new S3 cost-gate step.

Reviews (6): Last reviewed commit: "Merge branch 'main' into rfc-013-write-p..." | Re-trigger Greptile

ragnorc added 5 commits June 19, 2026 17:24
…e builder

Write opens routed through DatasetBuilder::from_namespace, whose describe_table
opened the whole dataset just to return a location and then re-resolved the
latest version — an O(commit-depth) double latest-resolution per table open that
missed Lance's O(1) version-hint fast path. On an object store this dominated
write latency (~70%, RFC-013 section 2.4).

TableStore::open_dataset_head_for_write now delegates to the direct opener
(open_dataset_head: Dataset::open by URI + checkout_branch, routed through the
tracked opener so cost tests can count it; a no-op in production). The manifest
already holds every sub-table's location, so the namespace catalog lookup was
redundant; ensure_expected_version still validates head == pinned for strict ops.
This completes PR #268's open-by-location migration on the write side.

With both reads (PR #268) and now writes bypassing it, nothing in production
routes through the per-table Lance namespace. The dead open chain
(load_table_from_namespace, open_table_head_for_write) is deleted and the
StagedTableNamespace contract apparatus is gated #[cfg(test)], mirroring the
already-test-only read namespace; __manifest commit coordination
(GraphNamespacePublisher) is a separate component and is unaffected.

See docs/dev/rfc-013-write-path-latency.md sections 2.4 and 9 (step 3a).
Adds tests/helpers/cost.rs, a store-agnostic cost harness (IoCounts/StagedCounts,
measure/measure_with_staged, assert_flat, local_graph/s3_graph) that the read-side
warm_read_cost.rs, write_cost.rs, and write_cost_s3.rs share, so the IOTracker /
task-local plumbing lives in exactly one place instead of duplicated per test.

write_cost.rs (local, every-PR) gates the internal-table scan term flat in
commit-history depth (a RED #[ignore]'d LOCK, the acceptance for bringing the
internal tables into compaction) plus green guards: a single insert's data writes
are bounded, a per-write read-op ceiling fails the moment a round-trip is added,
and a keyed insert routes through stage_merge_insert once with no stage_append or
vector-index build.

write_cost_s3.rs (bucket-gated, rustfs CI) gates the data-table opener term flat
across depth — the object-store-RPC phenomenon local FS cannot reproduce, and the
red->green proof of the opener bypass. Wired into the rustfs_integration CI job
and its path filter.

Guards the "hot-path cost is bounded by work, not history" invariant on writes.
See docs/dev/rfc-013-write-path-latency.md section 5.1, docs/dev/testing.md.
…t map

- Section 9: mark step 1 (gate + harness) and step 3a (opener bypass) landed;
  record the per-table namespace retirement to test-only and the corrected
  measurement note (the opener win is S3-only; the local data-table growth is the
  merge-insert/RI fragment scan, a compaction term, not the opener).
- Sections 7.1/6/11/5.5/10: correct the cross-table write-skew analysis after a
  prototype proved the scoped expected-set fix is a no-op against the per-object_id
  manifest (disjoint writers never share a row, so Lance never conflicts, the
  publisher never retries, and the expected check is a non-atomic pre-check
  evaluated once against stale state). The fix needs a shared contention row
  (Phase-7 graph_head / a minimal head row / commit-time re-validation), so it is
  coupled to that row, not standalone; that contention is load-bearing for
  correctness, not a drawback. Split the concurrent face (read-set + head) from the
  sequential face (inbound-RI validation on node removal) -- two different fixes.
- testing.md: add write_cost.rs / helpers/cost.rs / write_cost_s3.rs to the test
  map; document the local-vs-S3 backend split; extend the cost-budget checklist
  item to the write/open path and point at the shared harness.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65195c9903

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/omnigraph/tests/write_cost_s3.rs Outdated

// The data-table opener is O(1) after step 3a (direct-by-URI). Slack absorbs
// object-store variance; the pre-3a builder grew this ~+12/depth (RFC §2.4 [M]).
assert_flat(&curve, |c| c.data_reads, 8, "S3 data-table opener");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate opener IO before asserting flatness

With this insert_person/commit_many fixture, data_reads is not opener-only: the table wrapper is attached to the Dataset used by stage_merge_insert, so reads from scanning unoptimized fragments are included as history grows. In the S3 RustFS gate this assertion can fail (or misattribute fragment-scan growth to the opener) even when the direct opener is flat; isolate _versions/open IO, compact the fixture, or use an append-only operation if the intent is to gate only the opener term.

Useful? React with 👍 / 👎.

Comment thread crates/omnigraph/tests/helpers/cost.rs Outdated
Comment on lines +145 to +146
let mut db = Omnigraph::init(&uri, TEST_SCHEMA).await.ok()?;
load_jsonl(&mut db, TEST_DATA, LoadMode::Overwrite).await.ok()?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail configured S3 setup errors instead of skipping

When OMNIGRAPH_S3_TEST_BUCKET is set (the RustFS CI job this PR wires up), these .ok()? conversions turn any real Omnigraph::init or seed-load failure into None, so write_cost_s3 logs a skip and passes without exercising the new gate. The existing s3_storage pattern only skips when the bucket env var is absent and then unwraps setup failures; this helper should likewise return a Result or panic once the bucket is configured.

Useful? React with 👍 / 👎.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a966271. Configure here.

Comment thread crates/omnigraph/tests/helpers/cost.rs
ragnorc and others added 4 commits June 20, 2026 11:18
… setup errors

Addresses two PR review findings on the bucket-gated write_cost_s3 gate:

- The data-table opener was not isolated: `data_reads` also counts the
  merge-insert/RI scan, which reads O(fragment-count) and so grows with history
  for a different reason (compaction's domain, not the opener) -- the same term
  that made the local data-table count grow. The flat assertion would false-RED
  or misattribute scan growth to the opener on rustfs. Fix: compact (db.optimize)
  before each measurement so the table holds ~1 fragment, bounding the scan and
  leaving the opener's latest-version resolution as the only history-varying term.
  Compaction preserves version history, so the opener still faces a deep
  _versions/ chain -- the thing under test.

- s3_graph used `.ok()?`, so when OMNIGRAPH_S3_TEST_BUCKET was set but the store
  was down/misconfigured, init/seed failures collapsed to None and the gate
  skipped + passed vacuously. Fix: skip only when the bucket env var is absent;
  once it is set, init/seed failures panic (mirrors tests/s3_storage.rs).
…ct-by-design)

Replaces the fixture-bounded isolation (compact-before-measure) from the prior
commit with the root fix: a path-classifying ObjectStore wrapper (PrefixCounter)
that attributes each data-table read to the opener term (_versions/.manifest) vs
the scan term (data/*.lance). IoCounts now exposes data_opener_reads /
data_scan_reads, so write_cost_s3 asserts the opener flat *directly* -- no
compaction or fixture massaging, and the assertion measures the opener, not the
conflated total. Closes the "harness conflates two IO terms" class: any cost test
(read or write) can now isolate the opener.

PrefixCounter implements only the object_store 0.13 core ObjectStore methods; the
convenience surface (get/put/head/...) routes through get_opts/put_opts via
ObjectStoreExt's blanket impl, so every read/write is still counted.

Validated locally (every-PR) by write_cost::data_table_reads_split_into_flat_opener_
and_growing_scan: opener stays flat (7 -> 3) while scan grows (11 -> 91) and
opener + scan == data_total exactly -- proving the classifier and confirming the
local data-table growth is the fragment scan, not the opener. warm_read_cost
(12 tests) stays green under the shared-harness change.
…st) noise

Branch self-review (no behavior change) — pay down three liabilities the
write-path work left:

- warm_read_cost.rs kept its own probes() (three IOTrackers + a QueryIoProbes +
  a probe counter) and read raw .stats().read_iops — duplicating the shared
  helpers::cost harness this branch introduced. Migrated all 12 tests onto
  measure()/IoCounts; deleted the local probes(). (This also makes IoCounts'
  version_probes field used rather than dead.)

- insert_cost was copy-pasted verbatim into write_cost.rs and write_cost_s3.rs.
  Hoisted to helpers::cost::measure_insert so the measured write is defined once.

- The per-table Lance namespace (namespace.rs) became entirely test-only after
  step 3a, but was gated with ~22 per-item #[cfg(test)] attributes. Collapsed to
  a single `#[cfg(test)] mod namespace;` and stripped the per-item attributes;
  merged the import groups the gating had split.

Verified: lib in-source 162 passed; write_cost 4 + warm_read_cost 12 passed;
forbidden_apis passed.
@ragnorc ragnorc changed the title RFC-013: write-path cost gate + opener bypass (step 1 + step 3a) write-path cost gate + opener bypass Jun 20, 2026
@ragnorc ragnorc merged commit f6d2cc0 into main Jun 20, 2026
8 checks passed
ragnorc added a commit that referenced this pull request Jun 21, 2026
…#291)

* feat(engine): compact the internal __manifest/_graph_commits tables in optimize

`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).

`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.

Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).

maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).

* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)

`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.

* docs: RFC-013 step 2 internal-table compaction landed

- invariants.md: close the compaction half of the read-path-rederivation known
  gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
  sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
  watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.

* fix(engine): guard absent _graph_commits + always compact internal tables

Addresses PR #291 review findings:

- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
  but a graph can validly have none (the coordinator opens it as `Option`, gated on
  `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
  absent table errored and failed the whole optimize. Guard the `_graph_commits`
  compaction with the same `storage_adapter().exists()` check the coordinator uses;
  `__manifest` always exists so it stays unguarded. Regression test
  `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
  recreates the table before the guard).

- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
  compaction for a schema with no node/edge types. Removed it so the internal
  tables are compacted regardless of the data-table set.

- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
  `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
  so on a graph storing an *on* auto_cleanup config the commit would fire version
  GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
  safe; the only exposure is pre-fix upgraded graphs, identical to the existing
  data-table optimize path, with step 2b's watermark as the comprehensive guard.
  Added a comment in `compact_internal_table` recording this.

* fix(engine): retry publish on RetryableCommitConflict (compaction vs publish)

Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates,
validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files'
`Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a
live publish usually wins the race and the compaction rebases. But the publish
runs its merge-insert with conflict_retries(0) = one rebase attempt; if the
compaction commits first AND the merge touched a fragment the Rewrite rewrote,
Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT
variant from the row-level `TooMuchWriteContention` the publisher already retries.
Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance
compaction (physical op) failing a live write (logical op) — invariant 7.

Map `LanceError::RetryableCommitConflict` to a new
`ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the
publisher's outer loop (reload fresh state + re-merge), alongside
RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine
expectation break must not be blindly retried). This also hardens multi-process
concurrent writers generally, not just compaction.

Normal publishes are insert-only (new object_ids -> new fragments, disjoint from
rewritten old ones), so the conflict is rare; the guard covers the
same-fragment-update edge and multi-process writers. Unit tests in publisher.rs
pin the mapping + the retry-predicate contract.

* revert: publisher RetryableCommitConflict handling (it was the wrong side)

Reverts d138902. Validated against lance-7.0.0: the publisher's merge-insert runs
with conflict_retries(0), and execute_with_retry converts an exhausted retryable
commit conflict to TooMuchWriteContention before the caller sees it
(write/retry.rs ~95-130). So map_lance_publish_error NEVER receives
RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which
the publisher already maps to RowLevelCasContention and retries. The reverted
mapping was therefore dead on the real path and its unit test was synthetic.

The actual exposure is the *compaction* side: compact_files -> commit_compaction ->
apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn
conflict propagates raw and optimize can fail on a live graph. That is fixed
app-side in compact_internal_table in the following commit.

* fix(engine): make internal-table compaction correct by construction

Address three findings from review of the step-2 internal-table compaction:

- Non-destructive by construction: before compacting an internal table,
  strip any stored `lance.auto_cleanup.*` config off it. `compact_files`
  commits with a default `CommitConfig` (skip_auto_cleanup=false) and
  `CompactionOptions` exposes no override, so on a graph created by an older
  binary (on-by-default GC hook) the compaction commit would fire Lance's
  auto-cleanup and silently prune `__manifest`-pinned versions. Current
  binaries store no such config; the strip is the upgrade-path safety net so
  `optimize` can never GC versions.

- App-level compaction retry: `compact_files` does NOT auto-retry a semantic
  conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete
  propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the
  internal-table compaction in a bounded retry loop that reopens fresh and
  replans on a retryable Lance conflict, so a maintenance compaction (a
  physical op) never fails a live write (a logical op) — invariant 7.

- Compact all three internal tables, not two: `_graph_commit_actors` grows
  one fragment per commit on the authenticated write path, the same O(depth)
  scan as `__manifest`/`_graph_commits`. Drive the sweep from one
  source-of-truth list with per-table existence guards (the two commit-graph
  tables are optional). Make `graph_commit_actors_uri` pub(crate).

Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the
authenticated (actorful) write path so it covers `_graph_commit_actors` via
the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as`
helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins
the non-destructive guarantee (config cleared + no version GC); a unit test
pins the retryable-conflict classifier; the empty-graph stats count is 7
(the actor table is created at init).

* docs: internal-table compaction covers all 3 tables, non-destructive, retried

Sync the RFC-013 step-2a section and the maintenance guide with the
correctness-by-design refinements:

- optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors`
  (the actor table grows on the authenticated write path).
- optimize is non-destructive by construction — it never GCs versions, and
  strips stale `lance.auto_cleanup.*` config so an upgraded graph's commit-time
  GC hook cannot fire during compaction.
- internal-table compaction rebases and retries against concurrent live
  writers rather than failing the operator's optimize or the live write.
- the cost LOCK is the authenticated-path acceptance test.

* fix(engine): refresh coordinator after a config-strip with no compaction work

`compact_internal_table` returns early when `plan_compaction` finds no work,
but `clear_stale_auto_cleanup_config` may have already committed a config-strip
that advanced Lance HEAD. The early return skipped the coordinator refresh that
the successful-compaction path performs, leaving warm `__manifest`/commit-graph
handles pinned to the pre-strip version until the next read's version probe
healed them. No correctness bug (the probe self-heals, and a stale-handle write
would retry via publisher CAS), but the refresh makes coherence deterministic
rather than probe-dependent. Refresh iff the config-strip actually committed.

* docs(engine): correct compact_internal_table doc — compact_files does not auto-retry

The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite
against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict
doc just below it and the explicit retry loop that exists precisely because compact_files
does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw
through apply_commit). Also move the orphaned description from above the retry-budget
const onto the function, and include the third internal table.

* test(engine): optimize must clear stale auto_cleanup on DATA tables too (red)

Regression test for a destructive bug on the data-table optimize path: on an
upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.*
config, `optimize`'s compact_files/optimize_indices commits fire Lance's version
GC and prune __manifest-pinned data-table versions. Mirrors the internal-table
auto_cleanup test on a Person table (force-repair realigns the config-induced
drift so optimize doesn't skip the table). Red against the current code: the
data-table path does not strip the config. The fix lands in the next commit.

* fix(engine): clear stale auto_cleanup on the data-table optimize path too

The auto_cleanup scrub previously only protected the internal tables; the
data-table path (optimize_one_table) ran compact_files/optimize_indices with a
default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded
graph those commits could fire Lance's version-GC hook and prune __manifest-pinned
node/edge versions — making the "non-destructive" contract false for data tables.
Strip the config before the HEAD-advancing commits, capturing version_before first
so the strip's own commit still triggers the Phase-C manifest publish (no uncovered
drift). No retry loop needed: the data-table path holds the per-table write queue.
Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green.

Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys
to update_config(None values), and correct two now-inaccurate doc comments —
compaction is "one or more content-preserving commits" (compact_files can emit a
ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free
property rests on content-preservation + read-at-HEAD, not single-commit atomicity.

* docs: optimize is non-destructive on all tables; correct atomicity/retry claims

- non-destructive guarantee now spans data + internal tables (the auto_cleanup
  strip runs on both paths), not just the internal ones.
- "single atomic Lance commit" was inaccurate: compaction can emit a
  ReserveFragments commit before the Rewrite; the no-sidecar property rests on
  content-preservation + read-at-HEAD, not single-commit atomicity.
- "retries rather than failing" softened to the truth: a *bounded* retry on the
  internal path; sustained contention surfaces a loud conflict error (bounded +
  observable, not an infinite loop). The data path holds the per-table queue and
  never contends.
ragnorc added a commit that referenced this pull request Jun 21, 2026
…rency-cap correction + Lance-metadata comparison (#292)

* feat(engine): compact the internal __manifest/_graph_commits tables in optimize

`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).

`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.

Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).

maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).

* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)

`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.

* docs: RFC-013 step 2 internal-table compaction landed

- invariants.md: close the compaction half of the read-path-rederivation known
  gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
  sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
  watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.

* fix(engine): guard absent _graph_commits + always compact internal tables

Addresses PR #291 review findings:

- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
  but a graph can validly have none (the coordinator opens it as `Option`, gated on
  `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
  absent table errored and failed the whole optimize. Guard the `_graph_commits`
  compaction with the same `storage_adapter().exists()` check the coordinator uses;
  `__manifest` always exists so it stays unguarded. Regression test
  `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
  recreates the table before the guard).

- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
  compaction for a schema with no node/edge types. Removed it so the internal
  tables are compacted regardless of the data-table set.

- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
  `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
  so on a graph storing an *on* auto_cleanup config the commit would fire version
  GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
  safe; the only exposure is pre-fix upgraded graphs, identical to the existing
  data-table optimize path, with step 2b's watermark as the comprehensive guard.
  Added a comment in `compact_internal_table` recording this.

* docs(rfc-013): serial-hop correction — wall-clock is the ~110-hop backbone, not op count

Latency-slope measurement on the deployed edge binary (f6d2cc0, steps 1+3a
landed; rustfs + per-op latency proxy, depth 1..85) shows wall-clock is set by a
~110-hop SERIAL backbone that is depth-invariant. Total ops grow +~7/depth but
PARALLELIZE (parallelism 1->6), so the depth term adds little wall-clock.

- New §0(c): the serial-hop vs total-op finding + branch-op backbones
  (create ~77, delete ~87, branch-write ~258/1777-ops/21s floor = fork-on-first-write).
- §2.4: correct the '1720->198 ops => 258s->30s' op-count->wall-clock conversion.
- §5.1: promote serial-hop/num_stages to the PRIMARY latency LOCK; op-count
  flatness demoted to a cost/compute-floor gate.
- §9 step 2: reprioritized as Phase-7 prerequisite + compute-floor/space, NOT
  the wall-clock fix; step 3b (parallel capture-once WriteTxn) is the headline
  latency lever; branch-write moved under step 3b + fork seam.
- Summary: serial-backbone correction up front.

Vindicates the §3/§4.1 design; corrects the op-count latency framing.

* docs(rfc-013): concurrency-cap correction + Lance-metadata comparison

Fold in two measured findings from the deployed edge binary (f6d2cc0) on
rustfs behind a latency+concurrency proxy:

- §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table
  scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes
  and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s),
  while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The
  latency model is (serial_hops + ops/effective_concurrency)*RTT + compute.
- Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a
  primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier
  'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency
  artifact. Deployed f6d2cc0 optimize is node/edge-only; #291 (undeployed) is
  the prod win.
- §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency;
  assert serial_hops flat AND ops flat in history.
- §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a
  single-file per-version manifest read O(1) (latest_version_hint), pruned by
  default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the
  cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest
  (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path).

Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing,
step sizing, and one design-choice that was implicit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant