(feat): compact the internal manifest/commit-graph tables in optimize by ragnorc · Pull Request #291 · ModernRelay/omnigraph

ragnorc · 2026-06-20T15:29:42Z

What & why

optimize compacted node/edge data tables only — the two internal system tables __manifest (visibility authority) and _graph_commits (lineage DAG) were never compacted (all_table_keys is catalog-keys-only). They accumulate one fragment per commit, so every write's metadata scan of them is O(fragments) and grows forever on a long-lived graph. That was the residual internal-table term after PR #288 flattened the data-table opener, and the RED #[ignore]'d internal_table_scans_are_flat_in_history LOCK that #288 staged.

This PR brings both internal tables into db.optimize().

Changes

compact_internal_table (new, in optimize.rs) — a deliberately simpler path than optimize_one_table. The internal tables aren't catalog-tracked (readers open them at their latest Lance HEAD), so compaction just advances that HEAD: no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance compact_files commit — no HEAD-before-publish gap), no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). optimize_all_tables runs it for __manifest and _graph_commits after the data tables.
Concurrency: no application lock — validated against the canonical Lance consumer. LanceDB holds no lock around compaction; Lance's compact_files auto-retries its Operation::Rewrite against any concurrent writer (Rewrite vs Append is compatible, vs Update/merge-insert a retryable same-fragment conflict Lance rebases — lance/io/commit.rs, conflict_resolver.rs). A coordinator refresh after makes the warm __manifest/_graph_commits handle observe the compacted HEAD. (Within one process optimize &self and mutate &mut self can't overlap anyway; multi-process is the existing one-winner-CAS exposure.)
Both tables compacted, even though Phase 7 (iss-991) will later fold _graph_commits into __manifest — a one-call throwaway for the full interim win. __manifest compaction is also the hard prerequisite for Phase 7 (its graph_head CAS contention is only acceptable once compaction bounds the publisher's load_publish_state scan).

Measured result (the LOCK, now green)

internal_table_scans_are_flat_in_history (un-ignored): on a compacted graph a write's internal-table scan is flat in commit-history depth:

	depth 10	depth 100
`__manifest` reads	4	2
`_graph_commits` reads	7	3

vs the pre-step-2 RED curve (__manifest 34→214, _graph_commits 29→207).

Scope — compaction only (cleanup deferred)

This is step 2a. The cleanup (version-GC) half — step 2b — is deliberately not here: cleanup is the version-deleting op that hits the Q8 cleanup-resurrection hole (Lance's version CAS has no monotonic guard, validated in lance/dataset/transaction.rs/cleanup.rs), so it must land with a durable monotonic watermark (a Lance boundary tag). Deferred because it touches the read/open path, is the part most likely made redundant by Lance MTT, and only buys the secondary version-count/space term — whereas 2a delivers the dominant per-write scan win with zero resurrection risk. So the internal _versions/ chain still grows until 2b; run optimize on a cadence to keep per-write metadata scans flat. See docs/dev/rfc-013-write-path-latency.md §9.

Tests

cargo test --workspace --locked green (65 suites). (One pre-existing flaky server test, schema_apply_route_hard_drops_property_with_allow_data_loss, intermittently fails under full parallel load — passes 4/4 in isolation, on clean main, and on the re-run; unrelated to this optimize-only change, which the server test never exercises.)
internal_table_scans_are_flat_in_history un-ignored and green; optimize_compacts_internal_tables (sheds fragments, no sidecar leak, coherent after); optimize_on_empty_graph updated (6 stats); idempotent test already covers internal no-op.

Note

Medium Risk
Changes the maintenance path and coordinator refresh behavior under concurrent writes, but compaction is non-destructive (no version GC) and is heavily covered by integration and cost-budget tests.

Overview
optimize now compacts internal system tables (__manifest, _graph_commits, _graph_commit_actors) after node/edge tables, via a new compact_internal_table path: in-place Lance compaction only—no __manifest publish and no recovery sidecar—plus coordinator refresh for warm handles. The early return when the schema has no data tables is removed so internal tables are still swept.

Safety and concurrency: clear_stale_auto_cleanup_config runs before compaction on internal tables and on data tables inside the existing optimize sidecar, so upgraded graphs cannot trigger Lance commit-time version GC during optimize. Internal compaction uses a bounded retry on retryable Lance write conflicts (no per-table write queue); missing _graph_commits / _graph_commit_actors datasets are skipped via storage existence checks.

Tests and gates: Integration tests cover internal compaction, absent commit-graph tables, and auto-cleanup stripping on internal and data tables. The internal_table_scans_are_flat_in_history cost LOCK is enabled and uses actorful writes with optimize at each depth checkpoint. User and RFC docs describe the new behavior; internal version cleanup remains deferred (step 2b).

^{Reviewed by Cursor Bugbot for commit edfc73d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Greptile Summary

This PR brings the two (now three) internal system tables — __manifest, _graph_commits, and _graph_commit_actors — into db.optimize() as RFC-013 step 2a, fixing the O(fragments) per-write metadata scan that grew unboundedly on long-lived graphs. It also un-ignores the internal_table_scans_are_flat_in_history cost LOCK and adds clear_stale_auto_cleanup_config to both the new internal-table path and the existing data-table path, making optimize non-destructive by construction on pre-v7 upgraded graphs.

compact_internal_table (new) — a simplified variant of optimize_one_table with no manifest publish, no recovery sidecar, and a bounded app-level retry loop for Rewrite-vs-live-write conflicts; existence-guarded for the two optional commit-graph tables; followed by a coordinator refresh for cache coherence.
clear_stale_auto_cleanup_config (new, shared) — strips any stored lance.auto_cleanup.* config before compacting any table (data or internal) so the commit-time Lance GC hook cannot prune __manifest-pinned versions on upgraded graphs.
Test coverage — four new integration tests in maintenance.rs covering internal compaction, absent _graph_commits, auto-cleanup stripping on internal tables, and the same on data tables; the cost LOCK is now green on the authenticated write path.

Confidence Score: 5/5

Safe to merge — compaction is content-preserving, internal tables are read at HEAD so no manifest publish is needed, the bounded retry loop correctly handles concurrent writers, and the existence guard prevents failures on older graphs without commit-graph tables.

All changes are on the maintenance path (optimize), not the live write or read path. Compaction preserves all versions (no GC), the auto-cleanup strip prevents the one destructive upgrade hazard on older graphs, and the new compact_internal_table path is thoroughly tested.

The retry loop in compact_internal_table in optimize.rs is worth a second look: plan_compaction errors exit the loop entirely via ? rather than going through the retryable-conflict check, which is an asymmetry with the surrounding retry logic.

Important Files Changed

Filename	Overview
crates/omnigraph/src/db/omnigraph/optimize.rs	Core change: adds `compact_internal_table`, `clear_stale_auto_cleanup_config`, and `is_retryable_lance_conflict`; wires all three internal system tables into `optimize_all_tables`; removes the empty-table-tasks early return. The retry loop correctly handles `clear_stale_auto_cleanup_config` and `compact_files` conflicts, but `plan_compaction` errors exit the retry loop entirely via `?`. Multiple coordinator refreshes (one per internal table instead of one at the end) are correct but redundant.
crates/omnigraph/tests/maintenance.rs	Adds four new integration tests covering internal-table compaction, absent `_graph_commits` tolerance, stale auto-cleanup config stripping on internal tables, and the same on data tables; updates `optimize_on_empty_graph` to assert 7 stats. Comprehensive coverage.
crates/omnigraph/tests/write_cost.rs	Un-ignores `internal_table_scans_are_flat_in_history` LOCK, switches to actorful writes via `commit_many_as`/`measure_insert_as` so `_graph_commit_actors` is exercised, and inserts `db.optimize()` at each depth checkpoint to prove per-write scan is flat on a periodically-compacted graph.
crates/omnigraph/src/db/commit_graph.rs	One-line visibility change: `graph_commit_actors_uri` promoted from `fn` to `pub(crate) fn` so `optimize.rs` can call it directly.
crates/omnigraph/src/db/manifest.rs	Re-exports `manifest_uri` from `layout` as `pub(crate)` so `optimize.rs` can reference it without going through the internal `layout` module path.
crates/omnigraph/src/db/manifest/layout.rs	One-line visibility change: `manifest_uri` promoted from `pub(super)` to `pub(crate)` to support the re-export in `manifest.rs`.
crates/omnigraph/tests/helpers/cost.rs	Adds `measure_insert_as` helper for the authenticated write path, mirroring `measure_insert` but passing an actor so `_graph_commit_actors` is exercised in IO cost measurements.
crates/omnigraph/tests/helpers/mod.rs	Adds `commit_many_as` helper for building history on the authenticated write path, mirroring `commit_many`.
docs/dev/rfc-013-write-path-latency.md	Updates RFC-013 step 2 to mark 2a landed and 2b deferred; documents the three-table compaction, no-sidecar rationale, bounded retry, and auto-cleanup strip in detail.
docs/dev/invariants.md	Updates invariant 15 to note `optimize` now covers internal tables and that the version chain still grows until step 2b lands.
docs/user/operations/maintenance.md	Adds two new bullet points describing internal-table compaction (simplified path, no sidecar, no manifest publish) and the non-destructive guarantee (auto-cleanup strip, bounded retry).
docs/dev/testing.md	Updates the `write_cost.rs` row to reflect the LOCK is now green every-PR and describes the compaction-at-depth-checkpoint pattern.

Comments Outside Diff (1)

crates/omnigraph/src/db/omnigraph/optimize.rs, line 131-132 (link)

The committed doc-comment says "published to __manifest", but for internal tables committed = true only means Lance advanced its HEAD — there is no manifest publish for these tables. Callers who read committed as "a manifest entry changed" will be confused for the two new table_key values. Consider a short clarifying note here.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into rfc-013-interna..." | Re-trigger Greptile}

…n optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after).

`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR.

- invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8db8937a6a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T15:35:39Z

+    let metrics = compact_files(&mut ds, options, None)
+        .await
+        .map_err(|e| OmniError::Lance(e.to_string()))?;


Preserve auto-cleanup skips during internal compaction

On upgraded graphs whose internal Lance tables still carry the pre-v7 lance.auto_cleanup.* config, this compact_files commit has no skip_auto_cleanup equivalent, so running optimize can trigger Lance's automatic version GC on __manifest / _graph_commits. Those versions are the snapshot/time-travel authority; docs/dev/lance.md and skip_auto_cleanup_suppresses_version_gc call out that every commit path must skip this hook for pre-bump graphs. Please disable/clear the legacy auto-cleanup config before committing internal compaction, or use a commit path that can set the skip.

Useful? React with 👍 / 👎.

…bles Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an *on* auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this.

…publish) Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates, validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files' `Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a live publish usually wins the race and the compaction rebases. But the publish runs its merge-insert with conflict_retries(0) = one rebase attempt; if the compaction commits first AND the merge touched a fragment the Rewrite rewrote, Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT variant from the row-level `TooMuchWriteContention` the publisher already retries. Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance compaction (physical op) failing a live write (logical op) — invariant 7. Map `LanceError::RetryableCommitConflict` to a new `ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the publisher's outer loop (reload fresh state + re-merge), alongside RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine expectation break must not be blindly retried). This also hardens multi-process concurrent writers generally, not just compaction. Normal publishes are insert-only (new object_ids -> new fragments, disjoint from rewritten old ones), so the conflict is rare; the guard covers the same-fragment-update edge and multi-process writers. Unit tests in publisher.rs pin the mapping + the retry-predicate contract.

…side) Reverts d138902. Validated against lance-7.0.0: the publisher's merge-insert runs with conflict_retries(0), and execute_with_retry converts an exhausted retryable commit conflict to TooMuchWriteContention before the caller sees it (write/retry.rs ~95-130). So map_lance_publish_error NEVER receives RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which the publisher already maps to RowLevelCasContention and retries. The reverted mapping was therefore dead on the real path and its unit test was synthetic. The actual exposure is the *compaction* side: compact_files -> commit_compaction -> apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn conflict propagates raw and optimize can fail on a live graph. That is fixed app-side in compact_internal_table in the following commit.

Address three findings from review of the step-2 internal-table compaction: - Non-destructive by construction: before compacting an internal table, strip any stored `lance.auto_cleanup.*` config off it. `compact_files` commits with a default `CommitConfig` (skip_auto_cleanup=false) and `CompactionOptions` exposes no override, so on a graph created by an older binary (on-by-default GC hook) the compaction commit would fire Lance's auto-cleanup and silently prune `__manifest`-pinned versions. Current binaries store no such config; the strip is the upgrade-path safety net so `optimize` can never GC versions. - App-level compaction retry: `compact_files` does NOT auto-retry a semantic conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the internal-table compaction in a bounded retry loop that reopens fresh and replans on a retryable Lance conflict, so a maintenance compaction (a physical op) never fails a live write (a logical op) — invariant 7. - Compact all three internal tables, not two: `_graph_commit_actors` grows one fragment per commit on the authenticated write path, the same O(depth) scan as `__manifest`/`_graph_commits`. Drive the sweep from one source-of-truth list with per-table existence guards (the two commit-graph tables are optional). Make `graph_commit_actors_uri` pub(crate). Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the authenticated (actorful) write path so it covers `_graph_commit_actors` via the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as` helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins the non-destructive guarantee (config cleared + no version GC); a unit test pins the retryable-conflict classifier; the empty-graph stats count is 7 (the actor table is created at init).

… retried Sync the RFC-013 step-2a section and the maintenance guide with the correctness-by-design refinements: - optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors` (the actor table grows on the authenticated write path). - optimize is non-destructive by construction — it never GCs versions, and strips stale `lance.auto_cleanup.*` config so an upgraded graph's commit-time GC hook cannot fire during compaction. - internal-table compaction rebases and retries against concurrent live writers rather than failing the operator's optimize or the live write. - the cost LOCK is the authenticated-path acceptance test.

ragnorc · 2026-06-21T05:37:20Z

Correctness-by-design pass on internal-table compaction

Addressed four findings from review. All fixes are correct-by-construction (not symptom guards), each with a test at the matching boundary. Full workspace gate green (cargo test --workspace --locked: 65 suites ok, 0 failed).

F1 — optimize must be non-destructive, including on upgraded graphs. compact_files commits with a default CommitConfig (skip_auto_cleanup = false) and CompactionOptions exposes no override, so on a graph created by an older binary (Lance's auto_cleanup hook on by default) the compaction commit would fire version-GC and silently prune __manifest-pinned versions. Fix: clear_stale_auto_cleanup_config strips lance.auto_cleanup.* off the table before compacting — the resulting manifest lacks the keys, so the commit-time hook no-ops. New graphs already store auto_cleanup: None; this is the upgrade-path safety net. Test: optimize_clears_stale_auto_cleanup_and_preserves_versions (config cleared + no version GC; part (a) is non-vacuous since only this function removes config).

F2 — compact all three internal tables. _graph_commit_actors grows one fragment per commit on the authenticated write path (server/CLI always carry an actor), so it had the same O(depth) per-write scan as __manifest/_graph_commits but was excluded. Fix: one source-of-truth internal-table list with per-table existence guards (the two commit-graph tables are optional); graph_commit_actors_uri → pub(crate). The cost LOCK internal_table_scans_are_flat_in_history now runs the authenticated write path, so it covers the actor table via the shared commit-graph IO wrapper (__manifest 4→2, commit-graph+actors 10→2 across depth 10→100).

F3 — internal-table compaction must retry against live writers. compact_files does not auto-retry a semantic conflict: commit_compaction → apply_commit propagates a Rewrite-vs-Update/Merge/Delete check_txn conflict raw (unlike the merge-insert path, it's not wrapped in execute_with_retry). Lance prescribes app-rerun. Fix: bounded retry loop in compact_internal_table that reopens fresh and replans on a retryable Lance conflict — a maintenance compaction (physical op) never fails a live write (logical op), per invariant 7. Test: retryable_lance_conflicts_are_classified.

F4 — reverted the wrong-side hardening (d138902e). That commit added a RetryableCommitConflict mapping to the __manifest publisher, but the publisher's merge-insert path uses conflict_retries(0), which already converts an exhausted retryable conflict to TooMuchWriteContention (already handled) — so the added mapping was dead. The real gap was F3 (the compaction side, which has no retry wrapper at all). Reverted in 4ee7186c.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e034636. Configure here.}

…ion work `compact_internal_table` returns early when `plan_compaction` finds no work, but `clear_stale_auto_cleanup_config` may have already committed a config-strip that advanced Lance HEAD. The early return skipped the coordinator refresh that the successful-compaction path performs, leaving warm `__manifest`/commit-graph handles pinned to the pre-strip version until the next read's version probe healed them. No correctness bug (the probe self-heals, and a stale-handle write would retry via publisher CAS), but the refresh makes coherence deterministic rather than probe-dependent. Refresh iff the config-strip actually committed.

… not auto-retry The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict doc just below it and the explicit retry loop that exists precisely because compact_files does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw through apply_commit). Also move the orphaned description from above the retry-budget const onto the function, and include the third internal table.

…oo (red) Regression test for a destructive bug on the data-table optimize path: on an upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.* config, `optimize`'s compact_files/optimize_indices commits fire Lance's version GC and prune __manifest-pinned data-table versions. Mirrors the internal-table auto_cleanup test on a Person table (force-repair realigns the config-induced drift so optimize doesn't skip the table). Red against the current code: the data-table path does not strip the config. The fix lands in the next commit.

… too The auto_cleanup scrub previously only protected the internal tables; the data-table path (optimize_one_table) ran compact_files/optimize_indices with a default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded graph those commits could fire Lance's version-GC hook and prune __manifest-pinned node/edge versions — making the "non-destructive" contract false for data tables. Strip the config before the HEAD-advancing commits, capturing version_before first so the strip's own commit still triggers the Phase-C manifest publish (no uncovered drift). No retry loop needed: the data-table path holds the per-table write queue. Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green. Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys to update_config(None values), and correct two now-inaccurate doc comments — compaction is "one or more content-preserving commits" (compact_files can emit a ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free property rests on content-preservation + read-at-HEAD, not single-commit atomicity.

…try claims - non-destructive guarantee now spans data + internal tables (the auto_cleanup strip runs on both paths), not just the internal ones. - "single atomic Lance commit" was inaccurate: compaction can emit a ReserveFragments commit before the Rewrite; the no-sidecar property rests on content-preservation + read-at-HEAD, not single-commit atomicity. - "retries rather than failing" softened to the truth: a *bounded* retry on the internal path; sustained contention surfaces a loud conflict error (bounded + observable, not an infinite loop). The data path holds the per-table queue and never contends.

ragnorc · 2026-06-21T14:14:59Z

Second review pass — two more findings, both valid, both addressed

Validated against the current code and Lance 7.0.0 source. Full cargo test --workspace --locked green (65 suites, 0 failed).

Finding 1 (High) — optimize was destructive on upgraded DATA tables. Valid; fixed.
Confirmed: the data-table path (optimize_one_table) ran compact_files/optimize_indices with no auto-cleanup scrub, while Lance's commit_compaction → apply_commit(.., &Default::default(), &Default::default()) (lance dataset/optimize.rs:1627) commits with skip_auto_cleanup=false and CompactionOptions exposes no override. So on a graph created by a pre-v7 binary (auto_cleanup ON), optimize could fire Lance's version-GC and prune __manifest-pinned node/edge versions — falsifying the "non-destructive" contract. This was a pre-existing gap on a path that predates this PR, but the PR strengthened that contract, so it's fixed here rather than left to lie.

Fix (864d296e): reuse clear_stale_auto_cleanup_config on the data path too, inside the existing Optimize sidecar window. version_before is captured before the strip so the strip's own commit still triggers the Phase-C manifest publish (no uncovered HEAD>manifest drift). No retry loop needed — the data path holds the per-table write queue, so it doesn't contend. Regression test optimize_clears_stale_auto_cleanup_on_data_tables_too landed as a red commit first (b2897c62), green after the fix.

Finding 2 (Medium) — docs overclaimed substrate guarantees. Valid; corrected.

"single atomic Lance commit" was wrong: compact_files emits a ReserveFragments commit before the Rewrite (lance dataset/optimize.rs:2364 test: "1 commit for reserve fragments and 1 for final commit"). Corrected to "one or more content-preserving commits"; the no-sidecar property is re-grounded on content-preservation + read-at-HEAD, not single-commit atomicity.
"retries rather than failing" softened to the truth: a bounded retry on the internal path; sustained contention past the budget surfaces a loud conflict error (bounded + observable per invariant 13, not an infinite loop). The data path holds the per-table queue and never contends.

Incidental cleanup: clear_stale_auto_cleanup_config was on the deprecated delete_config_keys; switched to update_config(None values).

Commits: b2897c62 (test, red) → 864d296e (fix, green) → 99d8358b (docs).

Fold in two measured findings from the deployed edge binary (f6d2cc0) on rustfs behind a latency+concurrency proxy: - §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s), while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The latency model is (serial_hops + ops/effective_concurrency)*RTT + compute. - Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier 'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency artifact. Deployed f6d2cc0 optimize is node/edge-only; #291 (undeployed) is the prod win. - §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency; assert serial_hops flat AND ops flat in history. - §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a single-file per-version manifest read O(1) (latest_version_hint), pruned by default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path). Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing, step sizing, and one design-choice that was implicit.

…rency-cap correction + Lance-metadata comparison (#292) * feat(engine): compact the internal __manifest/_graph_commits tables in optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after). * test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance) `internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR. * docs: RFC-013 step 2 internal-table compaction landed - invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR. * fix(engine): guard absent _graph_commits + always compact internal tables Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an *on* auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this. * docs(rfc-013): serial-hop correction — wall-clock is the ~110-hop backbone, not op count Latency-slope measurement on the deployed edge binary (f6d2cc0, steps 1+3a landed; rustfs + per-op latency proxy, depth 1..85) shows wall-clock is set by a ~110-hop SERIAL backbone that is depth-invariant. Total ops grow +~7/depth but PARALLELIZE (parallelism 1->6), so the depth term adds little wall-clock. - New §0(c): the serial-hop vs total-op finding + branch-op backbones (create ~77, delete ~87, branch-write ~258/1777-ops/21s floor = fork-on-first-write). - §2.4: correct the '1720->198 ops => 258s->30s' op-count->wall-clock conversion. - §5.1: promote serial-hop/num_stages to the PRIMARY latency LOCK; op-count flatness demoted to a cost/compute-floor gate. - §9 step 2: reprioritized as Phase-7 prerequisite + compute-floor/space, NOT the wall-clock fix; step 3b (parallel capture-once WriteTxn) is the headline latency lever; branch-write moved under step 3b + fork seam. - Summary: serial-backbone correction up front. Vindicates the §3/§4.1 design; corrects the op-count latency framing. * docs(rfc-013): concurrency-cap correction + Lance-metadata comparison Fold in two measured findings from the deployed edge binary (f6d2cc0) on rustfs behind a latency+concurrency proxy: - §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s), while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The latency model is (serial_hops + ops/effective_concurrency)*RTT + compute. - Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier 'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency artifact. Deployed f6d2cc0 optimize is node/edge-only; #291 (undeployed) is the prod win. - §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency; assert serial_hops flat AND ops flat in history. - §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a single-file per-version manifest read O(1) (latest_version_hint), pruned by default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path). Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing, step sizing, and one design-choice that was implicit.

ragnorc added 3 commits June 20, 2026 17:28

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs Outdated

ragnorc mentioned this pull request Jun 20, 2026

docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison #292

Merged

ragnorc added 4 commits June 20, 2026 20:49

ragnorc changed the title ~~RFC-013 step 2a: compact the internal __manifest/_graph_commits tables~~ RFC-013 step 2a: compact the internal manifest/commit-graph tables in optimize Jun 21, 2026

cursor Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs

Comment thread crates/omnigraph/src/db/omnigraph/optimize.rs

ragnorc changed the title ~~RFC-013 step 2a: compact the internal manifest/commit-graph tables in optimize~~ (feat): compact the internal manifest/commit-graph tables in optimize Jun 21, 2026

ragnorc added 5 commits June 21, 2026 13:50

Merge branch 'main' into rfc-013-internal-table-compaction

edfc73d

ragnorc merged commit f2b792e into main Jun 21, 2026
8 checks passed

ragnorc mentioned this pull request Jun 22, 2026

fix(engine): optimize survives a cross-process write race on the same table #297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): compact the internal manifest/commit-graph tables in optimize#291

(feat): compact the internal manifest/commit-graph tables in optimize#291
ragnorc merged 14 commits into
mainfrom
rfc-013-internal-table-compaction

ragnorc commented Jun 20, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

Uh oh!

ragnorc commented Jun 21, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ragnorc commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ragnorc commented Jun 20, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Changes

Measured result (the LOCK, now green)

Scope — compaction only (cleanup deferred)

Tests

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ragnorc commented Jun 21, 2026

Correctness-by-design pass on internal-table compaction

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ragnorc commented Jun 21, 2026

Second review pass — two more findings, both valid, both addressed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ragnorc commented Jun 20, 2026 •

edited by greptile-apps Bot

Loading