docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison#292
Conversation
…n optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after).
`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR.
- invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR.
…bles Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an *on* auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this.
…kbone, not op count Latency-slope measurement on the deployed edge binary (f6d2cc0, steps 1+3a landed; rustfs + per-op latency proxy, depth 1..85) shows wall-clock is set by a ~110-hop SERIAL backbone that is depth-invariant. Total ops grow +~7/depth but PARALLELIZE (parallelism 1->6), so the depth term adds little wall-clock. - New §0(c): the serial-hop vs total-op finding + branch-op backbones (create ~77, delete ~87, branch-write ~258/1777-ops/21s floor = fork-on-first-write). - §2.4: correct the '1720->198 ops => 258s->30s' op-count->wall-clock conversion. - §5.1: promote serial-hop/num_stages to the PRIMARY latency LOCK; op-count flatness demoted to a cost/compute-floor gate. - §9 step 2: reprioritized as Phase-7 prerequisite + compute-floor/space, NOT the wall-clock fix; step 3b (parallel capture-once WriteTxn) is the headline latency lever; branch-write moved under step 3b + fork seam. - Summary: serial-backbone correction up front. Vindicates the §3/§4.1 design; corrects the op-count latency framing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a560297a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| assertion (`serial_hops ≤ K`, flat in depth) via a `ThrottledStore` wrapper that | ||
| injects per-op latency and reads the wall = `compute + serial_hops·L` slope (Lance's | ||
| `test_commit_iops` setup) — this is what catches the ~110-hop backbone §0(c) measured. |
There was a problem hiding this comment.
Keep the Tier-1 serial-hop gate deterministic
If this RFC is implemented literally, the every-PR Tier-1 gate would infer serial_hops by injecting latency and fitting a wall-clock slope, which conflicts with the same section's requirement that Tier 1 be deterministic/no-wall-clock and with §5.2 reserving wall-clock trend checks for nightly. On shared CI runners this would either make the PR gate slow/flaky or force implementers to choose arbitrary sleep values; the PR gate should count num_stages/critical-path events directly and leave latency-injection slope measurements to the Tier-2 benchmark.
Useful? React with 👍 / 👎.
Fold in two measured findings from the deployed edge binary (f6d2cc0) on rustfs behind a latency+concurrency proxy: - §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s), while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The latency model is (serial_hops + ops/effective_concurrency)*RTT + compute. - Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier 'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency artifact. Deployed f6d2cc0 optimize is node/edge-only; #291 (undeployed) is the prod win. - §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency; assert serial_hops flat AND ops flat in history. - §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a single-file per-version manifest read O(1) (latest_version_hint), pruned by default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path). Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing, step sizing, and one design-choice that was implicit.
…rrection # Conflicts: # crates/omnigraph/src/db/omnigraph/optimize.rs # crates/omnigraph/tests/maintenance.rs # crates/omnigraph/tests/write_cost.rs # docs/dev/rfc-013-write-path-latency.md # docs/user/operations/maintenance.md
Stacked on #291 (the branch that holds the RFC). Doc-only correction to RFC-013's latency measurement framing, from a latency+concurrency study of the deployed edge binary (
f6d2cc03, steps 1+3a landed) on rustfs.The corrected latency model
wall ≈ (serial_hops + ops / effective_concurrency) · RTT + computeTwo measured findings:
§0(c) — serial backbone. Under unlimited concurrency, wall-clock is a ~110-hop serial backbone, depth-invariant; the depth-driven ops (+~7/depth, 107→716) parallelize away (
max_inflighthit 129). → step 3b (parallel capture-onceWriteTxn) is the lever.§0(d) — concurrency-cap A/B (the key result). Under an R2-realistic cap (8), the internal-table fragment scan can't fan out, so op count re-enters wall-clock. With fragment count as the only variable:
__manifest137 frags)optimize(137→1)This is the prod 35s mechanism and its degradation over time (the O(N²)): every uncompacted write scans all
__manifest/_graph_commitsfragments and adds one.What this corrects (vs. my first draft of this PR)
My initial commit de-ranked step 2 as "parallel, irrelevant to latency" — that was an unlimited-concurrency artifact. On R2 (capped), step 2a is a primary latency lever and the anti-runaway fix. Both step 3b (serial backbone) and step 2a (op count) are load-bearing; which dominates is set by
effective_concurrency × fragment_count.Prod note: the deployed
f6d2cc03optimize is node/edge-only, so an operatoroptimizeon prod cannot compact the internal tables — #291 (undeployed) is the immediate prod win.Changes (doc only)
ThrottledStoremust cap concurrency AND inject latency; assert serial-hops flat AND op-count flat in history.latest_version_hint), pruned by default; omnigraph's__manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds an explicit defense of the Lance-dataset__manifest(MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path).The design (§3/§4.1) is unchanged and vindicated — every finding maps onto an existing pillar (step 3b, step 2a,
graph_headone-CAS, MTT exit). Only the measurement framing, step sizing, and one previously-implicit design choice are corrected.Note
Low Risk
Single markdown file only; no code, APIs, or CI behavior changes.
Overview
Doc-only update to RFC-013 — reframes write-path latency measurement after rustfs/proxy studies on the deployed edge binary; no runtime or test changes.
The RFC now states wall-clock as
(serial_hops + ops / effective_concurrency) · RTT + compute, not total ops × RTT. New §0(c) adds serial-backbone tables (~110 hops, depth-invariant under unlimited concurrency); §0(d) adds an R2-realistic concurrency cap=8 A/B showing uncompacted internal-table scans run away while #291 compaction bounds latency (~6×).§2.4, §9 step 2, and the summary restore step 2a as a primary latency lever on capped stores (not “parallel-only”). §5.1 requires a
ThrottledStorewith both latency injection and concurrency cap, plus two LOCKs (serial hops flat; op count flat in history under cap). §2.3 and §8 add Lance O(1) metadata vs omnigraph catalog-as-dataset, and defend keeping__manifestas a Lance dataset for the MTT seam vs a flat CAS file. §9 step 6 moves branch-write cost to step 3b / fork seam.Reviewed by Cursor Bugbot for commit 8c27303. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR corrects RFC-013's latency framing from a simple op-count model to
(serial_hops + ops/effective_concurrency)·RTT + compute, backed by two new measurements on the deployed edge binary (f6d2cc03) behind a latency+concurrency proxy.__manifestO(fragments) scan is self-inflicted (cross-table-atomicity choice) and defending the Lance-dataset catalog over a flat-file CAS manifest due to the MTT seam (lance#7264).ThrottledStorenow required to cap concurrency and inject latency, with two explicit LOCKs: serial-hop flat in depth and op-count flat in history under a capped store.Confidence Score: 5/5
Documentation-only change with no runtime or test code affected; safe to merge.
All changes are confined to a single RFC markdown file. The new measurement sections (§0(c)/(d)), Lance comparison (§2.3/§8), and §5.1 gate corrections document existing measured behavior and design rationale without touching any source code. Two internal data-consistency questions in the §0(d) table are worth clarifying before this becomes the authoritative measurement record, but they do not affect any deployed code.
docs/dev/rfc-013-write-path-latency.md — §0(d) measurement table (lines 200–203) and the max_inflight cross-reference between §0(c) and §0(d) (lines 175 vs 194).
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["Write request"] --> B{Store concurrency?} B -->|"Unlimited (rustfs)"| C["Parallel fan-out\nmax_inflight up to 65–129"] B -->|"Capped at 8 (R2-realistic)"| D{"Fragment count?"} C --> E["Wall ≈ serial_backbone·RTT\n(~110 hops, depth-invariant)"] D -->|"Uncompacted\n(__manifest 137 frags)"| F["Ops 1273→3505\nwall 5.9→16.4s RUNAWAY"] D -->|"After #291 optimize\n(137→1 frag)"| G["Ops 275→197\nwall 6.2→3.8s BOUNDED"] E --> H["Lever: step 3b\ncapture-once WriteTxn"] F --> I["Lever: step 2a\ninternal-table compaction (#291)"] G --> J["Residual: step 3b\nserial backbone"] H --> K["§5.1 LOCKs:\nserial_hops flat in depth\nops flat in history under MAXCONC=8"] I --> K J --> K%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A["Write request"] --> B{Store concurrency?} B -->|"Unlimited (rustfs)"| C["Parallel fan-out\nmax_inflight up to 65–129"] B -->|"Capped at 8 (R2-realistic)"| D{"Fragment count?"} C --> E["Wall ≈ serial_backbone·RTT\n(~110 hops, depth-invariant)"] D -->|"Uncompacted\n(__manifest 137 frags)"| F["Ops 1273→3505\nwall 5.9→16.4s RUNAWAY"] D -->|"After #291 optimize\n(137→1 frag)"| G["Ops 275→197\nwall 6.2→3.8s BOUNDED"] E --> H["Lever: step 3b\ncapture-once WriteTxn"] F --> I["Lever: step 2a\ninternal-table compaction (#291)"] G --> J["Residual: step 3b\nserial backbone"] H --> K["§5.1 LOCKs:\nserial_hops flat in depth\nops flat in history under MAXCONC=8"] I --> K J --> KReviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile