Skip to content

docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison#292

Merged
ragnorc merged 7 commits into
mainfrom
rfc-013-serial-hop-correction
Jun 21, 2026
Merged

docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison#292
ragnorc merged 7 commits into
mainfrom
rfc-013-serial-hop-correction

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Stacked on #291 (the branch that holds the RFC). Doc-only correction to RFC-013's latency measurement framing, from a latency+concurrency study of the deployed edge binary (f6d2cc03, steps 1+3a landed) on rustfs.

The corrected latency model

wall ≈ (serial_hops + ops / effective_concurrency) · RTT + compute

Two measured findings:

  • §0(c) — serial backbone. Under unlimited concurrency, wall-clock is a ~110-hop serial backbone, depth-invariant; the depth-driven ops (+~7/depth, 107→716) parallelize away (max_inflight hit 129). → step 3b (parallel capture-once WriteTxn) is the lever.

  • §0(d) — concurrency-cap A/B (the key result). Under an R2-realistic cap (8), the internal-table fragment scan can't fan out, so op count re-enters wall-clock. With fragment count as the only variable:

    state per-write ops wall (cap=8, L=20)
    uncompacted (__manifest 137 frags) 1273 → 1487 → 3505 5.9 → 8.4 → 16.4s (runaway)
    after (feat): compact the internal manifest/commit-graph tables in optimize #291 optimize (137→1) 275 → 250 → 197 6.2 → 5.4 → 3.8s (bounded)

    This is the prod 35s mechanism and its degradation over time (the O(N²)): every uncompacted write scans all __manifest/_graph_commits fragments and adds one.

What this corrects (vs. my first draft of this PR)

My initial commit de-ranked step 2 as "parallel, irrelevant to latency" — that was an unlimited-concurrency artifact. On R2 (capped), step 2a is a primary latency lever and the anti-runaway fix. Both step 3b (serial backbone) and step 2a (op count) are load-bearing; which dominates is set by effective_concurrency × fragment_count.

Prod note: the deployed f6d2cc03 optimize is node/edge-only, so an operator optimize on prod cannot compact the internal tables — #291 (undeployed) is the immediate prod win.

Changes (doc only)

  • New §0(c)+(d) — serial-hop + concurrency-cap A/B; branch-op backbones.
  • §2.4 / §9 step 2 / Summary — reframe step 2 as concurrency-dependent (primary latency lever on R2), not de-ranked.
  • §5.1 — the cost-gate ThrottledStore must cap concurrency AND inject latency; assert serial-hops flat AND op-count flat in history.
  • §2.3 + §8 (new) — Lance/LanceDB comparison from 7.0.0 source: Lance metadata is a single-file per-version manifest read O(1) (latest_version_hint), pruned by default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds an explicit defense of the Lance-dataset __manifest (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path).

The design (§3/§4.1) is unchanged and vindicated — every finding maps onto an existing pillar (step 3b, step 2a, graph_head one-CAS, MTT exit). Only the measurement framing, step sizing, and one previously-implicit design choice are corrected.


Note

Low Risk
Single markdown file only; no code, APIs, or CI behavior changes.

Overview
Doc-only update to RFC-013 — reframes write-path latency measurement after rustfs/proxy studies on the deployed edge binary; no runtime or test changes.

The RFC now states wall-clock as (serial_hops + ops / effective_concurrency) · RTT + compute, not total ops × RTT. New §0(c) adds serial-backbone tables (~110 hops, depth-invariant under unlimited concurrency); §0(d) adds an R2-realistic concurrency cap=8 A/B showing uncompacted internal-table scans run away while #291 compaction bounds latency (~6×).

§2.4, §9 step 2, and the summary restore step 2a as a primary latency lever on capped stores (not “parallel-only”). §5.1 requires a ThrottledStore with both latency injection and concurrency cap, plus two LOCKs (serial hops flat; op count flat in history under cap). §2.3 and §8 add Lance O(1) metadata vs omnigraph catalog-as-dataset, and defend keeping __manifest as a Lance dataset for the MTT seam vs a flat CAS file. §9 step 6 moves branch-write cost to step 3b / fork seam.

Reviewed by Cursor Bugbot for commit 8c27303. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR corrects RFC-013's latency framing from a simple op-count model to (serial_hops + ops/effective_concurrency)·RTT + compute, backed by two new measurements on the deployed edge binary (f6d2cc03) behind a latency+concurrency proxy.

  • §0(c)+(d): Adds a serial-hop sweep (direct-main backbone ~110 hops, depth-invariant) and a concurrency-cap A/B (cap=8 mimicking R2) proving that uncompacted internal-table fragment scans run away under a cap (6→16 s) while (feat): compact the internal manifest/commit-graph tables in optimize #291's compaction bounds them (~3.8 s); reprioritizes step 2a from "parallel afterthought" to primary R2 latency lever.
  • §2.3/§8: New Lance 7.0.0 source comparison explaining why omnigraph's __manifest O(fragments) scan is self-inflicted (cross-table-atomicity choice) and defending the Lance-dataset catalog over a flat-file CAS manifest due to the MTT seam (lance#7264).
  • §5.1: ThrottledStore now required to cap concurrency and inject latency, with two explicit LOCKs: serial-hop flat in depth and op-count flat in history under a capped store.

Confidence Score: 5/5

Documentation-only change with no runtime or test code affected; safe to merge.

All changes are confined to a single RFC markdown file. The new measurement sections (§0(c)/(d)), Lance comparison (§2.3/§8), and §5.1 gate corrections document existing measured behavior and design rationale without touching any source code. Two internal data-consistency questions in the §0(d) table are worth clarifying before this becomes the authoritative measurement record, but they do not affect any deployed code.

docs/dev/rfc-013-write-path-latency.md — §0(d) measurement table (lines 200–203) and the max_inflight cross-reference between §0(c) and §0(d) (lines 175 vs 194).

Important Files Changed

Filename Overview
docs/dev/rfc-013-write-path-latency.md Doc-only reframe of RFC-013's latency model from op-count-serial to (serial_hops + ops/concurrency)·RTT; adds §0(c)/(d) measurements, §2.3/§8 Lance comparison, §5.1 two-LOCK gate, and §9 step-2 reprioritization. Two internal measurement inconsistencies: the §0(d) "after optimize" op sequence is decreasing (275→250→197) where the per-fragment-append model predicts growth, and max_inflight is cited as 65 in §0(c) but 129 for the same measurement in §0(d).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Write request"] --> B{Store concurrency?}
    B -->|"Unlimited (rustfs)"| C["Parallel fan-out\nmax_inflight up to 65–129"]
    B -->|"Capped at 8 (R2-realistic)"| D{"Fragment count?"}
    C --> E["Wall ≈ serial_backbone·RTT\n(~110 hops, depth-invariant)"]
    D -->|"Uncompacted\n(__manifest 137 frags)"| F["Ops 1273→3505\nwall 5.9→16.4s RUNAWAY"]
    D -->|"After #291 optimize\n(137→1 frag)"| G["Ops 275→197\nwall 6.2→3.8s BOUNDED"]
    E --> H["Lever: step 3b\ncapture-once WriteTxn"]
    F --> I["Lever: step 2a\ninternal-table compaction (#291)"]
    G --> J["Residual: step 3b\nserial backbone"]
    H --> K["§5.1 LOCKs:\nserial_hops flat in depth\nops flat in history under MAXCONC=8"]
    I --> K
    J --> K
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["Write request"] --> B{Store concurrency?}
    B -->|"Unlimited (rustfs)"| C["Parallel fan-out\nmax_inflight up to 65–129"]
    B -->|"Capped at 8 (R2-realistic)"| D{"Fragment count?"}
    C --> E["Wall ≈ serial_backbone·RTT\n(~110 hops, depth-invariant)"]
    D -->|"Uncompacted\n(__manifest 137 frags)"| F["Ops 1273→3505\nwall 5.9→16.4s RUNAWAY"]
    D -->|"After #291 optimize\n(137→1 frag)"| G["Ops 275→197\nwall 6.2→3.8s BOUNDED"]
    E --> H["Lever: step 3b\ncapture-once WriteTxn"]
    F --> I["Lever: step 2a\ninternal-table compaction (#291)"]
    G --> J["Residual: step 3b\nserial backbone"]
    H --> K["§5.1 LOCKs:\nserial_hops flat in depth\nops flat in history under MAXCONC=8"]
    I --> K
    J --> K
Loading

Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

ragnorc added 5 commits June 20, 2026 17:28
…n optimize

`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).

`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.

Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).

maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).
`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.
- invariants.md: close the compaction half of the read-path-rederivation known
  gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
  sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
  watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.
…bles

Addresses PR #291 review findings:

- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
  but a graph can validly have none (the coordinator opens it as `Option`, gated on
  `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
  absent table errored and failed the whole optimize. Guard the `_graph_commits`
  compaction with the same `storage_adapter().exists()` check the coordinator uses;
  `__manifest` always exists so it stays unguarded. Regression test
  `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
  recreates the table before the guard).

- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
  compaction for a schema with no node/edge types. Removed it so the internal
  tables are compacted regardless of the data-table set.

- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
  `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
  so on a graph storing an *on* auto_cleanup config the commit would fire version
  GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
  safe; the only exposure is pre-fix upgraded graphs, identical to the existing
  data-table optimize path, with step 2b's watermark as the comprehensive guard.
  Added a comment in `compact_internal_table` recording this.
…kbone, not op count

Latency-slope measurement on the deployed edge binary (f6d2cc0, steps 1+3a
landed; rustfs + per-op latency proxy, depth 1..85) shows wall-clock is set by a
~110-hop SERIAL backbone that is depth-invariant. Total ops grow +~7/depth but
PARALLELIZE (parallelism 1->6), so the depth term adds little wall-clock.

- New §0(c): the serial-hop vs total-op finding + branch-op backbones
  (create ~77, delete ~87, branch-write ~258/1777-ops/21s floor = fork-on-first-write).
- §2.4: correct the '1720->198 ops => 258s->30s' op-count->wall-clock conversion.
- §5.1: promote serial-hop/num_stages to the PRIMARY latency LOCK; op-count
  flatness demoted to a cost/compute-floor gate.
- §9 step 2: reprioritized as Phase-7 prerequisite + compute-floor/space, NOT
  the wall-clock fix; step 3b (parallel capture-once WriteTxn) is the headline
  latency lever; branch-write moved under step 3b + fork seam.
- Summary: serial-backbone correction up front.

Vindicates the §3/§4.1 design; corrects the op-count latency framing.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a560297a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/dev/rfc-013-write-path-latency.md Outdated
Comment on lines +587 to +589
assertion (`serial_hops ≤ K`, flat in depth) via a `ThrottledStore` wrapper that
injects per-op latency and reads the wall = `compute + serial_hops·L` slope (Lance's
`test_commit_iops` setup) — this is what catches the ~110-hop backbone §0(c) measured.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the Tier-1 serial-hop gate deterministic

If this RFC is implemented literally, the every-PR Tier-1 gate would infer serial_hops by injecting latency and fitting a wall-clock slope, which conflicts with the same section's requirement that Tier 1 be deterministic/no-wall-clock and with §5.2 reserving wall-clock trend checks for nightly. On shared CI runners this would either make the PR gate slow/flaky or force implementers to choose arbitrary sleep values; the PR gate should count num_stages/critical-path events directly and leave latency-injection slope measurements to the Tier-2 benchmark.

Useful? React with 👍 / 👎.

Comment thread docs/dev/rfc-013-write-path-latency.md Outdated
Comment thread docs/dev/rfc-013-write-path-latency.md Outdated
Fold in two measured findings from the deployed edge binary (f6d2cc0) on
rustfs behind a latency+concurrency proxy:

- §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table
  scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes
  and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s),
  while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The
  latency model is (serial_hops + ops/effective_concurrency)*RTT + compute.
- Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a
  primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier
  'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency
  artifact. Deployed f6d2cc0 optimize is node/edge-only; #291 (undeployed) is
  the prod win.
- §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency;
  assert serial_hops flat AND ops flat in history.
- §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a
  single-file per-version manifest read O(1) (latest_version_hint), pruned by
  default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the
  cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest
  (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path).

Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing,
step sizing, and one design-choice that was implicit.
@ragnorc ragnorc changed the title docs(rfc-013): serial-hop correction — wall-clock is the ~110-hop backbone, not op count docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison Jun 21, 2026
@ragnorc ragnorc changed the base branch from rfc-013-internal-table-compaction to main June 21, 2026 19:48
…rrection

# Conflicts:
#	crates/omnigraph/src/db/omnigraph/optimize.rs
#	crates/omnigraph/tests/maintenance.rs
#	crates/omnigraph/tests/write_cost.rs
#	docs/dev/rfc-013-write-path-latency.md
#	docs/user/operations/maintenance.md
@ragnorc ragnorc merged commit 5cfae9a into main Jun 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant