perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415
perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415bplatz wants to merge 11 commits into
Conversation
Leaflet V3 column batches are cached under the projection (columns) a reader decoded, so a warm full-projection batch could not serve a narrower read. Add ColumnBatch::project_to and a superset fallback in try_get_or_decode_v3_batch: on an exact-key miss, serve a cached ColumnSet::ALL batch by projecting it down to the requested columns (invariant: cached columns must be a superset of requested; unrequested columns become AbsentDefault). Also add insert_leaf_dir/insert_v3_batch for writer-side seeding and a Debug impl for LeafletCache. This lets a single ALL batch per leaflet satisfy every projection and is the read-side half of warm-on-write.
The streaming copy-on-write path now seeds the shared LeafletCache with the leaflets it just wrote, decoded under ColumnSet::ALL, so a co-located query server's immediate read of a freshly-rewritten (new-CID) leaf hits the cache instead of re-reading and re-decoding from disk. The leaf id derivation matches the reader (xxh3_128 of the leaf CID). Gated behind a late-bound WarmCacheSource resolver on IndexerConfig (None by default); only the CoW update path warms, never fresh/full rebuilds (which would decode the whole graph).
Resolve the shared read cache for the background indexer from the running LedgerManager (LedgerManagerWarmCache), set once in start_background_indexing_dyn so every co-located build path warms the exact cache readers use; separate-machine indexers leave it unset. Add reindex_swap_read_profile, an instrumented harness that drives BSBM-shape write bursts and measures post-swap read latency, build cadence, and cache occupancy so warm-on-write is measurable independent of QMpH.
Extend warm-on-write to the reverse-dictionary tree: when the incremental CoW build writes a new reverse-dict leaf, seed the shared cache (DictLeaf) with its bytes so a co-located reader resolving a just-added IRI/string hits the cache instead of a cold read. Adds LeafletCache::insert_dict_leaf; the reader keys dict leaves on the CAS address string (cid.to_string()), which the warm key matches exactly. Threaded through the reverse-tree upload path and gated by the same warm_cache_source resolver (co-located only).
Absorb the one-time apply/swap cost with a throwaway load before the measured reads, so latency reflects a query on the applied generation (as a client read does after the background listener applies) rather than the first-load apply cost.
The incremental copy-on-write build never split a touched leaf (leaf_target_rows was bumped to existing_total+novelty for CID stability), so appended high-SID subjects concentrated into an ever-growing trailing leaf whose whole-blob read + full directory decode inflate every point read. Now a touched leaf splits once its merged size exceeds 2x the leaf target (matching the config's leaf_max = 2*leaf_target and the full-build LeafWriter's greedy packing), into bounded ~target-sized leaves; below the ceiling it still grows in place to avoid churning the branch on small commits. Applies uniformly to every touched leaf (middle, leftmost, rightmost). Splits are gap-free: novelty is routed to leaves by first_key(next) half-open intervals (slice_novelty_to_leaves), and the leftmost/rightmost leaves keep their -inf/+inf coverage. Adds a test asserting a split preserves the full row count, keeps leaves strictly ordered and non-overlapping, and spans the full key range.
Absorb the one-time apply cost with a throwaway load before the measured reads and time graph().load() separately from execute(), reporting query-only latency. This showed the read-after-swap cost lives in query execution (load is ~0.01ms) on leaves that grow with each incremental burst, not in apply or novelty.
The local leaf-open fast path used to std::fs::read the entire leaf blob into a fresh heap buffer on every read and re-decode the directory each time — cost that grows with the leaf and is re-paid per read. Add MmapLeafHandle: mmap the (immutable, content-addressed) leaf so raw bytes stay in OS page cache and only touched pages fault in (no whole-blob copy), and take the decoded directory as an Arc from the shared LeafletCache (parsed once per leaf CID). This also activates the LeafDir warm-on-write already seeded by the incremental build. Column data is still materialized once per leaflet via the V3Batch cache; raw leaf bytes are never copied into the cache budget.
…let scan filter_batch scanned every row of a leaflet applying the row filter. Leaflet rows are sorted by the order's key, so when the leading sort column is pinned to a single value (e.g. a bound subject on a SPOT scan) binary-search its contiguous [start,end) range and scan only that, instead of the whole (possibly large) leaflet. Output is identical to a full scan — rows outside the range can't match a filter that pins the leading column — so replay/overlay downstream are unaffected. Falls back to a full scan when the leading column is unbound or not a materialized sorted block, so no rows are ever missed. Cuts a bound-subject point read from O(leaflet) to O(log leaflet + result); measured ~2x on warm base-subject reads in the read-after-swap bench.
…ap bench read_new2 (second read of the same just-inserted product) isolates a first-touch effect from insertedness; read_old_random (a different existing product each burst) isolates cold-subject from just-inserted. Together they showed the read tail is not storage: read_old_random and read_new2 are both fast (~0.4ms) while read_new (first query of a fresh generation) is ~3.9ms — pinpointing the per-generation stats-view rebuild that the first query after each reindex swap pays.
497a3be to
25d8c28
Compare
aaj3f
left a comment
There was a problem hiding this comment.
This feels like a good find/catch!
| // Warm-on-write (co-located only): seed the shared read cache with the | ||
| // leaflets we just wrote, from bytes already in hand, so the query server's | ||
| // immediate read of this new-CID leaf hits the cache instead of cold decode. | ||
| if let Some(cache) = warm_cache { | ||
| warm_leaf_into_cache(cache, &info.leaf_cid, &info.leaf_bytes); | ||
| } |
There was a problem hiding this comment.
warm_leaf_into_cache runs a synchronous full zstd decode of every leaflet of every rewritten leaf (load_leaflet_columns at line 3850) directly on the async uploader task. upload_one_leaf_blob is awaited inside the async move uploader loop in run_update_branch (line ~340), i.e. on a Tokio worker thread — not the spawn_blocking CoW task (line 308). In the co-located deployment (the only case where warm_cache is Some) that runtime also serves queries, so a churn build can stall query-serving worker threads on decompression — partially working against the latency win this PR chases. (Verified: impact is bounded — the uploader is a single sequential future, so at most one runtime worker is intermittently occupied by a decode (≈1/num_cpus of capacity). Real anti-pattern and worth fixing, but a recommended cleanup, not a hard merge blocker.) Move the decode off the async path:
// in upload_one_leaf_blob, replace the inline call:
if let Some(cache) = warm_cache {
let cache = cache.clone(); // Arc<LeafletCache>
let leaf_cid = info.leaf_cid.clone();
let leaf_bytes = info.leaf_bytes.clone(); // or Arc the bytes to avoid the copy
tokio::task::spawn_blocking(move || {
warm_leaf_into_cache(&cache, &leaf_cid, &leaf_bytes);
});
}(Spawn-and-forget keeps the upload off the critical path; or thread the warm into the existing spawn_blocking CoW where the bytes already exist. At minimum, wrap in spawn_blocking so a large leaf's decompression cannot block the async reactor.)
| for (idx, entry) in dir.entries.iter().enumerate() { | ||
| if entry.row_count == 0 { | ||
| continue; | ||
| } | ||
| if let Ok(batch) = load_leaflet_columns( | ||
| leaf_bytes, | ||
| entry, | ||
| dir.payload_base, | ||
| &ColumnProjection::all(), | ||
| order, | ||
| ) { | ||
| cache.insert_v3_batch( | ||
| V3BatchCacheKey { | ||
| leaf_id, | ||
| leaflet_idx: idx as u32, | ||
| columns: ColumnSet::ALL.0, | ||
| }, | ||
| batch, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Warming inserts a ColumnSet::ALL batch per leaflet. ALL is wider than any read projection (for_scan never requests t unless time-travel), so warm-on-write pushes more bytes into the shared, TinyLFU-bounded cache than the reads it serves would — under heavy churn this can evict hot narrow entries that queries actually touch. This is the deliberate "one batch serves every projection" tradeoff, but consider (a) warming only when the leaf is small enough to fit a per-build warm budget, or (b) skipping the t column (warm ALL & !T) since reads rarely need it. Not a blocker; worth a budget knob.
// consider a bounded warm: skip leaflets whose ALL-decode would exceed a
// per-build byte budget, so a giant rewritten leaf can't evict the working set.
Problem
Under continuous writes with a low
reindex_min_bytes, frequent reindexing washurting read latency instead of helping — the opposite of the expected
"binary index ≫ novelty query" behavior. A sweep showed the default (100 KB)
churn regime collapsing product-detail read latency ~30× versus a 2 MB setting.
Root cause was not cache invalidation (the shared cache is a single global
Arcand CID-keyed, so unchanged warmth survives a swap). It was that every reindex
swap left the touched artifacts cold — the incremental build decoded the leaves
it rewrote but never seeded the shared read cache — compounded by local-read I/O
amplification and unbounded trailing-leaf growth.
Changes
LeafletCachewith the leaves it just wrote (decoded under
ColumnSet::ALL) and warms thereverse-dict leaves, so the first read after a swap hits warm instead of
cold-decoding. Wired through a late-bound
WarmCacheSourceat a singlechokepoint (
start_background_indexing_dyn) so every build path is covered.cached
ALLbatch viaColumnBatch::project_to(invariant: cached ⊇ requested),so warming does not duplicate column data across cache entries.
incremental build — gap-free, with
first_key → next-first-keyboundaries and±∞ sentinels at the ends — fixing unbounded growth of the trailing leaf.
Previously a local read did
std::fs::readon the whole leaf to serve one ofits leaflets and re-decoded the directory, bypassing the CID-keyed
LeafDircache.
scan instead of scanning the leaflet linearly to find the subject's rows.
Bench
Adds
reindex_swap_read_profile, a deterministic read-after-reindex-swap harnessthat isolates the per-swap cold cost as a single number (not inferred from QMpH),
with first-touch and cold-subject controls.
Validation
Existing-data reads after a churn swap drop from ~3 ms to ~0.2–0.3 ms (~10×) in
the read-after-swap microbench, with cold-miss cache insertions driven toward
zero. Query correctness is unchanged (query and api group suites pass).
Scope note
A follow-on experiment — warming the per-generation query caches (overlay
translation + stats view) at index apply — was prototyped and dropped: it is flat
on BSBM
update(the between-swap overlay is small and the query:swap ratio ishigh, so the per-swap translation cost amortizes to noise), and the remaining
default-regime cost is the co-located background build competing for CPU, which is
a separate lever tracked for later.