perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression by bplatz · Pull Request #1415 · fluree/db

bplatz · 2026-07-01T19:07:06Z

Problem

Under continuous writes with a low reindex_min_bytes, frequent reindexing was
hurting read latency instead of helping — the opposite of the expected
"binary index ≫ novelty query" behavior. A sweep showed the default (100 KB)
churn regime collapsing product-detail read latency ~30× versus a 2 MB setting.

Root cause was not cache invalidation (the shared cache is a single global Arc
and CID-keyed, so unchanged warmth survives a swap). It was that every reindex
swap left the touched artifacts cold — the incremental build decoded the leaves
it rewrote but never seeded the shared read cache — compounded by local-read I/O
amplification and unbounded trailing-leaf growth.

Changes

Warm-on-write: the incremental CoW build now seeds the shared LeafletCache
with the leaves it just wrote (decoded under ColumnSet::ALL) and warms the
reverse-dict leaves, so the first read after a swap hits warm instead of
cold-decoding. Wired through a late-bound WarmCacheSource at a single
chokepoint (start_background_indexing_dyn) so every build path is covered.
Superset cache fallback: a narrow (projection-keyed) read is served from a
cached ALL batch via ColumnBatch::project_to (invariant: cached ⊇ requested),
so warming does not duplicate column data across cache entries.
Bounded incremental leaves: oversized touched leaves are split during the
incremental build — gap-free, with first_key → next-first-key boundaries and
±∞ sentinels at the ends — fixing unbounded growth of the trailing leaf.
Local read path: mmap local leaf reads and cache the decoded directory.
Previously a local read did std::fs::read on the whole leaf to serve one of
its leaflets and re-decoded the directory, bypassing the CID-keyed LeafDir
cache.
Within-leaflet seek: binary-search the bound leading key within a leaflet
scan instead of scanning the leaflet linearly to find the subject's rows.

Bench

Adds reindex_swap_read_profile, a deterministic read-after-reindex-swap harness
that isolates the per-swap cold cost as a single number (not inferred from QMpH),
with first-touch and cold-subject controls.

Validation

Existing-data reads after a churn swap drop from ~3 ms to ~0.2–0.3 ms (~10×) in
the read-after-swap microbench, with cold-miss cache insertions driven toward
zero. Query correctness is unchanged (query and api group suites pass).

Scope note

A follow-on experiment — warming the per-generation query caches (overlay
translation + stats view) at index apply — was prototyped and dropped: it is flat
on BSBM update (the between-swap overlay is small and the query:swap ratio is
high, so the per-swap translation cost amortizes to noise), and the remaining
default-regime cost is the co-located background build competing for CPU, which is
a separate lever tracked for later.

Leaflet V3 column batches are cached under the projection (columns) a reader decoded, so a warm full-projection batch could not serve a narrower read. Add ColumnBatch::project_to and a superset fallback in try_get_or_decode_v3_batch: on an exact-key miss, serve a cached ColumnSet::ALL batch by projecting it down to the requested columns (invariant: cached columns must be a superset of requested; unrequested columns become AbsentDefault). Also add insert_leaf_dir/insert_v3_batch for writer-side seeding and a Debug impl for LeafletCache. This lets a single ALL batch per leaflet satisfy every projection and is the read-side half of warm-on-write.

The streaming copy-on-write path now seeds the shared LeafletCache with the leaflets it just wrote, decoded under ColumnSet::ALL, so a co-located query server's immediate read of a freshly-rewritten (new-CID) leaf hits the cache instead of re-reading and re-decoding from disk. The leaf id derivation matches the reader (xxh3_128 of the leaf CID). Gated behind a late-bound WarmCacheSource resolver on IndexerConfig (None by default); only the CoW update path warms, never fresh/full rebuilds (which would decode the whole graph).

Resolve the shared read cache for the background indexer from the running LedgerManager (LedgerManagerWarmCache), set once in start_background_indexing_dyn so every co-located build path warms the exact cache readers use; separate-machine indexers leave it unset. Add reindex_swap_read_profile, an instrumented harness that drives BSBM-shape write bursts and measures post-swap read latency, build cadence, and cache occupancy so warm-on-write is measurable independent of QMpH.

Extend warm-on-write to the reverse-dictionary tree: when the incremental CoW build writes a new reverse-dict leaf, seed the shared cache (DictLeaf) with its bytes so a co-located reader resolving a just-added IRI/string hits the cache instead of a cold read. Adds LeafletCache::insert_dict_leaf; the reader keys dict leaves on the CAS address string (cid.to_string()), which the warm key matches exactly. Threaded through the reverse-tree upload path and gated by the same warm_cache_source resolver (co-located only).

Absorb the one-time apply/swap cost with a throwaway load before the measured reads, so latency reflects a query on the applied generation (as a client read does after the background listener applies) rather than the first-load apply cost.

The incremental copy-on-write build never split a touched leaf (leaf_target_rows was bumped to existing_total+novelty for CID stability), so appended high-SID subjects concentrated into an ever-growing trailing leaf whose whole-blob read + full directory decode inflate every point read. Now a touched leaf splits once its merged size exceeds 2x the leaf target (matching the config's leaf_max = 2*leaf_target and the full-build LeafWriter's greedy packing), into bounded ~target-sized leaves; below the ceiling it still grows in place to avoid churning the branch on small commits. Applies uniformly to every touched leaf (middle, leftmost, rightmost). Splits are gap-free: novelty is routed to leaves by first_key(next) half-open intervals (slice_novelty_to_leaves), and the leftmost/rightmost leaves keep their -inf/+inf coverage. Adds a test asserting a split preserves the full row count, keeps leaves strictly ordered and non-overlapping, and spans the full key range.

Absorb the one-time apply cost with a throwaway load before the measured reads and time graph().load() separately from execute(), reporting query-only latency. This showed the read-after-swap cost lives in query execution (load is ~0.01ms) on leaves that grow with each incremental burst, not in apply or novelty.

The local leaf-open fast path used to std::fs::read the entire leaf blob into a fresh heap buffer on every read and re-decode the directory each time — cost that grows with the leaf and is re-paid per read. Add MmapLeafHandle: mmap the (immutable, content-addressed) leaf so raw bytes stay in OS page cache and only touched pages fault in (no whole-blob copy), and take the decoded directory as an Arc from the shared LeafletCache (parsed once per leaf CID). This also activates the LeafDir warm-on-write already seeded by the incremental build. Column data is still materialized once per leaflet via the V3Batch cache; raw leaf bytes are never copied into the cache budget.

…let scan filter_batch scanned every row of a leaflet applying the row filter. Leaflet rows are sorted by the order's key, so when the leading sort column is pinned to a single value (e.g. a bound subject on a SPOT scan) binary-search its contiguous [start,end) range and scan only that, instead of the whole (possibly large) leaflet. Output is identical to a full scan — rows outside the range can't match a filter that pins the leading column — so replay/overlay downstream are unaffected. Falls back to a full scan when the leading column is unbound or not a materialized sorted block, so no rows are ever missed. Cuts a bound-subject point read from O(leaflet) to O(log leaflet + result); measured ~2x on warm base-subject reads in the read-after-swap bench.

…ap bench read_new2 (second read of the same just-inserted product) isolates a first-touch effect from insertedness; read_old_random (a different existing product each burst) isolates cold-subject from just-inserted. Together they showed the read tail is not storage: read_old_random and read_new2 are both fast (~0.4ms) while read_new (first query of a fresh generation) is ~3.9ms — pinpointing the per-generation stats-view rebuild that the first query after each reindex swap pays.

aaj3f

This feels like a good find/catch!

aaj3f · 2026-07-05T03:16:09Z

+    // Warm-on-write (co-located only): seed the shared read cache with the
+    // leaflets we just wrote, from bytes already in hand, so the query server's
+    // immediate read of this new-CID leaf hits the cache instead of cold decode.
+    if let Some(cache) = warm_cache {
+        warm_leaf_into_cache(cache, &info.leaf_cid, &info.leaf_bytes);
+    }


warm_leaf_into_cache runs a synchronous full zstd decode of every leaflet of every rewritten leaf (load_leaflet_columns at line 3850) directly on the async uploader task. upload_one_leaf_blob is awaited inside the async move uploader loop in run_update_branch (line ~340), i.e. on a Tokio worker thread — not the spawn_blocking CoW task (line 308). In the co-located deployment (the only case where warm_cache is Some) that runtime also serves queries, so a churn build can stall query-serving worker threads on decompression — partially working against the latency win this PR chases. (Verified: impact is bounded — the uploader is a single sequential future, so at most one runtime worker is intermittently occupied by a decode (≈1/num_cpus of capacity). Real anti-pattern and worth fixing, but a recommended cleanup, not a hard merge blocker.) Move the decode off the async path:

// in upload_one_leaf_blob, replace the inline call: if let Some(cache) = warm_cache { let cache = cache.clone(); // Arc<LeafletCache> let leaf_cid = info.leaf_cid.clone(); let leaf_bytes = info.leaf_bytes.clone(); // or Arc the bytes to avoid the copy tokio::task::spawn_blocking(move || { warm_leaf_into_cache(&cache, &leaf_cid, &leaf_bytes); }); }

(Spawn-and-forget keeps the upload off the critical path; or thread the warm into the existing spawn_blocking CoW where the bytes already exist. At minimum, wrap in spawn_blocking so a large leaf's decompression cannot block the async reactor.)

aaj3f · 2026-07-05T03:17:41Z

+    for (idx, entry) in dir.entries.iter().enumerate() {
+        if entry.row_count == 0 {
+            continue;
+        }
+        if let Ok(batch) = load_leaflet_columns(
+            leaf_bytes,
+            entry,
+            dir.payload_base,
+            &ColumnProjection::all(),
+            order,
+        ) {
+            cache.insert_v3_batch(
+                V3BatchCacheKey {
+                    leaf_id,
+                    leaflet_idx: idx as u32,
+                    columns: ColumnSet::ALL.0,
+                },
+                batch,
+            );
+        }
+    }


Warming inserts a ColumnSet::ALL batch per leaflet. ALL is wider than any read projection (for_scan never requests t unless time-travel), so warm-on-write pushes more bytes into the shared, TinyLFU-bounded cache than the reads it serves would — under heavy churn this can evict hot narrow entries that queries actually touch. This is the deliberate "one batch serves every projection" tradeoff, but consider (a) warming only when the leaf is small enough to fit a per-build warm budget, or (b) skipping the t column (warm ALL & !T) since reads rarely need it. Not a blocker; worth a budget knob.

// consider a bounded warm: skip leaflets whose ALL-decode would exceed a // per-build byte budget, so a giant rewritten leaf can't evict the working set.

bplatz requested review from aaj3f and zonotope July 1, 2026 19:07

bplatz added 11 commits July 4, 2026 11:23

fmt

25d8c28

bplatz force-pushed the feature/warm-on-write-reindex-cache branch from 497a3be to 25d8c28 Compare July 4, 2026 15:26

bplatz changed the base branch from main to fix/filtered-delete-staging-hang July 4, 2026 15:26

aaj3f approved these changes Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415
bplatz wants to merge 11 commits into
fix/filtered-delete-staging-hangfrom
feature/warm-on-write-reindex-cache

bplatz commented Jul 1, 2026

Uh oh!

aaj3f left a comment

Uh oh!

aaj3f Jul 5, 2026

Uh oh!

aaj3f Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bplatz commented Jul 1, 2026

Problem

Changes

Bench

Validation

Scope note

Uh oh!

aaj3f left a comment

Choose a reason for hiding this comment

Uh oh!

aaj3f Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

aaj3f Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants