From 6ce11597d64945f6daa7b003b21fcfc220c3eb43 Mon Sep 17 00:00:00 2001 From: tamirms Date: Mon, 15 Jun 2026 19:26:24 +0200 Subject: [PATCH 01/18] docs(full-history): streaming + getTransaction design docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the full-history RPC design docs under design-docs/: - full-history-streaming-workflow.md — the streaming daemon: geometry, the meta store and the one write protocol, catch-up, hot-DB ingestion, the lifecycle tick (freeze -> rebuild -> discard -> prune), the reader retention contract, and the correctness invariants (INV-1..4) with audits. - gettransaction-full-history-design.md — the transaction-by-hash subsystem: the hot txhash CF, the .bin/.idx streamhash formats, the rolling window-index build protocol (pseudocode, crash matrix, rewriting-safety argument), the getTransaction read path, and capacity numbers. - full-history-design-explorer.html — a self-contained interactive companion to both docs. Docs live in the top-level design-docs/ alongside the events and packfile designs; the standalone backfill workflow doc is retained as historical and is subsumed by the streaming doc. Docs only — no code changes. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-design-explorer.html | 1611 +++++++++++++++++ .../full-history-streaming-workflow.md | 1274 +++++++++++++ .../gettransaction-full-history-design.md | 362 ++++ full-history/design-docs/README.md | 26 - 4 files changed, 3247 insertions(+), 26 deletions(-) create mode 100644 design-docs/full-history-design-explorer.html create mode 100644 design-docs/full-history-streaming-workflow.md create mode 100644 design-docs/gettransaction-full-history-design.md delete mode 100644 full-history/design-docs/README.md diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html new file mode 100644 index 000000000..28fc72614 --- /dev/null +++ b/design-docs/full-history-design-explorer.html @@ -0,0 +1,1611 @@ + + + + + +Full-History Design — Interactive Explorer + + + +
+ +
+ +
+
Full-History RPC · Interactive Design Explorer
+

The Full-History Streaming Design

+
+ How the full-history daemon catches up, ingests live ledgers, freezes immutable history, + and serves transactions by hash — explained with interactive models you can poke at. + Companion to the streaming workflow and + getTransaction design docs; the markdown remains the + normative spec. Each section links to the doc that owns it. +
+
+ + +
+

The big picture

+ +

+ Full-history RPC runs as one daemon in one mode. There is no separate backfill command and no + explicit catch-up step: on startup the daemon figures out how far behind the network tip it is and + catches up automatically; once caught up, it serves live ledgers as they're produced. +

+
+
startup

1 · Catch up

+

Runs bulk catch-up as a subroutine: any chunk inside the retention window that isn't already + frozen is pulled from the configured LedgerBackend (BSB by default) — skipping the tip chunk that + captive core is actively ingesting. Covers first-ever start, downtime gaps, and retention widening.

+
steady state

2 · Ingest

+

Streams live ledgers from CaptiveStellarCore into one hot RocksDB per chunk — + ledgers, tx hashes, and events as column families, written as one atomic synced WriteBatch per + ledger. A ledger is either fully in the hot DB or absent.

+
steady state

3 · Freeze & prune

+

A background goroutine wakes on each chunk boundary and runs one tick: freeze the completed + chunk to immutable files, rebuild the current tx-hash index to fold it in, discard hot DBs the cold + artifacts now serve, and prune everything superseded or past retention.

+
+ +
+
Data flow
+
Two sources feed one set of artifacts. Whatever produced the bytes, the artifacts — and the meta-store keys that catalog them — are identical.
+
+ + + + + + + + + CaptiveStellarCore + live ledgers at the tip + + + Object store (BSB) + or any conformant backend + + + + Hot RocksDB · one per chunk + column families: + ledgers · txhash · events + serves reads for the live chunk + + + + processChunk + one streaming pass over 10,000 LCMs + + + + {chunk}.pack + + events segment + + {chunk}.bin + per-chunk, write-once + + + per-window .idx (streamhash MPHF) + rebuilt from .bin files on every chunk boundary + + + + stream + + catch-up + + freeze at the chunk + boundary (hot branch) + + + + + k-way merge + + + + + meta-store RocksDB — catalogs every file and directory above: mark-then-write keys, synced WAL, no directory is ever listed to find work + + +
+
+ +
+ The one-sentence summary: data is born hot (one RocksDB per chunk), becomes cold and immutable + at the chunk boundary (.pack / events segment / .bin → rolled into a per-window + .idx), and every transition is recorded in a meta-store key before the bytes move — + so a crash at any instant is recoverable from keys alone. +
+
+ + +
+

Geometry

+ +

+ The chain starts at ledger 2 (GENESIS_LEDGER). Two units organize all storage: +

+
    +
  • Chunk — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery.
  • +
  • Windowchunks_per_txhash_index chunks (default 1000 = 10M ledgers). The unit of + the rolling tx-hash index. Configurable, but immutable once stored.
  • +
+ +
+
Geometry explorer
+
Drag the slider or type any ledger sequence to see where it lives. All ids are zero-padded %08d; file buckets group 1000 chunks (%05d).
+
+ + +
+ + + + +
+
+ +
+
+
WINDOW — 1000 chunks · 10,000,000 ledgers
+
+
+
CHUNK — 10,000 ledgers
+
+
+
+
+
+
+ +
+ With the default chunks_per_txhash_index = 1000, the file-bucket size (fixed at 1000 + chunks) and the window size coincide numerically — but they are different concepts: buckets are purely a + filesystem concern and never appear in meta-store keys; windows define the tx-hash index layout and are + pinned in the meta store forever. +
+
+ + +
+

The four guarantees

+ +

+ The daemon is built around four guarantees over its data. Everything else in the design — the write + protocol, the derived watermark, the key-driven sweeps — exists to maintain these through any crash at + any instant. +

+
+

Retention is complete

+

No gaps within the retention window — for every ledger in + [effectiveRetentionFloor, last_committed_ledger], all data derived from it (transactions, + events) is present on disk and can serve any request that falls entirely inside the window.

+

Cold is canonical, hot is transient

+

Frozen chunks and finalized indexes live in immutable cold artifacts. A chunk's hot DB is discarded + once every cold artifact derived from it is durable and the rolling index covers it — so a tx + lookup always has exactly one home: the hot DB until coverage, the .idx after.

+

The meta-store catalogs what's on disk

+

Disk content is exactly what the meta-store specifies — every file is named by a meta-store key and + every key in a final state has its file. File and key writes/deletes are ordered to preserve this + across crashes.

+

Storage tracks retention

+

Disk usage scales with retention_chunks, not with uptime — files and keys for ledger + ranges below the effective retention floor are pruned as the floor advances.

+
+
+ + +
+

Data model

+ +

+ Durable state lives in two places: the meta-store RocksDB (state markers and config pins) and the + filesystem (immutable files, plus one per-chunk hot RocksDB holding in-progress data during + ingestion). +

+ +

On disk

+
{default_data_dir}/
+├── meta/rocksdb/ ← meta store (WAL always on)
+├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient)
+├── ledgers/{bucket:05d}/{chunk:08d}.pack
+├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash)
+└── txhash/
+    ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization
+    └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named +
+
+ hot / transient-per-chunk + cold, persists until retention pruning + transient rebuild input +
+ +

+ The .bin is the interesting transient: it is the input to buildTxhashIndex and + is retained for the whole life of its window (every boundary re-reads all of the window's + .bin files to rebuild the index). The terminal build's commit batch demotes them to + "pruning" and the sweep removes them. +

+ +

The chunk hot DB

+

+ One RocksDB per chunk at hot/{chunk:08d}/, holding everything for that chunk not yet + materialized to cold artifacts. The data types are column families of one instance — they share the + instance's WAL, so each ledger commits as one atomic WriteBatch across all CFs. There is no + cross-store ordering to reason about within a chunk. +

+ + + + + +
Column familyHoldsServes
ledgerscompressed LCMs, keyed by seqgetLedger for the live chunk; the source processChunk reads at freeze
txhashtx hash → seqgetTransaction for the live chunk
events CFslive events (schema per the events doc)getEvents for the live chunk
+ +

Meta-store keys

+

Three groups: per-chunk artifact state, hot DB state, and config pins. Lifecycle states are shared by + every artifact key in the system:

+
+ "freezing" = file being written (or crashed mid-write) — delete or re-derive +
+
+ "frozen" = fsynced and durable — truth; the only state readers resolve +
+
+ "pruning" = queued for removal — finish the delete +
+ + + + + + + + +
KeyMeaning
chunk:{c}:lfsPer-chunk .pack file state.
chunk:{c}:txhashPer-chunk .bin file state. Transient — removed at window finalization.
chunk:{c}:eventsPer-chunk events cold segment state.
index:{w}:{lo}:{hi}One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment.
hot:chunk:{c}"ready" = dir exists and is usable; "transient" = a directory operation (create or delete) is in flight — the recovery is the same either way, which is why one value suffices.
config:earliest_ledger
config:chunks_per_txhash_index
Written on first start, immutable thereafter (startup aborts on mismatch).
+
+ Key names carry identity; values carry only lifecycle. An index key's filename is derived from + its name by a fixed bijection — resolving a key to its file never reads the value or lists a directory. + Every file on disk, including a crashed attempt's partial, is reachable from its key alone. +
+
+ + +
+

Artifact lifecycles

+ +

+ Three state machines cover every durable thing in the system. Click any state to see what it means and + what recovery does if a crash leaves the system there. +

+
+
Per-chunk artifacts — .pack, events segment, .bin
+
+
Click a state above.
+
+
+
Index coverage — {lo}-{hi}.idx
+
The one logically-mutable cold artifact: "mutation" happens by freezing the next coverage and demoting the old one in a single atomic batch. The frozen file readers resolve is immutable until unlinked.
+
+
Click a state above.
+
+
+
Hot DB — hot/{chunk:08d}/
+
+
Click a state above.
+
+
+ + +
+

One write protocol

+ +

+ Every durable artifact — per-chunk files and index coverages alike — uses the same protocol, + mark-then-write: put "freezing" before any I/O; + write the file; fsync the file and its dirent(s); flip the key to + "frozen". The pre-mark guarantees every file on disk has a + key, so all cleanup is key-driven. Deletion mirrors it: demote, unlink the + file before the key, with an fsyncDir barrier between — giving the complementary + guarantee, key absent ⟹ file gone. +

+
+
Crash simulator
+
Pick a protocol, then click any step. The right panel shows the durable state after that step completes — and what happens if the process dies right there.
+
+
+
+
+
+
+
+ Why the dirent fsyncs matter: without them, a power crash can revert a file's (or a freshly + created directory's) creation under a durable "frozen" key — a state + key-only idempotency would never repair. That's why writes that create their parent directory barrier + the grandparent too. +
+
+ The "never salvage" rule: a crashed index build's file might even be complete — but proving that + buys nothing. A rebuild re-derives byte-identical output (the merge is a deterministic function of the + coverage), and a single no-questions rule — delete "freezing" debris + unread — collapses the entire crash inventory. +
+
+ + +
+

Progress is derived, never stored

+ +

+ There is no stored watermark. The hot DB's synced per-ledger WriteBatch is the durable commit; + recording it again in the meta store would create a second copy of the same fact, plus an ordering rule + to keep the copy honest. Instead, two derivations read progress back out of the catalog. Both lean on one + key-creation invariant: a hot:chunk key is created only after every ledger below its + chunk has durably committed — so the highest hot key is the live chunk, and everything below it + is complete. +

+
+
deriveCompleteThrough — two terms, take the max
+
A cold term (highest chunk whose artifacts are all durable) leads at startup; a positional term (everything below the live chunk) leads in steady state. Pick a situation:
+
+
+
+
+ Postcondition-driven catch-up is what makes a derived watermark safe: catch-up converges + ranges, not resume pointers, so derivation can never hide a hole — and a lost hot volume + self-degrades the watermark to the last frozen boundary instead of requiring a manual rewind. +
+
+ + +
+

The rolling tx-hash index

+ +

+ The current window's index is re-derived from scratch on every chunk boundary to absorb the chunk + that just froze, growing until its window is complete. Only the window the network tip is in is ever + rebuilt; a completed window's index is finalized (its .bin inputs swept) and never touched + again. The rebuild is cheap relative to the cadence: a full-window streamhash build is ≈1 minute against + a chunk boundary every ~14 hours at mainnet rates. +

+
+
Rolling-window simulator
+
+ Scaled down to 8 chunks per window so you can watch it roll (real default: 1000). Each step is + one chunk boundary: the live chunk freezes, the window's coverage advances by one atomic + promote-and-demote, the hot DB is discarded once covered. Enable retention to watch the floor chase the + tip and lo rise. +
+
+ + + + +
+
+
+ live chunk (hot DB, being written) + hot DB awaiting coverage + frozen (.pack + events durable) + .bin present (rebuild input) + index coverage [lo, hi] + pruned (past retention) +
+
+
+
+ Why per-chunk .bin files make this affordable: processChunk sorts each + chunk's ~3M entries in memory before writing, so the every-boundary rebuild is a single streaming k-way + merge of sorted runs — no two-pass build over unsorted input. Transient .bin disk is bounded + by the windows actually in flight (floor: one dense window ≈ 60 GB), because every terminal commit's + eager sweep deletes a finalized window's inputs immediately. +
+
+ Provisioning note: old and new coverage files coexist from the start of a rebuild's write until + the eager sweep's unlink, so the window dir transiently holds ~2× the index size (~25 GB at the end of a + dense full window), and the window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial + on instance NVMe, worth provisioning for on throughput-capped volumes like EBS gp3. +
+
+ + +
+

A chunk boundary, end to end

+ +

+ The micro view: ledger 53,510,001 closes chunk 5350 (window 5, floor pinned at + chunk 5100 by earliest_ledger, frozen index covering chunks 5100–5349). Step through every + write the boundary performs — watch the meta store, the filesystem, and where reads are served at each + instant. +

+
+
+ + + + +
+
+
+
+
+
+
+
+
+ Every arrow in this walkthrough is the one write protocol or its exit sweep. At the end of the tick a + re-plan and re-scan find nothing to do — that quiescence is what makes the + invariant audits meaningful on a live daemon. +
+
+ + +
+

Catch-up & the resolver

+ +

+ Catch-up has a contract — given a range, ensure every artifact derived from every ledger in it is + durable and servable — and resolves what's missing before scheduling anything. A naive + scheduler would register every task and rely on self-skips; that shape re-derives every chunk's + .bin on every restart only for finalization to immediately delete it again. Instead, each + artifact kind contributes one rule that compares its postcondition against the catalog and emits + the difference as tasks: +

+
    +
  • lfs / events (per-chunk): needed for chunk c iff the key isn't "frozen".
  • +
  • txhash (per-window): compare the stored coverage (from the window's unique frozen + index key) with the desired coverage [max(window_start, floor), min(window_last, range_end)]. + Desired ⊆ stored → schedule nothing. Desired exceeds stored → request .bin + production for every chunk in the desired range (already-frozen ones self-skip; previously-covered ones + re-derive from local .pack) and emit one + buildTxhashIndex(w, desired_lo, desired_hi).
  • +
+

+ The plan is just a value — pure data recomputed from durable keys on every run, so a restart + re-plans from what is actually on disk with nothing to resume and nothing to reconcile. And the + comparison can trust "frozen" blindly: input keys are demoted in the + same synced write that freezes the terminal coverage, and files are only ever deleted by sweeps under + non-frozen keys — no crash can leave a frozen key whose file is gone. +

+ +
+
Resolver playground
+
Six situations the daemon actually encounters. Solid bar = stored coverage (the frozen index key); dashed bar = desired coverage. The plan below is what resolve() emits.
+
+
+
+
+
+
+ +

The execution model

+

+ executePlan is map/reduce without the shuffle or the job tracker: chunk builds are the maps, + index builds are the per-window reduces (the .bins are map-side-sorted runs, so each reduce + is one streaming merge), and completion is recorded as the artifacts themselves. There is + deliberately no task engine and no persisted task state: +

+
    +
  • The dependency structure is two strata with one edge type — an index build waits on the chunk builds + inside its coverage — expressed directly with done-channels. Thousands of goroutines may exist, parked + on a single worker semaphore (cfg.Workers, the only concurrency knob); at most + Workers tasks execute at any instant.
  • +
  • Done-channels broadcast completion, not success: a build whose input failed starts anyway and + trips buildTxhashIndex's loud .bin precondition check before touching any key + — landing on the same abort-and-restart path as the original failure.
  • +
  • A persisted task graph would be a second source of truth that can drift from the artifact keys it + describes; resolve re-plans from keys, so completed work never repeats and interrupted work + needs no reconciliation.
  • +
+

+ The same resolve + executePlan pair is the lifecycle tick's first stage — one + scheduler, two callers, so the two regimes can never disagree about what "done" looks like. + processChunk's source selection (catchupSource) is also shared: a ready, + complete hot DB beats the local .pack beats the bulk backend — which is exactly what lets + the lifecycle's freeze be ordinary plan execution rather than a special path. +

+
+ + +
+

Concurrency: two writers, one fence

+ +

+ Two writers; readers only read. Their domains partition at the live chunk, and the partition + itself is encoded in the catalog — the lifecycle's derivation treats the highest hot key as the live + chunk and touches only what lies below it. +

+
+
+

Ingestion loop — owns the live chunk

+

The only writer of the live chunk's hot DB, and the creator of each chunk's + hot:chunk:{c} key. One synced WriteBatch per ledger; no progress variable at all.

+
+
+

Lifecycle goroutine — owns everything below

+

Handed-off hot DBs (freeze + discard), all chunk:* and index:* keys, and + the deletion side of hot:chunk:*. The tick's plan stage fans out to the bounded worker + pool — every worker operating strictly below the live chunk.

+
+
+

+ The handoff fence is the boundary's write order: the ingestion loop closes its write handle + before creating the next chunk's hot key. Creating that key is the act that moves the partition + — the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one + already in flight) may freeze and discard it, by which point no writer holds it. +

+

+ The only connection between the goroutines is a payload-free doorbell — a non-blocking send on a + size-1 channel, coalescing freely. Nothing is lost because the notification carries no information to + lose: eligibility derives entirely from durable state, and one tick processes everything the catalog + shows, however many boundaries contributed. The doorbell answers "when should the lifecycle look", + never "what should it see". A tick racing a boundary only under-approximates eligibility — + work deferred to the next tick, never incorrect work. +

+
+ + +
+

The reader retention contract

+ +

+ A read for any seq below effectiveRetentionFloor returns not-found, regardless of + whether the underlying file still exists. This is what lets pruning remove chunks the moment they pass + retention without coordinating with the index lifecycle: retention is the single source of truth for + "is this data available?". For tx-hash lookups, the reader walks two tiers: +

+ + + + +
Chunk stateServed from
at or below the frozen key's hithe .idx named by the window's unique "frozen" key
above hi (live, or frozen and awaiting coverage)the txhash CF of the chunk's hot DB
+

+ The transition is gap-free by write ordering: the hot DB is discarded only after the durable + .idx covers the chunk. Two ENOENT sites need explicit rules — walk through + them: +

+
+
Read-path explorer
+
Four lookups, including both ENOENT races. The chain shows each step the reader takes.
+
+
+
+
+
+ + +
+

Correctness

+ +

+ Quiescence means the tick's plan is empty and both scans produce empty op lists — the state the + system returns to between boundaries, and the state in which the invariants below are auditable on a + live daemon. From any storage state — partial-completion crashes, operator actions, surgical + recovery — startup (catch-up + the first tick) drives the system to quiescence satisfying all four. +

+ +
+ INV-1 Read correctness +
Any data request whose ledger scope falls entirely within the retention window returns + correct results — content matches what a conformant LedgerBackend would produce, no partial state + visible, no in-retention range unreachable. Audit: issue reads, or re-derive artifacts via a + conformant backend and byte-compare. One transient exception after hot-volume loss: the regressed floor + briefly admits a few already-pruned bottom chunks — those reads fail soft via the + missing-data-file rule (not-found, never wrong data) until the floor re-advances.
+
+
+ INV-2 Single canonical state +
At most one "frozen" index key per window — + at all times, quiescent or not (the commit batch promotes and demotes in one write). At + quiescence: no key anywhere is "freezing" or + "pruning"; no hot DB persists for a chunk cold artifacts fully serve; + no chunk:c:txhash key survives in a finalized window. Audit: walk meta-store keys, + cross-check forbidden co-existence.
+
+
+ INV-3 Disk matches meta-store +
At quiescence, the set of artifact files and hot DB directories on disk equals exactly + the set the meta-store specifies — no orphan files, no dangling keys, no duplicate artifacts. A + non-key-named file in an index window dir is a real bug, not mid-tick debris. Audit: walk the + filesystem against the meta-store, both directions.
+
+
+ INV-4 Retention bound +
At quiescence, no file or meta-store key maps to a ledger range strictly below the + effective retention floor. Audit: walk meta-store keys, compare ledger ranges to the floor.
+
+ +
+ None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a + real invariant violation, not as something the buggy code silently considers acceptable. An + audit admin command can implement the walks directly. +
+ +

What a bug looks like

+

Common bugs land as concrete, detectable violations:

+ + + + + + + + + + + + +
SymptomViolatesDetected by
A key flips "frozen" before fsync; key's {lo,hi} doesn't match the file; a frozen file mutated post-freezeINV-1re-derive via a conformant backend, byte-compare
Pruning too aggressive — an in-retention read returns wrong/missing resultsINV-1issue reads
Two frozen index keys in one window (promotion and demotion landed as separate writes)INV-2walk index:*, count "frozen" per window
A "freezing"/"pruning" key survives served quiescenceINV-2walk keys for transient values at quiescence
A hot DB persists for a chunk cold artifacts fully serveINV-2walk hot:chunk:* against coverage
Finalization demotions don't complete — .bin keys outlive their terminal indexINV-2walk chunk:c:txhash in finalized windows
A file on disk without its key (orphan — invisible to every key-driven scan)INV-3walk filesystem against meta-store
A key without its file (dangling)INV-3walk meta-store against filesystem
Duplicate cold artifacts for the same logical dataINV-3walk filesystem against key-specified paths
Files or keys remain below the retention floorINV-4walk keys against the floor
+ +

Why convergence works

+

Three properties shared by the resolver and the scans, plus catch-up's postcondition contract:

+
    +
  • Eligibility from durable state alone — every decision derives from meta-store keys; nothing depends on in-memory history.
  • +
  • Idempotent ops — re-running any half-finished op is safe; re-materialization overwrites at canonical paths, sweeps re-run until the key is gone.
  • +
  • Everything re-derived on every notification — there is no persisted plan to drift.
  • +
+

+ Runtime op failure aborts the daemon (after bounded retries) rather than deferring silently — safe + because startup is the recovery path: every state a run can leave behind is one startup is built + to converge. +

+
+ +
+ Interactive companion to full-history-streaming-workflow.md + (the daemon: catch-up, ingestion, lifecycle, invariants) and + gettransaction-full-history-design.md + (the tx-by-hash subsystem: formats, the rolling index, the read path) — + the markdown is the normative spec; numbers here (chunk = 10,000 ledgers, window default = 1000 chunks, + build ≈ 1 min) come from those docs and the bench-fullhistory measurements they cite. + Generated 2026-06-12. Self-contained; no external dependencies. +
+ + + +
+
+ + diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md new file mode 100644 index 000000000..90ef64e72 --- /dev/null +++ b/design-docs/full-history-streaming-workflow.md @@ -0,0 +1,1274 @@ +# Streaming Workflow + +## Overview + +Full-history RPC runs as one daemon in one mode. On startup it figures out how far behind the network tip it is and catches up automatically; once caught up, it serves live ledgers as they're produced. There is no separate backfill command or explicit catch-up step for the operator to invoke. + +The daemon does three things: + +- **Catches up on startup** by running bulk catch-up as a subroutine. This brings on-disk coverage in line with the current retention window — pulling from a configured LedgerBackend (BSB — the Buffered Storage Backend, which reads ledgers from an object store — by default; captive core or any other conformant backend if BSB isn't available) any chunks inside that window that aren't already frozen — while skipping the tip chunk that captive core is actively ingesting; hot DB ingestion finishes that one. This covers first-ever start, downtime gaps, and retention-widening gaps. +- **Ingests** live ledgers from `CaptiveStellarCore` into one hot RocksDB per chunk — ledgers, transaction hashes, and events as column families, written in one atomic batch per ledger. +- **Freezes** completed chunks to immutable files, **rebuilds** the current tx-hash index from its frozen inputs on every chunk boundary, and **prunes** superseded and past-retention artifacts. All run in a background lifecycle goroutine. + +--- + +## Geometry + +The Stellar blockchain starts at ledger 2 (`GENESIS_LEDGER`). Two units organize all storage; everything in this doc is described in terms of them: + +- **Chunk** — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery. +- **Window** (tx-hash index) — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the rolling tx-hash index. Configurable, but immutable once stored. + +``` +chunkID(seq) = (seq - 2) / 10_000 +chunkFirstLedger(c) = c * 10_000 + 2 +chunkLastLedger(c) = (c + 1) * 10_000 + 1 +indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id +chunksInIndex(w) = [w*cpi, (w+1)*cpi - 1] # cpi = chunks_per_txhash_index +``` + +All chunk and window ids use uniform `%08d` zero-padding. Example, default `chunks_per_txhash_index = 1000`: + +| Window | First ledger | Last ledger | Chunks | +|---|---|---|---| +| 0 | 2 | 10,000,001 | 0–999 | +| 1 | 10,000,002 | 20,000,001 | 1000–1999 | +| N | N×10M + 2 | (N+1)×10M + 1 | N×1000 – (N+1)×1000−1 | + +--- + +## What the daemon guarantees + +The daemon is built around four guarantees over its data: + +- **Retention is complete.** No gaps within the retention window — for every ledger in the window, all data derived from it (transactions, events) is present on disk and available to serve any data request that falls entirely within it. +- **Cold is canonical, hot is transient.** Frozen chunks and finalized indexes live in immutable cold artifacts. The chunk hot DB is discarded once every cold artifact derived from the chunk is durable *and* the rolling index covers the chunk — so a tx lookup always has exactly one home: the chunk's hot DB until coverage, the `.idx` after. The current index is logically mutable — re-derived on every chunk boundary from the frozen `.bin` files — until its window finalizes. +- **The meta-store catalogs what's on disk.** Disk content is exactly what the meta-store specifies — every file is named by a meta-store key and every key in a final state has its file. File and key writes/deletes are ordered to preserve this across crashes. +- **Storage tracks retention.** Disk usage scales with `retention_chunks`, not with uptime — files and meta-store keys for ledger ranges below the effective retention floor are pruned as the floor advances. + +The retention window is bounded above by `last_committed_ledger` (the most recent ledger the daemon has durably committed) and below by `effectiveRetentionFloor` (computed from `retention_chunks` and `earliest_ledger`; defined in [Startup](#startup)). + +The rest of this doc explains how the daemon maintains these guarantees through three operational phases. The [Correctness](#correctness) section at the end gives the formal statement plus substrate assumptions, coverage scenarios, and audit shapes. + +--- + +## How the daemon runs + +Three activities, in the order a fresh daemon encounters them; the last two run together as the steady state. + +**Catch up.** On first start the daemon checks how far behind the network it is by sampling the network tip via the configured LedgerBackend; on subsequent starts it picks up from `last_committed_ledger`. It then runs the [catch-up primitives](#catch-up-primitives) (`processChunk`, `buildTxhashIndex`) over the missing range. The catch-up loop excludes the chunk captive core is currently ingesting — that one's finished by hot DB ingestion, not by the catch-up source. + +**Hot DB ingestion.** Once caught up, the daemon streams live ledgers from CaptiveStellarCore into the live chunk's hot RocksDB — ledgers, tx hashes, and events land in their column families via **one atomic, synced WriteBatch per ledger**, so a ledger is either fully in the hot DB or absent. Ingestion's own progress marker, `last_committed_ledger`, advances per ledger only after that batch is durable — it is a local of the ingestion loop, shared with nothing. + +**Freeze and prune.** A background goroutine wakes whenever ingestion's set of hot chunk DBs changes — each chunk boundary, plus once when ingestion starts — and runs one **tick** of three stages: **plan-and-execute** (the same resolver and executor catch-up uses, which freezes complete chunks to immutable files and folds them into the current tx-hash index), **discard** (retire hot DBs the cold artifacts now fully serve), then **prune** (sweep demoted artifacts and everything past the retention window). Each stage sees the previous stage's effects. + +The current tx-hash index is **re-derived from scratch on every chunk boundary** to absorb the chunk that just froze, growing until its window is complete; only the window the network tip is in is ever rebuilt, and a completed window's index is finalized (inputs cleaned up) and never touched again. The build is cheap relative to the ~chunk cadence (a full-window rebuild with streamhash — the minimal-perfect-hash index library behind tx-hash lookups — is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates), so rebuilding from scratch each boundary is affordable. + +The boundary between "in-retention" and "past-retention" is the `effectiveRetentionFloor`. As the network tip advances, the floor advances with it; complete chunks below the floor are removed by the prune stage. + +[Daemon flow](#daemon-flow) below has the pseudocode for each phase. [Data model](#data-model) describes what's on disk and in the meta store. [Correctness](#correctness) details the invariants the design maintains. + +--- + +## Configuration + +One TOML file (`--config`) configures the daemon. + +**[service]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `default_data_dir` | string | **required** | Base directory for the meta store and default storage paths. | + +**[catch_up]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `chunks_per_txhash_index` | uint32 | `1000` | Chunks per tx-hash window. Defines data layout — immutable once stored (startup aborts on mismatch; see `validateConfig`). | +| `workers` | int | `GOMAXPROCS` | Concurrent task slots for bulk catch-up. | +| `max_retries` | int | `3` | Retries per catch-up task before the daemon aborts. | + +**[catch_up.bsb]** — Buffered Storage Backend (the default bulk LedgerBackend; required **unless** another conformant LedgerBackend is configured as the bulk source — `backendNetworkTip`/`validateBackendCovers`/`processChunk`'s default `source` all go through whichever backend is configured) + +| Key | Type | Default | Description | +|---|---|---|---| +| `bucket_path` | string | **required** | Remote object store path for LedgerCloseMeta (no `gs://` prefix for GCS). | +| `buffer_size` | int | `1000` | Prefetch buffer depth per connection. | +| `num_workers` | int | `20` | Download workers per connection. | + +**[immutable_storage.*]** — one optional `path` per artifact tree (defaults under `{default_data_dir}`): + +| Section | Default path | Holds | +|---|---|---| +| `[immutable_storage.ledgers]` | `{default_data_dir}/ledgers` | `.pack` files | +| `[immutable_storage.events]` | `{default_data_dir}/events` | events cold segments | +| `[immutable_storage.txhash_raw]` | `{default_data_dir}/txhash/raw` | transient `.bin` files | +| `[immutable_storage.txhash_index]` | `{default_data_dir}/txhash/index` | per-window `.idx` | + +**[meta_store]** — optional `path` (default `{default_data_dir}/meta/rocksdb`). + +**[logging]** — optional `level` (`debug`/`info`/`warn`/`error`, default `info`) and `format` (`text`/`json`, default `text`). + +**[streaming]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `retention_chunks` | uint32 | `0` | Retention window in chunks. `0` = full history. | +| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(backendNetworkTip()))` at first start. Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | +| `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | + +**[streaming.hot_storage]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `path` | string | `{default_data_dir}/hot` | Base path for hot RocksDB databases. | + +**CLI** + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--config` | string | **required** | Path to TOML config file. | + +--- + +## Data model + +The daemon's durable state lives in two places: the meta-store RocksDB (state markers and config pins) and the filesystem (immutable files plus one per-chunk hot RocksDB that holds in-progress data during ingestion). + +Throughout this section, `chunk` is a chunk id, `txhash_index` is a window id, and `chunks_per_index` is shorthand for `config.chunks_per_txhash_index`. + +### Filesystem artifacts + +The per-chunk artifacts are each written once at chunk freeze; the txhash index is rebuilt on each chunk boundary while its window is current and then finalized. All four are produced by the [catch-up primitives](#catch-up-primitives): + +| Artifact | Granularity | Format | Produced by | +|---|---|---|---| +| Ledger pack file | per chunk | `.pack` | `processChunk` | +| Events cold segment | per chunk | three files per chunk (format defined in the events doc) | `processChunk` | +| Sorted txhash file | per chunk | `.bin` (sorted streamhash entries; see [rule 5](#catch-up-primitives)) | `processChunk` | +| Streamhash txhash index | per index | one `.idx` file per **coverage**, named `{lo:08d}-{hi:08d}.idx` inside the window's dir; at most one coverage frozen at any moment | `buildTxhashIndex` | + +The `.bin` is transient — it is the input to `buildTxhashIndex` and exists only until its index window finalizes, at which point the terminal build's commit batch demotes it to `"pruning"` and the sweep removes it. While a window is current, every boundary re-reads its `.bin` files to rebuild the index, so they are retained for the whole window. The pack file and events segment persist until retention-driven pruning removes them. The txhash index is rebuilt at a **new coverage** on every boundary (mark the new coverage's key `"freezing"` → write `{lo}-{hi'}.idx` → one atomic batch promotes it to `"frozen"` and demotes the predecessor coverage to `"pruning"`), then persists until pruning once its window has finalized and slid past retention. Key name and filename are a bijection, so every index file on disk — including a crashed attempt's partial — is reachable from its key alone; nothing ever lists the directory. + +### Directory layout + +Chunk-level files group into buckets of 1,000 chunks (`bucket_id = chunk_id / 1000`, formatted `%05d`) — a filesystem concern only; bucket ids never appear in meta-store keys. Directories are created on demand. + +``` +{default_data_dir}/ +├── meta/rocksdb/ ← meta store (WAL always on) +├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient) +├── ledgers/{bucket:05d}/{chunk:08d}.pack +├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash) +└── txhash/ + ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization + └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named +``` + +### The chunk hot DB + +During ingestion the daemon maintains **one hot RocksDB per chunk** at `{hot_storage.path}/{chunk:08d}/`, holding everything for that chunk not yet materialized to cold artifacts. The data types are column families of the one instance: + +| Column family | Holds | Serves | +|---|---|---| +| `ledgers` | compressed LCMs (LedgerCloseMeta), keyed by seq | `getLedger` for the live chunk; the source `processChunk` reads at freeze | +| `txhash` | tx hash → seq | `getTransaction` for the live chunk | +| events CFs | live events (schema per the events doc) | `getEvents` for the live chunk | + +CFs share the instance's WAL, so each ledger commits as **one atomic WriteBatch across all CFs** — there is no cross-store ordering to reason about within a chunk. Per-CF options keep tuning independent (the events CFs carry their own settings). The DB is created when ingestion enters the chunk and discarded whole once every cold artifact derived from the chunk is durable **and** the rolling index covers the chunk; it keeps serving tx lookups across the brief freeze-to-coverage interval, and freeze, rebuild, and discard all chain within one lifecycle tick. + +### Meta-store keys + +The meta store holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and config pins. + +**Artifact state keys**: + +| Key | Value | Meaning | +|---|---|---| +| `chunk:{chunk:08d}:lfs` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk pack file state. | +| `chunk:{chunk:08d}:txhash` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk `.bin` file state. Transient — removed at window finalization. | +| `chunk:{chunk:08d}:events` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk events cold segment state. | +| `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` | `"freezing"` \| `"frozen"` \| `"pruning"` | One key per index **coverage**. The key *name* carries the coverage `[lo, hi]` and maps 1:1 to the file `{lo:08d}-{hi:08d}.idx`; the *value* is pure lifecycle state — the same three values as every other artifact key. At most one coverage per window is `"frozen"` at any moment, and a key with `hi` = its window's last chunk is **terminal** by definition (see [Index keys](#index-keys) below). | + +For the per-chunk keys, `"freezing"` means the immutable file is being written; `"frozen"` means it's fsynced and durable; `"pruning"` means the file is queued for removal; key absent means neither file nor in-progress write exists. Index keys use the **same three states with the same meanings** — a rebuild marks its coverage `"freezing"` before any I/O, and its commit batch flips it to `"frozen"` while demoting the superseded coverage to `"pruning"`. Every artifact key therefore obeys one set of crash rules: `"freezing"` = delete (or re-derive) the file, `"pruning"` = finish the delete, `"frozen"` = truth. + +**Hot DB state key**: + +| Key | Value | Tracks | +|---|---|---| +| `hot:chunk:{chunk:08d}` | `"transient"` \| `"ready"` | The chunk's hot DB. | + +`"ready"` means the RocksDB dir exists and is usable. `"transient"` brackets a directory operation in flight — creation or deletion; no code path ever needs to know which, since the recovery is the same either way (the open path wipes and recreates; the discard scan re-runs). A crash mid-operation is detectable from the key value alone. One key per chunk; the column families inside the DB carry no individual meta-store state. + +**Config pins** (there is no stored watermark — see below): + +| Key | Value | Written when | +|---|---|---| +| `config:earliest_ledger` | `uint32` (decimal string, chunk-aligned) | On the first daemon start. Immutable thereafter — changing it currently requires wiping the data directory, until a `set-earliest-ledger` admin command exists (see [Configuration](#configuration); the floor machinery already converges for either direction). | +| `config:chunks_per_txhash_index` | `uint32` (decimal string) | On the first daemon start; immutable thereafter. Startup aborts if the config value doesn't match. | + +**Progress is derived, never stored — and never shared.** The hot DB's synced per-ledger WriteBatch *is* the durable commit; recording it again in the meta store would only create a second copy of the same fact, plus the ordering rule needed to keep the copy honest. Two derivations read progress back out of the catalog, one per consumer, at the two granularities they need. Both lean on one **key-creation invariant**: a `hot:chunk` key is created only after every ledger below its chunk has durably committed — at a boundary, ingestion closes chunk C's write handle *before* creating C+1's key (the [ingestion loop](#hot-db-ingestion) enforces the ordering); at startup, the resume chunk's key is created only after derivation has already run. The highest hot key therefore *is* the live chunk, and everything below it is complete. + +The lifecycle tick needs only chunk granularity — which chunks are complete, and where the sliding retention floor anchors: + +```go +// completeThrough is the highest ledger the lifecycle may treat as durably +// ingested. Two complementary terms: a COLD term — the highest chunk whose +// artifacts are all durable — which leads at startup (catch-up just ran; +// ingestion hasn't started); and a POSITIONAL term — everything below the +// live chunk, by the key-creation invariant — which leads in steady state +// (a chunk completes long before its cold artifacts exist). +func deriveCompleteThrough(cat Catalog) uint32 { + // Cold term: a chunk counts only when pendingArtifacts() is empty (lfs + // AND events frozen; txhash frozen or index-covered). NOT merely "lfs + // frozen": a crash mid-freeze can leave lfs frozen while events is still + // "freezing", and counting that chunk would let reads open over a + // partial artifact. An incompletely frozen tip chunk must DEGRADE the + // bound so catch-up / re-ingestion repairs it. + through := chunkLastLedger(highestDurableChunk(cat)) + // Positional term. hotChunkKeys returns every hot:chunk:* key regardless + // of value — counting a "transient" key is sound because it is only ever + // put after the predecessor chunk's write handle closed. + if hot := hotChunkKeys(cat); len(hot) > 0 { + through = max(through, chunkLastLedger(maxChunk(hot)-1)) + } + return max(through, cat.EarliestLedger()-1) +} +``` + +Ingestion's resume point at startup needs the exact ledger — the one consumer of sub-chunk precision: + +```go +// deriveWatermark is deriveCompleteThrough refined by exactly ONE read: +// sub-chunk precision only ever matters inside the live chunk, and a lower +// ready chunk can never hold the maximum (key-creation invariant, above), so +// only the live chunk's DB is opened. Runs once, before ingestion starts — +// the only time opening a hot DB is safe (once ingestion runs, the live DB +// is held exclusively by its writer). +func deriveWatermark(cat Catalog) uint32 { + for _, c := range readyHotChunks(cat) { + if !dirExists(hotChunkPath(c)) { + // Checked for EVERY ready key, not just the one opened below: + // derivation runs before every other open site, so without this + // a lost hot volume dies as an opaque RocksDB error (or worse, + // a predecessor's loss is silently healed by discard) instead + // of the curated recovery instruction. Never skip silently — + // that would auto-heal the mount-misconfiguration case. + fatalf("hot:chunk:%08d is \"ready\" but its dir is missing — "+ + "hot storage lost; run surgical recovery (case 4).", c) + } + } + w := deriveCompleteThrough(cat) + if live, ok := highestReadyHotChunk(cat); ok { + w = max(w, maxCommittedSeq(openReadOnly(live))) + } + return w +} +``` + +During operation no shared watermark exists at all: ingestion keeps its progress as a plain local, and each lifecycle tick calls `deriveCompleteThrough` fresh. The meta-store catalog thereby stays a *pure* catalog — every key names a file/dir state or a config pin. Postcondition-driven catch-up is what makes a derived watermark safe: catch-up converges *ranges*, not resume pointers, so derivation can never hide a hole — and a lost hot volume self-degrades the watermark to the last frozen boundary instead of requiring a manual rewind ([surgical recovery case 4](#scenario-coverage)). + +### Index keys + +An index key `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` carries the chunk range `[lo, hi]` its `.idx` covers in the key **name**; the value holds only lifecycle state. The filename is derived from the key by a fixed bijection — `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx` — so resolving a key to its file never involves reading the value or listing a directory. [The transactions design](./gettransaction-full-history-design.md) (§6.3) is the canonical reference for coverage semantics, with the rationale and a worked example; the properties this doc's protocols depend on: + +- **Coverage is the whole identity** — there is no per-attempt counter. A retry of a crashed build re-marks the same key and rewrites the same file from scratch, exactly as rule 1 re-materializes a per-chunk artifact at its canonical path. +- **`lo`** rises above the window's first chunk when `earliest_ledger` or the sliding floor cuts into the window at build time (`lo` rising is how a mid-window floor is encoded); **`hi`** advances by one chunk per boundary while the window is current, and equals the window's last chunk once it finalizes. +- **Terminal-ness is derived, not stored**: the key whose `hi` equals its window's last chunk (computable forever from the immutable `chunks_per_txhash_index` pin). A window whose frozen key is terminal is finalized — its `.bin` inputs were demoted in the same commit, and its index is never rebuilt again (only a retention-widening catch-up re-derives it, at its new, wider coverage). +- **The uniqueness invariant: at most one coverage per window is `"frozen"` at any moment.** The rebuild's commit is one atomic synced batch ([rule 3](#catch-up-primitives)'s commit step holds the exact composition), so the frozen coverage changes hands atomically and readers resolve "the window's index" as *the unique frozen key* — no tie-break, no value parsing. Everything else under the window's prefix is transient debris: `"freezing"` = a crashed attempt (re-marked and overwritten if its coverage is built again, otherwise swept); `"pruning"` = a superseded coverage (finish the unlink, drop the key). + +So the `.idx` hashes exactly the transactions in chunks `[lo, hi]`: chunks below `lo` are out of scope (floor); chunks above `hi` are served from their chunks' hot DBs until the next rebuild advances `hi`. While the window is current, `lo` tracks the floor and `hi` the tip automatically — no separate floor-driven rebuild is ever needed. Once finalized, the `.idx` is static; a floor that later advances within the window leaves it stale (`lo` referencing chunks pruning has since removed), which the reader retention contract handles cleanly. A window-straddling floor exists at most once at any moment — the window containing the effective retention floor. + +### Per-chunk artifact lifecycle + +Pack files, events segments, and `.bin` files (the `processChunk` outputs) are write-once: + +``` + absent ──► ingesting ──► freezing ──► frozen ──► pruning ──► absent +``` + +- **Absent** — no artifact key, no immutable file. +- **Ingesting** — hot DB holds the data being written; artifact key is absent; immutable file doesn't exist yet. +- **Freezing** — `processChunk` has put `"freezing"` and is materializing the file. The file may be partial on disk. A crash here is detectable from the key value alone — recovery either re-writes (within retention) or deletes the partial file (past retention). +- **Frozen** — immutable file is fsynced at its canonical path; artifact key is `"frozen"`. Once all three of a chunk's artifacts are frozen *and* the rolling index covers the chunk, the chunk's hot DB is discarded. +- **Pruning** — retention is deleting the immutable file; artifact key is `"pruning"`; the file may or may not still be on disk. + +The `"freezing"` mark is set **before** any I/O for that artifact. This gives the invariant **"any file on disk has its meta-store key set"** — a retention scan iterates keys, and every file is reachable that way. Without the pre-write mark, a crash between writing the file and setting the key would leave an artifact that no scan could see. + +### Index artifact lifecycle + +The streamhash index is the one logically-mutable cold artifact: it is re-derived on every chunk boundary while its window is current. Physically a file is writable only while its key has never been `"frozen"` in this run — mutation happens by freezing the next coverage and demoting the old one; the frozen file readers resolve is immutable until unlinked. + +Each coverage runs the same lifecycle as every per-chunk artifact: + +``` + absent ──► freezing ──► frozen ──► pruning ──► absent +``` + +- **Freezing** — the key was put (with its coverage in the name) *before* any I/O; the file may be partial or absent. A crashed attempt parks here. If its coverage is built again, the build re-marks the key and rewrites the file wholesale; one the prune scan observes was not retried — **delete file and key, never salvage**. Salvage would require proving the file complete; deletion needs no proof, and a rebuild re-derives identical bytes anyway (the merge is a deterministic function of the coverage). +- **Frozen** — the file and its dirent are fsynced and the commit batch has landed. The window's unique frozen coverage *is* the live index. +- **Pruning** — a newer coverage superseded this one (demoted in its commit batch), or retention is removing the window. The standard sweep finishes: unlink → `fsyncDir` → delete the key. + +The *window-level* progression — coverage advancing boundary by boundary, then finalization — emerges from the coverage chain: each boundary freezes the widened coverage and demotes its predecessor in one atomic batch, and the terminal key is the one whose `hi` equals the window's last chunk. The batch maintains **at most one frozen coverage per window at all times** — a crash at any instant leaves either the old coverage frozen (batch not landed; the new one is `"freezing"` debris) or the new one frozen (predecessor already `"pruning"`), never both, never neither. + +Why rewriting coverage-named files in place is safe — the question to ask, since readers hold the live `.idx` open while the next one is written — is argued in full in [the transactions design](./gettransaction-full-history-design.md) (§7.5). Four facts carry it: the build's skip rule (no scheduled build ever targets the name readers resolve), the stage ordering plus the eager sweep's window-locality and pruning-only scope, the floor's monotonicity within a run plus reader handles dying with the process, and the merge's determinism. A change to any of the four must re-prove the argument. + +### Hot DB lifecycle + +``` + absent ──► transient ──► ready ──► transient ──► absent + (creating) (deleting) +``` + +- **Absent** — no hot DB key, no dir on disk. +- **Transient** — a directory operation is in flight: either creation (key put, RocksDB dir initializing) or deletion (discard rmdir'ing the dir). A crash in either leaves a possibly-partial dir; the recovery is identical regardless of which operation was interrupted — the open path wipes and recreates, the discard scan re-runs — which is why one value suffices. +- **Ready** — dir exists and is usable for reads and writes. The chunk's contents at any moment run up to `last_committed_ledger`'s position within the chunk; ledgers above it are pending streaming. + +The hot DB is discarded whole once the chunk's cold artifacts are all frozen and the rolling index covers the chunk — freeze, rebuild, and discard chain within one lifecycle tick. All lifecycles in this section are observable purely from meta-store keys — no filesystem inspection needed. + +### One write protocol + +Every durable artifact — per-chunk files and index coverages alike — uses the same protocol, **mark-then-write**: put `"freezing"` *before* any I/O; write the file; fsync the file, its parent dirent, and (when the parent was just created) the grandparent dirent; flip the key to `"frozen"`. The pre-mark guarantees *every file on disk has a key*, so all cleanup is key-driven — nothing ever lists a directory to find work — and a crash mid-write is visible as a `"freezing"` key. The dirent barriers guarantee the key never outlives the file's creation: without them, a power crash can revert the file's — or a freshly created directory's — creation under a durable `"frozen"` key, which key-only idempotency would then never repair; that is why writes that create their parent dir barrier the grandparent too (the same two-level barrier `openHotDB` uses). + +Per-chunk artifacts write at a canonical path and flip with a single-key put. The index extends the flip into a **commit batch** — the artifact is logically mutable, so everything a build changes commits *together* in one atomic synced write; [rule 3](#catch-up-primitives)'s commit step holds the exact composition. The batch extension changes what commits together, not how a file becomes durable. + +Exits mirror the entries: every sweep demotes a still-`"frozen"` key first, then removes the *file before the key* with an `fsyncDir` barrier between (`sweepChunkArtifacts` and `sweepIndexKey` — the system's only two deletion bodies, one per key family), giving the complementary guarantee — *key absent ⟹ file gone*. The hot DB's `transient`/`ready` bracket is the same two ideas applied to a directory. + +--- + +## Catch-up primitives + +Two primitives materialize cold artifacts: `processChunk` and `buildTxhashIndex`. They have exactly two callers — the [Startup](#startup) catch-up loop and the [Lifecycle](#lifecycle) tick — and both call them the same way: through the [resolver](#postcondition-driven-scheduling) and executor, sharing one set of postconditions and one scheduler, so the two regimes can never disagree about what done looks like or how it gets there. Five protocol rules govern them; the [resolver](#postcondition-driven-scheduling) and the [execution model](#catch-up-execution-model) below turn them into a schedule. + +**(1) Artifact key values.** `processChunk` applies the [one write protocol](#one-write-protocol) to each requested kind (`lfs`, `events`, `txhash`/`.bin`): `"freezing"` before that kind's I/O begins, `"frozen"` only after its file and dirent barriers (a new bucket dir, every 1000th chunk, barriers the grandparent too). The per-kind idempotency rule: skip iff the key's value is `"frozen"`; a `"freezing"`, `"pruning"`, or absent key triggers re-materialization, itself idempotent — the writer overwrites the file at its canonical path and flips to `"frozen"`. The streamhash index uses the **same pattern at its coverage-named path**; only its commit differs — the `"freezing"`→`"frozen"` flip rides in rule 3's atomic batch instead of a single-key put. + +**(2) `processChunk(chunk, artifacts)`.** `artifacts` is the subset of outputs to produce; the [resolver](#postcondition-driven-scheduling) uses it to skip producing `.bin` when the window's `.idx` already covers the chunk. The LCM source is chosen internally by `catchupSource`, in preference order — and the same rule serves *both* callers, which is what lets the lifecycle's freeze be ordinary plan execution: + +1. **A ready, complete hot DB** (`maxCommittedSeq ≥ chunkLastLedger(chunk)`): read locally via `HotLedgers` — this is how a just-closed chunk freezes without refetching, and how a complete-but-unfrozen chunk is produced even when the bulk source lags behind it. +2. **The frozen local `.pack`**, when `lfs` is not among the requested outputs: re-derivation without a download. +3. **The configured bulk backend** (BSB by default — see `[catch_up.bsb]`). + +The hot branch distinguishes *loss* from *staleness*: a `"ready"` key whose directory is **missing or unopenable** is hot-volume loss — the same case-4 fatal `deriveWatermark` enforces, never silently healed; a hot DB that opens but is **incomplete** is legitimate staleness (a leftover awaiting the discard scan, or a surgically stripped chunk's stale neighbor) and simply falls through to the next source — re-derivation *is* its recovery. Per ledger, the needed extractors run over one LCM stream; tx-hash entries are collected and **sorted in memory** before the `.bin` is written (rule 5). + +```go +// processChunk materializes the requested artifact kinds for one chunk from a +// single pass over its 10,000 LCMs, sourced by catchupSource (rule 2's +// preference order). +func processChunk(chunk ChunkID, artifacts ArtifactSet, cfg Config) error { + cat := cfg.Catalog + for _, kind := range artifacts.Kinds() { // rule 1 idempotency: frozen kinds self-skip + if cat.State(chunk, kind) == Frozen { + artifacts = artifacts.Remove(kind) + } + } + if artifacts.Empty() { + return nil + } + source := catchupSource(chunk, artifacts, cfg) + + batch := cat.NewBatch() // mark-then-write: "freezing" BEFORE any I/O + for _, kind := range artifacts.Kinds() { + batch.Put(chunkKey(chunk, kind), "freezing") + } + batch.Commit() + + // One streaming pass; only the requested extractors run. Files are + // (re)created at their canonical paths — re-materialization overwrites. + w := newArtifactWriters(chunk, artifacts) // .pack, events segment, in-memory txhash entries + for seq := chunkFirstLedger(chunk); seq <= chunkLastLedger(chunk); seq++ { + w.Add(source.GetLedger(seq)) + } + w.Finish() // sorts txhash entries in memory, writes the .bin (rule 5) + w.FsyncAll() // files + parent dirents (+ grandparent for a new bucket dir) + // — all durable BEFORE the flips below (rule 1) + + batch = cat.NewBatch() + for _, kind := range artifacts.Kinds() { + batch.Put(chunkKey(chunk, kind), "frozen") + } + batch.Commit() + return nil +} + +// catchupSource implements rule 2's preference order. The hot branch fatals +// only on loss (ready key, missing/unopenable dir — deriveWatermark's rule, +// third call site); an incomplete-but-present DB is staleness and falls +// through, because re-derivation from the next source IS its recovery. +func catchupSource(chunk ChunkID, artifacts ArtifactSet, cfg Config) LedgerSource { + cat := cfg.Catalog + if state, _ := cat.Get(hotChunkKey(chunk)); state == "ready" { + if !dirExists(hotChunkPath(chunk)) { + fatalf("hot:chunk:%08d is \"ready\" but its dir is missing — "+ + "hot storage lost; run surgical recovery (case 4).", chunk) + } + if db := openRocksDBReadOnly(hotChunkPath(chunk)); maxCommittedSeq(db) >= chunkLastLedger(chunk) { + return &HotLedgers{chunk: chunk, store: db} + } // incomplete: stale leftover — fall through; the discard scan owns it + } + if cat.State(chunk, LFS) == Frozen && !artifacts.Has(LFS) { + return packReader(chunk) // re-derive locally; no redundant download + } + return bulkBackend(cfg) // BSB by default — see [catch_up.bsb] +} +``` + +**(3) `buildTxhashIndex(w, lo, hi)` — rolling rebuild.** The lifecycle rebuilds the **current** window's index on every chunk boundary, so a frozen chunk's tx hashes move into the index promptly and its hot DB is discarded in the same tick. The covered chunk range is explicit: + +- `lo` defaults to the window's first chunk, rising to `chunkID(effectiveRetentionFloor)` when the floor cuts into this index. +- `hi` is the highest frozen chunk in the window — the window's last chunk once it's complete, lower while it's still filling. + +The build runs the one write protocol with the batch-commit extension: + +1. **Skip check**: if the window's unique frozen key already covers exactly `[lo, hi]`, return — there is nothing to write, and any leftover transient keys are the sweeps' job (rule 4), not the builder's. (A frozen key covering the full window is terminal by definition — `hi` equals the window's last chunk — so the skip also covers re-scheduled builds of finalized windows, which must not demand `.bin` inputs the sweep has deleted.) +2. **Mark**: put `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` = `"freezing"` — an idempotent overwrite when a crashed attempt (or a demoted coverage made desired again by a cross-restart regression) left the key behind. The build is **terminal** iff `hi` is the window's last chunk — a derived property, marked nowhere. +3. **Write**: k-way merge the sorted `.bin` files for chunks `[lo, hi]` into streamhash's `SortedBuilder`, writing `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx` — created or truncated wholesale; a writer only ever holds a file whose key is non-frozen, never one a reader can resolve. Fsync the file and its dir (and the dir's own dirent in `txhash/index/` when this build created the window dir — first build of a window). +4. **Commit**: one atomic synced batch — this coverage `"freezing"`→`"frozen"`; the window's predecessor frozen coverage (if any) →`"pruning"`; and iff this is the terminal build, every `chunk:{chunk}:txhash` key in the window →`"pruning"`. This batch is the *entire* finalization protocol — there is no separate cleanup step; the demoted keys become ordinary sweep work (rule 4). + +A crash before step 4 leaves the predecessor frozen and the new coverage as `"freezing"` debris (file partial or complete — irrelevant; it is deleted unread). A crash after step 4 leaves the new coverage frozen and the demoted keys as `"pruning"` work the sweeps finish. There is no crash point at which two coverages are frozen, the live index is unreachable, or a `"frozen"` `chunk:c:txhash` key's `.bin` has been deleted — the batch only ever *demotes* keys, and files are touched exclusively by the sweeps, under non-frozen keys. + +Precondition: every chunk in `[lo, hi]` has `chunk:{chunk}:txhash == "frozen"` (its `.bin` exists). The function fails loudly if violated. Catch-up calls the same function for every window its range overlaps — a complete window's full desired range (a terminal build) or the trailing window's producible range (non-terminal); the terminal batch finalizes in the same write, so finalization is never a separate step for either caller. + +The full build mechanics live in [the transactions design](./gettransaction-full-history-design.md) (§7): the `buildTxhashIndex` pseudocode, the rationale for rebuilding from scratch each boundary, the disk bounds and provisioning numbers (≈2× index size transient per rebuild; ~12.5 GB written in ~1 minute at a dense window's end), the crash matrix, and the safety argument for rewriting coverage-named files. This doc keeps the protocol surface above — the steps, the commit batch's composition, and the precondition — because the resolver, the sweeps, and the crash analysis depend on exactly those. + +**(4) Key-driven sweeps.** All file deletion in the system happens through keys in a transient state — never by listing directories. Two sweep rules cover every case, sharing one mechanic (unlink the file → `fsyncDir` the parent → delete the key, batching the fsyncs and key-deletes when sweeping many at once): + +- **Index `"freezing"` keys** — an abandoned build attempt, left by a crash. A retry of the same coverage re-marks and overwrites the key in place (rule 3 step 2), and builds run before this sweep in every regime — so any `"freezing"` key the sweep observes was *not* retried: its coverage is no longer desired. Disposition: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — and a single no-questions rule collapses the crash inventory. (Per-chunk `"freezing"` keys are *not* swept this way: their artifacts live at canonical paths, so rule 1's idempotent re-materialization repairs them in place within retention, and the retention prune removes them past it. One exception: a `"freezing"` `chunk:c:txhash` key inside a *finalized* window — re-materialization will never be scheduled for a covered window, so the prune scan's redundant-input branch demotes and sweeps it like its `"frozen"` siblings.) +- **`"pruning"` keys** — superseded index coverages, a finalized window's `.bin` inputs (demoted en masse by the terminal commit batch), and everything demoted by retention pruning. The sweep finishes the removal. Because the key outlives the unlink — the `fsyncDir` barrier makes the unlink durable *before* the key delete commits — a power loss anywhere leaves the key in place and the sweep re-runs. This is the exit-side counterpart of rule 1's invariant: **key absent ⟹ file gone.** + +The unlink-before-key order is load-bearing: deleting the key first would, on a crash, leave a file with no key — invisible to every key-driven scan, the one orphan class this design cannot find. + +Both sweeps have two call sites: **eagerly, inside every `IndexBuild`'s execution** (`buildThenSweep`, right after the commit batch — whichever regime ran it), and from the tick's prune stage, which is the backstop for crash leftovers and the owner of retention pruning. The eager site is what bounds disk: without it, a long backfill would accumulate every finalized window's demoted `.bin`s until the first tick (≈20 bytes per transaction across all of history); with it, transient `.bin` disk is bounded by the windows actually in flight — the floor is one dense window's worth (~60 GB), irreducible because a window's build merges all of its `.bin`s at once. Crash anywhere mid-sweep leaves `"pruning"` keys the next tick finishes — the same convergence story regardless of caller. + +**(5) Streamhash formats.** The tx-hash artifacts use the streamhash pipeline, specified in [the transactions design](./gettransaction-full-history-design.md) (§6). What this doc's protocols rely on: the `.bin` is a **sorted** per-chunk run (`processChunk` sorts the chunk's entries in memory before writing — ~3M entries ≈ 60 MB, negligible), which is what makes the every-boundary rebuild a single streaming k-way merge instead of a two-pass build over unsorted input; the `.idx` is one self-contained streamhash MPHF file per coverage (its `MinLedger` derived from `lo`; no sidecar metadata); and hot tx hashes live as **a single column family inside the per-chunk hot DB** — there is no dedicated txhash store at any layer. The formats match the measured pipeline in the bench harness (`bench-fullhistory`), which is what makes the ~1-minute full-window build figure transfer to this design. + +### Postcondition-driven scheduling + +A naive scheduler would register every per-chunk and per-window task on every run and rely on each task to self-skip — but catch-up runs on every restart, and that shape re-derives every chunk's `.bin` only for finalization to immediately demote and delete it again: wasted work proportional to the retention window. Instead, catch-up has a contract — *given a range, ensure every artifact derived from every ledger in it is durable and servable* — and resolves what's missing before scheduling anything. The resolver is a registry of **kind rules**: each artifact kind contributes one rule that compares its postcondition against the catalog and emits the difference as tasks. The current kinds: + +- **`lfs`** (per-chunk): needed for chunk `c` iff `chunk:{c}:lfs` isn't `"frozen"`. +- **`events`** (per-chunk): same rule against `chunk:{c}:events`. +- **`txhash`** (per-window, the one kind with a cross-chunk artifact): for **each window overlapping the range**, compare the stored coverage (`{lo, hi}` from the *name* of the window's unique **frozen** index key) with the desired coverage `[max(window_start, chunkID(floor)), min(window_last_chunk, range_end)]` — the upper cap is what makes the rule uniform: for a complete window it's the window's last chunk, for the trailing window it's the range end, and no special trailing case exists. + - **Desired ⊆ stored** → schedule *nothing* for this window: no `.bin` production, no build. Three states land here: every steady-state restart; a floor that *rose* (the stale stored `lo` is the reader retention contract's problem, not a rebuild trigger); and a finalized window the range ends inside — a crash right after a terminal commit resumes exactly at that window's last ledger, where the terminal coverage already covers any desired range and the leftover `"pruning"` demotions stay the sweeps' job. + - **Desired exceeds stored** (`desired_lo < stored_lo`, or `desired_hi > stored_hi`, or no frozen key exists) → request `.bin` production for **every** chunk in the desired range — chunks whose `.bin` is already frozen self-skip inside `processChunk`, chunks the old `.idx` covered re-derive from local `.pack` files (no BSB) — and emit one index build `buildTxhashIndex(w, desired_lo, desired_hi)`. The build is terminal (input demotion) iff `desired_hi` is the window's last chunk. The `stored_hi` clause matters: a window that was *current* at shutdown carries a frozen key with `hi < last_chunk`, and when downtime crosses the window boundary it becomes a complete window that still needs its tail chunks' `.bin` and the full build — classifying by `lo` alone would strand chunks `(hi, last_chunk]` permanently. + +A new data type slots in as a new rule: a per-chunk kind adds a key check, an indexed kind adds another window loop contributing index builds. The skeleton that executes the plan (below) never changes. + +The comparison can trust `"frozen"` blindly: **a `"frozen"` `chunk:c:txhash` key implies its `.bin` exists, unconditionally.** Input keys are demoted to `"pruning"` in the same synced write that freezes the terminal coverage, and files are only ever deleted by the sweeps, under non-frozen keys — so no crash at any point can leave a frozen key whose file is gone. Whatever transient keys a crash does leave behind (`"freezing"` attempts, half-swept `"pruning"` demotions) are invisible to the resolver — it classifies on frozen state only — and are swept by the first lifecycle tick, rung at ingestion start. + +The per-window comparison is therefore crash-only-recoverable in *every* index state: a finalized window, a window mid-roll at shutdown, a terminal commit that landed but whose sweeps didn't run, and a crashed build attempt all converge through "desired vs stored → re-derive, rebuild," with inputs that are guaranteed producible (`.pack` files are within retention by definition of the desired range). One composition needs help from outside the resolver: a widening catch-up that re-froze a finalized window's `.bin` keys — or crashed mid-write, leaving one `"freezing"` — and then retention is narrowed back before its rebuild. The resolver then correctly schedules nothing (desired ⊆ stored), so re-materialization will never repair those keys; the prune stage's redundant-input branch demotes and sweeps them, `"frozen"` and `"freezing"` alike (see [Prune](#prune)). + +In code, the kind rules produce one flat value: + +```go +// Both strata are pure data — no behavior is baked into the plan; the +// executor interprets it. That is what makes "the plan is just a value" +// literally true: it can be logged, diffed, and tested without running it. +type ChunkBuild struct { + Chunk ChunkID + Artifacts ArtifactSet // which kinds this chunk still needs — one processChunk pass produces all +} + +type IndexBuild struct { + Window WindowID + Lo, Hi ChunkID // coverage to build; terminal iff Hi == windowLastChunk(Window) + // No input list: the build's dependencies are derivable — every chunk in + // [Lo, Hi] that has a ChunkBuild in the same plan. Carrying them as a + // field would be a second copy that can drift. +} + +type Plan struct { + ChunkBuilds []ChunkBuild + IndexBuilds []IndexBuild +} + +// resolve computes the diff between desired state and the catalog. Pure read; +// the plan is just a value, recomputed from durable keys on every run — a +// restart re-plans from what is actually on disk, with nothing to reconcile. +func resolve(cfg Config, rangeStart, rangeEnd ChunkID) Plan { + if rangeEnd < rangeStart { + return Plan{} // young network: no complete chunk exists yet + } + cat := cfg.Catalog + floor := chunkFirstLedger(rangeStart) // rangeStart already encodes the floor + needs := map[ChunkID]ArtifactSet{} // per-chunk work, union across kinds + + for c := rangeStart; c <= rangeEnd; c++ { // per-chunk kinds + for _, kind := range []Kind{LFS, Events} { + if cat.State(c, kind) != Frozen { + needs[c] = needs[c].Add(kind) + } + } + } + + var builds []IndexBuild + for _, w := range windowsOverlapping(rangeStart, rangeEnd) { // the txhash kind + desired := Range{ + Lo: max(windowFirstChunk(w), chunkID(floor)), + Hi: min(windowLastChunk(w), rangeEnd), // capped by range end ⇒ uniform trailing-window handling + } + stored := frozenCoverage(cat, w) // the unique "frozen" key's coverage, or none + if stored.Covers(desired) { + continue // steady-state restart, risen floor, or finalized window: nothing + } + for c := desired.Lo; c <= desired.Hi; c++ { + if cat.State(c, TxHashBin) != Frozen { + needs[c] = needs[c].Add(TxHashBin) + } + } + builds = append(builds, IndexBuild{Window: w, Lo: desired.Lo, Hi: desired.Hi}) + } + return Plan{ChunkBuilds: chunkBuilds(needs), IndexBuilds: builds} +} + +// buildThenSweep is how the executor runs an IndexBuild. The build's commit +// batch only demotes keys (rule 3); this eagerly runs the standard sweeps +// (rule 4) so the demoted files come back without waiting for a lifecycle +// tick. The sweep is WINDOW-LOCAL — it walks only this window's keys, so +// concurrent windows' sweeps touch disjoint keys — and as a bonus it +// finishes any "pruning" leftovers a previous crashed pass left in the same +// window. +func buildThenSweep(b IndexBuild, cfg Config) error { + cat := cfg.Catalog + if err := buildTxhashIndex(b.Window, b.Lo, b.Hi, cat); err != nil { + return err + } + for _, key := range indexKeys(cat, b.Window) { // superseded coverage(s) + if key.State == Pruning { + sweepIndexKey(key, cat) + } + } + var demoted []ArtifactRef // terminal build: the window's .bin inputs + for c := windowFirstChunk(b.Window); c <= windowLastChunk(b.Window); c++ { + if cat.State(c, TxHashBin) == Pruning { + demoted = append(demoted, ArtifactRef{Chunk: c, Kind: TxHashBin}) + } + } + if len(demoted) > 0 { + sweepChunkArtifacts(demoted, cfg, cat) + } + return nil +} +``` + +### Catch-up execution model + +`executePlan` executes a plan; `runBackfill` is just backend validation plus `executePlan(resolve(…))`, and the [lifecycle tick](#lifecycle) calls the *same* `executePlan` for its production work — one scheduler, two callers. The shape is map/reduce without the shuffle or the job tracker: chunk builds are the maps, index builds are the per-group reduces (the `.bin`s are map-side-sorted runs, so each reduce is one streaming merge), and completion is recorded as the artifacts themselves. There is deliberately no task engine and no persisted task state, for two reasons. First, the dependency structure is two strata with one edge type — an index build waits on the chunk builds inside its coverage — which the runtime expresses directly: each chunk build closes a done-channel, each index build waits on the in-coverage channels, and the *ready-set* a DAG scheduler would maintain is simply the goroutines parked on the one worker semaphore. Second, a persisted task graph would be a second source of truth about progress, one that can drift from the artifact keys it describes; `resolve` re-plans from the keys on every run, so there is nothing to resume and nothing to reconcile. + +```go +func runBackfill(ctx context.Context, cfg Config, rangeStart, rangeEnd ChunkID) error { + // Every in-range chunk must be producible from SOME source: durable + // artifacts (self-skips), a complete ready hot DB, the local .pack, or + // the bulk backend — fail before any work otherwise. + if err := validateBackendCovers(cfg, rangeStart, rangeEnd); err != nil { + return err + } + return executePlan(ctx, resolve(cfg, rangeStart, rangeEnd), cfg) +} + +func executePlan(ctx context.Context, plan Plan, cfg Config) error { + slots := make(chan struct{}, cfg.Workers) // the ONLY concurrency knob: one pool, all work kinds + done := make(map[ChunkID]chan struct{}, len(plan.ChunkBuilds)) + for _, cb := range plan.ChunkBuilds { + done[cb.Chunk] = make(chan struct{}) + } + + g, gctx := errgroup.WithContext(ctx) + for _, cb := range plan.ChunkBuilds { + g.Go(func() error { + defer close(done[cb.Chunk]) // completion broadcast + slots <- struct{}{} // acquire a worker slot + defer func() { <-slots }() // release + return withRetries(gctx, cfg.MaxRetries, func() error { + return processChunk(cb.Chunk, cb.Artifacts, cfg) + }) + }) + } + for _, b := range plan.IndexBuilds { + g.Go(func() error { + for c := b.Lo; c <= b.Hi; c++ { // wait on the in-coverage chunk builds — + if ch, ok := done[c]; ok { // derived, not stored; already-frozen + select { // inputs have no channel and no wait + case <-ch: + case <-gctx.Done(): + return gctx.Err() + } + } + } + slots <- struct{}{} // index builds draw from the same pool + defer func() { <-slots }() + return withRetries(gctx, cfg.MaxRetries, func() error { + return buildThenSweep(b, cfg) + }) + }) + } + return g.Wait() +} +``` + +- **`cfg.Workers` is the only resource knob** (default `GOMAXPROCS`). The goroutines are structure, not resources: thousands may exist, parked either on the semaphore (queued tasks) or on done-channels (builds awaiting inputs), costing a few KB each; at most `Workers` tasks execute at any instant, drawn from all windows' eligible work mixed together. An index build fires the moment its own in-coverage chunk builds finish, without waiting on other windows. (The derived wait slightly over-approximates — a build also waits on an in-coverage chunk producing only `lfs`/`events` — which is harmless: waiting longer is always safe, and the case arises only in widening scenarios.) +- The executor runs each `IndexBuild` via `buildThenSweep` (defined with `resolve` above), which lands the commit batch (terminal for complete windows) and then runs the eager `"pruning"` sweep (rule 4). The sweep is window-local — this window's demoted inputs and superseded coverages, not a store-wide scan — so concurrent windows' sweeps touch disjoint keys, and `fsyncDir` on a bucket dir shared with another window's in-flight `.bin` writes is safe (a dir fsync with concurrent creates just makes more entries durable). +- Done-channels broadcast *completion*, not success: a chunk build that exhausts its retries still closes its channel (the `defer`), so a dependent index build can win the race against context cancellation and start — whereupon it fails `buildTxhashIndex`'s loud `.bin` precondition check before writing any key, landing on the same abort-and-restart path as the original failure. The precondition check is load-bearing here, by design. +- A task that exhausts its retries aborts the daemon, per the [error policy](#lifecycle); restart re-resolves from durable keys, and completed work never repeats. +- **Single-process enforcement:** the meta store holds a kernel `flock` on a `LOCK` file; a second daemon opening the **same meta-store path** fails immediately, and the lock releases on any process exit (including `kill -9`). Because `[meta_store]` and each `[immutable_storage.*]` path are independently configurable, the meta-store lock alone cannot stop two daemons with *different* meta stores from sharing one artifact tree — the daemon therefore also takes a `flock` in each configured storage root. + +--- + +## Daemon flow + +### Startup + +Startup runs in two steps — catch up, then serve: + +1. **Catch up via backfill.** Bring on-disk coverage in line with the retention window. Each pass backfills up through the last complete chunk at the network tip, with one exclusion: when the watermark is **mid-chunk** and within one chunk of the tip, the partial resume chunk is left to ingestion — core replays its tail faster than a bulk refetch would gate serving, and a mid-chunk watermark can only have come from the live hot DB, so the data is local by construction. Every *complete* chunk is in range — including one whose hot DB holds it but whose artifacts aren't frozen yet (a boundary crash): `catchupSource`'s hot branch produces it locally, no refetch. (On a first-deployment frontfill the loop simply terminates because the range is empty.) The loop re-passes if new chunks appear at the tip while a pass is in flight. Catch-up brings **every** overlapping window's index to its desired coverage via the per-window rule — the trailing partial window included — so when it returns, every in-retention range at or below the last backfilled chunk is servable from durable artifacts. + +2. **Serve + ingest.** Open the resume chunk's hot DB, start captive core, start serving reads, run the lifecycle goroutine and the hot DB ingestion loop — whose first act, having opened its hot DB, is one lifecycle notification. **That first tick doubles as startup convergence**: it finishes whatever a crash left behind (every leftover is a key in a transient state — `"freezing"` attempts to delete, `"pruning"` demotions to finish) and removes downtime leftovers (hot DBs and artifacts now past the effective retention floor), concurrently with early serving. There is no startup-only cleanup action and no startup-only tick: the sweeps are key-driven, so the ordinary tick reaches everything without inspecting a single directory. + +No other preparation exists: the resume chunk's hot DB is simply reopened (steady-state restart) or created fresh, and the backfilled windows' tx hashes — the trailing window's included — are already queryable through the `.idx` files catch-up built. + +**Serve-readiness is established entirely by step 1 plus the resume chunk's hot DB.** Catch-up's postcondition covers every complete in-retention chunk from durable artifacts — boundary-crash leftovers included, produced locally through `catchupSource`'s hot branch — and the only chunk it ever skips is the *partial* resume chunk, whose data lives in the hot DB startup reopens before `serveReads()` (a mid-chunk watermark can only have come from that DB). Nothing gates serving on a cleanup pass, because crash debris and downtime leftovers are reader-invisible at *every* moment of operation — readers resolve `"frozen"` keys exclusively, and the retention check masks past-floor files — so the first tick clears them concurrently with serving rather than ahead of it. The store reaches quiescence within that first tick — typically seconds after reads open, longer when it prunes a long-downtime backlog; from then on the [invariant audits](#correctness) carry their usual meaning. (The one nicety surrendered: a store so damaged that tick ops fail aborts seconds after joining the pool rather than just before — the restart loop is identical either way.) + +Operational note — **peak disk after long downtime**: pruning runs only in the first tick's prune stage, *after* catch-up has materialized every newly-in-retention chunk, so a downtime approaching or exceeding the retention window transiently holds up to ~2× the retention footprint (the stale window plus its replacement). Size volumes accordingly, or prune stale ranges manually before restarting after very long downtime; a disk-full during catch-up otherwise aborts before the relieving prune can run, on every retry. + +`lastCommitted` is a *mutating* local that the catch-up loop advances as backfill makes progress; it determines `resumeLedger` and the hot DB ingestion start point. It is never written to the meta store, never shared — and not even carried into the ingestion loop, which needs no progress variable at all: each synced batch *is* the progress, re-derived from durable state at the next startup. + +The retention floor itself is computed by: + +```go +const ( + GenesisLedger = 2 + LedgersPerChunk = 10_000 +) + +// effectiveRetentionFloor is the lower bound of the retention window, +// chunk-aligned: the first ledger of the lowest in-scope chunk. Combines the +// sliding retention floor (lastCompleteChunkAt(upperBound) - retentionChunks +// + 1, when retentionChunks > 0) with the fixed earliest-ledger floor. +// +// The upper-bound ledger is ingestion's progress at runtime; the catch-up +// loop passes max(sampled network tip, derived watermark). The max() guards +// a lagging bulk tip: anchored on the tip alone, the floor would regress +// below where pruning has already advanced, scheduling a spurious re-derive +// of a pruned range. When the tip leads (long downtime), the tip is simply +// the correct anchor — it places the floor where retention will sit once +// caught up, so backfill starts at the true floor instead of wastefully +// below it. On a true first start the watermark is absent and the tip alone +// anchors the floor. +func effectiveRetentionFloor(upperBound uint32, retentionChunks uint32, earliest uint32) uint32 { + sliding := uint32(GenesisLedger) + if retentionChunks > 0 { + slidingChunk := lastCompleteChunkAt(upperBound) - int64(retentionChunks) + 1 + sliding = chunkFirstLedger(max(slidingChunk, 0)) + } + return max(sliding, earliest) +} + +// lastCompleteChunkAt is the inverse of chunkLastLedger: the largest chunk +// whose last ledger is <= ledger. E.g., lastCompleteChunkAt(10_001) == 0 +// (chunk 0 spans ledgers 2..10_001). +func lastCompleteChunkAt(ledger uint32) int64 { + return int64(ledger-1)/LedgersPerChunk - 1 +} +``` + +```go +func startStreaming(ctx context.Context, cfg Config) error { + cat := openMetaStore(cfg) + cfg.Catalog = cat // catch-up's plumbing (resolve, runBackfill) reads it from cfg + validateConfig(cfg, cat) + + retentionChunks := cfg.RetentionChunks + earliest := cat.EarliestLedger() + + // Derived, not read: highest frozen chunk end vs ready hot DBs' max + // committed seq, clamped by earliest - 1 (the frontfill floor). + lastCommitted := deriveWatermark(cat) + + // Step 1: catch up via backfill. The loop re-passes while new chunks + // appear at the tip; backfilledThrough guards against infinite re-passes + // when the tip stops moving (a fixed rangeEnd matching the previous + // iteration breaks the loop). Edge case: on a network younger than one + // chunk, rangeEnd = lastCompleteChunkAt(anchor) = -1 — the + // rangeEnd < rangeStart guard catches it cleanly. + backfilledThrough := int64(-1) + for { + tip := backendNetworkTip(cfg) + anchor := max(tip, lastCommitted) // guards a lagging bulk tip, in BOTH uses below + rangeStart := chunkID(effectiveRetentionFloor(anchor, retentionChunks, earliest)) + // Anchoring rangeEnd on the watermark too matters when the bulk tip + // lags: a complete watermark chunk must fall inside the range so the + // per-window rule folds it into its index before serving. The span + // beyond the bulk tip consists only of chunks that are already + // durable (production self-skips) or complete in a ready hot DB + // (produced locally via catchupSource's hot branch) — the bulk + // backend is never asked for them. + rangeEnd := lastCompleteChunkAt(anchor) + watermarkMidChunk := lastCommitted != chunkLastLedger(chunkID(lastCommitted)) + withinOneChunkOfTip := int64(tip)-int64(lastCommitted) < LedgersPerChunk + // ^ signed: a lagging bulk tip can sit BELOW the resume point + if withinOneChunkOfTip && watermarkMidChunk { + // The partial resume chunk is ingestion's: near the tip, core + // replays its tail faster than a bulk refetch would gate serving. + // Mid-chunk watermarks only ever come from the live hot DB, so + // the data is local by construction. + rangeEnd = chunkID(lastCommitted) - 1 + } + if rangeEnd < rangeStart || rangeEnd <= backfilledThrough { + break + } + if err := runBackfill(ctx, cfg, rangeStart, rangeEnd); err != nil { + return err + } + lastCommitted = max(lastCommitted, chunkLastLedger(rangeEnd)) + backfilledThrough = rangeEnd + } + resumeLedger := lastCommitted + 1 + + // Step 2: serve + ingest. The first tick — rung by ingestion's at-start + // notification — finishes anything a crash left half-done and prunes + // downtime leftovers, concurrently with early serving. + hotDB := openHotDBForChunk(cfg, cat, chunkID(resumeLedger)) + core := startCaptiveCore(cfg, resumeLedger) + doorbell := make(chan struct{}, 1) + go lifecycleLoop(ctx, cfg, cat, doorbell) + serveReads() + return runIngestionLoop(cfg, core, hotDB, cat, doorbell) +} +``` + +After `runBackfill` returns, every chunk in the backfilled range has `lfs` and `events` frozen, every overlapping window's index is at its desired coverage, and `txhash` keys are in one of three states: frozen (window still rolling), swept (finalized window whose terminal commit this pass landed — the batch demoted them, the eager sweep removed them), or `"pruning"` leftovers from a *pre-crash* terminal commit, which the resolver correctly skipped (desired ⊆ stored) and the first tick's prune phase sweeps; partially processed chunks were retried idempotently. The lowest chunk in the backfilled range is `chunkID(effectiveRetentionFloor(max(tip, lastCommitted), …))` — the same `max()` anchor the loop uses; if this falls mid-window, that window's finalized index is built with `lo` = that chunk (its terminal index key carrying that `lo` and `hi` = the window's last chunk). Streaming startup therefore doesn't re-validate contiguity or per-chunk flag-completeness — they follow from how backfill works. + +```go +func validateConfig(cfg Config, cat Catalog) { + // chunks_per_txhash_index immutability check. + if stored, ok := cat.Get("config:chunks_per_txhash_index"); !ok { + cat.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) + } else if stored != itoa(cfg.ChunksPerTxhashIndex) { + fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", + stored, cfg.ChunksPerTxhashIndex) + } + + // earliest_ledger: resolve, validate, store on first start, log/abort on mismatch. + var desired uint32 + switch cfg.EarliestLedger { + case "genesis": + desired = GenesisLedger + case "now": + desired = chunkFirstLedger(chunkID(backendNetworkTip(cfg))) + default: + desired = atoi(cfg.EarliestLedger) + if desired != chunkFirstLedger(chunkID(desired)) { + fatalf("earliest_ledger (%d) must be chunk-aligned.", desired) + } + } + if desired > backendNetworkTip(cfg) { + fatalf("earliest_ledger (%d) is past the current tip; reject.", desired) + } + if stored, ok := cat.Get("config:earliest_ledger"); !ok { + cat.Put("config:earliest_ledger", itoa(desired)) + } else if atoi(stored) != desired { + if cfg.EarliestLedger == "now" { + logInfof("earliest_ledger='now' resolves to %d, but stored is %s; "+ + "using stored value (no-op after first start).", desired, stored) + } else { + fatalf("earliest_ledger changed: stored=%s, config=%d. Wipe the data "+ + "directory to change earliest_ledger (or use the future "+ + "set-earliest-ledger admin command).", stored, desired) + } + } +} + +func openHotDBForChunk(cfg Config, cat Catalog, chunk ChunkID) *HotDB { + // createChunkHotDB creates the instance with its column families: + // ledgers, txhash, and the events CFs (schema per the events doc). + return openHotDB(cat, hotChunkKey(chunk), hotChunkPath(chunk), createChunkHotDB) +} +``` + +### Hot DB helpers + +These functions implement the hot DB state machine. Both startup and the lifecycle loop use them. + +`openHotDB` opens a ready hot DB, recovers from a prior crash, or creates a fresh one: + +```go +// openHotDB returns an open handle to the hot DB. If the key is "ready", +// opens the existing DB. Otherwise — "transient" from a crashed create or +// discard, or absent on first use — wipes any leftover dir and creates +// fresh. The caller owns the returned handle. +func openHotDB(cat Catalog, hotKey, path string, create func(string) *HotDB) *HotDB { + if state, _ := cat.Get(hotKey); state == "ready" { + if !dirExists(path) { + // The key promises a DB the filesystem doesn't have — hot storage + // was lost out from under a surviving meta store (e.g. ephemeral + // NVMe died). Recreating empty would silently lose the chunk's + // ledgers, so refuse: the operator deletes the orphaned hot:chunk + // keys (surgical recovery case 4) and restarts — the derived + // watermark then lands at the last frozen boundary automatically, + // and re-ingestion fills the gap. The fatal stays (rather than + // auto-healing) because a missing dir can also mean a mount + // misconfiguration, where auto-wiping state would be wrong. + fatalf("%s is \"ready\" but %s is missing — hot storage lost; "+ + "run surgical recovery (case 4).", hotKey, path) + } + return openExistingRocksDB(path) + } + // "transient" or absent — wipe any leftover dir and recreate. + deleteDirIfExists(path) + cat.Put(hotKey, "transient") + db := create(path) + fsyncDir(path) // dir + dirent durable BEFORE "ready" — else a power + fsyncParentDir(path) // crash fabricates the ready-without-dir fatal above + cat.Put(hotKey, "ready") + return db +} +``` + +`discardHotDBForChunk` retires a chunk's hot DB once every cold artifact derived from the chunk is durable (or the chunk has fallen past retention): + +```go +func discardHotDBForChunk(chunk ChunkID, cat Catalog) { + if !cat.Has(hotChunkKey(chunk)) { + return + } + cat.Put(hotChunkKey(chunk), "transient") + deleteDirIfExists(hotChunkPath(chunk)) + fsyncParentDir(hotChunkPath(chunk)) + cat.Delete(hotChunkKey(chunk)) +} +``` + +`HotLedgers` is the hot-DB reader `catchupSource` returns when its hot branch wins — a read-only view of the `ledgers` CF, opened and completeness-checked by `catchupSource` itself (the loss-vs-staleness rule in rule 2) before the wrapper is handed out: + +```go +type HotLedgers struct { + chunk ChunkID + store *RocksDB // opened (and verified complete) by catchupSource +} + +func (h *HotLedgers) GetLedger(seq uint32) LedgerCloseMeta { + return decompressLCM(h.store.GetCF("ledgers", beUint32(seq))) +} +``` + +### Hot DB Ingestion + +```go +func runIngestionLoop(cfg Config, core *CaptiveCore, hotDB *HotDB, cat Catalog, + doorbell chan struct{}) error { + + notify := func() { // payload-free doorbell: non-blocking send, coalescing + select { + case doorbell <- struct{}{}: + default: + } + } + notify() // first act: the hot-chunk set just changed (the resume DB was opened) + + for lcm := range core.StreamLedgers() { + // One atomic, synced WriteBatch across all CFs — a ledger is either + // fully in the hot DB or absent. The batch IS the durability + // boundary; the loop keeps no progress variable at all — progress is + // re-derived from durable state at the next startup. + batch := hotDB.NewBatch() + putLedger(batch, lcm) // ledgers CF + putTxHashes(batch, lcm) // txhash CF + putEvents(batch, lcm) // events CFs + batch.Commit( /*sync=*/ true) + + seq := lcm.LedgerSeq() + if seq == chunkLastLedger(chunkID(seq)) { // chunk boundary + // Close the write handle BEFORE creating the next chunk's hot + // key — the moment that key exists, a tick's derivation + // classifies this chunk as complete and may freeze and discard + // this hot DB, and no writer may hold it then. + hotDB.Close() + hotDB = openHotDBForChunk(cfg, cat, chunkID(seq)+1) + notify() + } + } + return nil +} +``` + +A batch error causes the loop to retry the entire ledger (the batch is all-or-nothing, so a retry can't double-apply). On repeated failure the daemon aborts; the next startup's derived watermark equals exactly what the last synced batch committed — there is no second durable write that could disagree with it — and ingestion resumes from the next seq. The close-before-open order at the boundary is load-bearing: the next chunk's hot key is what makes this chunk *visibly complete* to the lifecycle's derivation, so the write handle must already be released when that key appears — otherwise a tick still in flight from the *previous* notification could rmdir a dir whose writer is live. Readers hold their own independent read-only handles. + +The doorbell carries no payload, so its delivery semantics can be maximally sloppy: a non-blocking send on a size-1 buffered channel, coalescing freely. Nothing is lost because the notification carries no information to lose — eligibility derives entirely from durable state, and a tick triggered by one notification processes everything the catalog shows, however many boundaries contributed to it. The doorbell only answers "when should the lifecycle look", never "what should it see". + +### Lifecycle + +The lifecycle goroutine runs one **tick** per notification, in three stages: **plan-and-execute** (the same `resolve` + `executePlan` catch-up uses, from the retention floor up to `completeThrough` — this is where a just-closed chunk freezes, from its hot DB via `catchupSource`'s hot branch, and where the current window's index folds it in), then the **discard** scan (retire hot DBs the cold artifacts now fully serve), then the **prune** scan (sweep demoted and past-retention files). The retention floor plays two roles with *opposite safe directions*, and the design keeps them separate. As a **retention boundary** (the prune scan, the reader gate) it errs permissive: anchored on `completeThrough`, a floor that sits a little low keeps an extra chunk briefly, or admits a read that at worst lands on already-pruned data and returns not-found via the reader's missing-data-file rule — harmless either way. As a **production boundary** it would err dangerous: planning a build below existing storage means demanding chunks from the bulk source that nobody validated it can produce. So production below storage never consults the floor — the tick's plan range starts at the lowest chunk already materialized, and extending the *bottom* of storage (which is what retention widening means) is exclusively catch-up's job, the one path that runs `validateBackendCovers` before demanding anything. Ordering lives in two places, each natural to its half: freeze-before-build is a *plan dependency* (the window's `IndexBuild` waits on its in-coverage chunk builds' done-channels), and build-before-discard / demote-before-sweep is the *stage sequence* — the scans run after `executePlan` returns, so they see every commit the plan landed. Correctness never depends on any of it: every decision derives from durable keys, so work whose enabler hasn't landed is simply not scheduled, and the next tick picks it up. + +The one input the tick needs beyond the keys themselves is *how far ingestion has durably gotten* — which chunks are complete, and where the sliding retention floor anchors. That is `deriveCompleteThrough`, defined with the [derived-progress machinery](#meta-store-keys) in the data model; the tick derives it once, at tick start: + +```go +func runLifecycleTick(ctx context.Context, cfg Config, cat Catalog) { + // One derivation per tick — all stages see the same snapshot, so a + // boundary committing mid-tick can't make one stage's view contradict + // another's; the new chunk is simply next tick's work. + through := deriveCompleteThrough(cat) + floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) + start := chunkID(floor) + if low, ok := lowestMaterializedChunk(cat); ok && low > start { + // floor is a retention boundary (pruning, read gating), where erring + // low is harmless. As a PRODUCTION boundary it would err dangerous: + // a below-storage build demands chunks from a bulk source nobody + // validated. So the tick's plan range starts at existing storage; + // extending the bottom is catch-up's job, behind validateBackendCovers. + start = low + } + + if err := executePlan(ctx, resolve(cfg, start, lastCompleteChunkAt(through)), cfg); err != nil { + fatalf("lifecycle tick: %v", err) // error policy: retries exhausted ⇒ abort; + // startup is the recovery path + } + for _, op := range eligibleDiscardOps(cfg, cat, through) { + op() + } + for _, op := range eligiblePruneOps(cfg, cat, through) { + op() + } + // Assertable postcondition: re-running resolve and both scans against + // this same `through` snapshot yields nothing — a tick finishes + // everything its snapshot showed. (A fresh derivation may legitimately + // see a boundary that landed mid-tick; that is the next tick's work, + // not a violation.) +} + +// lowestMaterializedChunk is one more derivation over the same keys: the +// lowest chunk holding any chunk:* artifact key or hot:chunk key; ok=false +// on an empty catalog (first frontfill tick — resolve's inverted-range +// guard makes that tick a no-op anyway). +func lowestMaterializedChunk(cat Catalog) (ChunkID, bool) + +func lifecycleLoop(ctx context.Context, cfg Config, cat Catalog, doorbell <-chan struct{}) { + for range doorbell { + runLifecycleTick(ctx, cfg, cat) + } +} +``` + +With this, the tick is a pure function of the catalog: the two goroutines share no state at all, and any process holding the meta store could run a tick and reach the same decisions. One narrow consequence of the positional term: a crash *between* closing chunk C's handle and creating chunk C+1's key leaves C as the highest hot chunk, so the derivation conservatively treats it as live until ingestion opens the resume chunk's DB. That is a latency wart, not a correctness one — C's hot DB keeps serving it — and it closes at the first tick, which ingestion rings the moment that DB is open (the reason ingestion notifies at start, not just at boundaries). + +The goroutine is event-driven, not polled. Notifications arrive from exactly one source — ingestion's hot-chunk-set changes: each boundary, plus the one at ingestion start, whose tick doubles as startup convergence. Between notifications the goroutine is idle — and idle means *quiescent*: a re-scan would produce no ops, so the [invariant audits](#correctness) are meaningful at any moment between ticks. + +**Error policy.** A failing op is retried with backoff a bounded number of times within the tick; on persistent failure the daemon aborts — the same policy as the ingestion loop. Aborting is safe because startup *is* the recovery path: catch-up plus the first tick re-derive or finish whatever the failure interrupted. No op failure is ever deferred to the next boundary (~14 h away) silently. + +#### Production (plan-and-execute) + +The tick's first stage is catch-up's machinery verbatim: `resolve` diffs `[floor, completeThrough]` against the catalog and `executePlan` runs the result. In steady state the plan is tiny — one `ChunkBuild` for the chunk that just closed (its artifacts produced from its hot DB, which `catchupSource`'s hot branch selects) and one `IndexBuild` folding it into the current window — and at quiescence the plan is empty. The hot DB is *not* touched by production: it keeps serving the chunk's tx lookups until the index covers it, and only the discard stage retires it. Nothing but a terminal `IndexBuild`'s commit ever finalizes a window. + +#### Discard + +The discard scan walks `hot:chunk:*` keys. Per chunk: past retention → discard; complete, nothing pending, and the index covers it (cold artifacts fully serve it) → discard; otherwise (live, or frozen and awaiting coverage) → leave alone. `discardHotDBForChunk`'s coverage-gated branch is what retires hot DBs in steady state, and re-deriving it from durable keys makes it self-healing across a crash between build and discard. A past-retention discard leaves the chunk's artifact files to the prune stage on the same tick — they carry their own keys. + +#### Prune + +The prune scan is the system's only file-deleter, driven entirely by keys — one stage, both key families: + +- **`index:*` keys**: any key in a transient state is swept regardless of window — `"freezing"` (a crashed build attempt) means delete file and key, never salvage; `"pruning"` (a coverage demoted by a later build's commit batch, or by retention) means finish the removal. A `"frozen"` index key is swept only when its window has fallen wholly past the retention floor. +- **`chunk:*` keys**: chunks wholly past retention are swept whole (all artifacts, any state). Within retention, `chunk:c:txhash` keys reading `"pruning"` — demoted by their window's terminal commit batch — are swept batched. One more in-retention branch: a `"frozen"` *or* `"freezing"` `chunk:c:txhash` key inside a window whose frozen index key is terminal (re-derived — or left mid-write — by a widening catch-up that crashed before its rebuild, then abandoned when retention narrowed back) is provably redundant — the final `.idx` covers the chunk, and the resolver will never schedule re-materialization for a covered window — so it is swept here; this branch is what makes INV-2's no-leftover-txhash-keys clause self-healing rather than merely auditable. + +Every sweep, both families, runs the same mechanic: **demote to `"pruning"` if the key is still `"frozen"`** (never unlink under a frozen key), then unlink → `fsyncDir` → key delete, batched per family. The two prune walks never interact with each other — index sweeps touch only index keys, chunk sweeps only chunk keys — and their only cross-family *read* (is the window's frozen index key terminal?) concerns a key no sweep modifies. + +#### Eligibility + +Each `eligible*` function scans the meta store and returns a list of zero-arg callables — each one a closure over the op to run and its arguments. The lifecycle loop just calls them in order. + +```go +func eligibleDiscardOps(cfg Config, cat Catalog, through uint32) []func() { + floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) + var ops []func() + for _, chunk := range hotChunkKeys(cat) { + switch { + case chunkLastLedger(chunk) < floor: // past retention OR below earliest_ledger + ops = append(ops, func() { discardHotDBForChunk(chunk, cat) }) + case chunkLastLedger(chunk) <= through && + pendingArtifacts(chunk, cfg, cat).Empty() && + indexCovers(chunk, cfg, cat): // cold artifacts fully serve it + ops = append(ops, func() { discardHotDBForChunk(chunk, cat) }) + // else: live, or frozen and awaiting coverage — leave alone. + } + } + return ops +} + +// pendingArtifacts lists which processChunk outputs this chunk still needs. +// The per-chunk counterpart of catch-up's per-window rule: txhash/.bin is +// exempt when the window's index already covers the chunk — after +// finalization the chunk:c:txhash keys are legitimately demoted ("pruning") +// or swept away, and regenerating the .bin would orphan it. +func pendingArtifacts(chunk ChunkID, cfg Config, cat Catalog) ArtifactSet { + var need ArtifactSet + for _, kind := range []Kind{LFS, Events} { + if cat.State(chunk, kind) != Frozen { + need = need.Add(kind) + } + } + if cat.State(chunk, TxHashBin) != Frozen && !indexCovers(chunk, cfg, cat) { + need = need.Add(TxHashBin) + } + return need +} + +// indexCovers reports whether the durable .idx for chunk's window already +// hashes that chunk. +func indexCovers(chunk ChunkID, cfg Config, cat Catalog) bool { + fk := frozenCoverage(cat, indexID(chunk)) // the unique "frozen" index key, or none + return fk != nil && fk.Lo <= chunk && chunk <= fk.Hi +} + +func eligiblePruneOps(cfg Config, cat Catalog, through uint32) []func() { + floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) + windowFloor := WindowID(-1) + chunkFloor := ChunkID(-1) + if floor != GenesisLedger { + windowFloor = indexID(chunkID(floor)) - 1 + chunkFloor = lastCompleteChunkAt(floor - 1) + } + var ops []func() + + for _, key := range indexKeys(cat) { // index family + switch { + case key.State == Freezing || key.State == Pruning: + // Transient debris from any window — an abandoned attempt + // ("freezing": delete, never salvage — a retried coverage was + // re-marked and frozen before this scan ran) or an unfinished + // demotion ("pruning"). Safe to run only because no build is in + // flight when this scan runs: the prune stage follows + // executePlan's return within the tick, and catch-up finishes + // before the lifecycle goroutine starts. + ops = append(ops, func() { sweepIndexKey(key, cat) }) + case key.Window <= windowFloor: + // A frozen index key wholly below the floor; the sweep demotes + // it first — never unlink under a "frozen" key. + ops = append(ops, func() { sweepIndexKey(key, cat) }) + } + } + + var refs []ArtifactRef // chunk family, swept in one batch + for _, ref := range chunkArtifactKeys(cat) { // (chunk, kind) per key + switch { + case ref.Chunk <= chunkFloor: // wholly past retention: any state goes + refs = append(refs, ref) + case cat.State(ref.Chunk, ref.Kind) == Pruning: + // In-retention .bin demoted by its window's terminal commit batch. + refs = append(refs, ref) + case ref.Kind == TxHashBin: // "frozen" OR "freezing" inside a finalized window + if fk := frozenCoverage(cat, indexID(ref.Chunk)); fk != nil && fk.Hi == windowLastChunk(indexID(ref.Chunk)) { + // Redundant input: re-derived (or left mid-write) by a + // widening catch-up that crashed before its terminal rebuild, + // then abandoned. The terminal .idx provably covers the chunk + // and the resolver never re-materializes a covered window. + refs = append(refs, ref) + } + } + } + if len(refs) > 0 { + ops = append(ops, func() { sweepChunkArtifacts(refs, cfg, cat) }) + } + return ops +} +``` + +Hot DBs that outlived their retention window because of long downtime are removed by the discard stage; their files, if any, carry their flag keys and are picked up by the prune stage in the same tick. + +#### Op bodies + +```go +func sweepChunkArtifacts(refs []ArtifactRef, cfg Config, cat Catalog) { + batch := cat.NewBatch() // demote first — never unlink under a "frozen" key + for _, ref := range refs { + if cat.State(ref.Chunk, ref.Kind) != Pruning { + batch.Put(chunkKey(ref.Chunk, ref.Kind), "pruning") + } + } + batch.Commit() + + var paths []string // unlink (idempotent on already-gone paths) + for _, ref := range refs { + deleteArtifactFiles(ref.Chunk, ref.Kind) + paths = append(paths, artifactPaths(ref.Chunk, ref.Kind)...) + } + fsyncParentDirs(paths) // unlinks durable BEFORE the keys go + + batch = cat.NewBatch() + for _, ref := range refs { + batch.Delete(chunkKey(ref.Chunk, ref.Kind)) + } + batch.Commit() +} + +func sweepIndexKey(key IndexKey, cat Catalog) { + if key.State == Frozen { + cat.Put(key, "pruning") // never unlink under a "frozen" key — a crash + // mid-sweep must not leave a frozen key fileless + } + // "freezing" (crashed attempt — never salvage) and "pruning" (superseded + // or retention-demoted) take the same path from here; the key outlives + // the durable unlink, so a crash anywhere re-runs the sweep. + deleteFileIfExists(indexFilePath(key)) // filename derived from the key name + fsyncDir(indexWindowDir(key)) + cat.Delete(key) + rmdirIfEmpty(indexWindowDir(key)) // best-effort tidiness; an empty dir is + // not an artifact +} +``` + +The discard stage has no separate op body — `discardHotDBForChunk` is called directly from the eligibility closures above. These two sweeps are the *entire* deletion surface of the system: one body per key family, identical internal shape (demote if frozen → unlink → `fsyncDir` → key delete). + +The prune walk's two families are independent of each other and of discard: a chunk swept while its containing window's `.idx` is still around could leave a `getTransaction` query resolving to a missing `.pack`, but the [reader retention contract](#reader-retention-contract) handles that — past-retention seqs return not-found regardless. Discard touches only hot DBs, which the prune walk's flag-key iteration can't see. + +### Concurrency model + +Two writers; readers only read. The ingestion loop is one goroutine; the lifecycle is one goroutine whose tick's plan stage fans work out to the executor's bounded worker pool — every worker operating strictly below the live chunk, so the pool inherits the lifecycle's side of the partition. Their domains partition at the live chunk: + +- **The ingestion loop owns the live chunk** — the highest chunk with a `hot:chunk:*` key. It is the only writer of that chunk's hot DB and the creator of each chunk's `hot:chunk:{chunk}` key (via `openHotDBForChunk` at the boundary). +- **The lifecycle goroutine owns everything below the live chunk** — handed-off hot DBs (freeze + discard), all `chunk:*` and `index:*` artifact keys, and the deletion side of `hot:chunk:*` keys. + +**The two goroutines share no state.** Their only connection is the payload-free doorbell, and the partition itself is encoded in the catalog: the lifecycle's derivation treats the highest hot key as the live chunk and touches only what lies below it. The handoff fence is the boundary's write order — the ingestion loop closes its write handle *before* creating the next chunk's hot key. Creating that key is the act that moves the partition: the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one already in flight from the previous notification) may freeze and discard it — by which point no writer holds it. The two goroutines never write the same meta-store key, and never touch the same per-chunk hot RocksDB instance; both do write the meta store concurrently — on disjoint keys, relying on RocksDB's thread safety for the instance itself. The derivation is monotonic within the run (hot keys and frozen keys only advance), so a tick racing a boundary only under-approximates eligibility — work deferred to the next tick, never incorrect work. Readers hold their own read-only handles and resolve files through meta-store keys, so writer-side activity never races them. (The serving side will also need a notion of current progress — the [reader retention contract](#reader-retention-contract) bounds every read by the retention window — but how readers obtain it is the query-routing design's concern, not this doc's.) + +### One boundary, end to end + +Ledger 53,510,001 closes chunk 5350 (window 5, floor at chunk 5100, frozen index covering chunks 5100–5349): + +``` +ingestion batch for seq 53_510_001 commits (one fsync) + hotDB.Close() ← chunk 5350 handed off + open chunk 5351's hot DB (hot:chunk:00005351 = "ready") + ← chunk 5350 now visibly complete: + it sits below the highest hot key + notify ──────────────────────────────────────────────────┐ +lifecycle deriveCompleteThrough → 53_510_001 (positional term) ▼ + plan-and-execute: + resolve → Plan{ChunkBuild 5350, IndexBuild w5 [5100,5350]} + ChunkBuild 5350: catchupSource picks the hot DB (ready, + complete) → .pack, events segment, .bin all "frozen" + (the hot DB itself stays) + IndexBuild w5 (waited on 5350's done-channel): + put index:00000005:00005100:00005350 = "freezing" + merge .bin[5100..5350] → write 00005100-00005350.idx + → fsync → commit batch {[5100,5350] → "frozen", + [5100,5349] → "pruning"} + → eager sweep: unlink 00005100-00005349.idx → fsyncDir + → delete key + discard stage: index covers 5350 → discard chunk 5350's hot DB + prune stage: nothing left (the eager sweep already ran; floor + pinned at 5100 by earliest_ledger, so it doesn't + slide this tick) +reads tx in 5350: hot DB's txhash CF until the discard, .idx after + tx in 5351: hot DB of the new live chunk +``` + +Every arrow is the one write protocol or its exit sweep; at the end of the tick a re-plan and re-scan find nothing to do. + +--- + +## Reader retention contract + +A read for any seq below `effectiveRetentionFloor` returns *not found*, regardless of whether the underlying file still exists on disk. This is what lets pruning remove chunks the moment they pass retention, without coordinating with the index lifecycle: a stale `.idx` may resolve a tx-hash to a `.pack` that's been deleted, but the retention check at the top of the reader short-circuits the lookup before any file access. From the caller's perspective, retention is the single source of truth for "is this data available?" + +For tx-hash lookups specifically, the reader walks two tiers: + +| Chunk state | Served from | +|---|---| +| at or below the frozen key's `hi` | the `.idx` named by the window's unique `"frozen"` key (filename derived from the key name) | +| above `hi` (live, or frozen and awaiting coverage) | the `txhash` CF of the chunk's hot DB | + +The transition is gap-free by write ordering: the hot DB is discarded only after the durable `.idx` covers the chunk, and the rebuild commits atomically (one batch that promotes a fully-written coverage and demotes its predecessor). Two `ENOENT` sites need explicit rules — stated here because INV-1 depends on them, argued in full in [the transactions design](./gettransaction-full-history-design.md) (§8.4): + +- **Index-file `ENOENT`** (the reader resolved the frozen key, then lost the open race to a sweep): **re-resolve and retry**, never surface not-found directly. After a *supersession* demotion the retry always finds a frozen key — the commit that demoted the old coverage promoted its replacement in the same batch; after a *retention* demotion it finds none, and that not-found is legitimate — only retention pruning ever empties a window, so the queried seq is by then below the floor anyway. +- **Data-file `ENOENT`** (a live `.idx` resolved the hash into a `.pack` pruning has removed — a floor-straddling window's static `lo` keeps covering pruned chunks): **not-found directly**, no re-resolve. Artifacts are write-once and deletion is unlink-only, so a missing data file can only mean the chunk was pruned — never that wrong bytes could be served. Ordinarily the top-of-reader retention check short-circuits these reads before any file access; this rule is what keeps them fail-soft in the one state where it doesn't (INV-1's hot-volume-loss exception). + +How the reader dispatches between hot DBs and frozen files for in-retention queries — and how it stays consistent across rebuild/freeze/discard transitions — is the query-routing design, out of scope for this doc. + +--- + +## Correctness + +This section states what the streaming workflow guarantees, the assumptions it relies on, and the operator actions and crash timings the design covers. + +### Invariants + +The **retention window** is `[effectiveRetentionFloor, last_committed_ledger]`. The floor serves *retention* consumers only — pruning and the reader gate, where erring low is safe; it is never a production boundary (the plan ranges that produce data start at existing storage in the tick, and at a validated floor in catch-up). A future floor consumer picks its side by the err-direction test: if that consumer erring low would be dangerous, it is a production consumer and belongs behind catch-up's validation. **Quiescence** means the tick's plan is empty and both scans produce empty op lists. + +**INV-1 (read correctness).** Any data request whose ledger scope falls entirely within the retention window returns correct results — content matches what a conformant LedgerBackend would produce, no partial state is visible, no in-retention range is unreachable. One transient exception mirrors INV-2's: after hot-volume loss, the floor (anchored on `completeThrough`) regresses with the lost completeness, so for the minutes until re-ingestion re-advances it, the window's *bottom* admits a few chunks that were already pruned under the pre-loss floor — those reads fail soft via the reader's missing-data-file rule (not-found, never wrong data: artifacts are write-once and pruning only unlinks) and the gap closes as the floor re-advances. + +**INV-2 (single canonical state).** The meta-store records one home for each data range: +- **at most one `"frozen"` index key per window — at all times**, quiescent or not (the commit batch promotes and demotes in one write; this is what makes "the window's index" well-defined for readers); +- at quiescence, no artifact key anywhere is `"freezing"` or `"pruning"` — index transients are swept by the tick that observes them; per-chunk `"freezing"` keys are repaired by re-materialization (the plan stage, for chunks within `[floor, completeThrough]`, from whichever source `catchupSource` selects) and `"pruning"` keys are finished by the sweeps. One reachable exception: after hot-volume loss combined with a lagging bulk-backend tip, a partially-frozen chunk *above* the derived watermark can hold `"freezing"` keys at served quiescence — it lies outside every plan range (above `completeThrough`), and its ledgers exist nowhere any source can reach — until re-ingestion replays the chunk minutes later; it sits outside the retention window throughout, so no read can observe it; +- hot DB keys add one tolerated in-flight transient: `"transient"` brackets a directory operation in progress (the boundary's `openHotDBForChunk`, startup's resume-chunk open, a discard mid-op) and can be observed while the lifecycle sits idle between ticks; a crash-left bracket is finished by the next `openHotDB` or discard scan; +- at quiescence, no `hot:chunk:c` key for a chunk `c` whose artifacts are all durable *and* whose window's index covers `c` (the chunk is fully served by cold artifacts, so the hot DB must be gone); +- at quiescence, no `chunk:c:txhash` key for a chunk `c` in a window whose frozen index key is terminal (the terminal commit demoted them; the sweep removed them; the prune scan's redundant-input branch demotes any that a crashed widening re-froze or left mid-freeze). + +**INV-3 (disk matches meta-store).** At quiescence, the set of artifact files and hot DB directories on disk equals exactly the set the meta-store specifies. Every key in a final state names exactly one expected path; the disk holds those paths and no others — no orphan files, no dangling keys, no duplicate artifacts. By INV-2 every artifact key at served quiescence *is* in a final state — the hot-key `"transient"` bracket around an in-flight directory operation is the one tolerated exception — so the correspondence is exact, with no tolerance carve-outs for artifacts: a non-key-named file in an index window dir is a real bug, not mid-tick debris. + +**INV-4 (retention bound).** At quiescence, no file or meta-store key maps to a ledger range strictly below the effective retention floor. + +Each invariant has a distinct audit. INV-1 you check by issuing reads or by re-deriving artifacts and byte-comparing. INV-2 you check by walking meta-store keys and cross-checking forbidden co-existence. INV-3 you check by walking the filesystem against the meta-store. INV-4 you check by walking meta-store keys against the floor. None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a real invariant violation, not as something the buggy code silently considers acceptable. Quiescence between ticks makes these walks meaningful on a live daemon, so an `audit` admin command can implement them directly (with an optional deep mode that re-derives sampled artifacts via a conformant LedgerBackend and byte-compares, for INV-1). + +### Convergence + +From any storage state — partial-completion crashes, the state left after an operator action, the state left after surgical recovery — **startup** (the catch-up pass, then the first lifecycle tick, rung at ingestion start) drives the system to a quiescent state satisfying INV-1 ∧ INV-2 ∧ INV-3 ∧ INV-4 within the first tick (typically seconds after serving opens; bounded by that tick's freeze, rebuild, and prune workload). From any state reachable *during* a run, the lifecycle tick alone does, within a bounded number of ticks — and since runtime op failure aborts the daemon, every state a run can leave behind is one startup is built to converge. + +The split matters because some repairs are inherently catch-up's, not the tick's: a per-chunk `"freezing"` key with no hot DB behind it (a crashed catch-up write) is repaired by re-materialization, and a surgically removed range is re-derived from the LedgerBackend — no tick phase produces data. The tick's province is everything else: index transients, demotions, freezes from live hot DBs, prunes. + +Convergence rests on three properties shared by the resolver and the scans — eligibility is computed from durable meta-store state alone; ops are idempotent; everything is re-derived on every notification — plus catch-up's postcondition contract. Together, whatever a crash leaves half-done, the next tick or the next startup finishes. + +### Substrate assumptions + +Properties we rely on the underlying storage to provide: + +- **Sync WAL.** All meta-store puts and deletes that the invariants depend on use RocksDB's `WriteOptions.sync = true`, which fsyncs the WAL before the write returns. Multi-key commits — the index commit batch, the sweeps' key-delete batches — are single atomic synced WriteBatches: all-or-nothing across keys. +- **Per-ledger durability.** The chunk hot DB's synced WriteBatch (atomic across all CFs) is the sole per-ledger durability boundary; the watermark is derived from it, so no cross-store ordering exists to maintain. Per-artifact: the per-chunk file **and its directory entry** are fsynced before its key flips to `"frozen"`, and an index coverage's `.idx` (and its dir entry) is fsynced before the commit batch freezes its key. +- **Deterministic, idempotent writes.** Re-applying any write produces byte-identical state. Backed by deterministic LCM bytes from any conformant LedgerBackend and a byte-identical streamhash index from byte-identical sorted inputs. +- **Monotonic progress.** Within a process run, ingestion only moves forward (each synced batch extends the last), and the lifecycle's derived `completeThrough` only advances with it (hot keys and frozen keys move forward, never back). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value or marginally above it (a batch that committed in the instant before the crash); it sits *below* the pre-crash value only when hot state was removed or lost, or when surgery demoted keys feeding the cold term (case 3's no-hot-key corner). There is no stored watermark to rewind; surgical recovery shrinks the derivation's inputs by demoting or removing state, not by editing a counter. + +### Design invariants + +These are streaming-specific properties the implementation guarantees on top of the substrate, and that INV-1 through INV-4 depend on: + +- **Every key precedes its file.** The pre-write `"freezing"` mark and post-fsync `"frozen"` flip mean any file on disk — per-chunk artifact or index file, partial or complete — has its meta-store key set. Every scan and sweep iterates keys, so every file is reachable that way; nothing ever lists a directory to find work. +- **Index promotion is atomic and gap-free.** The commit batch freezes the new coverage and demotes its predecessor in one synced write, so the window's unique frozen key changes hands atomically — never two frozen keys, never none once the window has one. A reader following the frozen key always lands on a complete, fsynced index; a crash mid-build leaves the prior coverage frozen and the attempt as `"freezing"` debris that is either overwritten by the next build of that coverage or deleted unread by the sweeps. +- **Key absent ⟹ file gone.** Every sweep's shared ordering (unlink → `fsyncDir` → atomic key delete) gives the exit-side counterpart. +- **Hot DB keys bracket the directory.** The `hot:chunk:{chunk}` key is put (`"transient"`) before the directory is created, and deleted only after rmdir completes — with `"transient"` re-marked first. +- **Tx hashes always have a queryable home.** The hot DB is discarded only after the durable `.idx` covers the chunk — hot CF, then `.idx`, with no gap. (The `.bin` is never a serving tier; it is rebuild input, demoted to `"pruning"` by the terminal commit batch — the same write that freezes the final `.idx` — or by retention pruning once its chunk falls past the floor, and deleted only by the sweep after that.) +- **`"frozen"` ⟹ the file is durable and complete.** Flips to `"frozen"` happen only after fsync, and files are deleted only under non-frozen keys (sweeps demote first) — so frozen keys can be trusted blindly by readers, the resolver, and `buildTxhashIndex`'s precondition check. +- **`"pruning"` is committed.** Once a key is in `"pruning"` — demoted by a commit batch or by retention — the sweep runs to completion on subsequent scans. Catch-up treats any non-`"frozen"` state as empty and overwrites cleanly if the range is re-ingested. + +### Scenario coverage + +INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because readers resolve `"frozen"` keys exclusively and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every quiescence reached after the events below; startup's first quiescence arrives when the first tick completes, shortly after reads open. + +1. **Steady-state operation.** Hot DB ingestion advances `last_committed_ledger`; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on `last_committed_ledger`. +2. **Operator state changes** — retention widening or shortening (`retention_chunks`), `earliest_ledger` raised. Both reduce to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state." Catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage; the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateBackendCovers`, materializes the new bottom. +3. **Surgical recovery, frozen-range case** (tainted range strictly below `chunkID(last_committed_ledger)`). The operator never touches the filesystem. Recovery is **one atomic meta-store batch**: every `chunk:{c}:*` key in the tainted range and every `index:*` key of every window overlapping it → `"freezing"` — the state that already means *this file is not to be trusted: re-derive or delete* — and any leftover `hot:chunk` key in the range (a crash between a tick's build and its discard stage leaves a frozen chunk's hot DB behind) → `"transient"`, which makes it instantly ineligible as a source (`catchupSource` reads only `"ready"`). The batch commits atomically or not at all, so there is no interruption analysis and re-running it is a no-op; the meta store's lock means it can only be written against a stopped daemon. Every demoted key then converges through machinery that already exists: on restart, catch-up re-derives the freezing chunk artifacts from a conformant LedgerBackend — overwriting the tainted files in place, rule 1's ordinary re-materialization — and rebuilds each window's index, re-marking the freezing index key whose coverage is still desired (or leaving it to the prune scan's sweep-on-sight when retention has moved past it); the discard scan retires the transient hot key once the rebuilt chunk regains coverage — or past retention, after long downtime — unlinking the tainted hot DB unread. `last_committed_ledger` is exactly unchanged whenever ingestion has ever started: the batch keeps every hot key, the positional term counts them value-blind, and the live chunk's `"ready"` DB restores the watermark precisely. Only in the no-hot-key corner — the daemon stopped during initial catch-up, before ingestion opened its first hot DB — does the cold term lead the derivation, and there it can regress through every untainted chunk of a finalized window the taint overlaps (their `.bin` keys were swept at finalization, and the demoted index key breaks coverage), bounded by one window; catch-up's `max(tip, lastCommitted)` anchor re-derives forward regardless. +4. **Surgical recovery, partial-tail-chunk case** (tainted range includes the live chunk), and equally **hot-volume loss**. The same artifact-key batch as case 3, but hot keys are **removed**, not demoted — case 3's leftover hot DB holds data that exists elsewhere and sits below the watermark, so a demoted key is harmlessly reclaimed; here the hot DB *was* the data, and a kept key either re-fires the fatal or silently inflates the watermark. Remove **every** `hot:chunk` key whose dir is missing or whose contents are partial. A half-recovery that keeps a `"ready"` key merely relocates the fatal to the next restart. One that keeps the boundary-crash `"transient"` key — the next chunk's key, sitting above the last frozen boundary — is recoverable but worse than compliance: the positional term counts all hot keys while the fatal checks only `"ready"` ones, so the kept key silently props the derived watermark up to the lost chunk's end, pulling the lost chunk into catch-up's anchored range, which re-derives it from the bulk source — stalling startup until that source has the chunk, where full removal would have resumed from the last frozen boundary immediately. Remove any surviving dirs along with the keys — dirs first, keys last: an interrupted pass then leaves keys whose dirs are missing, which the next startup catches (the fatal for `"ready"`, the watermark-prop analysis above for `"transient"`), never a keyless orphan dir no key-driven scan can find. The hot DB is the only copy of its ledgers — discarding it loses them, and the **derived watermark admits as much automatically**: with the hot DB gone, derivation lands at the last frozen boundary, and the next startup re-ingests from there. There is no watermark to edit; recovery is key removal, never file surgery beyond the lost dirs themselves. +5. **First deployment / downtime between restarts.** `last_committed_ledger` derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). +6. **LedgerBackend choice or mid-flight swap.** The LedgerBackend contract guarantees canonical LCM bytes for any range, so any conformant backend produces byte-identical artifacts. Different backends differ in performance, not behavior. An operator using BSB for backfill and CaptiveCore for hot DB ingestion, or swapping mid-deployment, satisfies all four invariants. +7. **Crash at any point during any of the above.** Sync WAL plus per-ledger durability ordering mean the meta store on next start is internally coherent and the derived watermark equals exactly what the last synced batch committed. Idempotency means re-running any half-finished op is safe. Convergence finishes whatever the crash interrupted. + +### What a bug looks like + +The invariants describe what storage should look like, not how the phase scans maintain it. So common bugs show up as concrete violations: + +- **A meta-store key claims something the file doesn't actually deliver** — e.g., a per-chunk writer flips a key to `"frozen"` before fsync (leaving a partial file the meta store advertises as complete), or an index key freezes before its `.idx` is fully fsynced, or the key name's `{lo, hi}` doesn't match the file's actual coverage, or a frozen file is mutated post-freeze ⟹ reads through the meta key see wrong or missing data. **INV-1** violated. Detectable by re-deriving an artifact via a conformant LedgerBackend and byte-comparing against the on-disk file. +- **Pruning too aggressive** ⟹ a request whose ledger scope is in retention returns wrong or missing results. Issue a read to find it. **INV-1** violated. +- **Two frozen index keys in one window** — a commit batch failed to demote the predecessor, or promotion and demotion landed as separate writes ⟹ readers have no well-defined index. Walk `index:*` keys, count `"frozen"` per window. **INV-2** violated. +- **A `"freezing"` or `"pruning"` key survives served quiescence** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup catch-up should have re-materialized. Walk keys for transient values at quiescence. **INV-2** violated. +- **Chunk scan misses an orphan** ⟹ a hot DB persists for a chunk that cold artifacts fully serve. Walk `hot:chunk:c` keys whose chunk has its artifacts durable and its window's index covering `c`. **INV-2** violated. +- **Finalization demotions don't complete** ⟹ per-chunk frozen tx hash files outlive the index that consumed them. Walk `chunk:c:txhash` keys whose window's frozen key has `hi` = the window's last chunk. **INV-2** violated. +- **A writer leaves a file on disk without its meta-store key** (file fsynced before key was durable, or a sweep deleted the key before its unlink was durable) ⟹ orphan file — invisible to every key-driven scan. Walk the filesystem against the meta-store. **INV-3** violated. +- **A meta-store key persists without its file** (file deleted before key) ⟹ dangling key. Walk the meta-store against the filesystem. **INV-3** violated. +- **Duplicate cold artifacts for the same logical data** (e.g., two events files for the same chunk, from a migration or buggy retry) ⟹ the meta-store names one expected path; the extras are orphans. Walk the filesystem against meta-store-specified paths. **INV-3** violated. +- **Pruning fails past the floor** ⟹ files or keys remain for ranges below `effectiveRetentionFloor`. Walk meta-store keys, compare ledger ranges to the floor. **INV-4** violated. + +A storage walk against the invariants is enough to find these without inspecting the phase implementations. + +--- + +## Related documents + +- The transactions design ([gettransaction-full-history-design.md](./gettransaction-full-history-design.md)) — the tx-by-hash subsystem end to end: the hot `txhash` CF, the `.bin`/`.idx` formats, the rolling window index build protocol with its pseudocode and safety arguments, the `getTransaction` read path, and the capacity numbers. Canonical for everything rules 3 and 5 and the index-key section summarize. +- The events design ([getevents-full-history-design.md](./getevents-full-history-design.md), PR #635) — the cold-segment file formats and the hot events CF schema referenced by the data model. +- The reader / query-routing design — how reads dispatch between hot DBs and frozen files for in-retention queries. +- The original backfill workflow design ([../full-history/design-docs/03-backfill-workflow.md](../full-history/design-docs/03-backfill-workflow.md)) — the standalone backfill mode, **subsumed by this document**: its `process_chunk`, tx-hash index build and cleanup, geometry, configuration, directory layout, meta-store keys, crash recovery, and `getStatus` are all redefined here (and in the transactions design) in their current form. It is retained for history; where the two disagree, this document is current. In particular, that doc predates the 2026-06 streamhash redesign, so its 16-CF RecSplit `.idx` files, unsorted 36-byte `.bin` entries, `"1"`-valued meta keys, task-DAG scheduler, and standalone `full-history-backfill` CLI are all superseded. diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md new file mode 100644 index 000000000..9cf8fa827 --- /dev/null +++ b/design-docs/gettransaction-full-history-design.md @@ -0,0 +1,362 @@ +# RPC getTransaction Full-History Design + +## Summary + +How the full-history daemon ingests and serves transactions for the tx-by-hash endpoint (`getTransaction`). A transaction lookup is a two-step read: resolve the hash to a ledger sequence, then fetch the transaction from that ledger's stored LCM. This document is the canonical reference for the resolution structure end to end — the hot tier (a column family in the per-chunk hot RocksDB) and the cold tier (per-chunk sorted runs merged into a per-window minimal-perfect-hash index), the file formats, the rolling rebuild that keeps the cold tier current, the read path, and the capacity numbers. + +The daemon context — chunk geometry, the meta store, the one write protocol, catch-up, and the lifecycle tick — is defined in [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) (the streaming doc). That document references this one for everything tx-hash-specific and restates only what its own protocols depend on. + +--- + +# Part 1: Problem and Scope + +## 1. Objective + +Serve `getTransaction(hash)` for any transaction whose ledger falls within the retention window (full history by default): + +- **Complete.** Every transaction in every in-retention ledger is resolvable by hash. No gaps, including across crashes, restarts, and retention changes. +- **Correct.** A lookup never returns the wrong transaction. A missing or out-of-retention transaction returns not-found — never an error dressed as data, never stale bytes. +- **Cheap to serve.** A cold lookup costs one index probe plus one ledger fetch. Memory does not scale with history size. +- **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write path, and the cold index stays current with a rebuild that is small relative to its cadence. + +Out of scope: how readers obtain the daemon's current coverage and dispatch between tiers across rebuild/freeze/discard transitions (the query-routing design), and the storage of the transactions themselves (the ledger store — `.pack` files and the hot `ledgers` CF — covered by the streaming doc and the packfile library doc). + +## 2. Lookup model + +`getTransaction` takes a 32-byte transaction hash and returns the transaction's envelope, result, and meta, plus its ledger and close time. The data flow: + +``` +hash ──► seq ──► LCM for seq ──► extract the tx ──► verify hash ──► respond + (this doc) (ledger store) +``` + +The subsystem this document owns is the **hash → seq map**, plus the read-path rules that make the final fetch correct. Three properties of the key space shape the design: + +- **Point lookups only.** There are no range or prefix queries over tx hashes, so order-preserving structures buy nothing — perfect-hash structures apply. +- **Hashes are uniform and immutable.** A transaction hash is never updated and corresponds to at most one applied transaction (the network's replay protection); the map is append-only, one batch of entries per ledger. +- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — it needs to be *complete* (no false negatives) and *cheap*, with false positives screened first by a fingerprint and finally by the fetch-and-verify step. + +--- + +# Part 2: Architecture + +## 3. Two tiers, one home + +At any moment, every in-retention transaction hash has **exactly one queryable home**: + +| Tier | Structure | Serves | +|---|---|---| +| **Hot** | `txhash` CF of the per-chunk hot RocksDB | the live chunk, plus any frozen chunk the window index doesn't cover yet | +| **Cold** | one streamhash `.idx` per window, covering chunks `[lo, hi]` | every chunk at or below the window's frozen `hi` | + +``` + window w + chunks: [lo ···························· hi] [hi+1 ···] [live] + served by: └──────── {lo}-{hi}.idx ─────────┘ hot DBs hot DB + (awaiting (being + coverage) written) +``` + +The handoff is gap-free by write ordering: a chunk's hot DB is discarded only after the durable `.idx` covers the chunk (the streaming doc's discard stage gates on exactly this). Between a chunk's freeze and its coverage — normally one lifecycle tick, since freeze, rebuild, and discard chain within the tick — the chunk is served from its still-present hot DB. + +There is **no dedicated transaction store at any layer**. The map resolves hashes to sequences; transaction bytes live in the ledger store (`ledgers` CF while hot, `.pack` files when cold). + +## 4. Geometry + +Two units organize the map; both are defined in the streaming doc and restated here because every structure below is named by them: + +- **Chunk** — 10,000 ledgers (hardcoded). The unit of the hot DB and of the sorted runs. +- **Window** — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the cold index. Configurable, but pinned in the meta store on first start and immutable thereafter. + +``` +chunkID(seq) = (seq - 2) / 10_000 +chunkFirstLedger(c) = c * 10_000 + 2 +chunkLastLedger(c) = (c + 1) * 10_000 + 1 +indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id +chunksInIndex(w) = [w*cpi, (w+1)*cpi - 1] # cpi = chunks_per_txhash_index +``` + +With the default `chunks_per_txhash_index = 1000`: window 0 spans ledgers 2–10,000,001 (chunks 0–999), window N spans N×10M+2 – (N+1)×10M+1 (chunks N×1000 – (N+1)×1000−1). All ids zero-pad `%08d`. + +--- + +# Part 3: Implementation Reference + +## 5. Hot tier + +### 5.1 Storage + +The `txhash` CF is one column family of the per-chunk hot RocksDB (alongside `ledgers` and the events CFs — see the streaming doc's data model): + +- **Key**: the full 32-byte transaction hash. +- **Value**: the 4-byte ledger sequence. + +Full-key storage means the hot tier is *exact*: a lookup of a hash not in the chunk simply misses, with no fingerprint or verification subtleties. The CF carries its own tuning options (point-lookup-oriented: bloom filters, no ordering requirements) independent of its siblings. + +### 5.2 Write path + +The hot write path is the streaming doc's ingestion loop verbatim: each ledger commits as **one atomic, synced WriteBatch across all CFs** of the chunk's hot DB, and `putTxHashes` contributes one `(hash, seq)` entry per transaction in the LCM. A ledger's hashes are either all present or all absent; there is no separate tx-hash durability boundary to reason about. + +### 5.3 Lifetime + +The chunk's hot DB is created when ingestion enters the chunk and discarded whole once every cold artifact derived from the chunk is durable **and** the window index covers the chunk. The `txhash` CF is the reason for the *and*: the `.bin` (below) is never a serving tier, so without the coverage gate there would be a window where a frozen chunk's hashes had no queryable home. In steady state the freeze-to-coverage interval is the one tick in which the boundary's `ChunkBuild` and `IndexBuild` run; after catch-up or crashes it can span longer — the discard scan re-derives eligibility from durable keys, so the hot DB simply persists until coverage genuinely lands. + +## 6. Cold artifacts + +Two artifact kinds, both produced and cataloged under the streaming doc's one write protocol (mark `"freezing"` before any I/O → write → fsync file + dirents → flip `"frozen"`): + +### 6.1 The per-chunk sorted run: `.bin` + +`txhash/raw/{bucket:05d}/{chunk:08d}.bin`, meta-store key `chunk:{chunk:08d}:txhash`. Produced by `processChunk` in the same streaming pass that writes the chunk's `.pack` and events segment — per ledger, the tx-hash extractor collects entries; at the end of the pass they are **sorted in memory** (~3M entries ≈ 60 MB for a dense chunk — negligible) and written out. + +**Format** (the streamhash merge format): + +``` +uint64 LE entry count +entry × count 20 bytes each: [key: 16][seq: 4 LE] +``` + +- `key` is the **first 16 bytes of the transaction hash**. +- Entries are sorted ascending by the **big-endian `uint64` prefix of `key`**. + +The `.bin` is a *map-side-sorted run*, never a serving tier. Sorted runs are what make the rolling rebuild cheap: the index builder consumes them in a single streaming k-way merge, instead of the two passes (count, then add) it needs over unsorted input. + +A `.bin` lives as long as its window needs it as rebuild input: every boundary re-merges **all** of the current window's runs, so they are retained for the whole life of the window and demoted en masse by the terminal build's commit batch (§7.3), then swept. + +### 6.2 The per-window index: `.idx` + +`txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, meta-store key `index:{window:08d}:{lo:08d}:{hi:08d}`. One streamhash minimal-perfect-hash file per **coverage**, built by streamhash's `SortedBuilder` over the k-way merge of `.bin[lo..hi]`, with the cold-txhash option set: + +- **Payload: 3 bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range and embedded in the file as user metadata. 3 bytes spans 16.7M ledgers, comfortably over a 10M-ledger window. No sidecar metadata. +- **Fingerprint: 1 byte** — screens foreign keys (§8.2). + +All-in, the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. + +These formats match the measured pipeline in the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`), which is where the performance figures in Part 4 come from — adopting the formats unchanged is what makes those figures transfer to this design. + +### 6.3 Keys, coverage, and the uniqueness invariant + +An index key's **name carries the coverage; the value carries only lifecycle state** (`"freezing"` / `"frozen"` / `"pruning"`, with the same meanings as every artifact key in the system). The filename is derived from the key by a fixed bijection, so resolving a key to its file never reads the value or lists a directory — every file on disk, including a crashed attempt's partial, is reachable from its key alone. + +- **Coverage is the whole identity** — there is no per-attempt counter. A retry of a crashed build re-marks the same key and rewrites the same file from scratch. The file readers hold is never a writer's target: a file is writable only under a key that has never been `"frozen"` in this run, and a scheduled build's coverage always differs from the window's frozen coverage — equality is precisely the case the build's skip check returns on. +- **`lo`** — the lowest chunk the perfect hash covers. Defaults to the window's first chunk; rises above that when `earliest_ledger` or the sliding retention floor cuts into the window at build time. `lo` rising is how a mid-window floor is encoded — there is no other floor representation in the index. +- **`hi`** — the highest chunk the perfect hash covers. For the *current* window (the one the network tip is in), `hi` advances by one chunk on each boundary. For a *finalized* window, `hi` is the window's last chunk. +- **Terminal-ness is derived, not stored**: the key whose `hi` equals its window's last chunk (`windowLastChunk` is computable forever from the window id and the immutable `chunks_per_txhash_index` pin). A window whose frozen key is terminal is finalized — never rebuilt again (only a retention-widening catch-up re-derives it, at its new, wider coverage). There is no `"finalized"` value and no marker: it would be a second copy of a derivable fact. + +**The uniqueness invariant: at most one coverage per window is `"frozen"` at any moment — at all times, not just at quiescence.** The rebuild's commit is one atomic synced batch that promotes the new coverage and demotes its predecessor in the same write, so the frozen coverage changes hands atomically. Readers resolve "the window's index" as *the unique frozen key*: no tie-break, no value parsing. Everything else under the window's prefix is transient debris with an unambiguous disposition: `"freezing"` = a crashed attempt — re-marked and overwritten if its coverage is built again, otherwise swept; `"pruning"` = a superseded coverage — finish the unlink and drop the key. + +So the `.idx` hashes exactly the transactions in chunks `[lo, hi]`. Chunks below `lo` are out of scope (floor); chunks above `hi` are not yet folded in — their lookups are served from their chunks' hot DBs until the next rebuild advances `hi`. + +**Concrete example** (default `chunks_per_index = 1000`): while the tip is in chunk 5350, index 5 (chunks 5000–5999) is the current window. If the floor sits at chunk 5100, the store holds `index:00000005:00005100:00005349 = "frozen"` and the live file is `txhash/index/00000005/00005100-00005349.idx` — covering chunks 5100–5349, with chunk 5350 still streaming into its hot DB and chunks 5000–5099 below the floor. The next boundary puts `index:00000005:00005100:00005350 = "freezing"`, writes and fsyncs `00005100-00005350.idx`, then commits the batch {`[5100,5350]` → `"frozen"`, `[5100,5349]` → `"pruning"`}; the eager sweep unlinks the superseded file and deletes its key right after the commit. + +## 7. The rolling rebuild + +### 7.1 Why rebuild from scratch on every boundary + +The current window's index is **re-derived from scratch on every chunk boundary** to absorb the chunk that just froze, growing until its window completes. Only the window the tip is in is ever rebuilt; a finalized window's index is static. + +The rebuild is cheap relative to its cadence: a full-window build is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates (Part 4). That ratio is what buys the design's simplicity — because the index is always rebuilt whole from sorted inputs: + +- There is no incremental-update machinery and no partially-updated index state for a crash to expose. Every `.idx` on disk is a complete, deterministic function of its coverage. +- `lo` tracks the floor and `hi` tracks the tip automatically — no separate floor-driven rebuild is ever needed while the window is current. +- Catch-up and the steady-state tick share the build path identically (one scheduler, one set of postconditions — see §9), so there is no second regime to verify. +- A same-coverage rebuild writes byte-identical output, which collapses crash recovery to "re-mark and rewrite" with no salvage analysis. + +### 7.2 The build protocol + +`buildTxhashIndex(w, lo, hi)` runs the one write protocol with a batch-commit extension. The covered range is explicit: `lo` defaults to the window's first chunk, rising to `chunkID(effectiveRetentionFloor)` when the floor cuts in; `hi` is the highest frozen chunk in the window — the window's last chunk once it's complete, lower while it's still filling. + +1. **Skip check**: if the window's unique frozen key already covers exactly `[lo, hi]`, return — there is nothing to write, and any leftover transient keys are the sweeps' job, not the builder's. (A frozen key covering the full window is terminal by definition, so the skip also covers re-scheduled builds of finalized windows, which must not demand `.bin` inputs the sweep has deleted.) +2. **Mark**: put `index:{w:08d}:{lo:08d}:{hi:08d}` = `"freezing"` — an idempotent overwrite when a crashed attempt (or a demoted coverage made desired again by a cross-restart regression) left the key behind. The build is **terminal** iff `hi` is the window's last chunk — a derived property, marked nowhere. +3. **Write**: k-way merge the sorted `.bin` files for chunks `[lo, hi]` into streamhash's `SortedBuilder`, writing `{lo:08d}-{hi:08d}.idx` — created or truncated wholesale; a writer only ever holds a file whose key is non-frozen, never one a reader can resolve. Fsync the file and its dir (and the dir's own dirent when this build created the window dir — the first build of a window). +4. **Commit**: one atomic synced batch — this coverage `"freezing"` → `"frozen"`; the window's predecessor frozen coverage (if any) → `"pruning"`; and iff this is the terminal build, every `chunk:{c}:txhash` key in the window → `"pruning"`. This batch is the *entire* finalization protocol — there is no separate cleanup step; the demoted keys become ordinary sweep work. + +Precondition: every chunk in `[lo, hi]` has `chunk:{c}:txhash == "frozen"` (its `.bin` exists). The function fails loudly if violated — checked before any key is touched, which is also the backstop for a build whose input task failed in the executor (the streaming doc's done-channels broadcast completion, not success). + +```go +func buildTxhashIndex(w WindowID, lo, hi ChunkID, cat Catalog) error { + // Step 1 — skip check. Also covers re-scheduled builds of finalized + // windows, whose .bin inputs the sweeps may already have removed. + if fk := frozenCoverage(cat, w); fk != nil && fk.Lo == lo && fk.Hi == hi { + return nil + } + // Precondition, checked loudly before any write. + for c := lo; c <= hi; c++ { + if cat.State(c, TxHashBin) != Frozen { + return fmt.Errorf("window %d: chunk %d .bin not frozen", w, c) + } + } + + // Step 2 — mark the coverage. Re-marking a crashed attempt's key (or a + // demoted coverage a cross-restart regression made desired again) is an + // idempotent overwrite; the file is rewritten wholesale either way. + terminal := hi == windowLastChunk(w) // derived; no marker anywhere + key := indexKey(w, lo, hi) + cat.Put(key, "freezing") + + // Step 3 — write the coverage's file from scratch (create-or-truncate: + // a crashed attempt's partial is overwritten wholesale, never appended). + f := createTruncate(indexFilePath(key)) + merge := newKWayMerge(binPaths(lo, hi)) // sorted runs → one streaming pass + sb := streamhash.NewSortedBuilder(f, coldTxhashOptions(lo)) // §6.2 options + for merge.Next() { + sb.Add(merge.Entry()) + } + sb.Finish() + fsyncFile(f) + fsyncDir(indexWindowDir(w)) // + fsyncParentDir on the window dir's first build + + // Step 4 — commit: ONE atomic synced batch, the entire finalization + // protocol. Note: no file is unlinked here, ever — the batch only + // DEMOTES keys, and deletion is exclusively the sweeps' job (§7.4): + // eagerly by buildThenSweep right after this batch, in both regimes; + // the tick's prune scan is the crash backstop. + batch := cat.NewBatch() + batch.Put(key, "frozen") + if prev := frozenCoverage(cat, w); prev != nil { + batch.Put(prev.Key, "pruning") // supersede the predecessor — a distinct + // key: the skip check returned if the + // frozen coverage equaled [lo, hi] + } + if terminal { // demote every input key in the window + for c := windowFirstChunk(w); c <= hi; c++ { + if cat.Has(chunkKey(c, TxHashBin)) { + batch.Put(chunkKey(c, TxHashBin), "pruning") + } + } + } + batch.Commit() + return nil +} +``` + +### 7.3 Finalization + +A window finalizes when its terminal build's commit batch lands — the same write that freezes the full-coverage `.idx` demotes every `chunk:{c}:txhash` key in the window to `"pruning"`. Finalization is therefore never a separate step, for either caller: catch-up calls the same function for every window its range overlaps (a complete window's full desired range is a terminal build; the trailing window's producible range is non-terminal), and the boundary tick's window-end rebuild is terminal by arithmetic. + +After finalization the `.idx` is static. If the floor later advances *within* the finalized window, the file becomes stale (`lo` references chunks pruning has since removed) — deliberately tolerated: index keys are swept only when their window falls *wholly* past the floor, and the read path handles the straddling case (§8.4). A window-straddling floor exists at most once at any moment — the window containing the effective retention floor. + +### 7.4 Sweeps and disk bounds + +The commit batch only demotes keys; all file deletion happens through the streaming doc's key-driven sweeps (unlink → `fsyncDir` → delete key — key absent ⟹ file gone). Two call sites: + +- **Eagerly, inside every `IndexBuild`'s execution** (`buildThenSweep`, right after the commit batch, in both regimes): sweep the window's superseded coverage and, after a terminal build, its demoted `.bin` inputs. The sweep is **window-local** — it walks only this window's keys, so concurrent windows' sweeps touch disjoint keys and files. +- **The tick's prune scan** — the crash backstop, and the owner of retention pruning. A `"freezing"` index key it observes was *not* retried (builds run before the sweep in every regime), so its coverage is no longer desired: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — and a single no-questions rule collapses the crash inventory. + +The eager site is what bounds disk. Without it, a long backfill would accumulate every finalized window's demoted `.bin`s until the first tick (≈20 bytes per transaction across all of history); with it, transient `.bin` disk is bounded by the windows actually in flight — the floor is one dense window's worth (≈60 GB), irreducible because a window's build merges all of its runs at once. + +**Provisioning note**: the old and new coverage files coexist from the start of a rebuild's write until the eager sweep's unlink, so the window dir transiently holds ~2× the index size (~25 GB at the end of a dense full window), and the window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial on instance NVMe, but worth provisioning for on throughput-capped volumes like EBS gp3. + +### 7.5 Why rewriting coverage-named files in place is safe + +The question to ask, since readers hold the live `.idx` open while the next coverage is written. Four facts carry the argument: + +1. **The skip rule.** A build's target name equals the live file's name only when its coverage equals the frozen coverage — which is exactly the case the skip check returns on. So no scheduled build ever opens the file readers resolve. +2. **Stage ordering and sweep scope.** No sweep runs where a build could collide with it: the `"freezing"`-key sweep — the only sweep that can touch a name a future build may target — lives solely in the tick's prune scan, which follows the plan stage (and catch-up precedes the lifecycle goroutine entirely); the eager sweep inside `buildThenSweep` touches only `"pruning"` keys in its own window, strictly after that window's commit; and a plan holds at most one `IndexBuild` per window — so concurrent windows' sweeps and builds touch disjoint keys and files. +3. **Floor monotonicity and reader lifetime.** Desired coverage is monotone within a run, and reader file handles die with the process — so a name any reader has resolved is never rewritten while held. +4. **Determinism.** A same-coverage rebuild writes identical bytes regardless — the merge is a deterministic function of the coverage — leaving a partial file under a `"freezing"` key as the only hazardous state, and no reader resolves a non-frozen key. + +A change to any of the four — the skip rule, the stage ordering or the eager sweep's window-locality and pruning-only scope, the floor's monotonicity, or reader lifetime — must re-prove this argument. + +### 7.6 Crash matrix + +| Crash point | Durable state left | Convergence | +|---|---|---| +| after step 2, or mid step 3 | predecessor coverage still `"frozen"` (readers unaffected); new key `"freezing"`, file absent/partial/complete | next build of the coverage re-marks and rewrites wholesale; if the coverage is no longer desired, the prune scan deletes file + key unread | +| after step 4, before the eager sweep | new coverage `"frozen"` and live; predecessor `"pruning"`; terminal: window's `.bin` keys `"pruning"` | the sweeps finish — eager on the next build, prune scan otherwise | +| mid-sweep | `"pruning"` key outlives the durable unlink | the sweep re-runs; key absent ⟹ file gone | + +At no crash instant are two coverages frozen, or none (once the window has one), or a `"frozen"` `chunk:{c}:txhash` key whose `.bin` has been deleted — the commit batch only ever demotes keys, and files are touched exclusively by the sweeps, under non-frozen keys. This is what lets the catch-up resolver (§9) trust `"frozen"` blindly. + +## 8. Query path + +### 8.1 Routing + +For a hash lookup, the reader walks two tiers: + +| Chunk state | Served from | +|---|---| +| at or below the frozen key's `hi` | the `.idx` named by the window's unique `"frozen"` key (filename derived from the key name) | +| above `hi` (live, or frozen and awaiting coverage) | the `txhash` CF of the chunk's hot DB | + +How the reader learns the current coverage and holds it consistently across rebuilds is the query-routing design's concern; this document requires only that the union of the two tiers covers the retention window — which the discard gate (§5.3) and the uniqueness invariant (§6.3) guarantee — and that each chunk has exactly one home at any moment. + +### 8.2 Cold lookup + +``` +resolve the window's unique "frozen" key + → open {lo}-{hi}.idx + → MPHF probe on the hash's 16-byte prefix + → fingerprint check (1 byte) — rejects ~255/256 of foreign keys + → seq = MinLedger + payload (3 bytes) + → retention gate: seq ≥ floor? — else not-found, no file access + → fetch the LCM for seq, extract the tx + → verify the full 32-byte hash — the correctness backstop + → respond (or not-found on mismatch) +``` + +The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. It also makes 16-byte prefix collisions harmless to serving: two distinct in-set hashes sharing a prefix would be a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), but even then the verify step returns not-found rather than the wrong transaction. Wrong data is structurally unreachable from this path. + +### 8.3 Hot lookup + +Chunks above `hi` are probed in their hot DBs' `txhash` CF — an exact full-key point get, so misses are genuine misses with no verification subtleties (the fetch-and-verify still runs, as the response needs the transaction anyway). In steady state this tier is the live chunk plus, briefly, the chunk inside the freeze-to-coverage interval; after catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. + +### 8.4 ENOENT rules and the retention gate + +A read for any seq below `effectiveRetentionFloor` returns **not-found** before any file access, regardless of whether the underlying file still exists — retention is the single source of truth for "is this data available?". This is what lets pruning remove chunks the moment they pass retention without coordinating with the index lifecycle. Two `ENOENT` sites then need explicit rules: + +**Index-file `ENOENT` → re-resolve and retry; never surface directly.** A reader that resolves the frozen key and then loses the open race to a sweep (the key was demoted and its file unlinked between resolve and open) gets `ENOENT`. After a *supersession* demotion the retry always finds a frozen key: the commit that demoted the old coverage promoted its replacement in the same batch. After a *retention* demotion (the prune stage sweeping a window that just fell wholly past the floor) the retry finds **no** frozen key — and that is the legitimate not-found: only retention pruning ever empties a window, so the queried seq is by then below the floor, and a fresh request would be short-circuited by the retention gate anyway. (A reader already holding the old file open never even notices: POSIX unlink doesn't invalidate open handles; it picks up the new coverage on its next key resolution.) + +**Data-file `ENOENT` → not-found directly; never re-resolve.** A `.pack` the `.idx` resolved the hash into can be missing while the `.idx` itself is live: a floor-straddling window keeps its frozen index key (index keys are swept only when the window falls *wholly* past the floor), so its static `lo` keeps covering chunks whose files pruning has removed. Re-resolving is useless — the same frozen key resolves the hash identically — and unnecessary: artifacts are write-once and deletion is unlink-only, so a missing data file can only mean the chunk was pruned, never that wrong bytes could be served. Ordinarily the retention gate short-circuits these reads before any file access; this rule is what keeps them fail-soft in the one state where it doesn't (the streaming doc's INV-1 hot-volume-loss exception, where a transiently regressed floor admits already-pruned bottom chunks). + +## 9. Catch-up and recovery interaction + +The cold tier's converge-from-any-state story is the streaming doc's postcondition-driven resolver; the tx-hash kind contributes the one **per-window rule** (every other kind is per-chunk): + +For each window overlapping the catch-up range, compare the **stored** coverage — `{lo, hi}` from the name of the window's unique frozen index key — with the **desired** coverage `[max(window_start, chunkID(floor)), min(window_last_chunk, range_end)]`. The upper cap is what makes the rule uniform: for a complete window it is the window's last chunk, for the trailing window it is the range end, and no special trailing case exists. + +- **Desired ⊆ stored** → schedule *nothing* for this window: no `.bin` production, no build. Three states land here: every steady-state restart; a floor that *rose* (the stale stored `lo` is the read path's problem, §8.4 — never a rebuild trigger); and a finalized window the range ends inside. +- **Desired exceeds stored** (`desired_lo < stored_lo`, or `desired_hi > stored_hi`, or no frozen key exists) → request `.bin` production for **every** chunk in the desired range — chunks whose `.bin` is already frozen self-skip inside `processChunk`; chunks the old `.idx` covered re-derive from their local `.pack` files with no bulk-backend download — and emit one `buildTxhashIndex(w, desired_lo, desired_hi)`, terminal iff `desired_hi` is the window's last chunk. + +Two clauses are load-bearing: + +- **The `stored_hi` clause** catches downtime that crosses a window boundary: a window that was *current* at shutdown carries a frozen key with `hi <` its last chunk, and classifying by `lo` alone would see a frozen key and strand chunks `(hi, last_chunk]` permanently. +- **`"frozen"` is blindly trustable**: a `"frozen"` `chunk:{c}:txhash` key implies its `.bin` exists, unconditionally — input demotions ride the same synced write that freezes the terminal coverage, and files are deleted only by sweeps under non-frozen keys (§7.6). The resolver classifies on frozen state only; transient keys are invisible to it and are the sweeps' job. + +**Retention widening** re-derives a finalized window at its new, wider coverage: `.bin`s for previously-covered chunks come from local `.pack`s; fully-pruned chunks refetch from the bulk source; the rebuild is terminal at the wider `[lo', last_chunk]` and its commit batch demotes the old coverage. This runs at the next startup — extending the bottom of storage is exclusively catch-up's job, behind backend validation — never in a tick. One corner needs help from outside the resolver: a widening that re-froze (or left mid-write) a finalized window's `.bin` keys and was then abandoned by narrowing retention back — the resolver correctly schedules nothing (desired ⊆ stored), so the tick's prune scan demotes and sweeps those provably-redundant inputs (`"frozen"` and `"freezing"` alike) — the final `.idx` covers their chunks, and the resolver never re-materializes a covered window. + +In the executor, an `IndexBuild` waits on the done-channels of the chunk builds inside its coverage and draws from the same worker pool — the `.bin`s are map-side-sorted runs, the build is the per-window reduce. Done-channels broadcast completion, not success; the build's loud precondition (§7.2) is the backstop, by design. + +--- + +# Part 4: Capacity & Performance + +## 10. Storage footprint + +Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10⁹ transactions): + +| Structure | Unit cost | Dense chunk | Dense window | Lifetime | +|---|---|---|---|---| +| hot `txhash` CF | 36 B/tx raw (32 key + 4 value), before RocksDB overhead | ~110 MB raw | — (per-chunk) | chunk ingestion → index coverage | +| `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization | +| `.idx` | ≈4.2 B/tx | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | + +Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` floor is one dense in-flight window (~60 GB), bounded by the eager sweep (§7.4). Steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history. + +## 11. Performance + +- **Ingest, hot**: one `(hash, seq)` put per transaction inside the existing per-ledger WriteBatch — no separate sync, no separate store. +- **Ingest, cold**: the in-memory sort of ~3M entries is negligible against the chunk's streaming pass; the `.bin` write is sequential. +- **Rebuild**: a full dense window merges ~60 GB of sorted runs into a ~12.5 GB `.idx` in ≈1 minute (~200 MB/s write burst) — measured by the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`). Mid-window rebuilds scale with `hi − lo`. Against a ~14-hour boundary cadence at mainnet rates, the rebuild is ~0.1% duty cycle. +- **Lookup, cold**: one MPHF probe (O(1), a couple of small reads, typically page-cached) + one ledger fetch + hash verification. The index adds no per-history memory: it is a file, read through the page cache. +- **Lookup, hot**: one RocksDB point get in a bloom-filtered CF, then the same ledger fetch. + +--- + +## Related documents + +- [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) — the daemon this subsystem lives in: geometry, the meta store and one write protocol, `processChunk`, the resolver and executor, the lifecycle tick (freeze → rebuild → discard → prune), and the correctness invariants (INV-1 … INV-4) with their audits. +- The reader / query-routing design — how readers obtain current coverage and dispatch between hot DBs and frozen files across transitions. +- [getevents-full-history-design.md](./getevents-full-history-design.md) — the sibling subsystem (events), same hot/cold architecture over the same chunk geometry. +- [packfile-library.md](./packfile-library.md) — the `.pack` format the read path's ledger fetch lands on. +- `bench-fullhistory` — the measurement harness behind every figure in Part 4. diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md deleted file mode 100644 index e682068d6..000000000 --- a/full-history/design-docs/README.md +++ /dev/null @@ -1,26 +0,0 @@ -# Stellar Full History RPC Service — Design Docs - -> **Scope**: Backfill pipeline only. Streaming pipeline design is covered separately. - -## Documents - -| Document | Description | -|----------|-------------| -| [03-backfill-workflow.md](./03-backfill-workflow.md) | Complete backfill design — geometry, meta store keys, directory layout, configuration, DAG task graph, execution model, crash recovery, getStatus API | - -The backfill doc is self-contained. Read it top-to-bottom for the full picture. - -## Quick Context - -The Stellar Full History RPC Service ingests the complete blockchain history. Primary use cases: - -- Retrieve any ledger from history -- Retrieve any transaction from history -- Retrieve any events with filter matching from history - -It has two modes: - -- **Backfill** — offline bulk import. Writes directly to immutable files (LFS chunks + RecSplit indexes). No RocksDB, no queries during ingestion. DAG-scheduled with a flat worker pool. -- **Streaming** — real-time ingestion via CaptiveStellarCore. Writes to RocksDB active stores, serves queries, transitions to immutable storage at index boundaries. Covered in a separate design doc. - -These modes are fully independent — separate code, separate crash recovery, separate transition workflows. From d18a26b00ae6c2c7d4fdd0d36099cd44cb5c2626 Mon Sep 17 00:00:00 2001 From: tamirms Date: Mon, 15 Jun 2026 20:51:24 +0200 Subject: [PATCH 02/18] =?UTF-8?q?docs(full-history):=20address=20Codex=20r?= =?UTF-8?q?eview=20=E2=80=94=20variable=20index=20payload,=20genesis-corne?= =?UTF-8?q?r=20fixes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to the design-doc PR addressing the automated review. gettransaction (#1): size the .idx ledger-seq payload to the window — payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8) — instead of a fixed 3 bytes, so the format never caps chunks_per_txhash_index (3 bytes up to 1677 chunks, 4 bytes beyond). Capacity references qualified as default-window figures. streaming (#2): fix the young-genesis corner case at the root. Add the chunk-id convention to Geometry — signed ids, chunkID floor-divides so the earliest_ledger-1 watermark sentinel maps to chunk -1 with chunkLastLedger(-1) = 1, and the -1 sentinel is never serialized. With it, deriveCompleteThrough's cold term (highestDurableChunk returns -1 on a fresh start) and positional term (maxChunk-1 = -1 at chunk 0) degrade to the pre-genesis sentinel instead of a spurious chunk-0 bound that would resume a young network past its tip; the startup mid-chunk test reads the sentinel as a chunk boundary, no special-case guard needed. Codex finding #3 (earliest_ledger raise path) was a false positive: the doc consistently frames earliest_ledger movement as a deferred operation gated behind a future admin command. Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 24 +++++++++++++++---- .../gettransaction-full-history-design.md | 10 ++++---- 2 files changed, 24 insertions(+), 10 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 90ef64e72..0333d06f1 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -20,13 +20,15 @@ The Stellar blockchain starts at ledger 2 (`GENESIS_LEDGER`). Two units organize - **Window** (tx-hash index) — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the rolling tx-hash index. Configurable, but immutable once stored. ``` -chunkID(seq) = (seq - 2) / 10_000 +chunkID(seq) = floor((seq - 2) / 10_000) chunkFirstLedger(c) = c * 10_000 + 2 chunkLastLedger(c) = (c + 1) * 10_000 + 1 indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id chunksInIndex(w) = [w*cpi, (w+1)*cpi - 1] # cpi = chunks_per_txhash_index ``` +Chunk ids are **signed**, and `chunkID` uses floor division. The only sub-genesis sequence the daemon ever forms is the "nothing ingested" watermark sentinel `earliest_ledger - 1` (which is `1` when `earliest_ledger` is genesis); floor division maps it to **chunk −1**, and `chunkLastLedger(-1) = 1` reproduces the sentinel. Chunk −1 means "before the first chunk" and exists only as a transient in derivation arithmetic — the cold and positional terms of `deriveCompleteThrough`, and the watermark mid-chunk test at startup. It is never serialized: every chunk id written to a meta-store key or file path is a real chunk `≥ 0`, so `%08d` only ever sees non-negative ids. (`chunkID(seq)` for an in-range `seq ≥ GENESIS_LEDGER` is unaffected — floor and truncating division agree on non-negative numerators.) + All chunk and window ids use uniform `%08d` zero-padding. Example, default `chunks_per_txhash_index = 1000`: | Window | First ledger | Last ledger | Chunks | @@ -224,11 +226,16 @@ func deriveCompleteThrough(cat Catalog) uint32 { // frozen": a crash mid-freeze can leave lfs frozen while events is still // "freezing", and counting that chunk would let reads open over a // partial artifact. An incompletely frozen tip chunk must DEGRADE the - // bound so catch-up / re-ingestion repairs it. + // bound so catch-up / re-ingestion repairs it. highestDurableChunk + // returns -1 when NO chunk is durable (a fresh start), so the cold term + // is then chunkLastLedger(-1) = 1 — the pre-genesis sentinel — never a + // spurious chunk-0 bound that would resume a young network past its tip. through := chunkLastLedger(highestDurableChunk(cat)) // Positional term. hotChunkKeys returns every hot:chunk:* key regardless // of value — counting a "transient" key is sound because it is only ever - // put after the predecessor chunk's write handle closed. + // put after the predecessor chunk's write handle closed. When the live + // chunk is chunk 0 (a young genesis network), maxChunk-1 = -1 and + // chunkLastLedger(-1) = 1: nothing below chunk 0 is complete. if hot := hotChunkKeys(cat); len(hot) > 0 { through = max(through, chunkLastLedger(maxChunk(hot)-1)) } @@ -687,8 +694,10 @@ func startStreaming(ctx context.Context, cfg Config) error { // appear at the tip; backfilledThrough guards against infinite re-passes // when the tip stops moving (a fixed rangeEnd matching the previous // iteration breaks the loop). Edge case: on a network younger than one - // chunk, rangeEnd = lastCompleteChunkAt(anchor) = -1 — the - // rangeEnd < rangeStart guard catches it cleanly. + // chunk, rangeEnd = lastCompleteChunkAt(anchor) = -1, and the watermark + // sentinel reads as a chunk boundary (Geometry convention) so the + // mid-chunk branch below leaves rangeEnd at -1 — the rangeEnd < rangeStart + // guard then catches it cleanly. backfilledThrough := int64(-1) for { tip := backendNetworkTip(cfg) @@ -702,6 +711,11 @@ func startStreaming(ctx context.Context, cfg Config) error { // (produced locally via catchupSource's hot branch) — the bulk // backend is never asked for them. rangeEnd := lastCompleteChunkAt(anchor) + // The watermark sentinel (lastCommitted = earliest_ledger-1, e.g. 1 on + // a genesis fresh start) sits on a chunk boundary by construction — + // earliest_ledger is chunk-aligned — so chunkID maps it to its chunk + // (chunk -1 for the genesis sentinel, per the Geometry convention) and + // this reads false, never spuriously mid-chunk. watermarkMidChunk := lastCommitted != chunkLastLedger(chunkID(lastCommitted)) withinOneChunkOfTip := int64(tip)-int64(lastCommitted) < LedgersPerChunk // ^ signed: a lagging bulk tip can sit BELOW the resume point diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index 9cf8fa827..ebd7d25df 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -127,10 +127,10 @@ A `.bin` lives as long as its window needs it as rebuild input: every boundary r `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, meta-store key `index:{window:08d}:{lo:08d}:{hi:08d}`. One streamhash minimal-perfect-hash file per **coverage**, built by streamhash's `SortedBuilder` over the k-way merge of `.bin[lo..hi]`, with the cold-txhash option set: -- **Payload: 3 bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range and embedded in the file as user metadata. 3 bytes spans 16.7M ledgers, comfortably over a 10M-ledger window. No sidecar metadata. +- **Payload: `payloadWidth` bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range. The width is sized to the window so the format never caps `chunks_per_txhash_index`: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)`, the bytes needed to hold the largest in-window offset (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) this is **3 bytes** — a 24-bit offset spans 16.77M ledgers — and a window of 1678+ chunks (>16.77M ledgers) widens it to 4. Since `chunks_per_txhash_index` is immutable once stored, the width is fixed for every window's life; like `MinLedger`, it is embedded in the file as user metadata and read back at lookup time. No sidecar metadata. - **Fingerprint: 1 byte** — screens foreign keys (§8.2). -All-in, the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. +All-in, at the default 3-byte payload the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. A window past the 4-byte payload threshold adds one byte per transaction. These formats match the measured pipeline in the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`), which is where the performance figures in Part 4 come from — adopting the formats unchanged is what makes those figures transfer to this design. @@ -288,7 +288,7 @@ resolve the window's unique "frozen" key → open {lo}-{hi}.idx → MPHF probe on the hash's 16-byte prefix → fingerprint check (1 byte) — rejects ~255/256 of foreign keys - → seq = MinLedger + payload (3 bytes) + → seq = MinLedger + payload (payloadWidth bytes) → retention gate: seq ≥ floor? — else not-found, no file access → fetch the LCM for seq, extract the tx → verify the full 32-byte hash — the correctness backstop @@ -339,9 +339,9 @@ Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10 |---|---|---|---|---| | hot `txhash` CF | 36 B/tx raw (32 key + 4 value), before RocksDB overhead | ~110 MB raw | — (per-chunk) | chunk ingestion → index coverage | | `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization | -| `.idx` | ≈4.2 B/tx | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | +| `.idx` | ≈4.2 B/tx (3-byte payload) | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | -Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` floor is one dense in-flight window (~60 GB), bounded by the eager sweep (§7.4). Steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history. +Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` floor is one dense in-flight window (~60 GB), bounded by the eager sweep (§7.4). Steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history (at the default window; +1 B/tx past the 4-byte payload threshold). ## 11. Performance From 1c6c1adc90c24b8e5662e0fde56dc0f26882acbf Mon Sep 17 00:00:00 2001 From: tamirms Date: Mon, 15 Jun 2026 22:48:40 +0200 Subject: [PATCH 03/18] docs(full-history): genesis chunk-id convention, multi-window tx lookup, scope/wording cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to the design-doc review. Genesis corner (streaming): replace the per-site sentinel guard with one chunk-id convention in Geometry — chunk ids are signed and chunkID floor- divides, so the earliest_ledger-1 watermark sentinel maps to chunk -1 with chunkLastLedger(-1) = 1, never serialized. deriveCompleteThrough's cold and positional terms are then correct by construction on a young genesis network (highestDurableChunk returns -1; maxChunk-1 = -1 at chunk 0), instead of a spurious chunk-0 bound that would resume past the tip. getTransaction read path: a cold lookup probes EVERY in-retention window's .idx (a hash carries no window hint); correctness is the streamhash fingerprint screen plus the mandatory fetch-and-verify, which confirms at most one window. payloadWidth/fpWidth are streamhash config knobs. Scope: how reads stay correct while sweeps and pruning unlink files concurrently (the ENOENT races, coverage re-resolution) is the query-routing design's concern, per the doc's own deferral — trim tx 8.4 and the streaming reader-retention-contract section to the in-scope contract plus that deferral. Wording: drop vacuous/overselling phrasing across both docs (the "one home" framing, "bounded by RAM", "structurally unreachable", filler qualifiers). Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 20 ++---- .../gettransaction-full-history-design.md | 70 ++++++++++--------- 2 files changed, 40 insertions(+), 50 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 0333d06f1..89fd67fb6 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -318,7 +318,7 @@ Each coverage runs the same lifecycle as every per-chunk artifact: The *window-level* progression — coverage advancing boundary by boundary, then finalization — emerges from the coverage chain: each boundary freezes the widened coverage and demotes its predecessor in one atomic batch, and the terminal key is the one whose `hi` equals the window's last chunk. The batch maintains **at most one frozen coverage per window at all times** — a crash at any instant leaves either the old coverage frozen (batch not landed; the new one is `"freezing"` debris) or the new one frozen (predecessor already `"pruning"`), never both, never neither. -Why rewriting coverage-named files in place is safe — the question to ask, since readers hold the live `.idx` open while the next one is written — is argued in full in [the transactions design](./gettransaction-full-history-design.md) (§7.5). Four facts carry it: the build's skip rule (no scheduled build ever targets the name readers resolve), the stage ordering plus the eager sweep's window-locality and pruning-only scope, the floor's monotonicity within a run plus reader handles dying with the process, and the merge's determinism. A change to any of the four must re-prove the argument. +Why rewriting coverage-named files in place is safe — readers hold the live `.idx` open while the next coverage is written into the same directory — is argued in full in [the transactions design](./gettransaction-full-history-design.md) (§7.5). Four facts carry it: the build's skip rule (no scheduled build ever targets the name readers resolve), the stage ordering plus the eager sweep's window-locality and pruning-only scope, the floor's monotonicity within a run plus reader handles dying with the process, and the merge's determinism. A change to any of the four must re-prove the argument. ### Hot DB lifecycle @@ -614,7 +614,7 @@ func executePlan(ctx context.Context, plan Plan, cfg Config) error { - **`cfg.Workers` is the only resource knob** (default `GOMAXPROCS`). The goroutines are structure, not resources: thousands may exist, parked either on the semaphore (queued tasks) or on done-channels (builds awaiting inputs), costing a few KB each; at most `Workers` tasks execute at any instant, drawn from all windows' eligible work mixed together. An index build fires the moment its own in-coverage chunk builds finish, without waiting on other windows. (The derived wait slightly over-approximates — a build also waits on an in-coverage chunk producing only `lfs`/`events` — which is harmless: waiting longer is always safe, and the case arises only in widening scenarios.) - The executor runs each `IndexBuild` via `buildThenSweep` (defined with `resolve` above), which lands the commit batch (terminal for complete windows) and then runs the eager `"pruning"` sweep (rule 4). The sweep is window-local — this window's demoted inputs and superseded coverages, not a store-wide scan — so concurrent windows' sweeps touch disjoint keys, and `fsyncDir` on a bucket dir shared with another window's in-flight `.bin` writes is safe (a dir fsync with concurrent creates just makes more entries durable). -- Done-channels broadcast *completion*, not success: a chunk build that exhausts its retries still closes its channel (the `defer`), so a dependent index build can win the race against context cancellation and start — whereupon it fails `buildTxhashIndex`'s loud `.bin` precondition check before writing any key, landing on the same abort-and-restart path as the original failure. The precondition check is load-bearing here, by design. +- Done-channels broadcast *completion*, not success: a chunk build that exhausts its retries still closes its channel (the `defer`), so a dependent index build can win the race against context cancellation and start — whereupon it fails `buildTxhashIndex`'s loud `.bin` precondition check before writing any key, landing on the same abort-and-restart path as the original failure. The precondition check is load-bearing here. - A task that exhausts its retries aborts the daemon, per the [error policy](#lifecycle); restart re-resolves from durable keys, and completed work never repeats. - **Single-process enforcement:** the meta store holds a kernel `flock` on a `LOCK` file; a second daemon opening the **same meta-store path** fails immediately, and the lock releases on any process exit (including `kill -9`). Because `[meta_store]` and each `[immutable_storage.*]` path are independently configurable, the meta-store lock alone cannot stop two daemons with *different* meta stores from sharing one artifact tree — the daemon therefore also takes a `flock` in each configured storage root. @@ -1179,21 +1179,9 @@ Every arrow is the one write protocol or its exit sweep; at the end of the tick ## Reader retention contract -A read for any seq below `effectiveRetentionFloor` returns *not found*, regardless of whether the underlying file still exists on disk. This is what lets pruning remove chunks the moment they pass retention, without coordinating with the index lifecycle: a stale `.idx` may resolve a tx-hash to a `.pack` that's been deleted, but the retention check at the top of the reader short-circuits the lookup before any file access. From the caller's perspective, retention is the single source of truth for "is this data available?" +A read for any seq below `effectiveRetentionFloor` returns *not found*, regardless of whether the underlying file still exists on disk. This is the contract that lets pruning remove chunks the moment they pass retention **without coordinating with the index lifecycle**: a stale `.idx` may resolve a tx-hash to a `.pack` that's been deleted, but a below-floor read is not-found regardless. From the storage layer's perspective, retention is the single source of truth for "is this data available?", and it is all the prune and sweep stages rely on. -For tx-hash lookups specifically, the reader walks two tiers: - -| Chunk state | Served from | -|---|---| -| at or below the frozen key's `hi` | the `.idx` named by the window's unique `"frozen"` key (filename derived from the key name) | -| above `hi` (live, or frozen and awaiting coverage) | the `txhash` CF of the chunk's hot DB | - -The transition is gap-free by write ordering: the hot DB is discarded only after the durable `.idx` covers the chunk, and the rebuild commits atomically (one batch that promotes a fully-written coverage and demotes its predecessor). Two `ENOENT` sites need explicit rules — stated here because INV-1 depends on them, argued in full in [the transactions design](./gettransaction-full-history-design.md) (§8.4): - -- **Index-file `ENOENT`** (the reader resolved the frozen key, then lost the open race to a sweep): **re-resolve and retry**, never surface not-found directly. After a *supersession* demotion the retry always finds a frozen key — the commit that demoted the old coverage promoted its replacement in the same batch; after a *retention* demotion it finds none, and that not-found is legitimate — only retention pruning ever empties a window, so the queried seq is by then below the floor anyway. -- **Data-file `ENOENT`** (a live `.idx` resolved the hash into a `.pack` pruning has removed — a floor-straddling window's static `lo` keeps covering pruned chunks): **not-found directly**, no re-resolve. Artifacts are write-once and deletion is unlink-only, so a missing data file can only mean the chunk was pruned — never that wrong bytes could be served. Ordinarily the top-of-reader retention check short-circuits these reads before any file access; this rule is what keeps them fail-soft in the one state where it doesn't (INV-1's hot-volume-loss exception). - -How the reader dispatches between hot DBs and frozen files for in-retention queries — and how it stays consistent across rebuild/freeze/discard transitions — is the query-routing design, out of scope for this doc. +How the reader actually dispatches between hot DBs and frozen `.idx` files, and how it stays correct while sweeps and pruning unlink files concurrently with in-flight reads (tier dispatch, coverage re-resolution, the file-vanishes-mid-read cases), is the **query-routing design's** concern — out of scope here and in the transactions design (§8.4). --- diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index ebd7d25df..34b988c2b 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -2,7 +2,7 @@ ## Summary -How the full-history daemon ingests and serves transactions for the tx-by-hash endpoint (`getTransaction`). A transaction lookup is a two-step read: resolve the hash to a ledger sequence, then fetch the transaction from that ledger's stored LCM. This document is the canonical reference for the resolution structure end to end — the hot tier (a column family in the per-chunk hot RocksDB) and the cold tier (per-chunk sorted runs merged into a per-window minimal-perfect-hash index), the file formats, the rolling rebuild that keeps the cold tier current, the read path, and the capacity numbers. +How the full-history daemon ingests and serves transactions for the tx-by-hash endpoint (`getTransaction`). A transaction lookup is a two-step read: resolve the hash to a ledger sequence, then fetch the transaction from that ledger's stored LCM. This document covers the resolution structure: the hot tier (a column family in the per-chunk hot RocksDB) and the cold tier (per-chunk sorted runs merged into a per-window minimal-perfect-hash index), the file formats, the rolling rebuild that keeps the cold tier current, the read path, and the capacity numbers. The daemon context — chunk geometry, the meta store, the one write protocol, catch-up, and the lifecycle tick — is defined in [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) (the streaming doc). That document references this one for everything tx-hash-specific and restates only what its own protocols depend on. @@ -15,8 +15,8 @@ The daemon context — chunk geometry, the meta store, the one write protocol, c Serve `getTransaction(hash)` for any transaction whose ledger falls within the retention window (full history by default): - **Complete.** Every transaction in every in-retention ledger is resolvable by hash. No gaps, including across crashes, restarts, and retention changes. -- **Correct.** A lookup never returns the wrong transaction. A missing or out-of-retention transaction returns not-found — never an error dressed as data, never stale bytes. -- **Cheap to serve.** A cold lookup costs one index probe plus one ledger fetch. Memory does not scale with history size. +- **Correct.** A lookup never returns the wrong transaction; a missing or out-of-retention one returns not-found. +- **No in-memory index.** The hash→seq map is on-disk `.idx` files (read through the page cache), not a RAM-resident structure sized to the transaction count — so the daemon holds no memory proportional to the number of transactions in history. (A lookup probes one `.idx` per in-retention window — a hash carries no window hint; that probe set and its cost are the query-routing design's concern.) - **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write path, and the cold index stays current with a rebuild that is small relative to its cadence. Out of scope: how readers obtain the daemon's current coverage and dispatch between tiers across rebuild/freeze/discard transitions (the query-routing design), and the storage of the transactions themselves (the ledger store — `.pack` files and the hot `ledgers` CF — covered by the streaming doc and the packfile library doc). @@ -34,15 +34,15 @@ The subsystem this document owns is the **hash → seq map**, plus the read-path - **Point lookups only.** There are no range or prefix queries over tx hashes, so order-preserving structures buy nothing — perfect-hash structures apply. - **Hashes are uniform and immutable.** A transaction hash is never updated and corresponds to at most one applied transaction (the network's replay protection); the map is append-only, one batch of entries per ledger. -- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — it needs to be *complete* (no false negatives) and *cheap*, with false positives screened first by a fingerprint and finally by the fetch-and-verify step. +- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — only *complete* (no false negatives); false positives are screened first by a fingerprint and finally by the fetch-and-verify step. --- # Part 2: Architecture -## 3. Two tiers, one home +## 3. The two tiers -At any moment, every in-retention transaction hash has **exactly one queryable home**: +An in-retention transaction is stored in exactly one place — one tier, one window, never duplicated — but a bare hash doesn't say *which*, so a lookup probes every home (§8.1) and at most one confirms (none, if the hash isn't there). The two homes: | Tier | Structure | Serves | |---|---|---| @@ -128,7 +128,7 @@ A `.bin` lives as long as its window needs it as rebuild input: every boundary r `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, meta-store key `index:{window:08d}:{lo:08d}:{hi:08d}`. One streamhash minimal-perfect-hash file per **coverage**, built by streamhash's `SortedBuilder` over the k-way merge of `.bin[lo..hi]`, with the cold-txhash option set: - **Payload: `payloadWidth` bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range. The width is sized to the window so the format never caps `chunks_per_txhash_index`: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)`, the bytes needed to hold the largest in-window offset (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) this is **3 bytes** — a 24-bit offset spans 16.77M ledgers — and a window of 1678+ chunks (>16.77M ledgers) widens it to 4. Since `chunks_per_txhash_index` is immutable once stored, the width is fixed for every window's life; like `MinLedger`, it is embedded in the file as user metadata and read back at lookup time. No sidecar metadata. -- **Fingerprint: 1 byte** — screens foreign keys (§8.2). +- **Fingerprint: `fpWidth` bytes (default 1)** — a streamhash option screening foreign keys before fetch-and-verify. Since a hash lookup probes every in-retention window (§8.2), a wider fingerprint trades index size (+1 byte/tx) for fewer false-positive fetches across those windows. Fixed per build, like `payloadWidth`. All-in, at the default 3-byte payload the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. A window past the 4-byte payload threshold adds one byte per transaction. @@ -155,7 +155,7 @@ So the `.idx` hashes exactly the transactions in chunks `[lo, hi]`. Chunks below The current window's index is **re-derived from scratch on every chunk boundary** to absorb the chunk that just froze, growing until its window completes. Only the window the tip is in is ever rebuilt; a finalized window's index is static. -The rebuild is cheap relative to its cadence: a full-window build is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates (Part 4). That ratio is what buys the design's simplicity — because the index is always rebuilt whole from sorted inputs: +The rebuild is cheap relative to its cadence: a full-window build is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates (Part 4). That headroom is what lets the index be rebuilt whole from sorted inputs every boundary, rather than updated incrementally: - There is no incremental-update machinery and no partially-updated index state for a crash to expose. Every `.idx` on disk is a complete, deterministic function of its coverage. - `lo` tracks the floor and `hi` tracks the tip automatically — no separate floor-driven rebuild is ever needed while the window is current. @@ -241,7 +241,7 @@ After finalization the `.idx` is static. If the floor later advances *within* th The commit batch only demotes keys; all file deletion happens through the streaming doc's key-driven sweeps (unlink → `fsyncDir` → delete key — key absent ⟹ file gone). Two call sites: - **Eagerly, inside every `IndexBuild`'s execution** (`buildThenSweep`, right after the commit batch, in both regimes): sweep the window's superseded coverage and, after a terminal build, its demoted `.bin` inputs. The sweep is **window-local** — it walks only this window's keys, so concurrent windows' sweeps touch disjoint keys and files. -- **The tick's prune scan** — the crash backstop, and the owner of retention pruning. A `"freezing"` index key it observes was *not* retried (builds run before the sweep in every regime), so its coverage is no longer desired: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — and a single no-questions rule collapses the crash inventory. +- **The tick's prune scan** — the crash backstop, and the owner of retention pruning. A `"freezing"` index key it observes was *not* retried (builds run before the sweep in every regime), so its coverage is no longer desired: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — so one no-questions rule covers every crashed attempt. The eager site is what bounds disk. Without it, a long backfill would accumulate every finalized window's demoted `.bin`s until the first tick (≈20 bytes per transaction across all of history); with it, transient `.bin` disk is bounded by the windows actually in flight — the floor is one dense window's worth (≈60 GB), irreducible because a window's build merges all of its runs at once. @@ -249,7 +249,7 @@ The eager site is what bounds disk. Without it, a long backfill would accumulate ### 7.5 Why rewriting coverage-named files in place is safe -The question to ask, since readers hold the live `.idx` open while the next coverage is written. Four facts carry the argument: +The hazard: a reader holds the live `.idx` open while the next coverage is written into the same window directory. Four facts make the in-place rewrite safe anyway: 1. **The skip rule.** A build's target name equals the live file's name only when its coverage equals the frozen coverage — which is exactly the case the skip check returns on. So no scheduled build ever opens the file readers resolve. 2. **Stage ordering and sweep scope.** No sweep runs where a build could collide with it: the `"freezing"`-key sweep — the only sweep that can touch a name a future build may target — lives solely in the tick's prune scan, which follows the plan stage (and catch-up precedes the lifecycle goroutine entirely); the eager sweep inside `buildThenSweep` touches only `"pruning"` keys in its own window, strictly after that window's commit; and a plan holds at most one `IndexBuild` per window — so concurrent windows' sweeps and builds touch disjoint keys and files. @@ -272,46 +272,48 @@ At no crash instant are two coverages frozen, or none (once the window has one), ### 8.1 Routing -For a hash lookup, the reader walks two tiers: +A hash names no ledger, so the reader cannot know which home holds it in advance — it **probes them all**, and the hash resolves in exactly one: -| Chunk state | Served from | -|---|---| -| at or below the frozen key's `hi` | the `.idx` named by the window's unique `"frozen"` key (filename derived from the key name) | -| above `hi` (live, or frozen and awaiting coverage) | the `txhash` CF of the chunk's hot DB | +| Tier | Probe set | How | +|---|---|---| +| cold — one `.idx` per window | **every in-retention window** | MPHF + fingerprint + verify (§8.2) | +| hot — `txhash` CF per chunk | the chunks above any window's `hi` (live, or frozen awaiting coverage) | exact full-key get (§8.3) | -How the reader learns the current coverage and holds it consistently across rebuilds is the query-routing design's concern; this document requires only that the union of the two tiers covers the retention window — which the discard gate (§5.3) and the uniqueness invariant (§6.3) guarantee — and that each chunk has exactly one home at any moment. +The hot tier is a few chunks at most — one window's tail, normally just the live chunk — so the probe set is `≈ (in-retention windows) + (a handful of chunks)`. How the reader learns current coverage and stays consistent across rebuilds is the query-routing design's concern; this document requires only that the homes' union covers the retention window — guaranteed by the discard gate (§5.3) and the uniqueness invariant (§6.3) — and that each ledger has exactly one home, so **at most one probe confirms** — the verify runs on every fingerprint hit but succeeds for at most one. ### 8.2 Cold lookup +The cold tier **probes every in-retention window's `.idx`** — a hash gives no window hint (the window is `chunkID(seq) / chunks_per_txhash_index`, and `seq` is exactly what the lookup is trying to find), so there is nothing to pre-select. Each window probe: + ``` -resolve the window's unique "frozen" key - → open {lo}-{hi}.idx +for each in-retention window (its unique "frozen" key → {lo}-{hi}.idx): → MPHF probe on the hash's 16-byte prefix - → fingerprint check (1 byte) — rejects ~255/256 of foreign keys - → seq = MinLedger + payload (payloadWidth bytes) - → retention gate: seq ≥ floor? — else not-found, no file access - → fetch the LCM for seq, extract the tx - → verify the full 32-byte hash — the correctness backstop - → respond (or not-found on mismatch) + → fingerprint check (fpWidth bytes) — miss ⇒ skip this window + → on a fingerprint hit: + seq = MinLedger + payload (payloadWidth bytes) + retention gate: seq ≥ floor? — else skip this window + fetch the LCM for seq, extract the tx + verify the full 32-byte hash — confirms, or rejects a false positive +respond on the confirmed hit; not-found if no window confirms ``` -The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. It also makes 16-byte prefix collisions harmless to serving: two distinct in-set hashes sharing a prefix would be a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), but even then the verify step returns not-found rather than the wrong transaction. Wrong data is structurally unreachable from this path. +Because the hash belongs to at most one window, **at most one window confirms**; a not-found lookup — a non-existent or not-yet-ingested hash — confirms none and must rule out every in-retention window. -### 8.3 Hot lookup +The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. It also makes 16-byte prefix collisions harmless to serving: two distinct in-set hashes sharing a prefix would be a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), but even then the verify step returns not-found rather than the wrong transaction. -Chunks above `hi` are probed in their hot DBs' `txhash` CF — an exact full-key point get, so misses are genuine misses with no verification subtleties (the fetch-and-verify still runs, as the response needs the transaction anyway). In steady state this tier is the live chunk plus, briefly, the chunk inside the freeze-to-coverage interval; after catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. +**Probe ordering, parallelism, early-stop, and the resulting latency and I/O are the query-routing design's concern** (§8.1), out of scope here. -### 8.4 ENOENT rules and the retention gate +### 8.3 Hot lookup -A read for any seq below `effectiveRetentionFloor` returns **not-found** before any file access, regardless of whether the underlying file still exists — retention is the single source of truth for "is this data available?". This is what lets pruning remove chunks the moment they pass retention without coordinating with the index lifecycle. Two `ENOENT` sites then need explicit rules: +Chunks above `hi` are probed in their hot DBs' `txhash` CF — an exact full-key point get, so misses are genuine misses with no verification subtleties (the fetch-and-verify still runs, as the response needs the transaction anyway). In steady state this tier is the live chunk plus, briefly, the chunk inside the freeze-to-coverage interval; after catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. -**Index-file `ENOENT` → re-resolve and retry; never surface directly.** A reader that resolves the frozen key and then loses the open race to a sweep (the key was demoted and its file unlinked between resolve and open) gets `ENOENT`. After a *supersession* demotion the retry always finds a frozen key: the commit that demoted the old coverage promoted its replacement in the same batch. After a *retention* demotion (the prune stage sweeping a window that just fell wholly past the floor) the retry finds **no** frozen key — and that is the legitimate not-found: only retention pruning ever empties a window, so the queried seq is by then below the floor, and a fresh request would be short-circuited by the retention gate anyway. (A reader already holding the old file open never even notices: POSIX unlink doesn't invalidate open handles; it picks up the new coverage on its next key resolution.) +### 8.4 Reads and concurrent pruning -**Data-file `ENOENT` → not-found directly; never re-resolve.** A `.pack` the `.idx` resolved the hash into can be missing while the `.idx` itself is live: a floor-straddling window keeps its frozen index key (index keys are swept only when the window falls *wholly* past the floor), so its static `lo` keeps covering chunks whose files pruning has removed. Re-resolving is useless — the same frozen key resolves the hash identically — and unnecessary: artifacts are write-once and deletion is unlink-only, so a missing data file can only mean the chunk was pruned, never that wrong bytes could be served. Ordinarily the retention gate short-circuits these reads before any file access; this rule is what keeps them fail-soft in the one state where it doesn't (the streaming doc's INV-1 hot-volume-loss exception, where a transiently regressed floor admits already-pruned bottom chunks). +The cold-tier lifecycle unlinks files concurrently with in-flight reads — sweeps remove superseded `.idx` coverages (§7.4), and retention pruning removes a window's `.idx` and its chunks' `.pack`s. What makes that safe to do **unilaterally** is the streaming doc's reader retention contract: a read for any seq below `effectiveRetentionFloor` is not-found regardless of what is still on disk. How a read stays correct across these transitions otherwise — tier dispatch, coverage re-resolution, a file that vanishes mid-read — is the **query-routing design's** concern (§8.1), out of scope here. ## 9. Catch-up and recovery interaction -The cold tier's converge-from-any-state story is the streaming doc's postcondition-driven resolver; the tx-hash kind contributes the one **per-window rule** (every other kind is per-chunk): +The cold tier converges from any state through the streaming doc's postcondition-driven resolver; the tx-hash kind contributes the one **per-window rule** (every other kind is per-chunk): For each window overlapping the catch-up range, compare the **stored** coverage — `{lo, hi}` from the name of the window's unique frozen index key — with the **desired** coverage `[max(window_start, chunkID(floor)), min(window_last_chunk, range_end)]`. The upper cap is what makes the rule uniform: for a complete window it is the window's last chunk, for the trailing window it is the range end, and no special trailing case exists. @@ -325,7 +327,7 @@ Two clauses are load-bearing: **Retention widening** re-derives a finalized window at its new, wider coverage: `.bin`s for previously-covered chunks come from local `.pack`s; fully-pruned chunks refetch from the bulk source; the rebuild is terminal at the wider `[lo', last_chunk]` and its commit batch demotes the old coverage. This runs at the next startup — extending the bottom of storage is exclusively catch-up's job, behind backend validation — never in a tick. One corner needs help from outside the resolver: a widening that re-froze (or left mid-write) a finalized window's `.bin` keys and was then abandoned by narrowing retention back — the resolver correctly schedules nothing (desired ⊆ stored), so the tick's prune scan demotes and sweeps those provably-redundant inputs (`"frozen"` and `"freezing"` alike) — the final `.idx` covers their chunks, and the resolver never re-materializes a covered window. -In the executor, an `IndexBuild` waits on the done-channels of the chunk builds inside its coverage and draws from the same worker pool — the `.bin`s are map-side-sorted runs, the build is the per-window reduce. Done-channels broadcast completion, not success; the build's loud precondition (§7.2) is the backstop, by design. +In the executor, an `IndexBuild` waits on the done-channels of the chunk builds inside its coverage and draws from the same worker pool — the `.bin`s are map-side-sorted runs, the build is the per-window reduce. Done-channels broadcast completion, not success; the build's loud precondition (§7.2) is the backstop. --- @@ -348,7 +350,7 @@ Transient peaks: ~2× the index size in the window dir during each rebuild (~25 - **Ingest, hot**: one `(hash, seq)` put per transaction inside the existing per-ledger WriteBatch — no separate sync, no separate store. - **Ingest, cold**: the in-memory sort of ~3M entries is negligible against the chunk's streaming pass; the `.bin` write is sequential. - **Rebuild**: a full dense window merges ~60 GB of sorted runs into a ~12.5 GB `.idx` in ≈1 minute (~200 MB/s write burst) — measured by the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`). Mid-window rebuilds scale with `hi − lo`. Against a ~14-hour boundary cadence at mainnet rates, the rebuild is ~0.1% duty cycle. -- **Lookup, cold**: one MPHF probe (O(1), a couple of small reads, typically page-cached) + one ledger fetch + hash verification. The index adds no per-history memory: it is a file, read through the page cache. +- **Lookup, cold**: one MPHF probe per in-retention window — fingerprint screen, then fetch-and-verify on a hit. The hash is in at most one window, so at most one fetch confirms; fingerprint false positives (bounded by `fpWidth`, §6.2) are rejected by the full-hash verify. Probe ordering, parallelism, and the resulting latency/throughput are the query-routing design's concern (§8.1). - **Lookup, hot**: one RocksDB point get in a bloom-filtered CF, then the same ledger fetch. --- From 76d5d231a06721b098070898d7dd4fc722c4e263 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 09:59:42 +0200 Subject: [PATCH 04/18] docs(full-history): reject chunks_per_txhash_index = 0 in validateConfig A zero value would divide-by-zero in indexID / window math (and break payloadWidth) and, pinned on first start, permanently brick the data dir. Validate non-zero before storing the config pin. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-streaming-workflow.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 89fd67fb6..d6a94c883 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -753,7 +753,10 @@ After `runBackfill` returns, every chunk in the backfilled range has `lfs` and ` ```go func validateConfig(cfg Config, cat Catalog) { - // chunks_per_txhash_index immutability check. + // chunks_per_txhash_index: validate, then immutability check. + if cfg.ChunksPerTxhashIndex == 0 { + fatalf("chunks_per_txhash_index must be > 0 (it defines the index layout).") + } if stored, ok := cat.Get("config:chunks_per_txhash_index"); !ok { cat.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) } else if stored != itoa(cfg.ChunksPerTxhashIndex) { From 9a51666c50013470bd7ac73a76cada36bfdb4272 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 10:13:02 +0200 Subject: [PATCH 05/18] docs(full-history): sync HTML explorer with the md read-path changes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The explorer had drifted from the design docs: - Read-path explorer showed single-window resolution ("resolve the window's unique frozen key") — the exact bug the md was corrected away from. Rewrote it for probe-all: every in-retention window probed, fingerprint screen + fetch-and-verify, at most one confirms. Scenarios are now Found / Fingerprint false positive / Not found. - Dropped the four ENOENT race scenarios; that read-vs-pruning handling is deferred to the query-routing design (matches the md's 8.4 pullback). - chunkID shown as floor((seq-2)/10_000); payload referenced as payloadWidth (was a fixed 3 bytes). Co-Authored-By: Claude Fable 5 --- design-docs/full-history-design-explorer.html | 77 ++++++++----------- 1 file changed, 32 insertions(+), 45 deletions(-) diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html index 28fc72614..04729788b 100644 --- a/design-docs/full-history-design-explorer.html +++ b/design-docs/full-history-design-explorer.html @@ -813,27 +813,26 @@

Lifecycle goroutine — owns everything below
-

The reader retention contract

+

Reading a transaction

- A read for any seq below effectiveRetentionFloor returns not-found, regardless of - whether the underlying file still exists. This is what lets pruning remove chunks the moment they pass - retention without coordinating with the index lifecycle: retention is the single source of truth for - "is this data available?". For tx-hash lookups, the reader walks two tiers: + A hash names no ledger, so a lookup can't tell which window holds it — the window is + chunkID(seq) / chunks_per_txhash_index, and seq is exactly what the lookup is + trying to find. So the cold tier probes every in-retention window's .idx: an MPHF + probe, a fingerprint screen (fpWidth bytes), and — on a fingerprint hit — a fetch of the + LCM and a verify of the full 32-byte hash. The transaction is in at most one window, so at most one + probe confirms; the verify runs on every fingerprint hit (false positives included) and rejects all + but that one. A not-found lookup confirms none and must rule out every window.

- - - - -
Chunk stateServed from
at or below the frozen key's hithe .idx named by the window's unique "frozen" key
above hi (live, or frozen and awaiting coverage)the txhash CF of the chunk's hot DB

- The transition is gap-free by write ordering: the hot DB is discarded only after the durable - .idx covers the chunk. Two ENOENT sites need explicit rules — walk through - them: + A read for any seq below effectiveRetentionFloor returns not-found regardless of + what's still on disk — the contract that lets pruning unlink files unilaterally. How the reader + stays correct while sweeps and pruning unlink files concurrently with a read (tier dispatch, a file that + vanishes mid-read) is the query-routing design's concern, out of scope here.

Read-path explorer
-
Four lookups, including both ENOENT races. The chain shows each step the reader takes.
+
Three cold lookups over a multi-window retention. The chain shows each per-window probe.
@@ -995,7 +994,7 @@

Why convergence works

cb.querySelector(".marker").style.left = (cfrac * 100) + "%"; cb.querySelector(".lbl").textContent = "ledger " + fmt(seq - cFirst + 1) + " of 10,000"; $("#geo-readout").innerHTML = [ - '
chunkID(seq) = (seq − 2) / 10,000 = ' + fmt(chunk) + "
", + '
chunkID(seq) = floor((seq − 2) / 10,000) = ' + fmt(chunk) + "
", '
chunk spans ' + fmt(cFirst) + " – " + fmt(cLast) + "
", '
indexID(chunk) = chunk / 1000 = ' + fmt(win) + "
", '
window spans chunks ' + fmt(wcLo) + "–" + fmt(wcHi) + ' = ledgers ' + fmt(wFirst) + " – " + fmt(wLast) + "
", @@ -1550,42 +1549,30 @@

Why convergence works

/* ---------------- read-path explorer ---------------- */ { const SCEN = [ - { label:"Normal lookup", - nodes:[ - { c:"ok", ic:"✓", t:"Resolve the window's unique \"frozen\" index key", s:"index:00000005:00005100:00005350 — no tie-break, no value parsing" }, - { c:"ok", ic:"✓", t:"Open 00005100-00005350.idx", s:"filename derived from the key name by a fixed bijection" }, - { c:"ok", ic:"✓", t:"MPHF lookup: hash → seq", s:"3-byte payload = seq offset from MinLedger (embedded metadata); 1-byte fingerprint" }, - { c:"ok", ic:"✓", t:"Retention gate: seq ≥ floor → admitted", s:"" }, - { c:"ok", ic:"✓", t:"Open the chunk's .pack, fetch the tx, verify the full hash → return", s:"" }, - ], - note:"Two tiers: chunks ≤ the frozen key's hi serve from the .idx; chunks above hi (live, or frozen-awaiting-coverage) serve from their hot DB's txhash CF. The transition is gap-free because the hot DB is discarded only after the durable .idx covers the chunk." }, - { label:"Lost the race: supersession", + { label:"Found", nodes:[ - { c:"ok", ic:"✓", t:"Resolve the frozen key → coverage [5100, 5349]", s:"" }, - { c:"fail", ic:"✕", t:"open(00005100-00005349.idx) → ENOENT", s:"a rebuild's commit batch demoted this coverage and the eager sweep unlinked it, between resolve and open" }, - { c:"retry", ic:"↻", t:"Re-resolve — never surface this ENOENT directly", s:"" }, - { c:"ok", ic:"✓", t:"A frozen key exists: [5100, 5350]", s:"guaranteed: the batch that demoted the old coverage promoted its replacement in the same atomic write" }, - { c:"ok", ic:"✓", t:"Serve from 00005100-00005350.idx", s:"" }, + { c:"dim", ic:"·", t:"window 7 .idx: MPHF probe → fingerprint miss → skip", s:"non-containing window rejected with no fetch" }, + { c:"dim", ic:"·", t:"window 6 .idx: fingerprint miss → skip", s:"" }, + { c:"ok", ic:"✓", t:"window 5 .idx: fingerprint hit", s:"seq = MinLedger + payload (payloadWidth bytes)" }, + { c:"ok", ic:"✓", t:"retention gate: seq ≥ floor → admitted", s:"" }, + { c:"ok", ic:"✓", t:"fetch the LCM, verify the full 32-byte hash → confirms → return", s:"" }, ], - note:"Rule — index-file ENOENT: re-resolve and retry. After a supersession demotion the retry always finds a frozen key. (A reader already holding the old fd never even notices — POSIX unlink doesn't invalidate open handles.)" }, - { label:"Lost the race: retention", + note:"A hash has no window hint, so the reader probes every in-retention window. The fingerprint (fpWidth bytes) rejects non-containing windows with no fetch; the containing window's fingerprint hit is confirmed by the full-hash verify. The tx is in at most one window, so at most one probe confirms." }, + { label:"Fingerprint false positive", nodes:[ - { c:"ok", ic:"✓", t:"Resolve the frozen key in a window falling past the floor", s:"" }, - { c:"fail", ic:"✕", t:"open(...) → ENOENT", s:"the prune stage swept the window — retention demotion, not supersession" }, - { c:"retry", ic:"↻", t:"Re-resolve", s:"" }, - { c:"fail", ic:"✕", t:"No frozen key in the window", s:"only retention pruning ever empties a window" }, - { c:"ok", ic:"→", t:"Return not-found — legitimate", s:"the queried seq is by now below the floor; a fresh request would be short-circuited by the top-of-reader retention gate anyway" }, + { c:"dim", ic:"·", t:"window 7: fingerprint miss → skip", s:"" }, + { c:"fail", ic:"✕", t:"window 6: fingerprint HIT (false positive) → fetch + verify → fails", s:"~256^(−fpWidth) chance; the fetched tx's full hash doesn't match → rejected, keep probing" }, + { c:"ok", ic:"✓", t:"window 5: fingerprint hit → fetch + verify → confirms → return", s:"" }, ], - note:"Same rule, other branch: re-resolve; serve from the new frozen key if one exists, otherwise the not-found is the truth, not an error." }, - { label:"Floor-straddling window", + note:"The fingerprint is a screen, not a decision: a non-containing window matches it with probability 256^(−fpWidth) and costs a wasted fetch the verify rejects. fpWidth is the knob that keeps spurious fetches ≪ 1; the verify is what guarantees only the true window is ever returned." }, + { label:"Not found", nodes:[ - { c:"dim", ic:"·", t:"(Normally the retention gate short-circuits this read before any file access)", s:"this path is reachable only in INV-1's hot-volume-loss exception, where the regressed floor transiently admits already-pruned bottom chunks" }, - { c:"ok", ic:"✓", t:"Resolve the frozen key — it survives", s:"index keys are swept only when the window falls wholly past the floor, so a floor-straddling window keeps its .idx; its static lo still covers pruned chunks" }, - { c:"ok", ic:"✓", t:".idx resolves the hash → seq in a pruned chunk", s:"" }, - { c:"fail", ic:"✕", t:"open(.pack) → ENOENT — the data file is gone", s:"retention unlinked the chunk's files" }, - { c:"ok", ic:"→", t:"Return not-found directly — no re-resolve", s:"re-resolving is useless (the same frozen key resolves identically). Never wrong data: artifacts are write-once and deletion is unlink-only, so a missing data file can only mean the chunk was pruned" }, + { c:"dim", ic:"·", t:"window 7: fingerprint miss", s:"" }, + { c:"dim", ic:"·", t:"window 6: fingerprint miss", s:"" }, + { c:"dim", ic:"·", t:"windows 5 … 0: fingerprint miss (every in-retention window)", s:"a not-found lookup can't stop early — it must rule out every window" }, + { c:"fail", ic:"→", t:"no window confirms → not-found", s:"a non-existent or not-yet-ingested hash" }, ], - note:"Data-file ENOENT ≠ index-file ENOENT. Re-resolving helps only when the index key itself changed hands. This rule is what keeps reads fail-soft in the one state where the retention gate doesn't catch them first." }, + note:"Not-found is the cost ceiling: with no window hint and nothing to confirm, the reader must probe the full set of in-retention windows before answering. (Ordering and parallelism of those probes are the query-routing design's concern.)" }, ]; const btns = $("#rd-buttons"), chain = $("#rd-chain"), note = $("#rd-note"); function render(sc) { From 535aa8810baed8fa84d1df89fafdb767ddbb68f7 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 11:57:30 +0200 Subject: [PATCH 06/18] docs(full-history): reject non-positive workers; correct prefix-collision note MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - validateConfig rejects workers < 1: a zero pool makes executePlan's slots channel unbuffered, so every task blocks on acquire forever (deadlock); negative panics in make. - Correct the §8.2 16-byte-prefix-collision sentence. It claimed the collision "returns not-found", which is wrong for the cross-window case: probe-all + continue-past-verify-mismatch finds each transaction in its own window, and an intra-window collision is detected by streamhash at build time. The verify still guarantees the wrong transaction is never served; the residual collision (~10^-20/window) is accepted as negligible. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-streaming-workflow.md | 3 +++ design-docs/gettransaction-full-history-design.md | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index d6a94c883..5ca7ef360 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -757,6 +757,9 @@ func validateConfig(cfg Config, cat Catalog) { if cfg.ChunksPerTxhashIndex == 0 { fatalf("chunks_per_txhash_index must be > 0 (it defines the index layout).") } + if cfg.Workers < 1 { + fatalf("workers must be > 0 (got %d) — a zero pool deadlocks executePlan.", cfg.Workers) + } if stored, ok := cat.Get("config:chunks_per_txhash_index"); !ok { cat.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) } else if stored != itoa(cfg.ChunksPerTxhashIndex) { diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index 34b988c2b..1e0e16ed4 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -299,7 +299,7 @@ respond on the confirmed hit; not-found if no window confirms Because the hash belongs to at most one window, **at most one window confirms**; a not-found lookup — a non-existent or not-yet-ingested hash — confirms none and must rule out every in-retention window. -The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. It also makes 16-byte prefix collisions harmless to serving: two distinct in-set hashes sharing a prefix would be a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), but even then the verify step returns not-found rather than the wrong transaction. +The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. The same verify means a 16-byte prefix collision between two real transactions — a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), accepted as a negligible risk — can never serve the *wrong* transaction. **Probe ordering, parallelism, early-stop, and the resulting latency and I/O are the query-routing design's concern** (§8.1), out of scope here. From e54a73259d5ad57d408bca5714c7df8889f2fa45 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 14:25:06 +0200 Subject: [PATCH 07/18] docs(full-history): gate earliest_ledger tip-sampling to first start MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit validateConfig sampled the backend tip on every start — the numeric floor-past-tip rejection and the "now" re-resolution both. On a restart of a numeric-frontfill deployment, a lagging bulk backend (tip below the pinned floor — the case the startup loop's max(tip, lastCommitted) is built for) would spuriously fatal, even though local data is already ahead. Restructure around the pin: on restart, trust the stored pin and only check the config didn't change (no backend call); sample the tip only on first start, where the floor-past-tip rejection is meaningful. Fixes the numeric restart fatal and the redundant "now" resampling, and removes the backend dependency from the restart config path. Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 39 ++++++++++++------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 5ca7ef360..38031b230 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -767,7 +767,31 @@ func validateConfig(cfg Config, cat Catalog) { stored, cfg.ChunksPerTxhashIndex) } - // earliest_ledger: resolve, validate, store on first start, log/abort on mismatch. + // earliest_ledger. The backend tip is sampled ONLY on first start; once the + // pin exists it is the source of truth, and the tip may legitimately lag + // below it (the startup loop's max(tip, lastCommitted) is built for that), + // so a restart never re-checks against the tip — neither the numeric + // floor-past-tip rejection nor the "now" re-resolution. + stored, pinned := cat.Get("config:earliest_ledger") + if pinned { + // Restart: trust the pin; only confirm a static config didn't change. + // "now" is a no-op (it resolved once at first start and is now pinned). + if cfg.EarliestLedger != "now" { + want := uint32(GenesisLedger) + if cfg.EarliestLedger != "genesis" { + want = atoi(cfg.EarliestLedger) + } + if want != atoi(stored) { + fatalf("earliest_ledger changed: stored=%s, config=%s. Wipe the data "+ + "directory to change earliest_ledger (or use the future "+ + "set-earliest-ledger admin command).", stored, cfg.EarliestLedger) + } + } + return + } + + // First start: resolve (sampling the tip for "now"), reject a floor past + // the tip, then pin. var desired uint32 switch cfg.EarliestLedger { case "genesis": @@ -783,18 +807,7 @@ func validateConfig(cfg Config, cat Catalog) { if desired > backendNetworkTip(cfg) { fatalf("earliest_ledger (%d) is past the current tip; reject.", desired) } - if stored, ok := cat.Get("config:earliest_ledger"); !ok { - cat.Put("config:earliest_ledger", itoa(desired)) - } else if atoi(stored) != desired { - if cfg.EarliestLedger == "now" { - logInfof("earliest_ledger='now' resolves to %d, but stored is %s; "+ - "using stored value (no-op after first start).", desired, stored) - } else { - fatalf("earliest_ledger changed: stored=%s, config=%d. Wipe the data "+ - "directory to change earliest_ledger (or use the future "+ - "set-earliest-ledger admin command).", stored, desired) - } - } + cat.Put("config:earliest_ledger", itoa(desired)) } func openHotDBForChunk(cfg Config, cat Catalog, chunk ChunkID) *HotDB { From d597a77a9dcf9d5b1cfe961d2f6b926a3fd1fa1f Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 14:46:00 +0200 Subject: [PATCH 08/18] docs(full-history): commit first-start layout pins atomically; validate max_retries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - validateConfig pinned chunks_per_txhash_index before earliest_ledger was validated, so a first start that fataled on a bad earliest_ledger left a lone chunks pin — a retry correcting chunks_per_txhash_index then aborted as a layout mismatch though no artifacts existed. Validate both layout pins first, then commit them together in one atomic batch; treat "both pins present" as the committed/immutable state. - Proactively: validate max_retries >= 0 in the same stateless block (a negative value is meaningless and, depending on withRetries, could skip a catch-up task silently). Same class as the chunks=0 / workers<1 guards. Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 67 +++++++++++-------- 1 file changed, 38 insertions(+), 29 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 38031b230..8231d3567 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -753,61 +753,70 @@ After `runBackfill` returns, every chunk in the backfilled range has `lfs` and ` ```go func validateConfig(cfg Config, cat Catalog) { - // chunks_per_txhash_index: validate, then immutability check. + // Stateless config validation (no pins touched yet). if cfg.ChunksPerTxhashIndex == 0 { fatalf("chunks_per_txhash_index must be > 0 (it defines the index layout).") } if cfg.Workers < 1 { fatalf("workers must be > 0 (got %d) — a zero pool deadlocks executePlan.", cfg.Workers) } - if stored, ok := cat.Get("config:chunks_per_txhash_index"); !ok { - cat.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) - } else if stored != itoa(cfg.ChunksPerTxhashIndex) { - fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", - stored, cfg.ChunksPerTxhashIndex) + if cfg.MaxRetries < 0 { + fatalf("max_retries must be >= 0 (got %d).", cfg.MaxRetries) // 0 = run once, no retry } - - // earliest_ledger. The backend tip is sampled ONLY on first start; once the - // pin exists it is the source of truth, and the tip may legitimately lag - // below it (the startup loop's max(tip, lastCommitted) is built for that), - // so a restart never re-checks against the tip — neither the numeric - // floor-past-tip rejection nor the "now" re-resolution. - stored, pinned := cat.Get("config:earliest_ledger") - if pinned { - // Restart: trust the pin; only confirm a static config didn't change. - // "now" is a no-op (it resolved once at first start and is now pinned). + // The two layout pins (chunks_per_txhash_index, earliest_ledger) are + // committed together in one atomic batch on first start (below), so they + // exist all-or-nothing: BOTH present ⟹ a prior first start completed and the + // layout is immutable; otherwise startup never got past config validation, + // no artifacts exist, and re-validating + re-pinning is safe. + cpiStored, cpiPinned := cat.Get("config:chunks_per_txhash_index") + earliestStored, earliestPinned := cat.Get("config:earliest_ledger") + + if cpiPinned && earliestPinned { + // Restart: the layout is committed — confirm nothing changed, write nothing. + if cpiStored != itoa(cfg.ChunksPerTxhashIndex) { + fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", + cpiStored, cfg.ChunksPerTxhashIndex) + } + // earliest_ledger immutability. The backend tip is NOT re-sampled (it + // may lag below the pinned floor — the startup loop's + // max(tip, lastCommitted) handles that). "now" is a no-op: it resolved + // once at first start and is now pinned. if cfg.EarliestLedger != "now" { want := uint32(GenesisLedger) if cfg.EarliestLedger != "genesis" { want = atoi(cfg.EarliestLedger) } - if want != atoi(stored) { + if want != atoi(earliestStored) { fatalf("earliest_ledger changed: stored=%s, config=%s. Wipe the data "+ "directory to change earliest_ledger (or use the future "+ - "set-earliest-ledger admin command).", stored, cfg.EarliestLedger) + "set-earliest-ledger admin command).", earliestStored, cfg.EarliestLedger) } } return } - // First start: resolve (sampling the tip for "now"), reject a floor past - // the tip, then pin. - var desired uint32 + // First start (or an incomplete prior start — no artifacts yet). Resolve + // earliest_ledger, sampling the tip for "now" and rejecting a floor past the + // tip; then commit BOTH layout pins in one atomic synced batch. + var earliest uint32 switch cfg.EarliestLedger { case "genesis": - desired = GenesisLedger + earliest = GenesisLedger case "now": - desired = chunkFirstLedger(chunkID(backendNetworkTip(cfg))) + earliest = chunkFirstLedger(chunkID(backendNetworkTip(cfg))) default: - desired = atoi(cfg.EarliestLedger) - if desired != chunkFirstLedger(chunkID(desired)) { - fatalf("earliest_ledger (%d) must be chunk-aligned.", desired) + earliest = atoi(cfg.EarliestLedger) + if earliest != chunkFirstLedger(chunkID(earliest)) { + fatalf("earliest_ledger (%d) must be chunk-aligned.", earliest) } } - if desired > backendNetworkTip(cfg) { - fatalf("earliest_ledger (%d) is past the current tip; reject.", desired) + if earliest > backendNetworkTip(cfg) { + fatalf("earliest_ledger (%d) is past the current tip; reject.", earliest) } - cat.Put("config:earliest_ledger", itoa(desired)) + batch := cat.NewBatch() + batch.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) + batch.Put("config:earliest_ledger", itoa(earliest)) + batch.Commit() } func openHotDBForChunk(cfg Config, cat Catalog, chunk ChunkID) *HotDB { From 9007872dae9cac0a2f8415662887ffec9a8fba83 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 14:52:44 +0200 Subject: [PATCH 09/18] docs(full-history): validate earliest_ledger parses The numeric earliest_ledger path called atoi(cfg.EarliestLedger) with no parse check, so a malformed value panicked or fell through to a confusing "must be chunk-aligned (0)". Validate the form once in the stateless block ("genesis" / "now" / a ledger number), which also makes the restart-branch atoi safe. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-streaming-workflow.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 8231d3567..b310391df 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -763,6 +763,15 @@ func validateConfig(cfg Config, cat Catalog) { if cfg.MaxRetries < 0 { fatalf("max_retries must be >= 0 (got %d).", cfg.MaxRetries) // 0 = run once, no retry } + // earliest_ledger must be "genesis", "now", or a ledger number. Validating + // the form here (not in the branches below) keeps every later + // atoi(cfg.EarliestLedger) safe on both the restart and first-start paths. + if cfg.EarliestLedger != "genesis" && cfg.EarliestLedger != "now" { + if _, err := parseUint32(cfg.EarliestLedger); err != nil { + fatalf("earliest_ledger must be \"genesis\", \"now\", or a ledger number; got %q.", + cfg.EarliestLedger) + } + } // The two layout pins (chunks_per_txhash_index, earliest_ledger) are // committed together in one atomic batch on first start (below), so they // exist all-or-nothing: BOTH present ⟹ a prior first start completed and the From a356eaffd9cde4efe92ddf5cff2e37de40e1fa38 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 15:12:56 +0200 Subject: [PATCH 10/18] docs(full-history): validateBackendCovers -> validateRangeProducible MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The name read as "the bulk backend covers [rangeStart, rangeEnd]", but the startup loop deliberately sets rangeEnd above a lagging bulk tip (anchored on max(tip, lastCommitted)), and those chunks are produced locally. Rename and restate the contract: validate every in-range chunk is producible from SOME source (mirroring catchupSource) — the backend need cover only the chunks that fall through to it, not the whole range. Prevents a spurious abort on a lagging-backend restart. Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 20 +++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index b310391df..78bbad133 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -90,7 +90,7 @@ One TOML file (`--config`) configures the daemon. | `workers` | int | `GOMAXPROCS` | Concurrent task slots for bulk catch-up. | | `max_retries` | int | `3` | Retries per catch-up task before the daemon aborts. | -**[catch_up.bsb]** — Buffered Storage Backend (the default bulk LedgerBackend; required **unless** another conformant LedgerBackend is configured as the bulk source — `backendNetworkTip`/`validateBackendCovers`/`processChunk`'s default `source` all go through whichever backend is configured) +**[catch_up.bsb]** — Buffered Storage Backend (the default bulk LedgerBackend; required **unless** another conformant LedgerBackend is configured as the bulk source — `backendNetworkTip`/`validateRangeProducible`/`processChunk`'s default `source` all go through whichever backend is configured) | Key | Type | Default | Description | |---|---|---|---| @@ -563,10 +563,14 @@ func buildThenSweep(b IndexBuild, cfg Config) error { ```go func runBackfill(ctx context.Context, cfg Config, rangeStart, rangeEnd ChunkID) error { - // Every in-range chunk must be producible from SOME source: durable - // artifacts (self-skips), a complete ready hot DB, the local .pack, or - // the bulk backend — fail before any work otherwise. - if err := validateBackendCovers(cfg, rangeStart, rangeEnd); err != nil { + // Fail before any work if a chunk can't be produced from ANY source. This + // mirrors catchupSource's preference: a chunk needs the bulk backend only + // when it is not already durable (self-skips), not complete in a ready hot + // DB, and not re-derivable from a local .pack — so the backend must cover + // only those fall-through chunks, NOT the whole range. Load-bearing on a + // restart where rangeEnd is anchored on max(tip, lastCommitted): the span + // above a lagging bulk tip is produced locally and must not abort here. + if err := validateRangeProducible(cfg, rangeStart, rangeEnd); err != nil { return err } return executePlan(ctx, resolve(cfg, rangeStart, rangeEnd), cfg) @@ -947,7 +951,7 @@ The doorbell carries no payload, so its delivery semantics can be maximally slop ### Lifecycle -The lifecycle goroutine runs one **tick** per notification, in three stages: **plan-and-execute** (the same `resolve` + `executePlan` catch-up uses, from the retention floor up to `completeThrough` — this is where a just-closed chunk freezes, from its hot DB via `catchupSource`'s hot branch, and where the current window's index folds it in), then the **discard** scan (retire hot DBs the cold artifacts now fully serve), then the **prune** scan (sweep demoted and past-retention files). The retention floor plays two roles with *opposite safe directions*, and the design keeps them separate. As a **retention boundary** (the prune scan, the reader gate) it errs permissive: anchored on `completeThrough`, a floor that sits a little low keeps an extra chunk briefly, or admits a read that at worst lands on already-pruned data and returns not-found via the reader's missing-data-file rule — harmless either way. As a **production boundary** it would err dangerous: planning a build below existing storage means demanding chunks from the bulk source that nobody validated it can produce. So production below storage never consults the floor — the tick's plan range starts at the lowest chunk already materialized, and extending the *bottom* of storage (which is what retention widening means) is exclusively catch-up's job, the one path that runs `validateBackendCovers` before demanding anything. Ordering lives in two places, each natural to its half: freeze-before-build is a *plan dependency* (the window's `IndexBuild` waits on its in-coverage chunk builds' done-channels), and build-before-discard / demote-before-sweep is the *stage sequence* — the scans run after `executePlan` returns, so they see every commit the plan landed. Correctness never depends on any of it: every decision derives from durable keys, so work whose enabler hasn't landed is simply not scheduled, and the next tick picks it up. +The lifecycle goroutine runs one **tick** per notification, in three stages: **plan-and-execute** (the same `resolve` + `executePlan` catch-up uses, from the retention floor up to `completeThrough` — this is where a just-closed chunk freezes, from its hot DB via `catchupSource`'s hot branch, and where the current window's index folds it in), then the **discard** scan (retire hot DBs the cold artifacts now fully serve), then the **prune** scan (sweep demoted and past-retention files). The retention floor plays two roles with *opposite safe directions*, and the design keeps them separate. As a **retention boundary** (the prune scan, the reader gate) it errs permissive: anchored on `completeThrough`, a floor that sits a little low keeps an extra chunk briefly, or admits a read that at worst lands on already-pruned data and returns not-found via the reader's missing-data-file rule — harmless either way. As a **production boundary** it would err dangerous: planning a build below existing storage means demanding chunks from the bulk source that nobody validated it can produce. So production below storage never consults the floor — the tick's plan range starts at the lowest chunk already materialized, and extending the *bottom* of storage (which is what retention widening means) is exclusively catch-up's job, the one path that runs `validateRangeProducible` before demanding anything. Ordering lives in two places, each natural to its half: freeze-before-build is a *plan dependency* (the window's `IndexBuild` waits on its in-coverage chunk builds' done-channels), and build-before-discard / demote-before-sweep is the *stage sequence* — the scans run after `executePlan` returns, so they see every commit the plan landed. Correctness never depends on any of it: every decision derives from durable keys, so work whose enabler hasn't landed is simply not scheduled, and the next tick picks it up. The one input the tick needs beyond the keys themselves is *how far ingestion has durably gotten* — which chunks are complete, and where the sliding retention floor anchors. That is `deriveCompleteThrough`, defined with the [derived-progress machinery](#meta-store-keys) in the data model; the tick derives it once, at tick start: @@ -964,7 +968,7 @@ func runLifecycleTick(ctx context.Context, cfg Config, cat Catalog) { // low is harmless. As a PRODUCTION boundary it would err dangerous: // a below-storage build demands chunks from a bulk source nobody // validated. So the tick's plan range starts at existing storage; - // extending the bottom is catch-up's job, behind validateBackendCovers. + // extending the bottom is catch-up's job, behind validateRangeProducible. start = low } @@ -1279,7 +1283,7 @@ These are streaming-specific properties the implementation guarantees on top of INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because readers resolve `"frozen"` keys exclusively and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every quiescence reached after the events below; startup's first quiescence arrives when the first tick completes, shortly after reads open. 1. **Steady-state operation.** Hot DB ingestion advances `last_committed_ledger`; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on `last_committed_ledger`. -2. **Operator state changes** — retention widening or shortening (`retention_chunks`), `earliest_ledger` raised. Both reduce to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state." Catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage; the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateBackendCovers`, materializes the new bottom. +2. **Operator state changes** — retention widening or shortening (`retention_chunks`), `earliest_ledger` raised. Both reduce to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state." Catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage; the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateRangeProducible`, materializes the new bottom. 3. **Surgical recovery, frozen-range case** (tainted range strictly below `chunkID(last_committed_ledger)`). The operator never touches the filesystem. Recovery is **one atomic meta-store batch**: every `chunk:{c}:*` key in the tainted range and every `index:*` key of every window overlapping it → `"freezing"` — the state that already means *this file is not to be trusted: re-derive or delete* — and any leftover `hot:chunk` key in the range (a crash between a tick's build and its discard stage leaves a frozen chunk's hot DB behind) → `"transient"`, which makes it instantly ineligible as a source (`catchupSource` reads only `"ready"`). The batch commits atomically or not at all, so there is no interruption analysis and re-running it is a no-op; the meta store's lock means it can only be written against a stopped daemon. Every demoted key then converges through machinery that already exists: on restart, catch-up re-derives the freezing chunk artifacts from a conformant LedgerBackend — overwriting the tainted files in place, rule 1's ordinary re-materialization — and rebuilds each window's index, re-marking the freezing index key whose coverage is still desired (or leaving it to the prune scan's sweep-on-sight when retention has moved past it); the discard scan retires the transient hot key once the rebuilt chunk regains coverage — or past retention, after long downtime — unlinking the tainted hot DB unread. `last_committed_ledger` is exactly unchanged whenever ingestion has ever started: the batch keeps every hot key, the positional term counts them value-blind, and the live chunk's `"ready"` DB restores the watermark precisely. Only in the no-hot-key corner — the daemon stopped during initial catch-up, before ingestion opened its first hot DB — does the cold term lead the derivation, and there it can regress through every untainted chunk of a finalized window the taint overlaps (their `.bin` keys were swept at finalization, and the demoted index key breaks coverage), bounded by one window; catch-up's `max(tip, lastCommitted)` anchor re-derives forward regardless. 4. **Surgical recovery, partial-tail-chunk case** (tainted range includes the live chunk), and equally **hot-volume loss**. The same artifact-key batch as case 3, but hot keys are **removed**, not demoted — case 3's leftover hot DB holds data that exists elsewhere and sits below the watermark, so a demoted key is harmlessly reclaimed; here the hot DB *was* the data, and a kept key either re-fires the fatal or silently inflates the watermark. Remove **every** `hot:chunk` key whose dir is missing or whose contents are partial. A half-recovery that keeps a `"ready"` key merely relocates the fatal to the next restart. One that keeps the boundary-crash `"transient"` key — the next chunk's key, sitting above the last frozen boundary — is recoverable but worse than compliance: the positional term counts all hot keys while the fatal checks only `"ready"` ones, so the kept key silently props the derived watermark up to the lost chunk's end, pulling the lost chunk into catch-up's anchored range, which re-derives it from the bulk source — stalling startup until that source has the chunk, where full removal would have resumed from the last frozen boundary immediately. Remove any surviving dirs along with the keys — dirs first, keys last: an interrupted pass then leaves keys whose dirs are missing, which the next startup catches (the fatal for `"ready"`, the watermark-prop analysis above for `"transient"`), never a keyless orphan dir no key-driven scan can find. The hot DB is the only copy of its ledgers — discarding it loses them, and the **derived watermark admits as much automatically**: with the hot DB gone, derivation lands at the last frozen boundary, and the next startup re-ingests from there. There is no watermark to edit; recovery is key removal, never file surgery beyond the lost dirs themselves. 5. **First deployment / downtime between restarts.** `last_committed_ledger` derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). From b8cf6b909dd00412db698168464774198de99f27 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 15:23:27 +0200 Subject: [PATCH 11/18] docs(full-history): harden external-I/O boundaries (backend tip, captive core) + config bounds MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The audit showed the core design (concurrency, atomicity, arithmetic, consistency) is sound; the gaps were all at external-dependency edges and a couple of missing config bounds. One principle: untrusted external state must never poison durable state or silently stop the daemon. - networkTip(): single hardened tip sampler — retry-with-backoff, and reject a tip < genesis ("backend not ready"). Routes all three call sites. "now" (no local substitute) fatals on failure; the catch-up loop falls back to lastCommitted (serve local, skip catch-up); the numeric past-tip check is best-effort. Fixes: first-start "now"/below-genesis tip pinning a garbage immutable floor; a transient backend outage bricking a local-only restart. - earliest_ledger static-form validated once (parse + >= genesis + aligned), so no path feeds a sub-genesis value into chunkID. - chunks_per_txhash_index upper-bounded (MaxChunksPerTxhashIndex) so the window span fits the index offset and can't overflow uint32. - lastCompleteChunkAt: cast before subtract — total over uint32 (was a latent underflow at ledger 0). - runIngestionLoop: distinguish ctx-cancel (clean, return nil) from an unexpected captive-core stream close (restartable error → non-zero exit → supervisor restarts), instead of silently exiting 0 when core dies. - tx §3 cold-tier cell: "in [lo, hi]" (was "at or below hi", dropping the lo bound). Co-Authored-By: Claude Fable 5 --- .../full-history-streaming-workflow.md | 102 +++++++++++++----- .../gettransaction-full-history-design.md | 2 +- 2 files changed, 77 insertions(+), 27 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 78bbad133..5c082b58e 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -116,7 +116,7 @@ One TOML file (`--config`) configures the daemon. | Key | Type | Default | Description | |---|---|---|---| | `retention_chunks` | uint32 | `0` | Retention window in chunks. `0` = full history. | -| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(backendNetworkTip()))` at first start. Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | +| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(networkTip()))` at first start (and requires a reachable, ready backend). Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | | `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | **[streaming.hot_storage]** @@ -648,6 +648,10 @@ The retention floor itself is computed by: const ( GenesisLedger = 2 LedgersPerChunk = 10_000 + // MaxChunksPerTxhashIndex bounds chunks_per_txhash_index so the window's + // ledger span (cpi*LedgersPerChunk) fits a 4-byte index offset and the + // product can't overflow uint32; any real deployment stays far below it. + MaxChunksPerTxhashIndex = 429_496 // floor(2^32 / LedgersPerChunk) ) // effectiveRetentionFloor is the lower bound of the retention window, @@ -677,7 +681,26 @@ func effectiveRetentionFloor(upperBound uint32, retentionChunks uint32, earliest // whose last ledger is <= ledger. E.g., lastCompleteChunkAt(10_001) == 0 // (chunk 0 spans ledgers 2..10_001). func lastCompleteChunkAt(ledger uint32) int64 { - return int64(ledger-1)/LedgersPerChunk - 1 + return (int64(ledger)-1)/LedgersPerChunk - 1 // cast before subtract: total over uint32 +} + +// networkTip samples the configured backend's network tip, hardened against the +// two ways it lies. It retries with bounded backoff (transient object-store +// unavailability) and rejects a tip below GenesisLedger as "not ready" (an +// empty / not-yet-synced backend), so an unready tip never reaches the chunk +// arithmetic where it would pin a garbage floor. Callers with a local +// substitute degrade on error — the catch-up loop falls back to lastCommitted, +// the numeric past-tip check is skipped — and only "now" resolution, which has +// no substitute, fatals. +func networkTip(cfg Config) (uint32, error) { + tip, err := withBackoff(func() (uint32, error) { return backendNetworkTip(cfg) }) + if err != nil { + return 0, err + } + if tip < GenesisLedger { + return 0, fmt.Errorf("backend tip %d is below genesis — backend not ready", tip) + } + return tip, nil } ``` @@ -704,7 +727,10 @@ func startStreaming(ctx context.Context, cfg Config) error { // guard then catches it cleanly. backfilledThrough := int64(-1) for { - tip := backendNetworkTip(cfg) + tip, err := networkTip(cfg) + if err != nil { + tip = lastCommitted // backend unreachable: serve local, skip catch-up this pass + } anchor := max(tip, lastCommitted) // guards a lagging bulk tip, in BOTH uses below rangeStart := chunkID(effectiveRetentionFloor(anchor, retentionChunks, earliest)) // Anchoring rangeEnd on the watermark too matters when the bulk tip @@ -758,8 +784,9 @@ After `runBackfill` returns, every chunk in the backfilled range has `lfs` and ` ```go func validateConfig(cfg Config, cat Catalog) { // Stateless config validation (no pins touched yet). - if cfg.ChunksPerTxhashIndex == 0 { - fatalf("chunks_per_txhash_index must be > 0 (it defines the index layout).") + if cfg.ChunksPerTxhashIndex == 0 || cfg.ChunksPerTxhashIndex > MaxChunksPerTxhashIndex { + fatalf("chunks_per_txhash_index must be in [1, %d] (it defines the index "+ + "layout, immutable once stored).", MaxChunksPerTxhashIndex) } if cfg.Workers < 1 { fatalf("workers must be > 0 (got %d) — a zero pool deadlocks executePlan.", cfg.Workers) @@ -767,13 +794,15 @@ func validateConfig(cfg Config, cat Catalog) { if cfg.MaxRetries < 0 { fatalf("max_retries must be >= 0 (got %d).", cfg.MaxRetries) // 0 = run once, no retry } - // earliest_ledger must be "genesis", "now", or a ledger number. Validating - // the form here (not in the branches below) keeps every later - // atoi(cfg.EarliestLedger) safe on both the restart and first-start paths. + // earliest_ledger must be "genesis", "now", or a chunk-aligned ledger >= + // genesis. Validating the full static form here keeps every later + // atoi(cfg.EarliestLedger) well-formed on both the restart and first-start + // paths (and out of chunkID's sub-genesis underflow domain). if cfg.EarliestLedger != "genesis" && cfg.EarliestLedger != "now" { - if _, err := parseUint32(cfg.EarliestLedger); err != nil { - fatalf("earliest_ledger must be \"genesis\", \"now\", or a ledger number; got %q.", - cfg.EarliestLedger) + n, err := parseUint32(cfg.EarliestLedger) + if err != nil || n < GenesisLedger || n != chunkFirstLedger(chunkID(n)) { + fatalf("earliest_ledger must be \"genesis\", \"now\", or a chunk-aligned "+ + "ledger >= %d; got %q.", GenesisLedger, cfg.EarliestLedger) } } // The two layout pins (chunks_per_txhash_index, earliest_ledger) are @@ -809,23 +838,29 @@ func validateConfig(cfg Config, cat Catalog) { } // First start (or an incomplete prior start — no artifacts yet). Resolve - // earliest_ledger, sampling the tip for "now" and rejecting a floor past the - // tip; then commit BOTH layout pins in one atomic synced batch. + // earliest_ledger, then commit BOTH layout pins in one atomic synced batch. + // The network tip is needed only to resolve "now" and as a best-effort + // sanity bound on a numeric floor; networkTip rejects an unready (< genesis) + // or unreachable tip, so neither path can pin a garbage floor. var earliest uint32 switch cfg.EarliestLedger { case "genesis": earliest = GenesisLedger case "now": - earliest = chunkFirstLedger(chunkID(backendNetworkTip(cfg))) + tip, err := networkTip(cfg) // no local substitute for "now": must succeed + if err != nil { + fatalf("earliest_ledger=now needs a reachable, ready backend: %v", err) + } + earliest = chunkFirstLedger(chunkID(tip)) // <= tip, so never past the tip default: - earliest = atoi(cfg.EarliestLedger) - if earliest != chunkFirstLedger(chunkID(earliest)) { - fatalf("earliest_ledger (%d) must be chunk-aligned.", earliest) + earliest = atoi(cfg.EarliestLedger) // already form-validated: parse, >= genesis, aligned + // Best-effort: reject a floor past where the network is, but skip the + // check if the backend is unreachable — the floor is well-formed and the + // catch-up loop tolerates a lagging/absent tip. + if tip, err := networkTip(cfg); err == nil && earliest > tip { + fatalf("earliest_ledger (%d) is past the current tip (%d); reject.", earliest, tip) } } - if earliest > backendNetworkTip(cfg) { - fatalf("earliest_ledger (%d) is past the current tip; reject.", earliest) - } batch := cat.NewBatch() batch.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) batch.Put("config:earliest_ledger", itoa(earliest)) @@ -919,11 +954,27 @@ func runIngestionLoop(cfg Config, core *CaptiveCore, hotDB *HotDB, cat Catalog, } notify() // first act: the hot-chunk set just changed (the resume DB was opened) - for lcm := range core.StreamLedgers() { + for { + var lcm LedgerCloseMeta + select { + case <-ctx.Done(): + return nil // clean shutdown: the daemon was asked to stop + case l, ok := <-core.StreamLedgers(): + if !ok { + // Captive core's stream closed without a shutdown request — core + // crashed/exited. RESTARTABLE, not success: return an error so the + // process exits non-zero and the supervisor restarts it. Startup + // re-derives progress from durable state; the last synced batch is + // the watermark, so nothing is lost. + return fmt.Errorf("captive core stream closed unexpectedly") + } + lcm = l + } + // One atomic, synced WriteBatch across all CFs — a ledger is either - // fully in the hot DB or absent. The batch IS the durability - // boundary; the loop keeps no progress variable at all — progress is - // re-derived from durable state at the next startup. + // fully in the hot DB or absent. The batch IS the durability boundary; + // the loop keeps no progress variable at all — progress is re-derived + // from durable state at the next startup. batch := hotDB.NewBatch() putLedger(batch, lcm) // ledgers CF putTxHashes(batch, lcm) // txhash CF @@ -941,11 +992,10 @@ func runIngestionLoop(cfg Config, core *CaptiveCore, hotDB *HotDB, cat Catalog, notify() } } - return nil } ``` -A batch error causes the loop to retry the entire ledger (the batch is all-or-nothing, so a retry can't double-apply). On repeated failure the daemon aborts; the next startup's derived watermark equals exactly what the last synced batch committed — there is no second durable write that could disagree with it — and ingestion resumes from the next seq. The close-before-open order at the boundary is load-bearing: the next chunk's hot key is what makes this chunk *visibly complete* to the lifecycle's derivation, so the write handle must already be released when that key appears — otherwise a tick still in flight from the *previous* notification could rmdir a dir whose writer is live. Readers hold their own independent read-only handles. +A batch error causes the loop to retry the entire ledger (the batch is all-or-nothing, so a retry can't double-apply). On repeated failure the daemon aborts; the next startup's derived watermark equals exactly what the last synced batch committed — there is no second durable write that could disagree with it — and ingestion resumes from the next seq. An *unexpected* close of the captive-core stream — core crash or exit, as opposed to a `ctx`-cancelled shutdown — is handled the same way: the loop returns an error so the process exits non-zero and the supervisor restarts it, resuming exactly where the last synced batch left off (a clean close would otherwise look like success and not restart). The close-before-open order at the boundary is load-bearing: the next chunk's hot key is what makes this chunk *visibly complete* to the lifecycle's derivation, so the write handle must already be released when that key appears — otherwise a tick still in flight from the *previous* notification could rmdir a dir whose writer is live. Readers hold their own independent read-only handles. The doorbell carries no payload, so its delivery semantics can be maximally sloppy: a non-blocking send on a size-1 buffered channel, coalescing freely. Nothing is lost because the notification carries no information to lose — eligibility derives entirely from durable state, and a tick triggered by one notification processes everything the catalog shows, however many boundaries contributed to it. The doorbell only answers "when should the lifecycle look", never "what should it see". diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index 1e0e16ed4..8094db429 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -47,7 +47,7 @@ An in-retention transaction is stored in exactly one place — one tier, one win | Tier | Structure | Serves | |---|---|---| | **Hot** | `txhash` CF of the per-chunk hot RocksDB | the live chunk, plus any frozen chunk the window index doesn't cover yet | -| **Cold** | one streamhash `.idx` per window, covering chunks `[lo, hi]` | every chunk at or below the window's frozen `hi` | +| **Cold** | one streamhash `.idx` per window, covering chunks `[lo, hi]` | every chunk in `[lo, hi]` (at/below the frozen `hi`, at/above the floor chunk `lo`) | ``` window w From 5631e4ce0dbe757c03dc45399e75b4ddf2a98ba6 Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 15:35:22 +0200 Subject: [PATCH 12/18] docs(full-history): fix clean-shutdown race in runIngestionLoop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The select-based loop returned an error on the stream-close case without checking ctx, so a graceful shutdown (ctx cancelled AND core closes its stream) could exit non-zero if select picked the channel branch — making the supervisor restart a daemon that was asked to stop. Guard the close case with ctx.Err(): a close while ctx is done is a clean shutdown (return nil); only an unexpected close is the restartable error. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-streaming-workflow.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 5c082b58e..78d4aa783 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -961,11 +961,14 @@ func runIngestionLoop(cfg Config, core *CaptiveCore, hotDB *HotDB, cat Catalog, return nil // clean shutdown: the daemon was asked to stop case l, ok := <-core.StreamLedgers(): if !ok { - // Captive core's stream closed without a shutdown request — core - // crashed/exited. RESTARTABLE, not success: return an error so the - // process exits non-zero and the supervisor restarts it. Startup - // re-derives progress from durable state; the last synced batch is - // the watermark, so nothing is lost. + if ctx.Err() != nil { + return nil // stream closed *because* we're shutting down — clean + } + // Closed without a shutdown request — core crashed/exited. + // RESTARTABLE, not success: return an error so the process exits + // non-zero and the supervisor restarts it. Startup re-derives + // progress from durable state; the last synced batch is the + // watermark, so nothing is lost. return fmt.Errorf("captive core stream closed unexpectedly") } lcm = l From 4ff11b83d020706ce62e1bace80db6d52add4e9b Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 15:51:45 +0200 Subject: [PATCH 13/18] docs(full-history): don't serve empty history on first-start tip outage; lock hot storage MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - The networkTip fallback (tip = lastCommitted on backend outage) was wrong on a FIRST start: lastCommitted is only the earliest-1 sentinel, so the catch-up range goes empty and serveReads() starts on empty/incomplete history. Degrade to the local watermark only when there is committed local progress (lastCommitted >= earliest); otherwise fatal until a real tip is available. - Single-process enforcement: include [streaming.hot_storage] in the per-root flock set. Two daemons with different meta stores but a shared hot_storage path could both write the same hot/{chunk} tree — the only copy of recent ledgers — despite the immutable roots being locked. Co-Authored-By: Claude Fable 5 --- design-docs/full-history-streaming-workflow.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 78d4aa783..a1f0b89d2 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -620,7 +620,7 @@ func executePlan(ctx context.Context, plan Plan, cfg Config) error { - The executor runs each `IndexBuild` via `buildThenSweep` (defined with `resolve` above), which lands the commit batch (terminal for complete windows) and then runs the eager `"pruning"` sweep (rule 4). The sweep is window-local — this window's demoted inputs and superseded coverages, not a store-wide scan — so concurrent windows' sweeps touch disjoint keys, and `fsyncDir` on a bucket dir shared with another window's in-flight `.bin` writes is safe (a dir fsync with concurrent creates just makes more entries durable). - Done-channels broadcast *completion*, not success: a chunk build that exhausts its retries still closes its channel (the `defer`), so a dependent index build can win the race against context cancellation and start — whereupon it fails `buildTxhashIndex`'s loud `.bin` precondition check before writing any key, landing on the same abort-and-restart path as the original failure. The precondition check is load-bearing here. - A task that exhausts its retries aborts the daemon, per the [error policy](#lifecycle); restart re-resolves from durable keys, and completed work never repeats. -- **Single-process enforcement:** the meta store holds a kernel `flock` on a `LOCK` file; a second daemon opening the **same meta-store path** fails immediately, and the lock releases on any process exit (including `kill -9`). Because `[meta_store]` and each `[immutable_storage.*]` path are independently configurable, the meta-store lock alone cannot stop two daemons with *different* meta stores from sharing one artifact tree — the daemon therefore also takes a `flock` in each configured storage root. +- **Single-process enforcement:** the meta store holds a kernel `flock` on a `LOCK` file; a second daemon opening the **same meta-store path** fails immediately, and the lock releases on any process exit (including `kill -9`). Because `[meta_store]`, each `[immutable_storage.*]` path, *and* `[streaming.hot_storage]` are independently configurable, the meta-store lock alone cannot stop two daemons with *different* meta stores from sharing an artifact tree or a hot-DB tree — the daemon therefore also takes a `flock` in **each configured storage root, including the hot-storage root**. The hot root matters most: its `hot/{chunk}` DBs are the only copy of recently-ingested ledgers, independently created/opened/deleted by ingestion and discard, so two daemons sharing it would corrupt or delete that sole copy even though the immutable roots are protected. --- @@ -729,7 +729,18 @@ func startStreaming(ctx context.Context, cfg Config) error { for { tip, err := networkTip(cfg) if err != nil { - tip = lastCommitted // backend unreachable: serve local, skip catch-up this pass + if lastCommitted < earliest { + // First start (no committed progress) with no reachable backend: + // we can neither catch up nor serve a local history. Fail until a + // real tip is available — never start serving on empty/incomplete + // history. The supervisor restarts and networkTip retries. + fatalf("network tip unavailable and no local history to serve: %v", err) + } + // Restart with local progress: serve what's already materialized + // (the window below lastCommitted is complete, by catch-up-before- + // advance) and skip catch-up this pass; a later pass with a reachable + // backend resumes it. + tip = lastCommitted } anchor := max(tip, lastCommitted) // guards a lagging bulk tip, in BOTH uses below rangeStart := chunkID(effectiveRetentionFloor(anchor, retentionChunks, earliest)) From a640d6b1a6fee70ed89bdf05f224ddbb77b6d93c Mon Sep 17 00:00:00 2001 From: tamirms Date: Tue, 16 Jun 2026 22:07:00 +0200 Subject: [PATCH 14/18] docs(full-history): count-only-ready watermark + unified recovery; address review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Watermark derivation now counts only "ready" hot keys (not value-blind). That single change lets surgical recovery uniformly demote hot keys to "transient" instead of the old demote-vs-remove split — collapsing recovery cases 3/4, dropping the watermark-inflation reasoning and the dirs-first-keys-last operator footgun. deriveWatermark's existing refinement read recovers the chunk-level frontier after a boundary crash (the one spot the ready-only count leans on opening a hot DB rather than key existence). Also: - catchupSource: bounded backend-coverage wait instead of abort/restart churn when a re-derived chunk sits above a lagging bulk tip; validateRangeProducible no longer aborts on coverage timing. - Folded the no-hot-key-corner caveat out of recovery case 3 into the Monotonic-progress assumption where it's referenced. Review fixes carried in the same pass: - Numeric first-start earliest_ledger validated against the tip before pinning. - earliest_ledger immutability scenario no longer claims a live config edit converges. - Cold-index 16-byte prefix collision: same-window is a fail-stop ErrDuplicateKey build failure; cross-window is harmless (verify rejects it). payloadWidth/cpi cap cross-referenced; payloadWidth lives in the streamhash header, not user metadata. - runIngestionLoop takes ctx; INV-3/INV-2 made consistent and the INV-2 audit scoped to [floor, completeThrough]; "readers resolve a ready hot DB or a frozen cold artifact, never a transient key"; read-only hot-DB handles closed before reopen/discard. Docs only; no code changes. Co-Authored-By: Claude Opus 4.8 (1M context) --- design-docs/full-history-design-explorer.html | 2 +- .../full-history-streaming-workflow.md | 168 ++++++++++++------ .../gettransaction-full-history-design.md | 16 +- 3 files changed, 121 insertions(+), 65 deletions(-) diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html index 04729788b..d67f00365 100644 --- a/design-docs/full-history-design-explorer.html +++ b/design-docs/full-history-design-explorer.html @@ -1216,7 +1216,7 @@

Why convergence works

], cold:"end of chunk 5349", pos:"(would be end of 5350 — but the check fires first)", through:"—", wm:"FATAL: \"hot:chunk:00005350 is 'ready' but its dir is missing — hot storage lost; run surgical recovery (case 4).\"", - note:"A \"ready\" key whose dir is missing is loss, not staleness — never silently healed, because a missing dir can also mean a mount misconfiguration where auto-wiping would be wrong. Recovery is key removal, never file surgery: delete the orphaned hot keys and restart — the derived watermark then lands at the last frozen boundary (end of 5349) automatically, and re-ingestion fills the gap. No watermark to edit, because none is stored." }, + note:"A \"ready\" key whose dir is missing is loss, not staleness — never silently healed, because a missing dir can also mean a mount misconfiguration where auto-wiping would be wrong. Recovery is key demotion, never file surgery: demote the orphaned hot keys to \"transient\" and restart — the watermark (which counts only \"ready\" keys) then lands at the last frozen boundary (end of 5349) automatically, and re-ingestion fills the gap forward. No watermark to edit, because none is stored." }, ]; const host = $("#derived-presets"), viz = $("#derived-viz"); function render(p) { diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index a1f0b89d2..b2c55d518 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -116,7 +116,7 @@ One TOML file (`--config`) configures the daemon. | Key | Type | Default | Description | |---|---|---|---| | `retention_chunks` | uint32 | `0` | Retention window in chunks. `0` = full history. | -| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(networkTip()))` at first start (and requires a reachable, ready backend). Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | +| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(networkTip()))` at first start. A first start with `"now"` or with a numeric floor requires a reachable, ready backend — `"now"` has no other way to resolve, and a numeric floor is validated against the network tip (rejected if it is past the tip) before being pinned immutably; a genesis floor needs no tip, since genesis is always a valid lower bound. Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | | `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | **[streaming.hot_storage]** @@ -209,7 +209,7 @@ For the per-chunk keys, `"freezing"` means the immutable file is being written; | `config:earliest_ledger` | `uint32` (decimal string, chunk-aligned) | On the first daemon start. Immutable thereafter — changing it currently requires wiping the data directory, until a `set-earliest-ledger` admin command exists (see [Configuration](#configuration); the floor machinery already converges for either direction). | | `config:chunks_per_txhash_index` | `uint32` (decimal string) | On the first daemon start; immutable thereafter. Startup aborts if the config value doesn't match. | -**Progress is derived, never stored — and never shared.** The hot DB's synced per-ledger WriteBatch *is* the durable commit; recording it again in the meta store would only create a second copy of the same fact, plus the ordering rule needed to keep the copy honest. Two derivations read progress back out of the catalog, one per consumer, at the two granularities they need. Both lean on one **key-creation invariant**: a `hot:chunk` key is created only after every ledger below its chunk has durably committed — at a boundary, ingestion closes chunk C's write handle *before* creating C+1's key (the [ingestion loop](#hot-db-ingestion) enforces the ordering); at startup, the resume chunk's key is created only after derivation has already run. The highest hot key therefore *is* the live chunk, and everything below it is complete. +**Progress is derived, never stored — and never shared.** The hot DB's synced per-ledger WriteBatch *is* the durable commit; recording it again in the meta store would only create a second copy of the same fact, plus the ordering rule needed to keep the copy honest. Two derivations read progress back out of the catalog, one per consumer, at the two granularities they need. Both lean on one **key-creation invariant**: a `hot:chunk` key is created only after every ledger below its chunk has durably committed — at a boundary, ingestion closes chunk C's write handle *before* creating C+1's key (the [ingestion loop](#hot-db-ingestion) enforces the ordering); at startup, the resume chunk's key is created only after derivation has already run. The highest hot key therefore *is* the live chunk, and everything below it is complete. The watermark derivation, though, counts only `"ready"` hot keys (refining the top chunk's exact ledger from its hot DB), so a `"transient"` key never advances the bound on its own — which is precisely what lets recovery demote any hot key to `"transient"` without disturbing the watermark (see `deriveCompleteThrough`). The lifecycle tick needs only chunk granularity — which chunks are complete, and where the sliding retention floor anchors: @@ -231,12 +231,19 @@ func deriveCompleteThrough(cat Catalog) uint32 { // is then chunkLastLedger(-1) = 1 — the pre-genesis sentinel — never a // spurious chunk-0 bound that would resume a young network past its tip. through := chunkLastLedger(highestDurableChunk(cat)) - // Positional term. hotChunkKeys returns every hot:chunk:* key regardless - // of value — counting a "transient" key is sound because it is only ever - // put after the predecessor chunk's write handle closed. When the live - // chunk is chunk 0 (a young genesis network), maxChunk-1 = -1 and - // chunkLastLedger(-1) = 1: nothing below chunk 0 is complete. - if hot := hotChunkKeys(cat); len(hot) > 0 { + // Positional term — counts only "ready" hot keys, NOT "transient" ones. A + // "transient" key marks a hot DB mid-create/mid-delete, or one a recovery + // demoted; excluding it is what lets recovery demote ANY hot key without + // inflating this bound (see [Surgical recovery](#scenario-coverage)). The + // one case this under-counts — a complete-but-unfrozen chunk left + // "transient" by a boundary crash — is recovered by deriveWatermark's + // refinement (below), which opens that highest ready chunk and reads its + // committed seq; the lifecycle tick never sees it, because the resume chunk + // is reopened "ready" before the first tick (and an under-count would only + // defer work anyway). When the live chunk is chunk 0 (a young genesis + // network), maxChunk-1 = -1 and chunkLastLedger(-1) = 1: nothing below + // chunk 0 is complete. + if hot := readyHotChunkKeys(cat); len(hot) > 0 { through = max(through, chunkLastLedger(maxChunk(hot)-1)) } return max(through, cat.EarliestLedger()-1) @@ -246,12 +253,18 @@ func deriveCompleteThrough(cat Catalog) uint32 { Ingestion's resume point at startup needs the exact ledger — the one consumer of sub-chunk precision: ```go -// deriveWatermark is deriveCompleteThrough refined by exactly ONE read: -// sub-chunk precision only ever matters inside the live chunk, and a lower -// ready chunk can never hold the maximum (key-creation invariant, above), so -// only the live chunk's DB is opened. Runs once, before ingestion starts — -// the only time opening a hot DB is safe (once ingestion runs, the live DB -// is held exclusively by its writer). +// deriveWatermark is deriveCompleteThrough refined by exactly ONE read of the +// highest ready hot DB. That read does two jobs: (1) sub-chunk precision inside +// the live chunk, and (2) recovering the chunk-level frontier when the +// positional term under-counts — a boundary crash can leave the live chunk +// "transient", so the highest *ready* chunk is the just-completed predecessor, +// whose completion no key now advertises; reading its maxCommittedSeq supplies +// that frontier. This is the one spot where the ready-only count leans on +// opening a hot DB rather than on key existence alone — and if that ready +// chunk's dir is missing we fatal (below) rather than degrade: the price of +// count-only-ready. Runs once, before ingestion starts — the only time opening +// a hot DB is safe (once ingestion runs, the live DB is held exclusively by +// its writer). func deriveWatermark(cat Catalog) uint32 { for _, c := range readyHotChunks(cat) { if !dirExists(hotChunkPath(c)) { @@ -267,7 +280,9 @@ func deriveWatermark(cat Catalog) uint32 { } w := deriveCompleteThrough(cat) if live, ok := highestReadyHotChunk(cat); ok { - w = max(w, maxCommittedSeq(openReadOnly(live))) + db := openReadOnly(live) + w = max(w, maxCommittedSeq(db)) + db.Close() // released before startup reopens the same path read-write } return w } @@ -408,14 +423,25 @@ func catchupSource(chunk ChunkID, artifacts ArtifactSet, cfg Config) LedgerSourc fatalf("hot:chunk:%08d is \"ready\" but its dir is missing — "+ "hot storage lost; run surgical recovery (case 4).", chunk) } - if db := openRocksDBReadOnly(hotChunkPath(chunk)); maxCommittedSeq(db) >= chunkLastLedger(chunk) { + db := openRocksDBReadOnly(hotChunkPath(chunk)) + if maxCommittedSeq(db) >= chunkLastLedger(chunk) { return &HotLedgers{chunk: chunk, store: db} - } // incomplete: stale leftover — fall through; the discard scan owns it + } + db.Close() // incomplete: stale leftover — close and fall through; the discard scan owns it } if cat.State(chunk, LFS) == Frozen && !artifacts.Has(LFS) { return packReader(chunk) // re-derive locally; no redundant download } - return bulkBackend(cfg) // BSB by default — see [catch_up.bsb] + // Bulk backend — the only source for a chunk with no local copy: one below + // the floor on a retention widen, or one a surgical recovery demoted. If the + // backend's tip lags below this chunk (a captive-core node runs ahead of its + // trailing object store), block for coverage rather than aborting — poll the + // tip on a bounded backoff and fatal with a specific error only if it never + // advances. A chunk WITH a local copy never reaches here (it took the hot or + // pack branch), so this wait never gates a normal restart whose range is + // entirely local; it fires only for genuinely backend-only chunks. + waitForBackendCoverage(cfg, chunk) // bounded; fatal on timeout + return bulkBackend(cfg) // BSB by default — see [catch_up.bsb] } ``` @@ -563,13 +589,17 @@ func buildThenSweep(b IndexBuild, cfg Config) error { ```go func runBackfill(ctx context.Context, cfg Config, rangeStart, rangeEnd ChunkID) error { - // Fail before any work if a chunk can't be produced from ANY source. This - // mirrors catchupSource's preference: a chunk needs the bulk backend only - // when it is not already durable (self-skips), not complete in a ready hot - // DB, and not re-derivable from a local .pack — so the backend must cover - // only those fall-through chunks, NOT the whole range. Load-bearing on a - // restart where rangeEnd is anchored on max(tip, lastCommitted): the span - // above a lagging bulk tip is produced locally and must not abort here. + // Fail before any work only if a fall-through chunk has NO configured source + // at all. This mirrors catchupSource's preference: a chunk needs the bulk + // backend only when it is not already durable (self-skips), not complete in + // a ready hot DB, and not re-derivable from a local .pack — so the check + // concerns only those fall-through chunks, NOT the whole range. Load-bearing + // on a restart where rangeEnd is anchored on max(tip, lastCommitted): the + // span above a lagging bulk tip is produced locally and must not abort here. + // It does NOT check backend-tip COVERAGE: a fall-through chunk above a + // lagging-but-advancing backend is not doomed, only not-yet-producible, and + // catchupSource's bounded wait (rule 2) handles that per chunk rather than + // aborting the whole backfill. if err := validateRangeProducible(cfg, rangeStart, rangeEnd); err != nil { return err } @@ -636,7 +666,7 @@ Startup runs in two steps — catch up, then serve: No other preparation exists: the resume chunk's hot DB is simply reopened (steady-state restart) or created fresh, and the backfilled windows' tx hashes — the trailing window's included — are already queryable through the `.idx` files catch-up built. -**Serve-readiness is established entirely by step 1 plus the resume chunk's hot DB.** Catch-up's postcondition covers every complete in-retention chunk from durable artifacts — boundary-crash leftovers included, produced locally through `catchupSource`'s hot branch — and the only chunk it ever skips is the *partial* resume chunk, whose data lives in the hot DB startup reopens before `serveReads()` (a mid-chunk watermark can only have come from that DB). Nothing gates serving on a cleanup pass, because crash debris and downtime leftovers are reader-invisible at *every* moment of operation — readers resolve `"frozen"` keys exclusively, and the retention check masks past-floor files — so the first tick clears them concurrently with serving rather than ahead of it. The store reaches quiescence within that first tick — typically seconds after reads open, longer when it prunes a long-downtime backlog; from then on the [invariant audits](#correctness) carry their usual meaning. (The one nicety surrendered: a store so damaged that tick ops fail aborts seconds after joining the pool rather than just before — the restart loop is identical either way.) +**Serve-readiness is established entirely by step 1 plus the resume chunk's hot DB.** Catch-up's postcondition covers every complete in-retention chunk from durable artifacts — boundary-crash leftovers included, produced locally through `catchupSource`'s hot branch — and the only chunk it ever skips is the *partial* resume chunk, whose data lives in the hot DB startup reopens before `serveReads()` (a mid-chunk watermark can only have come from that DB). Nothing gates serving on a cleanup pass, because crash debris and downtime leftovers are reader-invisible at *every* moment of operation — a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks past-floor files — so the first tick clears them concurrently with serving rather than ahead of it. The store reaches quiescence within that first tick — typically seconds after reads open, longer when it prunes a long-downtime backlog; from then on the [invariant audits](#correctness) carry their usual meaning. (The one nicety surrendered: a store so damaged that tick ops fail aborts seconds after joining the pool rather than just before — the restart loop is identical either way.) Operational note — **peak disk after long downtime**: pruning runs only in the first tick's prune stage, *after* catch-up has materialized every newly-in-retention chunk, so a downtime approaching or exceeding the retention window transiently holds up to ~2× the retention footprint (the stale window plus its replacement). Size volumes accordingly, or prune stale ranges manually before restarting after very long downtime; a disk-full during catch-up otherwise aborts before the relieving prune can run, on every retry. @@ -688,10 +718,11 @@ func lastCompleteChunkAt(ledger uint32) int64 { // two ways it lies. It retries with bounded backoff (transient object-store // unavailability) and rejects a tip below GenesisLedger as "not ready" (an // empty / not-yet-synced backend), so an unready tip never reaches the chunk -// arithmetic where it would pin a garbage floor. Callers with a local -// substitute degrade on error — the catch-up loop falls back to lastCommitted, -// the numeric past-tip check is skipped — and only "now" resolution, which has -// no substitute, fatals. +// arithmetic where it would pin a garbage floor. The catch-up loop has a local +// substitute and degrades on error — it falls back to lastCommitted — but the +// two first-start consumers with no substitute (resolving "now", and validating +// a numeric floor against the tip before it is pinned immutably) fatal instead, +// so neither can commit an unverifiable layout. func networkTip(cfg Config) (uint32, error) { tip, err := withBackoff(func() (uint32, error) { return backendNetworkTip(cfg) }) if err != nil { @@ -786,7 +817,7 @@ func startStreaming(ctx context.Context, cfg Config) error { doorbell := make(chan struct{}, 1) go lifecycleLoop(ctx, cfg, cat, doorbell) serveReads() - return runIngestionLoop(cfg, core, hotDB, cat, doorbell) + return runIngestionLoop(ctx, cfg, core, hotDB, cat, doorbell) } ``` @@ -832,8 +863,14 @@ func validateConfig(cfg Config, cat Catalog) { } // earliest_ledger immutability. The backend tip is NOT re-sampled (it // may lag below the pinned floor — the startup loop's - // max(tip, lastCommitted) handles that). "now" is a no-op: it resolved - // once at first start and is now pinned. + // max(tip, lastCommitted) handles that). A genesis/numeric value must + // equal the stored pin or startup aborts. "now" cannot be re-resolved + // without re-sampling, so on a restart it is a deliberate no-op meaning + // "keep the pinned floor" — a frontfill deployment leaves "now" in its + // config across restarts and must not abort. One consequence: editing an + // existing deployment FROM genesis/numeric TO "now" is silently kept at + // the pinned floor rather than aborting (the floor is immutable either + // way); to actually move it, wipe the data dir. if cfg.EarliestLedger != "now" { want := uint32(GenesisLedger) if cfg.EarliestLedger != "genesis" { @@ -850,9 +887,11 @@ func validateConfig(cfg Config, cat Catalog) { // First start (or an incomplete prior start — no artifacts yet). Resolve // earliest_ledger, then commit BOTH layout pins in one atomic synced batch. - // The network tip is needed only to resolve "now" and as a best-effort - // sanity bound on a numeric floor; networkTip rejects an unready (< genesis) - // or unreachable tip, so neither path can pin a garbage floor. + // The network tip is required to resolve "now" and to validate a numeric + // floor against the network before pinning it — both forms therefore need a + // reachable, ready backend on first start. A genesis floor needs no tip: + // GenesisLedger is always a valid lower bound. networkTip rejects an unready + // (< genesis) or unreachable tip, so no path can pin a garbage or future floor. var earliest uint32 switch cfg.EarliestLedger { case "genesis": @@ -865,11 +904,21 @@ func validateConfig(cfg Config, cat Catalog) { earliest = chunkFirstLedger(chunkID(tip)) // <= tip, so never past the tip default: earliest = atoi(cfg.EarliestLedger) // already form-validated: parse, >= genesis, aligned - // Best-effort: reject a floor past where the network is, but skip the - // check if the backend is unreachable — the floor is well-formed and the - // catch-up loop tolerates a lagging/absent tip. - if tip, err := networkTip(cfg); err == nil && earliest > tip { - fatalf("earliest_ledger (%d) is past the current tip (%d); reject.", earliest, tip) + // A numeric floor is pinned immutably below, so it must be validated + // against a real tip FIRST — the check is mandatory, not best-effort. + // Skipping it when the backend is down would let a floor AHEAD of the + // network become permanent: on a later pass the catch-up loop's + // max(tip, earliest-1) anchor collapses the backfill range to empty + // (earliest-1 >= tip), and the daemon would resume ingestion from a + // future ledger with the bad floor already pinned. Like "now", a numeric + // first-start floor therefore requires a reachable, ready backend. + tip, err := networkTip(cfg) + if err != nil { + fatalf("first start with a numeric earliest_ledger needs a reachable, "+ + "ready backend to validate the floor against the network tip: %v", err) + } + if earliest > tip { + fatalf("earliest_ledger (%d) is past the current network tip (%d); reject.", earliest, tip) } } batch := cat.NewBatch() @@ -902,12 +951,12 @@ func openHotDB(cat Catalog, hotKey, path string, create func(string) *HotDB) *Ho // The key promises a DB the filesystem doesn't have — hot storage // was lost out from under a surviving meta store (e.g. ephemeral // NVMe died). Recreating empty would silently lose the chunk's - // ledgers, so refuse: the operator deletes the orphaned hot:chunk - // keys (surgical recovery case 4) and restarts — the derived - // watermark then lands at the last frozen boundary automatically, - // and re-ingestion fills the gap. The fatal stays (rather than - // auto-healing) because a missing dir can also mean a mount - // misconfiguration, where auto-wiping state would be wrong. + // ledgers, so refuse: the operator demotes the orphaned hot:chunk + // keys to "transient" (surgical recovery case 4) and restarts — the + // watermark (count-only-ready) then lands at the last frozen boundary + // automatically, and re-ingestion fills the gap forward. The fatal + // stays (rather than auto-healing) because a missing dir can also + // mean a mount misconfiguration, where auto-wiping state would be wrong. fatalf("%s is \"ready\" but %s is missing — hot storage lost; "+ "run surgical recovery (case 4).", hotKey, path) } @@ -943,7 +992,8 @@ func discardHotDBForChunk(chunk ChunkID, cat Catalog) { ```go type HotLedgers struct { chunk ChunkID - store *RocksDB // opened (and verified complete) by catchupSource + store *RocksDB // opened + completeness-checked by catchupSource; closed when + // processChunk's pass ends — before the same tick's discard can rmdir the dir } func (h *HotLedgers) GetLedger(seq uint32) LedgerCloseMeta { @@ -954,8 +1004,8 @@ func (h *HotLedgers) GetLedger(seq uint32) LedgerCloseMeta { ### Hot DB Ingestion ```go -func runIngestionLoop(cfg Config, core *CaptiveCore, hotDB *HotDB, cat Catalog, - doorbell chan struct{}) error { +func runIngestionLoop(ctx context.Context, cfg Config, core *CaptiveCore, hotDB *HotDB, + cat Catalog, doorbell chan struct{}) error { notify := func() { // payload-free doorbell: non-blocking send, coalescing select { @@ -1244,7 +1294,7 @@ Two writers; readers only read. The ingestion loop is one goroutine; the lifecyc - **The ingestion loop owns the live chunk** — the highest chunk with a `hot:chunk:*` key. It is the only writer of that chunk's hot DB and the creator of each chunk's `hot:chunk:{chunk}` key (via `openHotDBForChunk` at the boundary). - **The lifecycle goroutine owns everything below the live chunk** — handed-off hot DBs (freeze + discard), all `chunk:*` and `index:*` artifact keys, and the deletion side of `hot:chunk:*` keys. -**The two goroutines share no state.** Their only connection is the payload-free doorbell, and the partition itself is encoded in the catalog: the lifecycle's derivation treats the highest hot key as the live chunk and touches only what lies below it. The handoff fence is the boundary's write order — the ingestion loop closes its write handle *before* creating the next chunk's hot key. Creating that key is the act that moves the partition: the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one already in flight from the previous notification) may freeze and discard it — by which point no writer holds it. The two goroutines never write the same meta-store key, and never touch the same per-chunk hot RocksDB instance; both do write the meta store concurrently — on disjoint keys, relying on RocksDB's thread safety for the instance itself. The derivation is monotonic within the run (hot keys and frozen keys only advance), so a tick racing a boundary only under-approximates eligibility — work deferred to the next tick, never incorrect work. Readers hold their own read-only handles and resolve files through meta-store keys, so writer-side activity never races them. (The serving side will also need a notion of current progress — the [reader retention contract](#reader-retention-contract) bounds every read by the retention window — but how readers obtain it is the query-routing design's concern, not this doc's.) +**The two goroutines share no state.** Their only connection is the payload-free doorbell, and the partition itself is encoded in the catalog: the lifecycle treats the highest `hot:chunk` key — *any* value — as the live chunk and touches only what lies below it. (This ownership boundary is value-blind: any `hot:chunk` key marks an owned chunk. Only the *watermark* derivation counts `"ready"` keys exclusively — a distinct concern, [defined earlier](#meta-store-keys).) The handoff fence is the boundary's write order — the ingestion loop closes its write handle *before* creating the next chunk's hot key. Creating that key is the act that moves the partition: the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one already in flight from the previous notification) may freeze and discard it — by which point no writer holds it. The two goroutines never write the same meta-store key, and never touch the same per-chunk hot RocksDB instance; both do write the meta store concurrently — on disjoint keys, relying on RocksDB's thread safety for the instance itself. The derivation is monotonic within the run (hot keys and frozen keys only advance), so a tick racing a boundary only under-approximates eligibility — work deferred to the next tick, never incorrect work. Readers hold their own read-only handles and resolve files through meta-store keys, so writer-side activity never races them. (The serving side will also need a notion of current progress — the [reader retention contract](#reader-retention-contract) bounds every read by the retention window — but how readers obtain it is the query-routing design's concern, not this doc's.) ### One boundary, end to end @@ -1302,12 +1352,12 @@ The **retention window** is `[effectiveRetentionFloor, last_committed_ledger]`. **INV-2 (single canonical state).** The meta-store records one home for each data range: - **at most one `"frozen"` index key per window — at all times**, quiescent or not (the commit batch promotes and demotes in one write; this is what makes "the window's index" well-defined for readers); -- at quiescence, no artifact key anywhere is `"freezing"` or `"pruning"` — index transients are swept by the tick that observes them; per-chunk `"freezing"` keys are repaired by re-materialization (the plan stage, for chunks within `[floor, completeThrough]`, from whichever source `catchupSource` selects) and `"pruning"` keys are finished by the sweeps. One reachable exception: after hot-volume loss combined with a lagging bulk-backend tip, a partially-frozen chunk *above* the derived watermark can hold `"freezing"` keys at served quiescence — it lies outside every plan range (above `completeThrough`), and its ledgers exist nowhere any source can reach — until re-ingestion replays the chunk minutes later; it sits outside the retention window throughout, so no read can observe it; +- at quiescence, no artifact key anywhere is `"freezing"` or `"pruning"` — index transients are swept by the tick that observes them; per-chunk `"freezing"` keys are repaired by re-materialization (the plan stage, for chunks within `[floor, completeThrough]`, from whichever source `catchupSource` selects) and `"pruning"` keys are finished by the sweeps. One reachable exception: after hot-volume loss, a partially-frozen chunk *above* the derived watermark can hold `"freezing"` keys at served quiescence — it lies above `completeThrough` (outside every plan range and the retention window, so no read can observe it) until re-ingestion replays it forward from the last frozen boundary and re-freezes it, minutes later; - hot DB keys add one tolerated in-flight transient: `"transient"` brackets a directory operation in progress (the boundary's `openHotDBForChunk`, startup's resume-chunk open, a discard mid-op) and can be observed while the lifecycle sits idle between ticks; a crash-left bracket is finished by the next `openHotDB` or discard scan; - at quiescence, no `hot:chunk:c` key for a chunk `c` whose artifacts are all durable *and* whose window's index covers `c` (the chunk is fully served by cold artifacts, so the hot DB must be gone); - at quiescence, no `chunk:c:txhash` key for a chunk `c` in a window whose frozen index key is terminal (the terminal commit demoted them; the sweep removed them; the prune scan's redundant-input branch demotes any that a crashed widening re-froze or left mid-freeze). -**INV-3 (disk matches meta-store).** At quiescence, the set of artifact files and hot DB directories on disk equals exactly the set the meta-store specifies. Every key in a final state names exactly one expected path; the disk holds those paths and no others — no orphan files, no dangling keys, no duplicate artifacts. By INV-2 every artifact key at served quiescence *is* in a final state — the hot-key `"transient"` bracket around an in-flight directory operation is the one tolerated exception — so the correspondence is exact, with no tolerance carve-outs for artifacts: a non-key-named file in an index window dir is a real bug, not mid-tick debris. +**INV-3 (disk matches meta-store).** At quiescence, the set of artifact files and hot DB directories on disk equals exactly the set the meta-store specifies. Every key names exactly one expected path, and the mark-before-write rule keeps even a partial file reachable from its key — so the correspondence holds whether a key is in a final state or in one of the transients INV-2 tolerates (the hot-key `"transient"` bracket around an in-flight directory operation; the above-watermark `"freezing"` artifact key left by hot-volume loss with a lagging tip). The disk holds those paths and no others — no orphan files, no dangling keys, no duplicate artifacts: a non-key-named file in an index window dir is a real bug, not mid-tick debris. **INV-4 (retention bound).** At quiescence, no file or meta-store key maps to a ledger range strictly below the effective retention floor. @@ -1328,7 +1378,7 @@ Properties we rely on the underlying storage to provide: - **Sync WAL.** All meta-store puts and deletes that the invariants depend on use RocksDB's `WriteOptions.sync = true`, which fsyncs the WAL before the write returns. Multi-key commits — the index commit batch, the sweeps' key-delete batches — are single atomic synced WriteBatches: all-or-nothing across keys. - **Per-ledger durability.** The chunk hot DB's synced WriteBatch (atomic across all CFs) is the sole per-ledger durability boundary; the watermark is derived from it, so no cross-store ordering exists to maintain. Per-artifact: the per-chunk file **and its directory entry** are fsynced before its key flips to `"frozen"`, and an index coverage's `.idx` (and its dir entry) is fsynced before the commit batch freezes its key. - **Deterministic, idempotent writes.** Re-applying any write produces byte-identical state. Backed by deterministic LCM bytes from any conformant LedgerBackend and a byte-identical streamhash index from byte-identical sorted inputs. -- **Monotonic progress.** Within a process run, ingestion only moves forward (each synced batch extends the last), and the lifecycle's derived `completeThrough` only advances with it (hot keys and frozen keys move forward, never back). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value or marginally above it (a batch that committed in the instant before the crash); it sits *below* the pre-crash value only when hot state was removed or lost, or when surgery demoted keys feeding the cold term (case 3's no-hot-key corner). There is no stored watermark to rewind; surgical recovery shrinks the derivation's inputs by demoting or removing state, not by editing a counter. +- **Monotonic progress.** Within a process run, ingestion only moves forward (each synced batch extends the last), and the lifecycle's derived `completeThrough` only advances with it (hot keys and frozen keys move forward, never back). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value or marginally above it (a batch that committed in the instant before the crash); it sits *below* the pre-crash value only when hot state was lost or demoted to `"transient"`, or when — on a daemon interrupted during its first backfill, before any live ingestion — recovery demotes a finished window's index for rebuild: with no hot DBs to anchor the watermark, it drops below that whole window until catch-up rebuilds the index, re-deriving the untainted chunks' inputs from their on-disk `.pack`s and re-fetching only the tainted chunks. There is no stored watermark to rewind; surgical recovery shrinks the derivation's inputs by demoting state, not by editing a counter. ### Design invariants @@ -1344,12 +1394,12 @@ These are streaming-specific properties the implementation guarantees on top of ### Scenario coverage -INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because readers resolve `"frozen"` keys exclusively and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every quiescence reached after the events below; startup's first quiescence arrives when the first tick completes, shortly after reads open. +INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every quiescence reached after the events below; startup's first quiescence arrives when the first tick completes, shortly after reads open. 1. **Steady-state operation.** Hot DB ingestion advances `last_committed_ledger`; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on `last_committed_ledger`. -2. **Operator state changes** — retention widening or shortening (`retention_chunks`), `earliest_ledger` raised. Both reduce to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state." Catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage; the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateRangeProducible`, materializes the new bottom. -3. **Surgical recovery, frozen-range case** (tainted range strictly below `chunkID(last_committed_ledger)`). The operator never touches the filesystem. Recovery is **one atomic meta-store batch**: every `chunk:{c}:*` key in the tainted range and every `index:*` key of every window overlapping it → `"freezing"` — the state that already means *this file is not to be trusted: re-derive or delete* — and any leftover `hot:chunk` key in the range (a crash between a tick's build and its discard stage leaves a frozen chunk's hot DB behind) → `"transient"`, which makes it instantly ineligible as a source (`catchupSource` reads only `"ready"`). The batch commits atomically or not at all, so there is no interruption analysis and re-running it is a no-op; the meta store's lock means it can only be written against a stopped daemon. Every demoted key then converges through machinery that already exists: on restart, catch-up re-derives the freezing chunk artifacts from a conformant LedgerBackend — overwriting the tainted files in place, rule 1's ordinary re-materialization — and rebuilds each window's index, re-marking the freezing index key whose coverage is still desired (or leaving it to the prune scan's sweep-on-sight when retention has moved past it); the discard scan retires the transient hot key once the rebuilt chunk regains coverage — or past retention, after long downtime — unlinking the tainted hot DB unread. `last_committed_ledger` is exactly unchanged whenever ingestion has ever started: the batch keeps every hot key, the positional term counts them value-blind, and the live chunk's `"ready"` DB restores the watermark precisely. Only in the no-hot-key corner — the daemon stopped during initial catch-up, before ingestion opened its first hot DB — does the cold term lead the derivation, and there it can regress through every untainted chunk of a finalized window the taint overlaps (their `.bin` keys were swept at finalization, and the demoted index key breaks coverage), bounded by one window; catch-up's `max(tip, lastCommitted)` anchor re-derives forward regardless. -4. **Surgical recovery, partial-tail-chunk case** (tainted range includes the live chunk), and equally **hot-volume loss**. The same artifact-key batch as case 3, but hot keys are **removed**, not demoted — case 3's leftover hot DB holds data that exists elsewhere and sits below the watermark, so a demoted key is harmlessly reclaimed; here the hot DB *was* the data, and a kept key either re-fires the fatal or silently inflates the watermark. Remove **every** `hot:chunk` key whose dir is missing or whose contents are partial. A half-recovery that keeps a `"ready"` key merely relocates the fatal to the next restart. One that keeps the boundary-crash `"transient"` key — the next chunk's key, sitting above the last frozen boundary — is recoverable but worse than compliance: the positional term counts all hot keys while the fatal checks only `"ready"` ones, so the kept key silently props the derived watermark up to the lost chunk's end, pulling the lost chunk into catch-up's anchored range, which re-derives it from the bulk source — stalling startup until that source has the chunk, where full removal would have resumed from the last frozen boundary immediately. Remove any surviving dirs along with the keys — dirs first, keys last: an interrupted pass then leaves keys whose dirs are missing, which the next startup catches (the fatal for `"ready"`, the watermark-prop analysis above for `"transient"`), never a keyless orphan dir no key-driven scan can find. The hot DB is the only copy of its ledgers — discarding it loses them, and the **derived watermark admits as much automatically**: with the hot DB gone, derivation lands at the last frozen boundary, and the next startup re-ingests from there. There is no watermark to edit; recovery is key removal, never file surgery beyond the lost dirs themselves. +2. **Operator state changes** — widening or shortening retention (`retention_chunks`). A `retention_chunks` change reduces to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state": catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage, and the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateRangeProducible`, materializes the new bottom. **`earliest_ledger` is not a live operator change**: it is pinned on first start and immutable — `validateConfig` aborts on any later genesis/numeric value that differs from the pin, and treats `"now"` as the pinned floor (see [Configuration](#configuration)) — so a plain config edit never moves the floor. The same floor machinery *would* converge for either direction once a future `set-earliest-ledger` admin command demotes the pin; until then the only supported way to change it is wiping the data directory, which is simply a fresh first start. +3. **Surgical recovery (tainted data).** The operator never touches the filesystem. Recovery is **one atomic meta-store batch** that *demotes* the affected keys — never removes — split by tier: tainted cold artifacts (`chunk:{c}:*` and every overlapping `index:*` key) → `"freezing"`, the state that already means *this file is not to be trusted: re-derive or delete*; tainted or lost hot DBs (`hot:chunk`, the live chunk's included) → `"transient"`, instantly ineligible as a source (`catchupSource` reads only `"ready"`) and ignored by the watermark, which counts only `"ready"` keys. The batch commits atomically or not at all, so there is no interruption analysis and re-running it is a no-op; the meta store's lock means it can only be written against a stopped daemon. Everything converges through machinery that already exists: catch-up re-derives the `"freezing"` cold artifacts from a conformant LedgerBackend — overwriting in place, rule 1's ordinary re-materialization — and rebuilds each window's index (if the backend tip lags below a re-derived chunk, `catchupSource` waits for coverage rather than aborting — see [catch-up primitives](#catch-up-primitives)); the `"transient"` hot DBs need no file surgery — `openHotDB` wipes and recreates one when re-ingestion re-opens that chunk, and the discard scan retires any sitting below the live chunk. Demoting hot DBs is **self-correcting for `last_committed_ledger`** because the watermark ignores `"transient"`: a demotion that reaches the live chunk rewinds the watermark to the last frozen boundary, and captive core re-ingests the un-frozen tail **forward**, never through the lagging bulk backend; a demotion strictly below the live chunk leaves the watermark unchanged (those chunks aren't the highest `"ready"` key, and the live chunk's `"ready"` DB still pins it). This uniformity is what replaces the old demote-vs-remove / above-or-below-the-live-chunk split — every recovery demotes, nothing is removed by hand, so there is no dirs-first-keys-last ordering for an operator to get wrong; the daemon's own sweeps and `openHotDB` handle the dirs in their existing crash-safe order. +4. **Hot-volume loss.** The hot-tier demotion above, triggered by loss rather than taint: the hot storage tree is gone (e.g. ephemeral NVMe died) while the meta store survives, so its `hot:chunk` keys read `"ready"` with missing dirs. `deriveWatermark`/`openHotDB` fatal on that mismatch — deliberately, since a missing dir can also be a mount misconfiguration where auto-wiping would destroy state — and point the operator at recovery. The operator demotes the orphaned `hot:chunk` keys to `"transient"` (the case-3 batch, hot tier only). On restart the fatal no longer fires (it checks `"ready"` keys), the watermark falls to the last frozen boundary (the cold artifacts survive on durable storage), and captive core re-ingests the lost tail **forward**. The hot DB was the only copy of its un-frozen ledgers — losing it loses them — and the **derived watermark admits as much automatically**: with those keys no longer `"ready"`, derivation lands at the last frozen boundary and re-ingestion fills from there. There is no watermark to edit, and the dirs are already gone, so recovery is pure key demotion. 5. **First deployment / downtime between restarts.** `last_committed_ledger` derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). 6. **LedgerBackend choice or mid-flight swap.** The LedgerBackend contract guarantees canonical LCM bytes for any range, so any conformant backend produces byte-identical artifacts. Different backends differ in performance, not behavior. An operator using BSB for backfill and CaptiveCore for hot DB ingestion, or swapping mid-deployment, satisfies all four invariants. 7. **Crash at any point during any of the above.** Sync WAL plus per-ledger durability ordering mean the meta store on next start is internally coherent and the derived watermark equals exactly what the last synced batch committed. Idempotency means re-running any half-finished op is safe. Convergence finishes whatever the crash interrupted. @@ -1361,7 +1411,7 @@ The invariants describe what storage should look like, not how the phase scans m - **A meta-store key claims something the file doesn't actually deliver** — e.g., a per-chunk writer flips a key to `"frozen"` before fsync (leaving a partial file the meta store advertises as complete), or an index key freezes before its `.idx` is fully fsynced, or the key name's `{lo, hi}` doesn't match the file's actual coverage, or a frozen file is mutated post-freeze ⟹ reads through the meta key see wrong or missing data. **INV-1** violated. Detectable by re-deriving an artifact via a conformant LedgerBackend and byte-comparing against the on-disk file. - **Pruning too aggressive** ⟹ a request whose ledger scope is in retention returns wrong or missing results. Issue a read to find it. **INV-1** violated. - **Two frozen index keys in one window** — a commit batch failed to demote the predecessor, or promotion and demotion landed as separate writes ⟹ readers have no well-defined index. Walk `index:*` keys, count `"frozen"` per window. **INV-2** violated. -- **A `"freezing"` or `"pruning"` key survives served quiescence** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup catch-up should have re-materialized. Walk keys for transient values at quiescence. **INV-2** violated. +- **A `"freezing"` or `"pruning"` key within `[floor, completeThrough]` survives served quiescence** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup catch-up should have re-materialized. Walk keys for transient values at quiescence, excluding the one corner INV-2 tolerates — a `"freezing"` artifact key *above* `completeThrough` after hot-volume loss with a lagging tip, which no source can yet repair. **INV-2** violated. - **Chunk scan misses an orphan** ⟹ a hot DB persists for a chunk that cold artifacts fully serve. Walk `hot:chunk:c` keys whose chunk has its artifacts durable and its window's index covering `c`. **INV-2** violated. - **Finalization demotions don't complete** ⟹ per-chunk frozen tx hash files outlive the index that consumed them. Walk `chunk:c:txhash` keys whose window's frozen key has `hi` = the window's last chunk. **INV-2** violated. - **A writer leaves a file on disk without its meta-store key** (file fsynced before key was durable, or a sweep deleted the key before its unlink was durable) ⟹ orphan file — invisible to every key-driven scan. Walk the filesystem against the meta-store. **INV-3** violated. diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index 8094db429..0899db3e4 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -14,7 +14,7 @@ The daemon context — chunk geometry, the meta store, the one write protocol, c Serve `getTransaction(hash)` for any transaction whose ledger falls within the retention window (full history by default): -- **Complete.** Every transaction in every in-retention ledger is resolvable by hash. No gaps, including across crashes, restarts, and retention changes. +- **Complete.** Every transaction in every in-retention ledger is resolvable by hash — no gaps, including across crashes, restarts, and retention changes — with one quantified residual: the cold index keys on streamhash's 128-bit routing key (the hash's first 16 bytes, §6.1), so two distinct in-retention hashes that share a 16-byte prefix *and fall in the same window* collide as one index key. That is a ~10⁻²⁰-per-dense-window event, and it is *fail-stop* rather than silent — the window's index build fails loudly rather than dropping or mis-resolving a transaction. (A shared prefix across *different* windows is harmless: the two never meet in one build, and the read path's verify rejects the resulting cross-window false positive — §8.2.) The guarantee is otherwise absolute. - **Correct.** A lookup never returns the wrong transaction; a missing or out-of-retention one returns not-found. - **No in-memory index.** The hash→seq map is on-disk `.idx` files (read through the page cache), not a RAM-resident structure sized to the transaction count — so the daemon holds no memory proportional to the number of transactions in history. (A lookup probes one `.idx` per in-retention window — a hash carries no window hint; that probe set and its cost are the query-routing design's concern.) - **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write path, and the cold index stays current with a rebuild that is small relative to its cadence. @@ -34,7 +34,7 @@ The subsystem this document owns is the **hash → seq map**, plus the read-path - **Point lookups only.** There are no range or prefix queries over tx hashes, so order-preserving structures buy nothing — perfect-hash structures apply. - **Hashes are uniform and immutable.** A transaction hash is never updated and corresponds to at most one applied transaction (the network's replay protection); the map is append-only, one batch of entries per ledger. -- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — only *complete* (no false negatives); false positives are screened first by a fingerprint and finally by the fetch-and-verify step. +- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — only *complete* (no false negatives; the one residual, a same-window 16-byte prefix collision, is fail-stop rather than a silent miss — §8.2); false positives are screened first by a fingerprint and finally by the fetch-and-verify step. --- @@ -116,7 +116,7 @@ uint64 LE entry count entry × count 20 bytes each: [key: 16][seq: 4 LE] ``` -- `key` is the **first 16 bytes of the transaction hash**. +- `key` is the **first 16 bytes of the transaction hash** — streamhash's routing-key width: it derives the perfect-hash slot from these 16 bytes alone, both when this run is built and when a lookup probes (§8.1). Two distinct in-retention hashes sharing these 16 bytes are a true duplicate key only when they land in the *same* window's build; across windows the shared prefix is harmless. Both cases, and their handling, are §8.2. - Entries are sorted ascending by the **big-endian `uint64` prefix of `key`**. The `.bin` is a *map-side-sorted run*, never a serving tier. Sorted runs are what make the rolling rebuild cheap: the index builder consumes them in a single streaming k-way merge, instead of the two passes (count, then add) it needs over unsorted input. @@ -127,7 +127,7 @@ A `.bin` lives as long as its window needs it as rebuild input: every boundary r `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, meta-store key `index:{window:08d}:{lo:08d}:{hi:08d}`. One streamhash minimal-perfect-hash file per **coverage**, built by streamhash's `SortedBuilder` over the k-way merge of `.bin[lo..hi]`, with the cold-txhash option set: -- **Payload: `payloadWidth` bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range. The width is sized to the window so the format never caps `chunks_per_txhash_index`: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)`, the bytes needed to hold the largest in-window offset (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) this is **3 bytes** — a 24-bit offset spans 16.77M ledgers — and a window of 1678+ chunks (>16.77M ledgers) widens it to 4. Since `chunks_per_txhash_index` is immutable once stored, the width is fixed for every window's life; like `MinLedger`, it is embedded in the file as user metadata and read back at lookup time. No sidecar metadata. +- **Payload: `payloadWidth` bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range. The width is sized to the window so the format never caps `chunks_per_txhash_index`: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)`, the bytes needed to hold the largest in-window offset (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) this is **3 bytes** — a 24-bit offset spans 16.77M ledgers — and a window of 1678+ chunks (>16.77M ledgers) widens it to 4. (The format imposes no upper bound of its own; `chunks_per_txhash_index` is independently capped at `MaxChunksPerTxhashIndex` ≈ 429,496 — `floor(2³²/10_000)` — by the streaming doc's `validateConfig` so a window's ledger span always fits a uint32 seq, which is why the offset never needs more than 4 bytes.) Since `chunks_per_txhash_index` is immutable once stored, the width is fixed for every window's life; streamhash records it in the index file header (recovered when the file is opened), while `MinLedger` — which streamhash itself does not model — rides in the user-metadata slot. Both are read back at lookup time; no sidecar metadata. - **Fingerprint: `fpWidth` bytes (default 1)** — a streamhash option screening foreign keys before fetch-and-verify. Since a hash lookup probes every in-retention window (§8.2), a wider fingerprint trades index size (+1 byte/tx) for fewer false-positive fetches across those windows. Fixed per build, like `payloadWidth`. All-in, at the default 3-byte payload the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. A window past the 4-byte payload threshold adds one byte per transaction. @@ -299,7 +299,13 @@ respond on the confirmed hit; not-found if no window confirms Because the hash belongs to at most one window, **at most one window confirms**; a not-found lookup — a non-existent or not-yet-ingested hash — confirms none and must rule out every in-retention window. -The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. The same verify means a 16-byte prefix collision between two real transactions — a ~10⁻²⁰-per-window event (birthday bound over ~3×10⁹ keys against 2¹²⁸), accepted as a negligible risk — can never serve the *wrong* transaction. +The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. + +A **16-byte prefix collision between two distinct in-retention transactions** has two cases, and only one bounds completeness. The cold index keys on streamhash's 128-bit routing key (§6.1), so two hashes sharing their first 16 bytes are indistinguishable *to a single window's build*. + +*Different windows* — the more likely of the two, since a shared prefix is far more apt to straddle two of history's windows than to fall inside one. Each transaction keys into its own window's `.idx`, so neither build sees a duplicate and both transactions resolve normally. The collision shows up only as a fingerprint false-positive when a lookup probes the *other* window: that window's MPHF maps the shared prefix to its own resident transaction, the fingerprint (also derived from those 16 bytes) matches, and the fetch-and-verify above rejects it because the full 32-byte hashes differ. This is exactly the foreign-key path the verify already exists for — one wasted ledger fetch, no wrong answer and no false negative. + +*Same window* — the genuine residual. The two are a single key to that window's builder, so streamhash rejects the duplicate at build time (`ErrDuplicateKey`) and the window's build fails **loudly and deterministically** — never a silently dropped transaction (the false not-found a keep-one-payload index would produce), and never, thanks to the verify, a wrong transaction. This same-window case is the sole bound on completeness; the birthday bound over a dense window's ~3×10⁹ keys against 2¹²⁸ puts it at ~10⁻²⁰ per window — a cryptographic-scale probability accepted as negligible, comparable to the undetected storage bit-errors the design likewise does not defend against. The design carries no per-key full hash or collision list to drive it to zero, because at 10⁻²⁰ that machinery would cost far more than the risk it removes. **Probe ordering, parallelism, early-stop, and the resulting latency and I/O are the query-routing design's concern** (§8.1), out of scope here. From c586667a150777a724434596710b2a944c77f438 Mon Sep 17 00:00:00 2001 From: tamirms Date: Thu, 18 Jun 2026 14:49:12 +0200 Subject: [PATCH 15/18] docs(full-history): readability + simplification pass; explorer widgets Rewrite the full-history design docs in the plain, concept-first voice: lead with the plain concept and rely on the pseudocode instead of re-narrating it, drop defensive asides about alternatives that were never pursued, and define terms before first use. Rename for consistency throughout: meta store -> catalog, catch-up -> backfill, lifecycle tick -> lifecycle run, quiescent -> settled. Reorder the streaming doc's Backfill section top-down (postcondition-driven planning -> primitives -> execution model) and tighten Correctness; the gettransaction doc is cut to storage/rebuild/query only, with the crash-safety machinery left to the streaming doc. Re-sync the interactive explorer and add two widgets: an executePlan dependency graph (worker semaphore + done-channel barrier, with a failure-cascade toggle) and a Startup backfill-loop walkthrough that runs the real startStreaming arithmetic. Fix dead cross-doc anchors left by the gettransaction reorganization. Docs only. Co-Authored-By: Claude Opus 4.8 (1M context) --- design-docs/full-history-design-explorer.html | 671 ++++++--- .../full-history-streaming-workflow.md | 1323 ++++++----------- .../gettransaction-full-history-design.md | 245 +-- 3 files changed, 985 insertions(+), 1254 deletions(-) diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html index d67f00365..b0dd35f1c 100644 --- a/design-docs/full-history-design-explorer.html +++ b/design-docs/full-history-design-explorer.html @@ -193,7 +193,7 @@ .chunk .cov { position: absolute; left: 0; right: 0; bottom: -7px; height: 3px; border-radius: 2px; background: var(--frozen); } @keyframes pulse { 0%,100% { box-shadow: 0 0 0 0 rgba(63,185,143,.35);} 50% { box-shadow: 0 0 0 5px rgba(63,185,143,0);} } .roll-log { font-family: var(--mono); font-size: 0.76rem; color: var(--muted); background: #11161f; border: 1px solid var(--line); border-radius: 8px; padding: 10px 12px; margin-top: 12px; max-height: 170px; overflow-y: auto; } -.roll-log .tick { color: var(--head); } +.roll-log .run { color: var(--head); } .roll-log div { padding: 1px 0; } .badge { font-size: 0.68rem; border-radius: 999px; padding: 1px 8px; border: 1px solid; } .badge.final { color: var(--ready); border-color: var(--ready); } @@ -266,7 +266,8 @@ Derived progress The rolling index A boundary, end to end - Catch-up & the resolver + Backfill & the resolver + Startup Concurrency Reader contract Correctness @@ -277,7 +278,7 @@
Full-History RPC · Interactive Design Explorer

The Full-History Streaming Design

- How the full-history daemon catches up, ingests live ledgers, freezes immutable history, + How the full-history daemon backfills old history, ingests live ledgers, freezes immutable history, and serves transactions by hash — explained with interactive models you can poke at. Companion to the streaming workflow and getTransaction design docs; the markdown remains the @@ -288,15 +289,15 @@

The Full-History Streaming Design

The big picture

- +

Full-history RPC runs as one daemon in one mode. There is no separate backfill command and no - explicit catch-up step: on startup the daemon figures out how far behind the network tip it is and - catches up automatically; once caught up, it serves live ledgers as they're produced. + explicit step for the operator: on startup the daemon figures out how far behind the network tip it is + and backfills to it automatically, then serves live ledgers as they're produced.

-
startup

1 · Catch up

-

Runs bulk catch-up as a subroutine: any chunk inside the retention window that isn't already +

startup

1 · Backfill

+

Runs bulk backfill as a subroutine: any chunk inside the retention window that isn't already frozen is pulled from the configured LedgerBackend (BSB by default) — skipping the tip chunk that captive core is actively ingesting. Covers first-ever start, downtime gaps, and retention widening.

steady state

2 · Ingest

@@ -304,14 +305,14 @@

The big picture

ledgers, tx hashes, and events as column families, written as one atomic synced WriteBatch per ledger. A ledger is either fully in the hot DB or absent.

steady state

3 · Freeze & prune

-

A background goroutine wakes on each chunk boundary and runs one tick: freeze the completed +

A background goroutine wakes on each chunk boundary and runs one run: freeze the completed chunk to immutable files, rebuild the current tx-hash index to fold it in, discard hot DBs the cold artifacts now serve, and prune everything superseded or past retention.

Data flow
-
Two sources feed one set of artifacts. Whatever produced the bytes, the artifacts — and the meta-store keys that catalog them — are identical.
+
Two sources feed one set of artifacts. Whatever produced the bytes, the artifacts — and the catalog keys that catalog them — are identical.
@@ -357,7 +358,7 @@

The big picture

stream - catch-up + backfill freeze at the chunk boundary (hot branch) @@ -370,7 +371,7 @@

The big picture

- meta-store RocksDB — catalogs every file and directory above: mark-then-write keys, synced WAL, no directory is ever listed to find work + catalog RocksDB — catalogs every file and directory above: mark-then-write keys, synced WAL, no directory is ever listed to find work
@@ -379,7 +380,7 @@

The big picture

The one-sentence summary: data is born hot (one RocksDB per chunk), becomes cold and immutable at the chunk boundary (.pack / events segment / .bin → rolled into a per-window - .idx), and every transition is recorded in a meta-store key before the bytes move — + .idx), and every transition is recorded in a catalog key before the bytes move — so a crash at any instant is recoverable from keys alone.
@@ -405,7 +406,7 @@

Geometry

- +
@@ -427,31 +428,31 @@

Geometry

With the default chunks_per_txhash_index = 1000, the file-bucket size (fixed at 1000 chunks) and the window size coincide numerically — but they are different concepts: buckets are purely a - filesystem concern and never appear in meta-store keys; windows define the tx-hash index layout and are - pinned in the meta store forever. + filesystem concern and never appear in catalog keys; windows define the tx-hash index layout and are + pinned in the catalog forever.

The four guarantees

- +

The daemon is built around four guarantees over its data. Everything else in the design — the write - protocol, the derived watermark, the key-driven sweeps — exists to maintain these through any crash at + protocol, the derived last committed ledger, the key-driven sweeps — exists to maintain these through any crash at any instant.

Retention is complete

-

No gaps within the retention window — for every ledger in - [effectiveRetentionFloor, last_committed_ledger], all data derived from it (transactions, +

No gaps within the retention window — for every ledger from the retention floor up to + the last committed ledger, all data derived from it (transactions, events) is present on disk and can serve any request that falls entirely inside the window.

Cold is canonical, hot is transient

Frozen chunks and finalized indexes live in immutable cold artifacts. A chunk's hot DB is discarded once every cold artifact derived from it is durable and the rolling index covers it — so a tx lookup always has exactly one home: the hot DB until coverage, the .idx after.

-

The meta-store catalogs what's on disk

-

Disk content is exactly what the meta-store specifies — every file is named by a meta-store key and +

The catalog catalogs what's on disk

+

Disk content is exactly what the catalog specifies — every file is named by a catalog key and every key in a final state has its file. File and key writes/deletes are ordered to preserve this across crashes.

Storage tracks retention

@@ -465,14 +466,14 @@

The four guarantees

Data model

- Durable state lives in two places: the meta-store RocksDB (state markers and config pins) and the + Durable state lives in two places: the catalog RocksDB (state markers and config pins) and the filesystem (immutable files, plus one per-chunk hot RocksDB holding in-progress data during ingestion).

On disk

{default_data_dir}/
-├── meta/rocksdb/ ← meta store (WAL always on)
+├── meta/rocksdb/ ← catalog (WAL always on)
├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient)
├── ledgers/{bucket:05d}/{chunk:08d}.pack
├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash)
@@ -488,8 +489,8 @@

On disk

The .bin is the interesting transient: it is the input to buildTxhashIndex and - is retained for the whole life of its window (every boundary re-reads all of the window's - .bin files to rebuild the index). The terminal build's commit batch demotes them to + is retained for the whole life of its window (at each boundary the rebuild reads all of the + window's .bin files). The terminal build's commit batch demotes them to "pruning" and the sweep removes them.

@@ -497,8 +498,7 @@

The chunk hot DB

One RocksDB per chunk at hot/{chunk:08d}/, holding everything for that chunk not yet materialized to cold artifacts. The data types are column families of one instance — they share the - instance's WAL, so each ledger commits as one atomic WriteBatch across all CFs. There is no - cross-store ordering to reason about within a chunk. + instance's WAL, so each ledger commits as one atomic WriteBatch across all CFs.

@@ -507,7 +507,7 @@

The chunk hot DB

Column familyHoldsServes
events CFslive events (schema per the events doc)getEvents for the live chunk
-

Meta-store keys

+

Catalog keys

Three groups: per-chunk artifact state, hot DB state, and config pins. Lifecycle states are shared by every artifact key in the system:

@@ -521,7 +521,7 @@

Meta-store keys

- + @@ -538,7 +538,7 @@

Meta-store keys

Artifact lifecycles

- +

Three state machines cover every durable thing in the system. Click any state to see what it means and what recovery does if a crash leaves the system there. @@ -564,7 +564,7 @@

Artifact lifecycles

One write protocol

- +

Every durable artifact — per-chunk files and index coverages alike — uses the same protocol, mark-then-write: put "freezing" before any I/O; @@ -584,48 +584,48 @@

One write protocol

- Why the dirent fsyncs matter: without them, a power crash can revert a file's (or a freshly - created directory's) creation under a durable "frozen" key — a state - key-only idempotency would never repair. That's why writes that create their parent directory barrier - the grandparent too. + Why the dirent fsyncs matter: step 3 fsyncs the directory entries, not just the file, so a file's + (or a freshly created directory's) existence on disk is durable before its key flips to + "frozen" — which is why a write that creates its parent directory also + barriers the grandparent.
- The "never salvage" rule: a crashed index build's file might even be complete — but proving that - buys nothing. A rebuild re-derives byte-identical output (the merge is a deterministic function of the - coverage), and a single no-questions rule — delete "freezing" debris - unread — collapses the entire crash inventory. + A crashed index build is deleted, not salvaged: a rebuild re-derives byte-identical output (the merge + is a deterministic function of the coverage), so a partial + "freezing" file is just re-derived from scratch.

Progress is derived, never stored

- +

- There is no stored watermark. The hot DB's synced per-ledger WriteBatch is the durable commit; - recording it again in the meta store would create a second copy of the same fact, plus an ordering rule - to keep the copy honest. Instead, two derivations read progress back out of the catalog. Both lean on one - key-creation invariant: a hot:chunk key is created only after every ledger below its - chunk has durably committed — so the highest hot key is the live chunk, and everything below it - is complete. + There is no stored progress value. The hot DB's synced per-ledger WriteBatch is the durable commit; + recording it again in the catalog would create a second copy of the same fact. Instead, startup + recomputes the exact last committed ledger from the catalog, and during operation ingestion hands the + lifecycle each chunk as it completes. The recomputation leans on one key-creation invariant: a + hot:chunk key is created only after every ledger below its chunk has durably committed — so + everything below the highest hot key is complete, and a single read of the live hot DB pins the exact + ledger inside it.

-
deriveCompleteThrough — two terms, take the max
-
A cold term (highest chunk whose artifacts are all durable) leads at startup; a positional term (everything below the live chunk) leads in steady state. Pick a situation:
+
lastCommittedLedger — two terms, take the higher
+
The COLD term is the last ledger of the highest fully-frozen chunk; the HOT term reads the live hot DB for the exact ledger inside it. The higher chunk wins — HOT in steady state, COLD only when no hot chunk sits above the frozen ones. Pick a state startup might find:
- Postcondition-driven catch-up is what makes a derived watermark safe: catch-up converges - ranges, not resume pointers, so derivation can never hide a hole — and a lost hot volume - self-degrades the watermark to the last frozen boundary instead of requiring a manual rewind. + Postcondition-driven backfill is what makes a recomputed last committed ledger safe: backfill converges + whole ranges, so recomputation can never skip a hole. And because nothing is stored, a lost hot + volume drops the recomputed answer to the last frozen boundary on its own (surgical recovery).

The rolling tx-hash index

- +

The current window's index is re-derived from scratch on every chunk boundary to absorb the chunk that just froze, growing until its window is complete. Only the window the network tip is in is ever @@ -662,10 +662,10 @@

The rolling tx-hash index

Why per-chunk .bin files make this affordable: processChunk sorts each - chunk's ~3M entries in memory before writing, so the every-boundary rebuild is a single streaming k-way - merge of sorted runs — no two-pass build over unsorted input. Transient .bin disk is bounded - by the windows actually in flight (floor: one dense window ≈ 60 GB), because every terminal commit's - eager sweep deletes a finalized window's inputs immediately. + chunk's ~3M entries in memory before writing, so the rebuild feeds streamhash sorted keys — its fast, + low-memory sorted-builder mode. Transient .bin disk is bounded by the windows actually in + flight (floor: one dense window ≈ 60 GB), because a finalized window's inputs are deleted as soon as its + final index is built.
Provisioning note: old and new coverage files coexist from the start of a rebuild's write until @@ -678,11 +678,11 @@

The rolling tx-hash index

A chunk boundary, end to end

- +

The micro view: ledger 53,510,001 closes chunk 5350 (window 5, floor pinned at chunk 5100 by earliest_ledger, frozen index covering chunks 5100–5349). Step through every - write the boundary performs — watch the meta store, the filesystem, and where reads are served at each + write the boundary performs — watch the catalog, the filesystem, and where reads are served at each instant.

@@ -700,26 +700,24 @@

A chunk boundary, end to end

- Every arrow in this walkthrough is the one write protocol or its exit sweep. At the end of the tick a - re-plan and re-scan find nothing to do — that quiescence is what makes the + Every arrow in this walkthrough is the one write protocol or its exit sweep. At the end of the run a + re-plan and re-scan find nothing to do — that settled is what makes the invariant audits meaningful on a live daemon.
-

Catch-up & the resolver

- +

Backfill & the resolver

+

- Catch-up has a contract — given a range, ensure every artifact derived from every ledger in it is - durable and servable — and resolves what's missing before scheduling anything. A naive - scheduler would register every task and rely on self-skips; that shape re-derives every chunk's - .bin on every restart only for finalization to immediately delete it again. Instead, each - artifact kind contributes one rule that compares its postcondition against the catalog and emits - the difference as tasks: + Backfill has a contract — given a range, ensure every artifact derived from every ledger in it is + durable and servable — and resolves what's missing before scheduling anything, so a restart + re-plans from what is on disk instead of redoing finished work. Each artifact kind contributes one rule + that compares its postcondition against the catalog and emits the difference as tasks:

    -
  • lfs / events (per-chunk): needed for chunk c iff the key isn't "frozen".
  • +
  • ledgers / events (per-chunk): needed for chunk c iff the key isn't "frozen".
  • txhash (per-window): compare the stored coverage (from the window's unique frozen index key) with the desired coverage [max(window_start, floor), min(window_last, range_end)]. Desired ⊆ stored → schedule nothing. Desired exceeds stored → request .bin @@ -748,29 +746,81 @@

    Catch-up & the resolver

    The execution model

    executePlan is map/reduce without the shuffle or the job tracker: chunk builds are the maps, - index builds are the per-window reduces (the .bins are map-side-sorted runs, so each reduce - is one streaming merge), and completion is recorded as the artifacts themselves. There is - deliberately no task engine and no persisted task state: + index builds are the per-window reduces, and completion is recorded as the artifacts themselves. + Dependencies are simple, and nothing is persisted:

    • The dependency structure is two strata with one edge type — an index build waits on the chunk builds inside its coverage — expressed directly with done-channels. Thousands of goroutines may exist, parked on a single worker semaphore (cfg.Workers, the only concurrency knob); at most Workers tasks execute at any instant.
    • -
    • Done-channels broadcast completion, not success: a build whose input failed starts anyway and - trips buildTxhashIndex's loud .bin precondition check before touching any key - — landing on the same abort-and-restart path as the original failure.
    • -
    • A persisted task graph would be a second source of truth that can drift from the artifact keys it - describes; resolve re-plans from keys, so completed work never repeats and interrupted work - needs no reconciliation.
    • +
    • Done-channels signal success: a chunk build closes its channel only once its .bin + is frozen, so an index build proceeds only when every input it needs exists. A chunk build that exhausts + its retries leaves its channel open and returns an error, which cancels the group context; any dependent + waiting on it unblocks through the <-gctx.Done() case and bails — the daemon aborts and a + restart re-resolves from durable keys.
    • +
    • resolve re-plans from the artifact keys on every run, so completed work never repeats and + interrupted work needs no reconciliation.

    - The same resolve + executePlan pair is the lifecycle tick's first stage — one + The same resolve + executePlan pair is the lifecycle run's first stage — one scheduler, two callers, so the two regimes can never disagree about what "done" looks like. - processChunk's source selection (catchupSource) is also shared: a ready, - complete hot DB beats the local .pack beats the bulk backend — which is exactly what lets + processChunk's source selection (backfillSource) is also shared: a ready, + complete hot DB beats the local .pack beats the backfill backend — which is exactly what lets the lifecycle's freeze be ordinary plan execution rather than a special path.

    + +
    +
    executePlan — bounded concurrency & the done-channel barrier
    +
    + Five chunk builds feeding one window's index build, run under 2 worker slots. Step the scheduler: + chunk builds fill the slots, each closes its done-channel on success, and the index build stays parked + until every input it needs is frozen. Toggle a failure to watch a build leave its channel open, cancel + the group, and bail its dependents. +
    +
    + + + + +
    +
    +
    +
    +
    +
+ + +
+

Startup: the backfill loop

+
Normative spec: streaming — Startup
+

+ Before it serves anything, the daemon runs backfill in a loop until on-disk coverage reaches the last + complete chunk at the network tip. Each pass re-reads the tip, plans [floor, last complete + chunk], and executes it; if the tip advanced while the pass ran, another pass picks up the chunks it + moved past. The partial chunk still forming at the tip is never backfilled — its ledgers are already + in the live hot DB, and hot-DB ingestion finishes it. When the loop exits, the daemon opens the resume + chunk's hot DB, seeds the lifecycle, and starts serving. +

+
+
Startup walkthrough
+
+ Three situations, each running the real startStreaming loop arithmetic + (lastCompleteChunkAt, the near-tip/mid-chunk trim, the rangeEnd ≤ backfilledThrough + exit). Each card is one loop pass; the last is the hand-off to serve + ingest. +
+
+
+
+
+
+ The loop reads durable keys only, so it is its own crash recovery: a restart re-plans from what is on disk, + redoing no finished chunk and skipping no unfinished one. The same resolve + + executePlan pair runs here and in every lifecycle run — startup just drives the + bottom of storage down to the floor, where the running lifecycle never reaches. +
@@ -791,47 +841,55 @@

Ingestion loop — owns the live chunk

Lifecycle goroutine — owns everything below

Handed-off hot DBs (freeze + discard), all chunk:* and index:* keys, and - the deletion side of hot:chunk:*. The tick's plan stage fans out to the bounded worker + the deletion side of hot:chunk:*. The run's plan stage fans out to the bounded worker pool — every worker operating strictly below the live chunk.

- The handoff fence is the boundary's write order: the ingestion loop closes its write handle - before creating the next chunk's hot key. Creating that key is the act that moves the partition - — the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one - already in flight) may freeze and discard it, by which point no writer holds it. + The handoff fence is the boundary's write order: the ingestion loop closes its write handle and + opens the next chunk — which moves the partition, since the closed chunk now lies below the new live + chunk — before it hands the completed chunk to the lifecycle on the channel. So by the time the + lifecycle freezes and discards it, no writer holds it.

- The only connection between the goroutines is a payload-free doorbell — a non-blocking send on a - size-1 channel, coalescing freely. Nothing is lost because the notification carries no information to - lose: eligibility derives entirely from durable state, and one tick processes everything the catalog - shows, however many boundaries contributed. The doorbell answers "when should the lifecycle look", - never "what should it see". A tick racing a boundary only under-approximates eligibility — - work deferred to the next tick, never incorrect work. + The only connection between the goroutines is the channel, which carries the chunk ingestion just + completed on a buffered channel of depth lifecycleQueueDepth. The lifecycle drains it to the + highest value each wake and plans up to that chunk, folding a backlog of boundaries into one run. The + value sets only the run's range; the work is still gated by durable keys — resolve + and the scans decide what to build, discard, and prune. A send onto a full buffer means the lifecycle has + fallen lifecycleQueueDepth boundaries behind ingestion — a fatal "freeze can't keep up," + never a silent drop (the depth sits well above the at-most-one signal a healthy daemon holds).

-

Reading a transaction

- +

The reader contract

+

- A hash names no ledger, so a lookup can't tell which window holds it — the window is - chunkID(seq) / chunks_per_txhash_index, and seq is exactly what the lookup is - trying to find. So the cold tier probes every in-retention window's .idx: an MPHF - probe, a fingerprint screen (fpWidth bytes), and — on a fingerprint hit — a fetch of the - LCM and a verify of the full 32-byte hash. The transaction is in at most one window, so at most one - probe confirms; the verify runs on every fingerprint hit (false positives included) and rejects all - but that one. A not-found lookup confirms none and must rule out every window. + A read resolves data through two rules, and the rest of the design relies on both:

+
    +
  1. Only "ready" and "frozen" are + visible. A read resolves a chunk only from a "ready" hot DB or a + "frozen" cold file — never a key in a transient state + ("freezing", "pruning", + "transient"). So a reader never sees a half-written file, crash + debris, or an in-flight sweep.
  2. +
  3. Below the floor is not-found. A read for any seq below the retention floor returns + not-found regardless of what's still on disk — the contract that lets pruning unlink files + unilaterally (a stale .idx may resolve a hash to a .pack that's been + deleted, but the below-floor read is not-found anyway).
  4. +

- A read for any seq below effectiveRetentionFloor returns not-found regardless of - what's still on disk — the contract that lets pruning unlink files unilaterally. How the reader - stays correct while sweeps and pruning unlink files concurrently with a read (tier dispatch, a file that - vanishes mid-read) is the query-routing design's concern, out of scope here. + Together these make retention the single source of truth for "is this available?". Everything else + about serving a read — how the reader picks the tier, probes the right window, and stays correct while + a sweep unlinks a file mid-read — is the query-routing design's concern, out of scope here (and in + the streaming doc). The explorer below illustrates the cold-tier getTransaction probe from the + transactions design, for reference:

-
Read-path explorer
+
Read-path explorer · query-routing, out of scope of the streaming doc
Three cold lookups over a multi-window retention. The chain shows each per-window probe.
@@ -844,10 +902,10 @@

Reading a transaction

Correctness

- Quiescence means the tick's plan is empty and both scans produce empty op lists — the state the + settled means the run's plan is empty and both scans produce empty op lists — the state the system returns to between boundaries, and the state in which the invariants below are auditable on a live daemon. From any storage state — partial-completion crashes, operator actions, surgical - recovery — startup (catch-up + the first tick) drives the system to quiescence satisfying all four. + recovery — startup (backfill + the first run) drives the system to settled satisfying all four.

@@ -855,30 +913,35 @@

Correctness

Any data request whose ledger scope falls entirely within the retention window returns correct results — content matches what a conformant LedgerBackend would produce, no partial state visible, no in-retention range unreachable. Audit: issue reads, or re-derive artifacts via a - conformant backend and byte-compare. One transient exception after hot-volume loss: the regressed floor - briefly admits a few already-pruned bottom chunks — those reads fail soft via the - missing-data-file rule (not-found, never wrong data) until the floor re-advances.
+ conformant backend and byte-compare. One transient exception: when surgical recovery demotes hot data down + to the live chunk, the last committed ledger rewinds and the floor regresses with it, briefly admitting a + few already-pruned bottom chunks — those reads fail soft (not-found, never wrong data) until the + floor re-advances.
INV-2 Single canonical state
At most one "frozen" index key per window — - at all times, quiescent or not (the commit batch promotes and demotes in one write). At - quiescence: no key anywhere is "freezing" or + at all times, settled or not (the commit batch promotes and demotes in one write). At + settled: no key anywhere is "freezing" or "pruning"; no hot DB persists for a chunk cold artifacts fully serve; - no chunk:c:txhash key survives in a finalized window. Audit: walk meta-store keys, - cross-check forbidden co-existence.
+ no chunk:c:txhash key survives in a finalized window. Two transients are tolerated even at + settled: a hot DB's "transient" bracket around an in-flight directory + op, and — after surgical recovery — a partially-frozen chunk above the last committed ledger + (no read can observe it). Audit: walk catalog keys, cross-check forbidden co-existence.
- INV-3 Disk matches meta-store -
At quiescence, the set of artifact files and hot DB directories on disk equals exactly - the set the meta-store specifies — no orphan files, no dangling keys, no duplicate artifacts. A - non-key-named file in an index window dir is a real bug, not mid-tick debris. Audit: walk the - filesystem against the meta-store, both directions.
+ INV-3 Disk matches catalog +
At settled, the set of artifact files and hot DB directories on disk equals exactly + the set the catalog specifies — no orphan files, no dangling keys, no duplicate artifacts. A + non-key-named file in an index window dir is a real bug, not mid-run debris. Audit: walk the + filesystem against the catalog, both directions.
INV-4 Retention bound -
At quiescence, no file or meta-store key maps to a ledger range strictly below the - effective retention floor. Audit: walk meta-store keys, compare ledger ranges to the floor.
+
At settled, no file or catalog key maps to a ledger range strictly below the + effective retention floor — except a frozen index key whose window straddles the floor, which keeps the + lo it was built with (its below-floor coverage is never served; the reader gate returns + not-found). Audit: walk catalog keys, compare ledger ranges to the floor.
@@ -887,6 +950,21 @@

Correctness

audit admin command can implement the walks directly.
+
+ Surgical recovery (tainted data). The operator never touches the filesystem — recovery is one + atomic catalog batch that demotes keys. Tainted cold artifacts go to + "freezing", and backfill re-derives them. For the hot tier, demote + every hot:chunk at or above the lowest tainted chunk — the live chunk always included — + to "transient". Why the whole tail, not just the tainted chunk: the + hot tier is repaired only by re-ingestion, which replays forward from the last committed ledger + (the highest "ready" hot chunk). To replay a tainted hot chunk the + watermark must first fall below it — and since it's the max over all + "ready" keys, that means demoting every hot DB at or above the lowest + tainted one. Then captive core re-ingests the tail forward; the untainted chunks swept up in the demotion + are re-derived byte-identically. (A lost hot volume is the same recovery, triggered by loss rather than + taint.) +
+

What a bug looks like

Common bugs land as concrete, detectable violations:

KeyMeaning
chunk:{c}:lfsPer-chunk .pack file state.
chunk:{c}:ledgersPer-chunk .pack file state.
chunk:{c}:txhashPer-chunk .bin file state. Transient — removed at window finalization.
chunk:{c}:eventsPer-chunk events cold segment state.
index:{w}:{lo}:{hi}One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment.
@@ -894,19 +972,19 @@

What a bug looks like

- + - - + +
A key flips "frozen" before fsync; key's {lo,hi} doesn't match the file; a frozen file mutated post-freezeINV-1re-derive via a conformant backend, byte-compare
Pruning too aggressive — an in-retention read returns wrong/missing resultsINV-1issue reads
Two frozen index keys in one window (promotion and demotion landed as separate writes)INV-2walk index:*, count "frozen" per window
A "freezing"/"pruning" key survives served quiescenceINV-2walk keys for transient values at quiescence
A "freezing"/"pruning" key survives served settledINV-2walk keys for transient values at settled
A hot DB persists for a chunk cold artifacts fully serveINV-2walk hot:chunk:* against coverage
Finalization demotions don't complete — .bin keys outlive their terminal indexINV-2walk chunk:c:txhash in finalized windows
A file on disk without its key (orphan — invisible to every key-driven scan)INV-3walk filesystem against meta-store
A key without its file (dangling)INV-3walk meta-store against filesystem
A file on disk without its key (orphan — invisible to every key-driven scan)INV-3walk filesystem against catalog
A key without its file (dangling)INV-3walk catalog against filesystem
Duplicate cold artifacts for the same logical dataINV-3walk filesystem against key-specified paths
Files or keys remain below the retention floorINV-4walk keys against the floor

Why convergence works

-

Three properties shared by the resolver and the scans, plus catch-up's postcondition contract:

+

Three properties shared by the resolver and the scans, plus backfill's postcondition contract:

    -
  • Eligibility from durable state alone — every decision derives from meta-store keys; nothing depends on in-memory history.
  • +
  • Eligibility from durable state alone — every decision derives from catalog keys; nothing depends on in-memory history.
  • Idempotent ops — re-running any half-finished op is safe; re-materialization overwrites at canonical paths, sweeps re-run until the key is gone.
  • Everything re-derived on every notification — there is no persisted plan to drift.
@@ -919,12 +997,12 @@

Why convergence works

Interactive companion to full-history-streaming-workflow.md - (the daemon: catch-up, ingestion, lifecycle, invariants) and + (the daemon: backfill, ingestion, lifecycle, invariants) and gettransaction-full-history-design.md (the tx-by-hash subsystem: formats, the rolling index, the read path) — the markdown is the normative spec; numbers here (chunk = 10,000 ledgers, window default = 1000 chunks, build ≈ 1 min) come from those docs and the bench-fullhistory measurements they cite. - Generated 2026-06-12. Self-contained; no external dependencies. + Re-synced to the current docs 2026-06-18. Self-contained; no external dependencies.
diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index b2c55d518..8c7cd7157 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -2,11 +2,13 @@ ## Overview -Full-history RPC runs as one daemon in one mode. On startup it figures out how far behind the network tip it is and catches up automatically; once caught up, it serves live ledgers as they're produced. There is no separate backfill command or explicit catch-up step for the operator to invoke. +Full-history RPC runs as one daemon in one mode: it both backfills old history and follows the live network. + +It keeps two tiers of data. **Hot** data is the most recent ledgers near the network tip, written append-only into RocksDB. **Cold** data is older ledgers, held as immutable files on disk. On startup RPC backfills to the current tip, then ingests new ledgers continuously into the hot DB; when the hot DB fills, it writes the immutable cold files for that ledger range and discards the hot DB. This migration from hot to cold is called **freezing**. The daemon does three things: -- **Catches up on startup** by running bulk catch-up as a subroutine. This brings on-disk coverage in line with the current retention window — pulling from a configured LedgerBackend (BSB — the Buffered Storage Backend, which reads ledgers from an object store — by default; captive core or any other conformant backend if BSB isn't available) any chunks inside that window that aren't already frozen — while skipping the tip chunk that captive core is actively ingesting; hot DB ingestion finishes that one. This covers first-ever start, downtime gaps, and retention-widening gaps. +- **Backfills on startup.** Before it serves anything, it runs backfill as a subroutine to bring what's on disk in line with the current retention window. It pulls every chunk inside that window that isn't already frozen from a configured `LedgerBackend` — by default BSB (the Buffered Storage Backend, which reads ledgers from an object store), or captive core or any other conformant backend if BSB isn't available. It skips the partial chunk still forming at the tip; hot-DB ingestion fills that one once it starts. This single mechanism covers a first-ever start, gaps left by downtime, and gaps opened by widening retention. - **Ingests** live ledgers from `CaptiveStellarCore` into one hot RocksDB per chunk — ledgers, transaction hashes, and events as column families, written in one atomic batch per ledger. - **Freezes** completed chunks to immutable files, **rebuilds** the current tx-hash index from its frozen inputs on every chunk boundary, and **prunes** superseded and past-retention artifacts. All run in a background lifecycle goroutine. @@ -16,18 +18,17 @@ The daemon does three things: The Stellar blockchain starts at ledger 2 (`GENESIS_LEDGER`). Two units organize all storage; everything in this doc is described in terms of them: -- **Chunk** — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery. -- **Window** (tx-hash index) — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the rolling tx-hash index. Configurable, but immutable once stored. +- **Chunk** — a run of 10,000 ledgers (hardcoded); the atomic unit of ingestion, freezing, and crash recovery. A hot DB holds at most one chunk, and each cold file — ledgers, events, transactions — spans exactly one chunk. +- **Window** — a run of chunks (`chunks_per_txhash_index`, default 1000 = 10M ledgers); the unit of the rolling tx-hash index. The index is the one exception to the per-chunk rule: it maps transaction hashes to ledger sequences across a whole window. Configurable, but immutable once stored. ``` chunkID(seq) = floor((seq - 2) / 10_000) chunkFirstLedger(c) = c * 10_000 + 2 chunkLastLedger(c) = (c + 1) * 10_000 + 1 indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id -chunksInIndex(w) = [w*cpi, (w+1)*cpi - 1] # cpi = chunks_per_txhash_index ``` -Chunk ids are **signed**, and `chunkID` uses floor division. The only sub-genesis sequence the daemon ever forms is the "nothing ingested" watermark sentinel `earliest_ledger - 1` (which is `1` when `earliest_ledger` is genesis); floor division maps it to **chunk −1**, and `chunkLastLedger(-1) = 1` reproduces the sentinel. Chunk −1 means "before the first chunk" and exists only as a transient in derivation arithmetic — the cold and positional terms of `deriveCompleteThrough`, and the watermark mid-chunk test at startup. It is never serialized: every chunk id written to a meta-store key or file path is a real chunk `≥ 0`, so `%08d` only ever sees non-negative ids. (`chunkID(seq)` for an in-range `seq ≥ GENESIS_LEDGER` is unaffected — floor and truncating division agree on non-negative numerators.) +Chunk ids are **signed**, because `chunkID` uses floor division. The only id below 0 is **chunk −1**, meaning "before the first chunk." It comes up in one place: the "nothing ingested yet" sentinel `earliest_ledger - 1`, which maps to chunk −1 (and `chunkLastLedger(-1) = 1` maps back). Chunk −1 only ever appears in startup arithmetic; every chunk id written to disk is `≥ 0`. All chunk and window ids use uniform `%08d` zero-padding. Example, default `chunks_per_txhash_index = 1000`: @@ -39,39 +40,6 @@ All chunk and window ids use uniform `%08d` zero-padding. Example, default `chun --- -## What the daemon guarantees - -The daemon is built around four guarantees over its data: - -- **Retention is complete.** No gaps within the retention window — for every ledger in the window, all data derived from it (transactions, events) is present on disk and available to serve any data request that falls entirely within it. -- **Cold is canonical, hot is transient.** Frozen chunks and finalized indexes live in immutable cold artifacts. The chunk hot DB is discarded once every cold artifact derived from the chunk is durable *and* the rolling index covers the chunk — so a tx lookup always has exactly one home: the chunk's hot DB until coverage, the `.idx` after. The current index is logically mutable — re-derived on every chunk boundary from the frozen `.bin` files — until its window finalizes. -- **The meta-store catalogs what's on disk.** Disk content is exactly what the meta-store specifies — every file is named by a meta-store key and every key in a final state has its file. File and key writes/deletes are ordered to preserve this across crashes. -- **Storage tracks retention.** Disk usage scales with `retention_chunks`, not with uptime — files and meta-store keys for ledger ranges below the effective retention floor are pruned as the floor advances. - -The retention window is bounded above by `last_committed_ledger` (the most recent ledger the daemon has durably committed) and below by `effectiveRetentionFloor` (computed from `retention_chunks` and `earliest_ledger`; defined in [Startup](#startup)). - -The rest of this doc explains how the daemon maintains these guarantees through three operational phases. The [Correctness](#correctness) section at the end gives the formal statement plus substrate assumptions, coverage scenarios, and audit shapes. - ---- - -## How the daemon runs - -Three activities, in the order a fresh daemon encounters them; the last two run together as the steady state. - -**Catch up.** On first start the daemon checks how far behind the network it is by sampling the network tip via the configured LedgerBackend; on subsequent starts it picks up from `last_committed_ledger`. It then runs the [catch-up primitives](#catch-up-primitives) (`processChunk`, `buildTxhashIndex`) over the missing range. The catch-up loop excludes the chunk captive core is currently ingesting — that one's finished by hot DB ingestion, not by the catch-up source. - -**Hot DB ingestion.** Once caught up, the daemon streams live ledgers from CaptiveStellarCore into the live chunk's hot RocksDB — ledgers, tx hashes, and events land in their column families via **one atomic, synced WriteBatch per ledger**, so a ledger is either fully in the hot DB or absent. Ingestion's own progress marker, `last_committed_ledger`, advances per ledger only after that batch is durable — it is a local of the ingestion loop, shared with nothing. - -**Freeze and prune.** A background goroutine wakes whenever ingestion's set of hot chunk DBs changes — each chunk boundary, plus once when ingestion starts — and runs one **tick** of three stages: **plan-and-execute** (the same resolver and executor catch-up uses, which freezes complete chunks to immutable files and folds them into the current tx-hash index), **discard** (retire hot DBs the cold artifacts now fully serve), then **prune** (sweep demoted artifacts and everything past the retention window). Each stage sees the previous stage's effects. - -The current tx-hash index is **re-derived from scratch on every chunk boundary** to absorb the chunk that just froze, growing until its window is complete; only the window the network tip is in is ever rebuilt, and a completed window's index is finalized (inputs cleaned up) and never touched again. The build is cheap relative to the ~chunk cadence (a full-window rebuild with streamhash — the minimal-perfect-hash index library behind tx-hash lookups — is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates), so rebuilding from scratch each boundary is affordable. - -The boundary between "in-retention" and "past-retention" is the `effectiveRetentionFloor`. As the network tip advances, the floor advances with it; complete chunks below the floor are removed by the prune stage. - -[Daemon flow](#daemon-flow) below has the pseudocode for each phase. [Data model](#data-model) describes what's on disk and in the meta store. [Correctness](#correctness) details the invariants the design maintains. - ---- - ## Configuration One TOML file (`--config`) configures the daemon. @@ -80,17 +48,17 @@ One TOML file (`--config`) configures the daemon. | Key | Type | Default | Description | |---|---|---|---| -| `default_data_dir` | string | **required** | Base directory for the meta store and default storage paths. | +| `default_data_dir` | string | **required** | Base directory for the catalog and default storage paths. | -**[catch_up]** +**[backfill]** | Key | Type | Default | Description | |---|---|---|---| | `chunks_per_txhash_index` | uint32 | `1000` | Chunks per tx-hash window. Defines data layout — immutable once stored (startup aborts on mismatch; see `validateConfig`). | -| `workers` | int | `GOMAXPROCS` | Concurrent task slots for bulk catch-up. | -| `max_retries` | int | `3` | Retries per catch-up task before the daemon aborts. | +| `workers` | int | `GOMAXPROCS` | Concurrent task slots for backfill. | +| `max_retries` | int | `3` | Retries per backfill task before the daemon aborts. | -**[catch_up.bsb]** — Buffered Storage Backend (the default bulk LedgerBackend; required **unless** another conformant LedgerBackend is configured as the bulk source — `backendNetworkTip`/`validateRangeProducible`/`processChunk`'s default `source` all go through whichever backend is configured) +**[backfill.bsb]** — Buffered Storage Backend (the default backfill `LedgerBackend`; required **unless** another conformant `LedgerBackend` is configured as the backfill source — `backendNetworkTip`/`processChunk`'s default `source` all go through whichever backend is configured) | Key | Type | Default | Description | |---|---|---|---| @@ -107,7 +75,7 @@ One TOML file (`--config`) configures the daemon. | `[immutable_storage.txhash_raw]` | `{default_data_dir}/txhash/raw` | transient `.bin` files | | `[immutable_storage.txhash_index]` | `{default_data_dir}/txhash/index` | per-window `.idx` | -**[meta_store]** — optional `path` (default `{default_data_dir}/meta/rocksdb`). +**[catalog]** — optional `path` (default `{default_data_dir}/catalog/rocksdb`). **[logging]** — optional `level` (`debug`/`info`/`warn`/`error`, default `info`) and `format` (`text`/`json`, default `text`). @@ -116,7 +84,7 @@ One TOML file (`--config`) configures the daemon. | Key | Type | Default | Description | |---|---|---|---| | `retention_chunks` | uint32 | `0` | Retention window in chunks. `0` = full history. | -| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for. Acts as a fixed lower floor on history; combines with `retention_chunks` (the effective floor is the higher of the two). Must be chunk-aligned (i.e., `chunkFirstLedger` of some chunk); `"now"` resolves to `chunkFirstLedger(chunkID(networkTip()))` at first start. A first start with `"now"` or with a numeric floor requires a reachable, ready backend — `"now"` has no other way to resolve, and a numeric floor is validated against the network tip (rejected if it is past the tip) before being pinned immutably; a genesis floor needs no tip, since genesis is always a valid lower bound. Stored on first start; immutable thereafter. Setting it higher than genesis skips upfront catch-up — useful for *frontfill* deployments (`earliest_ledger = "now"`) where bringing a fast bulk source online isn't possible. The current immutability is enforced only by `validateConfig`; the rest of the system reads the value through the meta store, so a future `set-earliest-ledger` admin command would be a small change. | +| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for — a fixed lower floor on history. Combined with `retention_chunks`, the effective floor is the higher of the two. Must be chunk-aligned; `"now"` resolves to the current network tip's chunk at first start. Resolved and stored on the first start (a reachable backend is required for `"now"` and numeric floors; see `validateConfig`), immutable thereafter. Setting it above genesis skips upfront backfill — useful when no fast backfill source is available and the daemon only follows the live network (`earliest_ledger = "now"`). | | `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | **[streaming.hot_storage]** @@ -135,30 +103,30 @@ One TOML file (`--config`) configures the daemon. ## Data model -The daemon's durable state lives in two places: the meta-store RocksDB (state markers and config pins) and the filesystem (immutable files plus one per-chunk hot RocksDB that holds in-progress data during ingestion). +The daemon's durable state lives in two places. The **catalog** — a small RocksDB — records what's on disk and the state each file is in, plus a few config values fixed on the first start. The **filesystem** holds the data itself: the immutable cold files, and one per-chunk hot RocksDB for data still being ingested. -Throughout this section, `chunk` is a chunk id, `txhash_index` is a window id, and `chunks_per_index` is shorthand for `config.chunks_per_txhash_index`. +Throughout this section, `chunk` is a chunk id and `txhash_index` is a window id. ### Filesystem artifacts -The per-chunk artifacts are each written once at chunk freeze; the txhash index is rebuilt on each chunk boundary while its window is current and then finalized. All four are produced by the [catch-up primitives](#catch-up-primitives): +The per-chunk artifacts are each written once at chunk freeze; the txhash index is rebuilt on each chunk boundary while its window is current and then finalized. All four are produced by [the primitives](#the-primitives): | Artifact | Granularity | Format | Produced by | |---|---|---|---| | Ledger pack file | per chunk | `.pack` | `processChunk` | | Events cold segment | per chunk | three files per chunk (format defined in the events doc) | `processChunk` | -| Sorted txhash file | per chunk | `.bin` (sorted streamhash entries; see [rule 5](#catch-up-primitives)) | `processChunk` | -| Streamhash txhash index | per index | one `.idx` file per **coverage**, named `{lo:08d}-{hi:08d}.idx` inside the window's dir; at most one coverage frozen at any moment | `buildTxhashIndex` | +| Sorted txhash file | per chunk | `.bin` (sorted **streamhash** entries — the sorted on-disk tx-hash index format, specified in [the transactions design](./gettransaction-full-history-design.md) §6) | `processChunk` | +| Streamhash txhash index | per index | one `.idx` file per **coverage** (the chunk range `[lo, hi]` an index spans), named `{lo:08d}-{hi:08d}.idx` inside the window's dir; at most one coverage frozen at any moment | `buildTxhashIndex` | -The `.bin` is transient — it is the input to `buildTxhashIndex` and exists only until its index window finalizes, at which point the terminal build's commit batch demotes it to `"pruning"` and the sweep removes it. While a window is current, every boundary re-reads its `.bin` files to rebuild the index, so they are retained for the whole window. The pack file and events segment persist until retention-driven pruning removes them. The txhash index is rebuilt at a **new coverage** on every boundary (mark the new coverage's key `"freezing"` → write `{lo}-{hi'}.idx` → one atomic batch promotes it to `"frozen"` and demotes the predecessor coverage to `"pruning"`), then persists until pruning once its window has finalized and slid past retention. Key name and filename are a bijection, so every index file on disk — including a crashed attempt's partial — is reachable from its key alone; nothing ever lists the directory. +The `.bin` files are transient — they are the input `buildTxhashIndex` merges, and the terminal build deletes them once its window is complete. The pack files, events segments, and `.idx` files persist until retention pruning removes them. State for each lives in [Catalog keys](#catalog-keys); the write ordering is [One write protocol](#one-write-protocol). ### Directory layout -Chunk-level files group into buckets of 1,000 chunks (`bucket_id = chunk_id / 1000`, formatted `%05d`) — a filesystem concern only; bucket ids never appear in meta-store keys. Directories are created on demand. +Chunk-level files group into buckets of 1,000 chunks (`bucket_id = chunk_id / 1000`, formatted `%05d`) — a filesystem concern only; bucket ids never appear in catalog keys. Directories are created on demand. ``` {default_data_dir}/ -├── meta/rocksdb/ ← meta store (WAL always on) +├── catalog/rocksdb/ ← catalog (WAL always on) ├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient) ├── ledgers/{bucket:05d}/{chunk:08d}.pack ├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash) @@ -177,17 +145,17 @@ During ingestion the daemon maintains **one hot RocksDB per chunk** at `{hot_sto | `txhash` | tx hash → seq | `getTransaction` for the live chunk | | events CFs | live events (schema per the events doc) | `getEvents` for the live chunk | -CFs share the instance's WAL, so each ledger commits as **one atomic WriteBatch across all CFs** — there is no cross-store ordering to reason about within a chunk. Per-CF options keep tuning independent (the events CFs carry their own settings). The DB is created when ingestion enters the chunk and discarded whole once every cold artifact derived from the chunk is durable **and** the rolling index covers the chunk; it keeps serving tx lookups across the brief freeze-to-coverage interval, and freeze, rebuild, and discard all chain within one lifecycle tick. +CFs share the instance's WAL, so each ledger commits as **one atomic WriteBatch across all CFs**. Per-CF options keep tuning independent (the events CFs carry their own settings). The DB is created when ingestion enters the chunk. It is discarded whole once every cold artifact derived from the chunk is durable **and** the rolling index covers the chunk. It keeps serving tx lookups across the brief freeze-to-coverage interval; freeze, rebuild, and discard all chain within one lifecycle run. -### Meta-store keys +### Catalog keys -The meta store holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and config pins. +The catalog holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and config pins. **Artifact state keys**: | Key | Value | Meaning | |---|---|---| -| `chunk:{chunk:08d}:lfs` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk pack file state. | +| `chunk:{chunk:08d}:ledgers` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk pack file state. | | `chunk:{chunk:08d}:txhash` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk `.bin` file state. Transient — removed at window finalization. | | `chunk:{chunk:08d}:events` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk events cold segment state. | | `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` | `"freezing"` \| `"frozen"` \| `"pruning"` | One key per index **coverage**. The key *name* carries the coverage `[lo, hi]` and maps 1:1 to the file `{lo:08d}-{hi:08d}.idx`; the *value* is pure lifecycle state — the same three values as every other artifact key. At most one coverage per window is `"frozen"` at any moment, and a key with `hi` = its window's last chunk is **terminal** by definition (see [Index keys](#index-keys) below). | @@ -200,209 +168,132 @@ For the per-chunk keys, `"freezing"` means the immutable file is being written; |---|---|---| | `hot:chunk:{chunk:08d}` | `"transient"` \| `"ready"` | The chunk's hot DB. | -`"ready"` means the RocksDB dir exists and is usable. `"transient"` brackets a directory operation in flight — creation or deletion; no code path ever needs to know which, since the recovery is the same either way (the open path wipes and recreates; the discard scan re-runs). A crash mid-operation is detectable from the key value alone. One key per chunk; the column families inside the DB carry no individual meta-store state. +`"ready"` means the RocksDB dir exists and is usable. `"transient"` brackets a directory operation in flight — creation or deletion; no code path ever needs to know which, since the recovery is the same either way (the open path wipes and recreates; the discard scan re-runs). A crash mid-operation is detectable from the key value alone. One key per chunk; the column families inside the DB carry no individual catalog state. -**Config pins** (there is no stored watermark — see below): +**Config pins:** | Key | Value | Written when | |---|---|---| | `config:earliest_ledger` | `uint32` (decimal string, chunk-aligned) | On the first daemon start. Immutable thereafter — changing it currently requires wiping the data directory, until a `set-earliest-ledger` admin command exists (see [Configuration](#configuration); the floor machinery already converges for either direction). | | `config:chunks_per_txhash_index` | `uint32` (decimal string) | On the first daemon start; immutable thereafter. Startup aborts if the config value doesn't match. | -**Progress is derived, never stored — and never shared.** The hot DB's synced per-ledger WriteBatch *is* the durable commit; recording it again in the meta store would only create a second copy of the same fact, plus the ordering rule needed to keep the copy honest. Two derivations read progress back out of the catalog, one per consumer, at the two granularities they need. Both lean on one **key-creation invariant**: a `hot:chunk` key is created only after every ledger below its chunk has durably committed — at a boundary, ingestion closes chunk C's write handle *before* creating C+1's key (the [ingestion loop](#hot-db-ingestion) enforces the ordering); at startup, the resume chunk's key is created only after derivation has already run. The highest hot key therefore *is* the live chunk, and everything below it is complete. The watermark derivation, though, counts only `"ready"` hot keys (refining the top chunk's exact ledger from its hot DB), so a `"transient"` key never advances the bound on its own — which is precisely what lets recovery demote any hot key to `"transient"` without disturbing the watermark (see `deriveCompleteThrough`). - -The lifecycle tick needs only chunk granularity — which chunks are complete, and where the sliding retention floor anchors: - -```go -// completeThrough is the highest ledger the lifecycle may treat as durably -// ingested. Two complementary terms: a COLD term — the highest chunk whose -// artifacts are all durable — which leads at startup (catch-up just ran; -// ingestion hasn't started); and a POSITIONAL term — everything below the -// live chunk, by the key-creation invariant — which leads in steady state -// (a chunk completes long before its cold artifacts exist). -func deriveCompleteThrough(cat Catalog) uint32 { - // Cold term: a chunk counts only when pendingArtifacts() is empty (lfs - // AND events frozen; txhash frozen or index-covered). NOT merely "lfs - // frozen": a crash mid-freeze can leave lfs frozen while events is still - // "freezing", and counting that chunk would let reads open over a - // partial artifact. An incompletely frozen tip chunk must DEGRADE the - // bound so catch-up / re-ingestion repairs it. highestDurableChunk - // returns -1 when NO chunk is durable (a fresh start), so the cold term - // is then chunkLastLedger(-1) = 1 — the pre-genesis sentinel — never a - // spurious chunk-0 bound that would resume a young network past its tip. - through := chunkLastLedger(highestDurableChunk(cat)) - // Positional term — counts only "ready" hot keys, NOT "transient" ones. A - // "transient" key marks a hot DB mid-create/mid-delete, or one a recovery - // demoted; excluding it is what lets recovery demote ANY hot key without - // inflating this bound (see [Surgical recovery](#scenario-coverage)). The - // one case this under-counts — a complete-but-unfrozen chunk left - // "transient" by a boundary crash — is recovered by deriveWatermark's - // refinement (below), which opens that highest ready chunk and reads its - // committed seq; the lifecycle tick never sees it, because the resume chunk - // is reopened "ready" before the first tick (and an under-count would only - // defer work anyway). When the live chunk is chunk 0 (a young genesis - // network), maxChunk-1 = -1 and chunkLastLedger(-1) = 1: nothing below - // chunk 0 is complete. - if hot := readyHotChunkKeys(cat); len(hot) > 0 { - through = max(through, chunkLastLedger(maxChunk(hot)-1)) - } - return max(through, cat.EarliestLedger()-1) -} -``` - -Ingestion's resume point at startup needs the exact ledger — the one consumer of sub-chunk precision: - -```go -// deriveWatermark is deriveCompleteThrough refined by exactly ONE read of the -// highest ready hot DB. That read does two jobs: (1) sub-chunk precision inside -// the live chunk, and (2) recovering the chunk-level frontier when the -// positional term under-counts — a boundary crash can leave the live chunk -// "transient", so the highest *ready* chunk is the just-completed predecessor, -// whose completion no key now advertises; reading its maxCommittedSeq supplies -// that frontier. This is the one spot where the ready-only count leans on -// opening a hot DB rather than on key existence alone — and if that ready -// chunk's dir is missing we fatal (below) rather than degrade: the price of -// count-only-ready. Runs once, before ingestion starts — the only time opening -// a hot DB is safe (once ingestion runs, the live DB is held exclusively by -// its writer). -func deriveWatermark(cat Catalog) uint32 { - for _, c := range readyHotChunks(cat) { - if !dirExists(hotChunkPath(c)) { - // Checked for EVERY ready key, not just the one opened below: - // derivation runs before every other open site, so without this - // a lost hot volume dies as an opaque RocksDB error (or worse, - // a predecessor's loss is silently healed by discard) instead - // of the curated recovery instruction. Never skip silently — - // that would auto-heal the mount-misconfiguration case. - fatalf("hot:chunk:%08d is \"ready\" but its dir is missing — "+ - "hot storage lost; run surgical recovery (case 4).", c) - } - } - w := deriveCompleteThrough(cat) - if live, ok := highestReadyHotChunk(cat); ok { - db := openReadOnly(live) - w = max(w, maxCommittedSeq(db)) - db.Close() // released before startup reopens the same path read-write - } - return w -} -``` - -During operation no shared watermark exists at all: ingestion keeps its progress as a plain local, and each lifecycle tick calls `deriveCompleteThrough` fresh. The meta-store catalog thereby stays a *pure* catalog — every key names a file/dir state or a config pin. Postcondition-driven catch-up is what makes a derived watermark safe: catch-up converges *ranges*, not resume pointers, so derivation can never hide a hole — and a lost hot volume self-degrades the watermark to the last frozen boundary instead of requiring a manual rewind ([surgical recovery case 4](#scenario-coverage)). +**Resume point.** Recomputed at startup from the durable keys plus a read of the live hot DB (see [Startup](#startup)). ### Index keys -An index key `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` carries the chunk range `[lo, hi]` its `.idx` covers in the key **name**; the value holds only lifecycle state. The filename is derived from the key by a fixed bijection — `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx` — so resolving a key to its file never involves reading the value or listing a directory. [The transactions design](./gettransaction-full-history-design.md) (§6.3) is the canonical reference for coverage semantics, with the rationale and a worked example; the properties this doc's protocols depend on: - -- **Coverage is the whole identity** — there is no per-attempt counter. A retry of a crashed build re-marks the same key and rewrites the same file from scratch, exactly as rule 1 re-materializes a per-chunk artifact at its canonical path. -- **`lo`** rises above the window's first chunk when `earliest_ledger` or the sliding floor cuts into the window at build time (`lo` rising is how a mid-window floor is encoded); **`hi`** advances by one chunk per boundary while the window is current, and equals the window's last chunk once it finalizes. -- **Terminal-ness is derived, not stored**: the key whose `hi` equals its window's last chunk (computable forever from the immutable `chunks_per_txhash_index` pin). A window whose frozen key is terminal is finalized — its `.bin` inputs were demoted in the same commit, and its index is never rebuilt again (only a retention-widening catch-up re-derives it, at its new, wider coverage). -- **The uniqueness invariant: at most one coverage per window is `"frozen"` at any moment.** The rebuild's commit is one atomic synced batch ([rule 3](#catch-up-primitives)'s commit step holds the exact composition), so the frozen coverage changes hands atomically and readers resolve "the window's index" as *the unique frozen key* — no tie-break, no value parsing. Everything else under the window's prefix is transient debris: `"freezing"` = a crashed attempt (re-marked and overwritten if its coverage is built again, otherwise swept); `"pruning"` = a superseded coverage (finish the unlink, drop the key). - -So the `.idx` hashes exactly the transactions in chunks `[lo, hi]`: chunks below `lo` are out of scope (floor); chunks above `hi` are served from their chunks' hot DBs until the next rebuild advances `hi`. While the window is current, `lo` tracks the floor and `hi` the tip automatically — no separate floor-driven rebuild is ever needed. Once finalized, the `.idx` is static; a floor that later advances within the window leaves it stale (`lo` referencing chunks pruning has since removed), which the reader retention contract handles cleanly. A window-straddling floor exists at most once at any moment — the window containing the effective retention floor. +An index key `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` names the chunk range `[lo, hi]` that its `.idx` covers, mapping 1:1 to the file `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx`. -### Per-chunk artifact lifecycle +`hi` grows as the window fills: at each chunk boundary the rebuild folds in the chunk that just froze, advancing `hi` by one. When `hi` reaches the window's last chunk, the window is **complete** and its index is **terminal** — final, never rebuilt again. -Pack files, events segments, and `.bin` files (the `processChunk` outputs) are write-once: +`lo` is the higher of the window's first chunk and the retention floor, fixed when the index is built. So: -``` - absent ──► ingesting ──► freezing ──► frozen ──► pruning ──► absent -``` - -- **Absent** — no artifact key, no immutable file. -- **Ingesting** — hot DB holds the data being written; artifact key is absent; immutable file doesn't exist yet. -- **Freezing** — `processChunk` has put `"freezing"` and is materializing the file. The file may be partial on disk. A crash here is detectable from the key value alone — recovery either re-writes (within retention) or deletes the partial file (past retention). -- **Frozen** — immutable file is fsynced at its canonical path; artifact key is `"frozen"`. Once all three of a chunk's artifacts are frozen *and* the rolling index covers the chunk, the chunk's hot DB is discarded. -- **Pruning** — retention is deleting the immutable file; artifact key is `"pruning"`; the file may or may not still be on disk. - -The `"freezing"` mark is set **before** any I/O for that artifact. This gives the invariant **"any file on disk has its meta-store key set"** — a retention scan iterates keys, and every file is reachable that way. Without the pre-write mark, a crash between writing the file and setting the key would leave an artifact that no scan could see. - -### Index artifact lifecycle +- a window still being rebuilt each boundary has its `lo` recomputed every time, so it rises as the floor does, dropping chunks that have aged out of retention; +- a terminal window's `.idx` keeps the `lo` it was built with; if the floor later climbs past that `lo`, the index still covers chunks that have dropped out of retention — but a read for any ledger below the floor returns not-found regardless of what the index says, so that stale coverage is never served. -The streamhash index is the one logically-mutable cold artifact: it is re-derived on every chunk boundary while its window is current. Physically a file is writable only while its key has never been `"frozen"` in this run — mutation happens by freezing the next coverage and demoting the old one; the frozen file readers resolve is immutable until unlinked. +So `lo` equals the window's first chunk unless the start of the window has dropped below the floor. -Each coverage runs the same lifecycle as every per-chunk artifact: +[The transactions design](./gettransaction-full-history-design.md) (§6.3) is canonical for coverage semantics, with a worked example. -``` - absent ──► freezing ──► frozen ──► pruning ──► absent -``` - -- **Freezing** — the key was put (with its coverage in the name) *before* any I/O; the file may be partial or absent. A crashed attempt parks here. If its coverage is built again, the build re-marks the key and rewrites the file wholesale; one the prune scan observes was not retried — **delete file and key, never salvage**. Salvage would require proving the file complete; deletion needs no proof, and a rebuild re-derives identical bytes anyway (the merge is a deterministic function of the coverage). -- **Frozen** — the file and its dirent are fsynced and the commit batch has landed. The window's unique frozen coverage *is* the live index. -- **Pruning** — a newer coverage superseded this one (demoted in its commit batch), or retention is removing the window. The standard sweep finishes: unlink → `fsyncDir` → delete the key. +### One write protocol -The *window-level* progression — coverage advancing boundary by boundary, then finalization — emerges from the coverage chain: each boundary freezes the widened coverage and demotes its predecessor in one atomic batch, and the terminal key is the one whose `hi` equals the window's last chunk. The batch maintains **at most one frozen coverage per window at all times** — a crash at any instant leaves either the old coverage frozen (batch not landed; the new one is `"freezing"` debris) or the new one frozen (predecessor already `"pruning"`), never both, never neither. +Every durable artifact — per-chunk files and index coverages alike — is written the same way, **mark-then-write**: -Why rewriting coverage-named files in place is safe — readers hold the live `.idx` open while the next coverage is written into the same directory — is argued in full in [the transactions design](./gettransaction-full-history-design.md) (§7.5). Four facts carry it: the build's skip rule (no scheduled build ever targets the name readers resolve), the stage ordering plus the eager sweep's window-locality and pruning-only scope, the floor's monotonicity within a run plus reader handles dying with the process, and the merge's determinism. A change to any of the four must re-prove the argument. +1. put `"freezing"` *before* any I/O; +2. write the file; +3. fsync the file, its parent dirent, and — when the parent was just created — the grandparent dirent; +4. flip the key to `"frozen"`. -### Hot DB lifecycle +The key is always written before the file. So every file can be found from its key — cleanup walks keys, never directories — and a file left half-written by a crash carries a `"freezing"` key, which marks it for re-derivation or removal. Step 3 fsyncs the directory entries, not just the file, so the file's existence on disk survives a crash before its key flips to `"frozen"`. -``` - absent ──► transient ──► ready ──► transient ──► absent - (creating) (deleting) -``` +Deletion is the same protocol in reverse: demote the key to `"pruning"`, unlink the file, then delete the key, with an `fsyncDir` between the unlink and the key delete. So a key is gone only once its file is — **key absent ⟹ file gone**. Two functions do all file deletion: `sweepChunkArtifacts` for per-chunk artifacts and `sweepIndexKey` for index files. -- **Absent** — no hot DB key, no dir on disk. -- **Transient** — a directory operation is in flight: either creation (key put, RocksDB dir initializing) or deletion (discard rmdir'ing the dir). A crash in either leaves a possibly-partial dir; the recovery is identical regardless of which operation was interrupted — the open path wipes and recreates, the discard scan re-runs — which is why one value suffices. -- **Ready** — dir exists and is usable for reads and writes. The chunk's contents at any moment run up to `last_committed_ledger`'s position within the chunk; ledgers above it are pending streaming. +--- -The hot DB is discarded whole once the chunk's cold artifacts are all frozen and the rolling index covers the chunk — freeze, rebuild, and discard chain within one lifecycle tick. All lifecycles in this section are observable purely from meta-store keys — no filesystem inspection needed. +## Backfill -### One write protocol +Backfill makes every artifact derived from a range of ledgers durable and servable. It has three parts, in the order below: a **resolver** (`resolve`) that diffs what's wanted against the catalog and returns a plan of the missing work; the **primitives** (`processChunk`, `buildTxhashIndex`) that produce each artifact; and an **executor** (`executePlan`) that runs the plan concurrently. The [Startup](#startup) backfill loop and the [Lifecycle](#lifecycle) run are its two callers. -Every durable artifact — per-chunk files and index coverages alike — uses the same protocol, **mark-then-write**: put `"freezing"` *before* any I/O; write the file; fsync the file, its parent dirent, and (when the parent was just created) the grandparent dirent; flip the key to `"frozen"`. The pre-mark guarantees *every file on disk has a key*, so all cleanup is key-driven — nothing ever lists a directory to find work — and a crash mid-write is visible as a `"freezing"` key. The dirent barriers guarantee the key never outlives the file's creation: without them, a power crash can revert the file's — or a freshly created directory's — creation under a durable `"frozen"` key, which key-only idempotency would then never repair; that is why writes that create their parent dir barrier the grandparent too (the same two-level barrier `openHotDB` uses). +### Postcondition-driven planning -Per-chunk artifacts write at a canonical path and flip with a single-key put. The index extends the flip into a **commit batch** — the artifact is logically mutable, so everything a build changes commits *together* in one atomic synced write; [rule 3](#catch-up-primitives)'s commit step holds the exact composition. The batch extension changes what commits together, not how a file becomes durable. +Backfill works from a postcondition: *given a range, every artifact derived from every ledger in it must be durable and servable.* `resolve` reads the catalog and returns a `Plan` of only the missing work — per-chunk artifacts whose key isn't `"frozen"`, and window indexes whose frozen coverage doesn't yet span the range. It reads nothing but durable keys, so every run re-plans from what's on disk; a restart neither redoes finished work nor skips unfinished work. The plan is a flat list of chunk builds and index builds: -Exits mirror the entries: every sweep demotes a still-`"frozen"` key first, then removes the *file before the key* with an `fsyncDir` barrier between (`sweepChunkArtifacts` and `sweepIndexKey` — the system's only two deletion bodies, one per key family), giving the complementary guarantee — *key absent ⟹ file gone*. The hot DB's `transient`/`ready` bracket is the same two ideas applied to a directory. +```go +type ChunkBuild struct { + Chunk ChunkID + Artifacts ArtifactSet // which kinds this chunk still needs — one processChunk pass produces all +} ---- +type IndexBuild struct { + Window WindowID + Lo, Hi ChunkID // coverage to build; terminal iff Hi == windowLastChunk(Window) + // dependencies are derivable (the ChunkBuilds in [Lo, Hi]), so no input list +} -## Catch-up primitives +type Plan struct { + ChunkBuilds []ChunkBuild + IndexBuilds []IndexBuild +} -Two primitives materialize cold artifacts: `processChunk` and `buildTxhashIndex`. They have exactly two callers — the [Startup](#startup) catch-up loop and the [Lifecycle](#lifecycle) tick — and both call them the same way: through the [resolver](#postcondition-driven-scheduling) and executor, sharing one set of postconditions and one scheduler, so the two regimes can never disagree about what done looks like or how it gets there. Five protocol rules govern them; the [resolver](#postcondition-driven-scheduling) and the [execution model](#catch-up-execution-model) below turn them into a schedule. +// resolve returns the work missing for [rangeStart, rangeEnd]. +func resolve(cfg Config, rangeStart, rangeEnd ChunkID) Plan { + if rangeEnd < rangeStart { + return Plan{} // young network: no complete chunk yet + } + cat := cfg.Catalog + needs := map[ChunkID]ArtifactSet{} -**(1) Artifact key values.** `processChunk` applies the [one write protocol](#one-write-protocol) to each requested kind (`lfs`, `events`, `txhash`/`.bin`): `"freezing"` before that kind's I/O begins, `"frozen"` only after its file and dirent barriers (a new bucket dir, every 1000th chunk, barriers the grandparent too). The per-kind idempotency rule: skip iff the key's value is `"frozen"`; a `"freezing"`, `"pruning"`, or absent key triggers re-materialization, itself idempotent — the writer overwrites the file at its canonical path and flips to `"frozen"`. The streamhash index uses the **same pattern at its coverage-named path**; only its commit differs — the `"freezing"`→`"frozen"` flip rides in rule 3's atomic batch instead of a single-key put. + for c := rangeStart; c <= rangeEnd; c++ { + for _, kind := range []Kind{Ledgers, Events} { + if cat.State(c, kind) != Frozen { + needs[c] = needs[c].Add(kind) + } + } + } -**(2) `processChunk(chunk, artifacts)`.** `artifacts` is the subset of outputs to produce; the [resolver](#postcondition-driven-scheduling) uses it to skip producing `.bin` when the window's `.idx` already covers the chunk. The LCM source is chosen internally by `catchupSource`, in preference order — and the same rule serves *both* callers, which is what lets the lifecycle's freeze be ordinary plan execution: + var builds []IndexBuild + for _, w := range windowsOverlapping(rangeStart, rangeEnd) { + desired := Range{ + Lo: max(windowFirstChunk(w), rangeStart), + Hi: min(windowLastChunk(w), rangeEnd), + } + if frozenCoverage(cat, w).Covers(desired) { + continue + } + for c := desired.Lo; c <= desired.Hi; c++ { + if cat.State(c, TxHashBin) != Frozen { + needs[c] = needs[c].Add(TxHashBin) + } + } + builds = append(builds, IndexBuild{Window: w, Lo: desired.Lo, Hi: desired.Hi}) + } + return Plan{ChunkBuilds: chunkBuilds(needs), IndexBuilds: builds} +} +``` -1. **A ready, complete hot DB** (`maxCommittedSeq ≥ chunkLastLedger(chunk)`): read locally via `HotLedgers` — this is how a just-closed chunk freezes without refetching, and how a complete-but-unfrozen chunk is produced even when the bulk source lags behind it. -2. **The frozen local `.pack`**, when `lfs` is not among the requested outputs: re-derivation without a download. -3. **The configured bulk backend** (BSB by default — see `[catch_up.bsb]`). +### The primitives -The hot branch distinguishes *loss* from *staleness*: a `"ready"` key whose directory is **missing or unopenable** is hot-volume loss — the same case-4 fatal `deriveWatermark` enforces, never silently healed; a hot DB that opens but is **incomplete** is legitimate staleness (a leftover awaiting the discard scan, or a surgically stripped chunk's stale neighbor) and simply falls through to the next source — re-derivation *is* its recovery. Per ledger, the needed extractors run over one LCM stream; tx-hash entries are collected and **sorted in memory** before the `.bin` is written (rule 5). +`processChunk` writes a chunk's requested artifacts through the [one write protocol](#one-write-protocol), reading ledgers from `backfillSource`. Its hot-DB branch is what lets the lifecycle freeze a just-closed chunk from its own hot DB, on the same path as a cold backfill. ```go -// processChunk materializes the requested artifact kinds for one chunk from a -// single pass over its 10,000 LCMs, sourced by catchupSource (rule 2's -// preference order). -func processChunk(chunk ChunkID, artifacts ArtifactSet, cfg Config) error { +func processChunk(cfg Config, chunk ChunkID, artifacts ArtifactSet) error { cat := cfg.Catalog - for _, kind := range artifacts.Kinds() { // rule 1 idempotency: frozen kinds self-skip - if cat.State(chunk, kind) == Frozen { - artifacts = artifacts.Remove(kind) - } - } - if artifacts.Empty() { - return nil + source, err := backfillSource(cfg, chunk, artifacts) + if err != nil { + return err } - source := catchupSource(chunk, artifacts, cfg) - batch := cat.NewBatch() // mark-then-write: "freezing" BEFORE any I/O + batch := cat.NewBatch() // mark "freezing" before any I/O for _, kind := range artifacts.Kinds() { batch.Put(chunkKey(chunk, kind), "freezing") } batch.Commit() - // One streaming pass; only the requested extractors run. Files are - // (re)created at their canonical paths — re-materialization overwrites. - w := newArtifactWriters(chunk, artifacts) // .pack, events segment, in-memory txhash entries + w := newArtifactWriters(chunk, artifacts) for seq := chunkFirstLedger(chunk); seq <= chunkLastLedger(chunk); seq++ { w.Add(source.GetLedger(seq)) } - w.Finish() // sorts txhash entries in memory, writes the .bin (rule 5) - w.FsyncAll() // files + parent dirents (+ grandparent for a new bucket dir) - // — all durable BEFORE the flips below (rule 1) + w.Finish() + w.FsyncAll() // durable before the keys flip to "frozen" batch = cat.NewBatch() for _, kind := range artifacts.Kinds() { @@ -412,202 +303,76 @@ func processChunk(chunk ChunkID, artifacts ArtifactSet, cfg Config) error { return nil } -// catchupSource implements rule 2's preference order. The hot branch fatals -// only on loss (ready key, missing/unopenable dir — deriveWatermark's rule, -// third call site); an incomplete-but-present DB is staleness and falls -// through, because re-derivation from the next source IS its recovery. -func catchupSource(chunk ChunkID, artifacts ArtifactSet, cfg Config) LedgerSource { +// backfillSource picks a chunk's ledger source in a fixed preference order. The +// hot branch errors only when a "ready" hot DB won't open — its data is lost. +// An incomplete-but-present DB is just stale: it falls through to the next +// source, which re-derives the chunk and recovers it. +func backfillSource(cfg Config, chunk ChunkID, artifacts ArtifactSet) (LedgerSource, error) { cat := cfg.Catalog if state, _ := cat.Get(hotChunkKey(chunk)); state == "ready" { - if !dirExists(hotChunkPath(chunk)) { - fatalf("hot:chunk:%08d is \"ready\" but its dir is missing — "+ - "hot storage lost; run surgical recovery (case 4).", chunk) + db, err := openRocksDBReadOnly(hotChunkPath(chunk)) + if err != nil { + return nil, fmt.Errorf("hot DB for chunk %d is ready but won't open: %w", chunk, err) } - db := openRocksDBReadOnly(hotChunkPath(chunk)) if maxCommittedSeq(db) >= chunkLastLedger(chunk) { - return &HotLedgers{chunk: chunk, store: db} + return &HotLedgers{chunk: chunk, store: db}, nil } db.Close() // incomplete: stale leftover — close and fall through; the discard scan owns it } - if cat.State(chunk, LFS) == Frozen && !artifacts.Has(LFS) { - return packReader(chunk) // re-derive locally; no redundant download + if cat.State(chunk, Ledgers) == Frozen && !artifacts.Has(Ledgers) { + return packReader(chunk), nil // re-derive locally } - // Bulk backend — the only source for a chunk with no local copy: one below - // the floor on a retention widen, or one a surgical recovery demoted. If the - // backend's tip lags below this chunk (a captive-core node runs ahead of its - // trailing object store), block for coverage rather than aborting — poll the - // tip on a bounded backoff and fatal with a specific error only if it never - // advances. A chunk WITH a local copy never reaches here (it took the hot or - // pack branch), so this wait never gates a normal restart whose range is - // entirely local; it fires only for genuinely backend-only chunks. + // Backfill backend: the only source for a chunk with no local copy. If its + // tip lags below this chunk, wait for coverage. waitForBackendCoverage(cfg, chunk) // bounded; fatal on timeout - return bulkBackend(cfg) // BSB by default — see [catch_up.bsb] + return backfillBackend(cfg), nil // BSB by default } ``` -**(3) `buildTxhashIndex(w, lo, hi)` — rolling rebuild.** The lifecycle rebuilds the **current** window's index on every chunk boundary, so a frozen chunk's tx hashes move into the index promptly and its hot DB is discarded in the same tick. The covered chunk range is explicit: - -- `lo` defaults to the window's first chunk, rising to `chunkID(effectiveRetentionFloor)` when the floor cuts into this index. -- `hi` is the highest frozen chunk in the window — the window's last chunk once it's complete, lower while it's still filling. - -The build runs the one write protocol with the batch-commit extension: - -1. **Skip check**: if the window's unique frozen key already covers exactly `[lo, hi]`, return — there is nothing to write, and any leftover transient keys are the sweeps' job (rule 4), not the builder's. (A frozen key covering the full window is terminal by definition — `hi` equals the window's last chunk — so the skip also covers re-scheduled builds of finalized windows, which must not demand `.bin` inputs the sweep has deleted.) -2. **Mark**: put `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` = `"freezing"` — an idempotent overwrite when a crashed attempt (or a demoted coverage made desired again by a cross-restart regression) left the key behind. The build is **terminal** iff `hi` is the window's last chunk — a derived property, marked nowhere. -3. **Write**: k-way merge the sorted `.bin` files for chunks `[lo, hi]` into streamhash's `SortedBuilder`, writing `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx` — created or truncated wholesale; a writer only ever holds a file whose key is non-frozen, never one a reader can resolve. Fsync the file and its dir (and the dir's own dirent in `txhash/index/` when this build created the window dir — first build of a window). -4. **Commit**: one atomic synced batch — this coverage `"freezing"`→`"frozen"`; the window's predecessor frozen coverage (if any) →`"pruning"`; and iff this is the terminal build, every `chunk:{chunk}:txhash` key in the window →`"pruning"`. This batch is the *entire* finalization protocol — there is no separate cleanup step; the demoted keys become ordinary sweep work (rule 4). - -A crash before step 4 leaves the predecessor frozen and the new coverage as `"freezing"` debris (file partial or complete — irrelevant; it is deleted unread). A crash after step 4 leaves the new coverage frozen and the demoted keys as `"pruning"` work the sweeps finish. There is no crash point at which two coverages are frozen, the live index is unreachable, or a `"frozen"` `chunk:c:txhash` key's `.bin` has been deleted — the batch only ever *demotes* keys, and files are touched exclusively by the sweeps, under non-frozen keys. - -Precondition: every chunk in `[lo, hi]` has `chunk:{chunk}:txhash == "frozen"` (its `.bin` exists). The function fails loudly if violated. Catch-up calls the same function for every window its range overlaps — a complete window's full desired range (a terminal build) or the trailing window's producible range (non-terminal); the terminal batch finalizes in the same write, so finalization is never a separate step for either caller. - -The full build mechanics live in [the transactions design](./gettransaction-full-history-design.md) (§7): the `buildTxhashIndex` pseudocode, the rationale for rebuilding from scratch each boundary, the disk bounds and provisioning numbers (≈2× index size transient per rebuild; ~12.5 GB written in ~1 minute at a dense window's end), the crash matrix, and the safety argument for rewriting coverage-named files. This doc keeps the protocol surface above — the steps, the commit batch's composition, and the precondition — because the resolver, the sweeps, and the crash analysis depend on exactly those. - -**(4) Key-driven sweeps.** All file deletion in the system happens through keys in a transient state — never by listing directories. Two sweep rules cover every case, sharing one mechanic (unlink the file → `fsyncDir` the parent → delete the key, batching the fsyncs and key-deletes when sweeping many at once): - -- **Index `"freezing"` keys** — an abandoned build attempt, left by a crash. A retry of the same coverage re-marks and overwrites the key in place (rule 3 step 2), and builds run before this sweep in every regime — so any `"freezing"` key the sweep observes was *not* retried: its coverage is no longer desired. Disposition: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — and a single no-questions rule collapses the crash inventory. (Per-chunk `"freezing"` keys are *not* swept this way: their artifacts live at canonical paths, so rule 1's idempotent re-materialization repairs them in place within retention, and the retention prune removes them past it. One exception: a `"freezing"` `chunk:c:txhash` key inside a *finalized* window — re-materialization will never be scheduled for a covered window, so the prune scan's redundant-input branch demotes and sweeps it like its `"frozen"` siblings.) -- **`"pruning"` keys** — superseded index coverages, a finalized window's `.bin` inputs (demoted en masse by the terminal commit batch), and everything demoted by retention pruning. The sweep finishes the removal. Because the key outlives the unlink — the `fsyncDir` barrier makes the unlink durable *before* the key delete commits — a power loss anywhere leaves the key in place and the sweep re-runs. This is the exit-side counterpart of rule 1's invariant: **key absent ⟹ file gone.** - -The unlink-before-key order is load-bearing: deleting the key first would, on a crash, leave a file with no key — invisible to every key-driven scan, the one orphan class this design cannot find. - -Both sweeps have two call sites: **eagerly, inside every `IndexBuild`'s execution** (`buildThenSweep`, right after the commit batch — whichever regime ran it), and from the tick's prune stage, which is the backstop for crash leftovers and the owner of retention pruning. The eager site is what bounds disk: without it, a long backfill would accumulate every finalized window's demoted `.bin`s until the first tick (≈20 bytes per transaction across all of history); with it, transient `.bin` disk is bounded by the windows actually in flight — the floor is one dense window's worth (~60 GB), irreducible because a window's build merges all of its `.bin`s at once. Crash anywhere mid-sweep leaves `"pruning"` keys the next tick finishes — the same convergence story regardless of caller. - -**(5) Streamhash formats.** The tx-hash artifacts use the streamhash pipeline, specified in [the transactions design](./gettransaction-full-history-design.md) (§6). What this doc's protocols rely on: the `.bin` is a **sorted** per-chunk run (`processChunk` sorts the chunk's entries in memory before writing — ~3M entries ≈ 60 MB, negligible), which is what makes the every-boundary rebuild a single streaming k-way merge instead of a two-pass build over unsorted input; the `.idx` is one self-contained streamhash MPHF file per coverage (its `MinLedger` derived from `lo`; no sidecar metadata); and hot tx hashes live as **a single column family inside the per-chunk hot DB** — there is no dedicated txhash store at any layer. The formats match the measured pipeline in the bench harness (`bench-fullhistory`), which is what makes the ~1-minute full-window build figure transfer to this design. - -### Postcondition-driven scheduling - -A naive scheduler would register every per-chunk and per-window task on every run and rely on each task to self-skip — but catch-up runs on every restart, and that shape re-derives every chunk's `.bin` only for finalization to immediately demote and delete it again: wasted work proportional to the retention window. Instead, catch-up has a contract — *given a range, ensure every artifact derived from every ledger in it is durable and servable* — and resolves what's missing before scheduling anything. The resolver is a registry of **kind rules**: each artifact kind contributes one rule that compares its postcondition against the catalog and emits the difference as tasks. The current kinds: - -- **`lfs`** (per-chunk): needed for chunk `c` iff `chunk:{c}:lfs` isn't `"frozen"`. -- **`events`** (per-chunk): same rule against `chunk:{c}:events`. -- **`txhash`** (per-window, the one kind with a cross-chunk artifact): for **each window overlapping the range**, compare the stored coverage (`{lo, hi}` from the *name* of the window's unique **frozen** index key) with the desired coverage `[max(window_start, chunkID(floor)), min(window_last_chunk, range_end)]` — the upper cap is what makes the rule uniform: for a complete window it's the window's last chunk, for the trailing window it's the range end, and no special trailing case exists. - - **Desired ⊆ stored** → schedule *nothing* for this window: no `.bin` production, no build. Three states land here: every steady-state restart; a floor that *rose* (the stale stored `lo` is the reader retention contract's problem, not a rebuild trigger); and a finalized window the range ends inside — a crash right after a terminal commit resumes exactly at that window's last ledger, where the terminal coverage already covers any desired range and the leftover `"pruning"` demotions stay the sweeps' job. - - **Desired exceeds stored** (`desired_lo < stored_lo`, or `desired_hi > stored_hi`, or no frozen key exists) → request `.bin` production for **every** chunk in the desired range — chunks whose `.bin` is already frozen self-skip inside `processChunk`, chunks the old `.idx` covered re-derive from local `.pack` files (no BSB) — and emit one index build `buildTxhashIndex(w, desired_lo, desired_hi)`. The build is terminal (input demotion) iff `desired_hi` is the window's last chunk. The `stored_hi` clause matters: a window that was *current* at shutdown carries a frozen key with `hi < last_chunk`, and when downtime crosses the window boundary it becomes a complete window that still needs its tail chunks' `.bin` and the full build — classifying by `lo` alone would strand chunks `(hi, last_chunk]` permanently. - -A new data type slots in as a new rule: a per-chunk kind adds a key check, an indexed kind adds another window loop contributing index builds. The skeleton that executes the plan (below) never changes. - -The comparison can trust `"frozen"` blindly: **a `"frozen"` `chunk:c:txhash` key implies its `.bin` exists, unconditionally.** Input keys are demoted to `"pruning"` in the same synced write that freezes the terminal coverage, and files are only ever deleted by the sweeps, under non-frozen keys — so no crash at any point can leave a frozen key whose file is gone. Whatever transient keys a crash does leave behind (`"freezing"` attempts, half-swept `"pruning"` demotions) are invisible to the resolver — it classifies on frozen state only — and are swept by the first lifecycle tick, rung at ingestion start. - -The per-window comparison is therefore crash-only-recoverable in *every* index state: a finalized window, a window mid-roll at shutdown, a terminal commit that landed but whose sweeps didn't run, and a crashed build attempt all converge through "desired vs stored → re-derive, rebuild," with inputs that are guaranteed producible (`.pack` files are within retention by definition of the desired range). One composition needs help from outside the resolver: a widening catch-up that re-froze a finalized window's `.bin` keys — or crashed mid-write, leaving one `"freezing"` — and then retention is narrowed back before its rebuild. The resolver then correctly schedules nothing (desired ⊆ stored), so re-materialization will never repair those keys; the prune stage's redundant-input branch demotes and sweeps them, `"frozen"` and `"freezing"` alike (see [Prune](#prune)). - -In code, the kind rules produce one flat value: +**`buildTxhashIndex(w, lo, hi, cat)`** rebuilds window `w`'s index to cover chunks `[lo, hi]` — `lo` the lowest in-floor chunk, `hi` the highest frozen chunk (the window's last once the window is complete). The lifecycle calls it on every chunk boundary while the window is current. ```go -// Both strata are pure data — no behavior is baked into the plan; the -// executor interprets it. That is what makes "the plan is just a value" -// literally true: it can be logged, diffed, and tested without running it. -type ChunkBuild struct { - Chunk ChunkID - Artifacts ArtifactSet // which kinds this chunk still needs — one processChunk pass produces all -} - -type IndexBuild struct { - Window WindowID - Lo, Hi ChunkID // coverage to build; terminal iff Hi == windowLastChunk(Window) - // No input list: the build's dependencies are derivable — every chunk in - // [Lo, Hi] that has a ChunkBuild in the same plan. Carrying them as a - // field would be a second copy that can drift. -} - -type Plan struct { - ChunkBuilds []ChunkBuild - IndexBuilds []IndexBuild -} - -// resolve computes the diff between desired state and the catalog. Pure read; -// the plan is just a value, recomputed from durable keys on every run — a -// restart re-plans from what is actually on disk, with nothing to reconcile. -func resolve(cfg Config, rangeStart, rangeEnd ChunkID) Plan { - if rangeEnd < rangeStart { - return Plan{} // young network: no complete chunk exists yet +func buildTxhashIndex(w WindowID, lo, hi ChunkID, cat Catalog) error { + prev := frozenCoverage(cat, w) + if prev != nil && prev.Lo == lo && prev.Hi == hi { + return nil // already built (e.g. a buildThenSweep retry re-entering after the commit) } - cat := cfg.Catalog - floor := chunkFirstLedger(rangeStart) // rangeStart already encodes the floor - needs := map[ChunkID]ArtifactSet{} // per-chunk work, union across kinds - for c := rangeStart; c <= rangeEnd; c++ { // per-chunk kinds - for _, kind := range []Kind{LFS, Events} { - if cat.State(c, kind) != Frozen { - needs[c] = needs[c].Add(kind) - } - } - } + key := indexKey(w, lo, hi) + cat.Put(key, "freezing") // mark before any I/O - var builds []IndexBuild - for _, w := range windowsOverlapping(rangeStart, rangeEnd) { // the txhash kind - desired := Range{ - Lo: max(windowFirstChunk(w), chunkID(floor)), - Hi: min(windowLastChunk(w), rangeEnd), // capped by range end ⇒ uniform trailing-window handling - } - stored := frozenCoverage(cat, w) // the unique "frozen" key's coverage, or none - if stored.Covers(desired) { - continue // steady-state restart, risen floor, or finalized window: nothing - } - for c := desired.Lo; c <= desired.Hi; c++ { - if cat.State(c, TxHashBin) != Frozen { - needs[c] = needs[c].Add(TxHashBin) - } - } - builds = append(builds, IndexBuild{Window: w, Lo: desired.Lo, Hi: desired.Hi}) + sb := streamhash.NewSortedBuilder(indexFilePath(key)) + for entry := range kWayMerge(binFiles(lo, hi)) { // sorted .bin files → one stream + sb.Add(entry) } - return Plan{ChunkBuilds: chunkBuilds(needs), IndexBuilds: builds} -} + sb.Finish() + fsyncFile(indexFilePath(key)) + fsyncDir(indexWindowDir(key)) // + grandparent on the window's first build -// buildThenSweep is how the executor runs an IndexBuild. The build's commit -// batch only demotes keys (rule 3); this eagerly runs the standard sweeps -// (rule 4) so the demoted files come back without waiting for a lifecycle -// tick. The sweep is WINDOW-LOCAL — it walks only this window's keys, so -// concurrent windows' sweeps touch disjoint keys — and as a bonus it -// finishes any "pruning" leftovers a previous crashed pass left in the same -// window. -func buildThenSweep(b IndexBuild, cfg Config) error { - cat := cfg.Catalog - if err := buildTxhashIndex(b.Window, b.Lo, b.Hi, cat); err != nil { - return err - } - for _, key := range indexKeys(cat, b.Window) { // superseded coverage(s) - if key.State == Pruning { - sweepIndexKey(key, cat) - } + batch := cat.NewBatch() // one atomic synced write — the whole finalization + batch.Put(key, "frozen") + if prev != nil { + batch.Put(indexKey(w, prev.Lo, prev.Hi), "pruning") // demote predecessor } - var demoted []ArtifactRef // terminal build: the window's .bin inputs - for c := windowFirstChunk(b.Window); c <= windowLastChunk(b.Window); c++ { - if cat.State(c, TxHashBin) == Pruning { - demoted = append(demoted, ArtifactRef{Chunk: c, Kind: TxHashBin}) + if hi == windowLastChunk(w) { // terminal: the merged .bin inputs are spent + for c := lo; c <= hi; c++ { + batch.Put(chunkKey(c, TxHashBin), "pruning") } } - if len(demoted) > 0 { - sweepChunkArtifacts(demoted, cfg, cat) - } + batch.Commit() return nil } ``` -### Catch-up execution model +`kWayMerge` and `SortedBuilder` are streamhash internals, covered in [the transactions design](./gettransaction-full-history-design.md) (§6–§7). -`executePlan` executes a plan; `runBackfill` is just backend validation plus `executePlan(resolve(…))`, and the [lifecycle tick](#lifecycle) calls the *same* `executePlan` for its production work — one scheduler, two callers. The shape is map/reduce without the shuffle or the job tracker: chunk builds are the maps, index builds are the per-group reduces (the `.bin`s are map-side-sorted runs, so each reduce is one streaming merge), and completion is recorded as the artifacts themselves. There is deliberately no task engine and no persisted task state, for two reasons. First, the dependency structure is two strata with one edge type — an index build waits on the chunk builds inside its coverage — which the runtime expresses directly: each chunk build closes a done-channel, each index build waits on the in-coverage channels, and the *ready-set* a DAG scheduler would maintain is simply the goroutines parked on the one worker semaphore. Second, a persisted task graph would be a second source of truth about progress, one that can drift from the artifact keys it describes; `resolve` re-plans from the keys on every run, so there is nothing to resume and nothing to reconcile. +### Execution model -```go -func runBackfill(ctx context.Context, cfg Config, rangeStart, rangeEnd ChunkID) error { - // Fail before any work only if a fall-through chunk has NO configured source - // at all. This mirrors catchupSource's preference: a chunk needs the bulk - // backend only when it is not already durable (self-skips), not complete in - // a ready hot DB, and not re-derivable from a local .pack — so the check - // concerns only those fall-through chunks, NOT the whole range. Load-bearing - // on a restart where rangeEnd is anchored on max(tip, lastCommitted): the - // span above a lagging bulk tip is produced locally and must not abort here. - // It does NOT check backend-tip COVERAGE: a fall-through chunk above a - // lagging-but-advancing backend is not doomed, only not-yet-producible, and - // catchupSource's bounded wait (rule 2) handles that per chunk rather than - // aborting the whole backfill. - if err := validateRangeProducible(cfg, rangeStart, rangeEnd); err != nil { - return err - } - return executePlan(ctx, resolve(cfg, rangeStart, rangeEnd), cfg) -} +`executePlan` runs a plan from either caller — startup backfill or the [lifecycle run](#lifecycle). Chunk builds run concurrently under one worker semaphore; each index build waits on the done-channels of the chunk builds inside its coverage, then runs. -func executePlan(ctx context.Context, plan Plan, cfg Config) error { - slots := make(chan struct{}, cfg.Workers) // the ONLY concurrency knob: one pool, all work kinds +```go +func executePlan(ctx context.Context, cfg Config, plan Plan) error { + slots := make(chan struct{}, cfg.Workers) // the only concurrency knob done := make(map[ChunkID]chan struct{}, len(plan.ChunkBuilds)) for _, cb := range plan.ChunkBuilds { done[cb.Chunk] = make(chan struct{}) @@ -616,113 +381,128 @@ func executePlan(ctx context.Context, plan Plan, cfg Config) error { g, gctx := errgroup.WithContext(ctx) for _, cb := range plan.ChunkBuilds { g.Go(func() error { - defer close(done[cb.Chunk]) // completion broadcast - slots <- struct{}{} // acquire a worker slot - defer func() { <-slots }() // release - return withRetries(gctx, cfg.MaxRetries, func() error { - return processChunk(cb.Chunk, cb.Artifacts, cfg) - }) + slots <- struct{}{} + defer func() { <-slots }() + if err := withRetries(gctx, cfg.MaxRetries, func() error { + return processChunk(cfg, cb.Chunk, cb.Artifacts) + }); err != nil { + return err // leave done[cb.Chunk] open; the error cancels gctx, freeing waiters + } + close(done[cb.Chunk]) // success: dependents may now read this chunk's .bin + return nil }) } for _, b := range plan.IndexBuilds { g.Go(func() error { - for c := b.Lo; c <= b.Hi; c++ { // wait on the in-coverage chunk builds — - if ch, ok := done[c]; ok { // derived, not stored; already-frozen - select { // inputs have no channel and no wait - case <-ch: - case <-gctx.Done(): + for c := b.Lo; c <= b.Hi; c++ { // wait on the in-coverage chunk builds + if ch, ok := done[c]; ok { + select { + case <-ch: // this chunk's .bin is frozen + case <-gctx.Done(): // a build failed (or cancel) — bail return gctx.Err() } } } - slots <- struct{}{} // index builds draw from the same pool + slots <- struct{}{} defer func() { <-slots }() return withRetries(gctx, cfg.MaxRetries, func() error { - return buildThenSweep(b, cfg) + return buildThenSweep(cfg, b) }) }) } return g.Wait() } + +// buildThenSweep runs an IndexBuild, then eagerly sweeps the keys its commit +// demoted (this window only), so freed disk returns without waiting for a run. +func buildThenSweep(cfg Config, b IndexBuild) error { + cat := cfg.Catalog + if err := buildTxhashIndex(b.Window, b.Lo, b.Hi, cat); err != nil { + return err + } + for _, key := range indexKeys(cat, b.Window) { // superseded coverage(s) + if key.State == Pruning { + sweepIndexKey(cat, key) + } + } + var demoted []ArtifactRef // terminal build: the window's .bin inputs + for c := windowFirstChunk(b.Window); c <= windowLastChunk(b.Window); c++ { + if cat.State(c, TxHashBin) == Pruning { + demoted = append(demoted, ArtifactRef{Chunk: c, Kind: TxHashBin}) + } + } + if len(demoted) > 0 { + sweepChunkArtifacts(cat, demoted) + } + return nil +} ``` -- **`cfg.Workers` is the only resource knob** (default `GOMAXPROCS`). The goroutines are structure, not resources: thousands may exist, parked either on the semaphore (queued tasks) or on done-channels (builds awaiting inputs), costing a few KB each; at most `Workers` tasks execute at any instant, drawn from all windows' eligible work mixed together. An index build fires the moment its own in-coverage chunk builds finish, without waiting on other windows. (The derived wait slightly over-approximates — a build also waits on an in-coverage chunk producing only `lfs`/`events` — which is harmless: waiting longer is always safe, and the case arises only in widening scenarios.) -- The executor runs each `IndexBuild` via `buildThenSweep` (defined with `resolve` above), which lands the commit batch (terminal for complete windows) and then runs the eager `"pruning"` sweep (rule 4). The sweep is window-local — this window's demoted inputs and superseded coverages, not a store-wide scan — so concurrent windows' sweeps touch disjoint keys, and `fsyncDir` on a bucket dir shared with another window's in-flight `.bin` writes is safe (a dir fsync with concurrent creates just makes more entries durable). -- Done-channels broadcast *completion*, not success: a chunk build that exhausts its retries still closes its channel (the `defer`), so a dependent index build can win the race against context cancellation and start — whereupon it fails `buildTxhashIndex`'s loud `.bin` precondition check before writing any key, landing on the same abort-and-restart path as the original failure. The precondition check is load-bearing here. -- A task that exhausts its retries aborts the daemon, per the [error policy](#lifecycle); restart re-resolves from durable keys, and completed work never repeats. -- **Single-process enforcement:** the meta store holds a kernel `flock` on a `LOCK` file; a second daemon opening the **same meta-store path** fails immediately, and the lock releases on any process exit (including `kill -9`). Because `[meta_store]`, each `[immutable_storage.*]` path, *and* `[streaming.hot_storage]` are independently configurable, the meta-store lock alone cannot stop two daemons with *different* meta stores from sharing an artifact tree or a hot-DB tree — the daemon therefore also takes a `flock` in **each configured storage root, including the hot-storage root**. The hot root matters most: its `hot/{chunk}` DBs are the only copy of recently-ingested ledgers, independently created/opened/deleted by ingestion and discard, so two daemons sharing it would corrupt or delete that sole copy even though the immutable roots are protected. +- **`cfg.Workers`** (default `GOMAXPROCS`) is the only resource knob: at most that many tasks run at once, drawn from all windows' eligible work. Goroutines are cheap structure — thousands may be parked on the semaphore or on done-channels. +- Done-channels signal *success*: a chunk build closes its channel only once its `.bin` is frozen, so an index build proceeds only when every input it needs exists. A chunk build that exhausts its retries leaves its channel open and returns an error, which cancels `gctx`; any dependent waiting on it unblocks through the `<-gctx.Done()` case and bails. A task that exhausts its retries aborts the daemon ([error policy](#lifecycle)); restart re-resolves from durable keys and completed work never repeats. --- ## Daemon flow -### Startup - -Startup runs in two steps — catch up, then serve: +After startup, the daemon runs two goroutines. **Hot-DB ingestion** pulls new ledgers from captive core into the per-chunk hot DBs as the network closes them, and hands each completed chunk to the lifecycle. (This is the live-network loop — distinct from startup backfill, which reads *old* ledgers into cold files.) The **lifecycle** is a background goroutine responsible for everything else, and it does two kinds of work: **freezing** complete chunks from hot storage into immutable cold files (rolling the tx-hash index forward as it goes), and **cleanup** — discarding hot DBs the cold files now serve, and pruning artifacts that are superseded or have fallen past the retention floor. The sections below cover startup, then each goroutine in turn. -1. **Catch up via backfill.** Bring on-disk coverage in line with the retention window. Each pass backfills up through the last complete chunk at the network tip, with one exclusion: when the watermark is **mid-chunk** and within one chunk of the tip, the partial resume chunk is left to ingestion — core replays its tail faster than a bulk refetch would gate serving, and a mid-chunk watermark can only have come from the live hot DB, so the data is local by construction. Every *complete* chunk is in range — including one whose hot DB holds it but whose artifacts aren't frozen yet (a boundary crash): `catchupSource`'s hot branch produces it locally, no refetch. (On a first-deployment frontfill the loop simply terminates because the range is empty.) The loop re-passes if new chunks appear at the tip while a pass is in flight. Catch-up brings **every** overlapping window's index to its desired coverage via the per-window rule — the trailing partial window included — so when it returns, every in-retention range at or below the last backfilled chunk is servable from durable artifacts. - -2. **Serve + ingest.** Open the resume chunk's hot DB, start captive core, start serving reads, run the lifecycle goroutine and the hot DB ingestion loop — whose first act, having opened its hot DB, is one lifecycle notification. **That first tick doubles as startup convergence**: it finishes whatever a crash left behind (every leftover is a key in a transient state — `"freezing"` attempts to delete, `"pruning"` demotions to finish) and removes downtime leftovers (hot DBs and artifacts now past the effective retention floor), concurrently with early serving. There is no startup-only cleanup action and no startup-only tick: the sweeps are key-driven, so the ordinary tick reaches everything without inspecting a single directory. - -No other preparation exists: the resume chunk's hot DB is simply reopened (steady-state restart) or created fresh, and the backfilled windows' tx hashes — the trailing window's included — are already queryable through the `.idx` files catch-up built. +### Startup -**Serve-readiness is established entirely by step 1 plus the resume chunk's hot DB.** Catch-up's postcondition covers every complete in-retention chunk from durable artifacts — boundary-crash leftovers included, produced locally through `catchupSource`'s hot branch — and the only chunk it ever skips is the *partial* resume chunk, whose data lives in the hot DB startup reopens before `serveReads()` (a mid-chunk watermark can only have come from that DB). Nothing gates serving on a cleanup pass, because crash debris and downtime leftovers are reader-invisible at *every* moment of operation — a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks past-floor files — so the first tick clears them concurrently with serving rather than ahead of it. The store reaches quiescence within that first tick — typically seconds after reads open, longer when it prunes a long-downtime backlog; from then on the [invariant audits](#correctness) carry their usual meaning. (The one nicety surrendered: a store so damaged that tick ops fail aborts seconds after joining the pool rather than just before — the restart loop is identical either way.) +Startup runs in two steps, both in `startStreaming` below: -Operational note — **peak disk after long downtime**: pruning runs only in the first tick's prune stage, *after* catch-up has materialized every newly-in-retention chunk, so a downtime approaching or exceeding the retention window transiently holds up to ~2× the retention footprint (the stale window plus its replacement). Size volumes accordingly, or prune stale ranges manually before restarting after very long downtime; a disk-full during catch-up otherwise aborts before the relieving prune can run, on every retry. +1. **Backfill** brings on-disk coverage in line with the retention window, up through the last *complete* chunk at the tip. The partial chunk still forming at the tip is left to hot-DB ingestion: its ledgers so far are already in the live hot DB (which serves them), and ingestion completes the chunk as new ledgers arrive. Backfill re-runs if the tip advances mid-pass, and when it returns, the whole in-retention history up to that point is on disk as frozen files — ready to serve. +2. **Serve + ingest** opens the resume chunk's hot DB, starts captive core, serving, the lifecycle goroutine, and the hot-DB ingestion loop. The lifecycle is seeded with the last complete chunk so its first run fires at once; that run finishes any crash/downtime leftovers concurrently with serving. Reads never wait for it, because a reader only ever resolves a `"ready"` hot DB or a `"frozen"` cold file — never a transient key. -`lastCommitted` is a *mutating* local that the catch-up loop advances as backfill makes progress; it determines `resumeLedger` and the hot DB ingestion start point. It is never written to the meta store, never shared — and not even carried into the ingestion loop, which needs no progress variable at all: each synced batch *is* the progress, re-derived from durable state at the next startup. +Operational note — **peak disk after long downtime**: pruning runs only in the first run's prune stage, *after* backfill has materialized every newly-in-retention chunk, so a downtime approaching or exceeding the retention window transiently holds up to ~2× the retention footprint (the stale window plus its replacement). Size volumes accordingly, or prune stale ranges manually before restarting after very long downtime; a disk-full during backfill otherwise aborts before the relieving prune can run, on every retry. -The retention floor itself is computed by: +The retention floor and resume point are computed by: ```go const ( GenesisLedger = 2 LedgersPerChunk = 10_000 - // MaxChunksPerTxhashIndex bounds chunks_per_txhash_index so the window's - // ledger span (cpi*LedgersPerChunk) fits a 4-byte index offset and the - // product can't overflow uint32; any real deployment stays far below it. + // MaxChunksPerTxhashIndex caps the window so its ledger span fits a 4-byte offset. MaxChunksPerTxhashIndex = 429_496 // floor(2^32 / LedgersPerChunk) ) -// effectiveRetentionFloor is the lower bound of the retention window, -// chunk-aligned: the first ledger of the lowest in-scope chunk. Combines the -// sliding retention floor (lastCompleteChunkAt(upperBound) - retentionChunks -// + 1, when retentionChunks > 0) with the fixed earliest-ledger floor. -// -// The upper-bound ledger is ingestion's progress at runtime; the catch-up -// loop passes max(sampled network tip, derived watermark). The max() guards -// a lagging bulk tip: anchored on the tip alone, the floor would regress -// below where pruning has already advanced, scheduling a spurious re-derive -// of a pruned range. When the tip leads (long downtime), the tip is simply -// the correct anchor — it places the floor where retention will sit once -// caught up, so backfill starts at the true floor instead of wastefully -// below it. On a true first start the watermark is absent and the tip alone -// anchors the floor. -func effectiveRetentionFloor(upperBound uint32, retentionChunks uint32, earliest uint32) uint32 { - sliding := uint32(GenesisLedger) +// retentionFloorChunk: the lowest chunk kept — retentionChunks back from +// lastChunk, never below earliest's chunk. +func retentionFloorChunk(lastChunk ChunkID, retentionChunks uint32, earliest uint32) ChunkID { + floor := chunkID(earliest) if retentionChunks > 0 { - slidingChunk := lastCompleteChunkAt(upperBound) - int64(retentionChunks) + 1 - sliding = chunkFirstLedger(max(slidingChunk, 0)) + floor = max(floor, lastChunk-ChunkID(retentionChunks)+1) } - return max(sliding, earliest) + return floor } -// lastCompleteChunkAt is the inverse of chunkLastLedger: the largest chunk -// whose last ledger is <= ledger. E.g., lastCompleteChunkAt(10_001) == 0 -// (chunk 0 spans ledgers 2..10_001). +// lastCompleteChunkAt: the largest chunk whose last ledger is <= ledger. func lastCompleteChunkAt(ledger uint32) int64 { - return (int64(ledger)-1)/LedgersPerChunk - 1 // cast before subtract: total over uint32 + return (int64(ledger)-1)/LedgersPerChunk - 1 +} + +// maxCommittedSeq returns the highest ledger committed to a hot DB; for a +// freshly opened, empty chunk-C DB it returns chunkFirstLedger(C) - 1 (the +// watermark just below the chunk), so the boundary-crash derivation is exact. +// +// lastCommittedLedger: the highest ledger in durable storage — the live hot DB's +// last, the highest frozen chunk's if it leads, or earliest-1 if neither exists. +func lastCommittedLedger(cat Catalog) uint32 { + base := cat.EarliestLedger() - 1 + cold := highestDurableChunk(cat) + hot := highestReadyHotChunk(cat) + switch { + case hot > cold: + db := openReadOnly(hot) + defer db.Close() + return max(base, maxCommittedSeq(db)) + case cold >= 0: + return max(base, chunkLastLedger(cold)) + default: + return base + } } -// networkTip samples the configured backend's network tip, hardened against the -// two ways it lies. It retries with bounded backoff (transient object-store -// unavailability) and rejects a tip below GenesisLedger as "not ready" (an -// empty / not-yet-synced backend), so an unready tip never reaches the chunk -// arithmetic where it would pin a garbage floor. The catch-up loop has a local -// substitute and degrades on error — it falls back to lastCommitted — but the -// two first-start consumers with no substitute (resolving "now", and validating -// a numeric floor against the tip before it is pinned immutably) fatal instead, -// so neither can commit an unverifiable layout. func networkTip(cfg Config) (uint32, error) { tip, err := withBackoff(func() (uint32, error) { return backendNetworkTip(cfg) }) if err != nil { @@ -737,71 +517,36 @@ func networkTip(cfg Config) (uint32, error) { ```go func startStreaming(ctx context.Context, cfg Config) error { - cat := openMetaStore(cfg) - cfg.Catalog = cat // catch-up's plumbing (resolve, runBackfill) reads it from cfg - validateConfig(cfg, cat) + cat := openCatalog(cfg) + cfg.Catalog = cat + validateConfig(cfg) - retentionChunks := cfg.RetentionChunks earliest := cat.EarliestLedger() + lastCommitted := lastCommittedLedger(cat) - // Derived, not read: highest frozen chunk end vs ready hot DBs' max - // committed seq, clamped by earliest - 1 (the frontfill floor). - lastCommitted := deriveWatermark(cat) - - // Step 1: catch up via backfill. The loop re-passes while new chunks - // appear at the tip; backfilledThrough guards against infinite re-passes - // when the tip stops moving (a fixed rangeEnd matching the previous - // iteration breaks the loop). Edge case: on a network younger than one - // chunk, rangeEnd = lastCompleteChunkAt(anchor) = -1, and the watermark - // sentinel reads as a chunk boundary (Geometry convention) so the - // mid-chunk branch below leaves rangeEnd at -1 — the rangeEnd < rangeStart - // guard then catches it cleanly. + // Step 1: backfill from the floor up to the last complete chunk at the tip, + // leaving the partial tip chunk to ingestion. Re-pass while the tip moves. backfilledThrough := int64(-1) for { tip, err := networkTip(cfg) if err != nil { if lastCommitted < earliest { - // First start (no committed progress) with no reachable backend: - // we can neither catch up nor serve a local history. Fail until a - // real tip is available — never start serving on empty/incomplete - // history. The supervisor restarts and networkTip retries. fatalf("network tip unavailable and no local history to serve: %v", err) } - // Restart with local progress: serve what's already materialized - // (the window below lastCommitted is complete, by catch-up-before- - // advance) and skip catch-up this pass; a later pass with a reachable - // backend resumes it. - tip = lastCommitted + tip = lastCommitted // backend down, but local data exists: serve it } - anchor := max(tip, lastCommitted) // guards a lagging bulk tip, in BOTH uses below - rangeStart := chunkID(effectiveRetentionFloor(anchor, retentionChunks, earliest)) - // Anchoring rangeEnd on the watermark too matters when the bulk tip - // lags: a complete watermark chunk must fall inside the range so the - // per-window rule folds it into its index before serving. The span - // beyond the bulk tip consists only of chunks that are already - // durable (production self-skips) or complete in a ready hot DB - // (produced locally via catchupSource's hot branch) — the bulk - // backend is never asked for them. + anchor := max(tip, lastCommitted) rangeEnd := lastCompleteChunkAt(anchor) - // The watermark sentinel (lastCommitted = earliest_ledger-1, e.g. 1 on - // a genesis fresh start) sits on a chunk boundary by construction — - // earliest_ledger is chunk-aligned — so chunkID maps it to its chunk - // (chunk -1 for the genesis sentinel, per the Geometry convention) and - // this reads false, never spuriously mid-chunk. - watermarkMidChunk := lastCommitted != chunkLastLedger(chunkID(lastCommitted)) - withinOneChunkOfTip := int64(tip)-int64(lastCommitted) < LedgersPerChunk - // ^ signed: a lagging bulk tip can sit BELOW the resume point - if withinOneChunkOfTip && watermarkMidChunk { - // The partial resume chunk is ingestion's: near the tip, core - // replays its tail faster than a bulk refetch would gate serving. - // Mid-chunk watermarks only ever come from the live hot DB, so - // the data is local by construction. - rangeEnd = chunkID(lastCommitted) - 1 + rangeStart := retentionFloorChunk(rangeEnd, cfg.RetentionChunks, earliest) + midChunk := lastCommitted != chunkLastLedger(chunkID(lastCommitted)) + nearTip := int64(tip)-int64(lastCommitted) < LedgersPerChunk + if nearTip && midChunk { + rangeEnd = chunkID(lastCommitted) - 1 // leave the partial resume chunk to ingestion } if rangeEnd < rangeStart || rangeEnd <= backfilledThrough { break } - if err := runBackfill(ctx, cfg, rangeStart, rangeEnd); err != nil { + if err := executePlan(ctx, cfg, resolve(cfg, rangeStart, rangeEnd)); err != nil { return err } lastCommitted = max(lastCommitted, chunkLastLedger(rangeEnd)) @@ -809,37 +554,35 @@ func startStreaming(ctx context.Context, cfg Config) error { } resumeLedger := lastCommitted + 1 - // Step 2: serve + ingest. The first tick — rung by ingestion's at-start - // notification — finishes anything a crash left half-done and prunes - // downtime leftovers, concurrently with early serving. - hotDB := openHotDBForChunk(cfg, cat, chunkID(resumeLedger)) + // Step 2: serve + ingest. Seed the lifecycle with the last complete chunk so + // its first run clears crash/downtime leftovers while serving is already live. + hotDB, err := openHotDBForChunk(cat, chunkID(resumeLedger)) + if err != nil { + return err + } core := startCaptiveCore(cfg, resumeLedger) - doorbell := make(chan struct{}, 1) - go lifecycleLoop(ctx, cfg, cat, doorbell) + lifecycleCh := make(chan ChunkID, lifecycleQueueDepth) + lifecycleCh <- lastCompleteChunkAt(resumeLedger - 1) // seed the first run + go lifecycleLoop(ctx, cfg, lifecycleCh) serveReads() - return runIngestionLoop(ctx, cfg, core, hotDB, cat, doorbell) + return runIngestionLoop(ctx, cat, core, hotDB, lifecycleCh, resumeLedger) } ``` -After `runBackfill` returns, every chunk in the backfilled range has `lfs` and `events` frozen, every overlapping window's index is at its desired coverage, and `txhash` keys are in one of three states: frozen (window still rolling), swept (finalized window whose terminal commit this pass landed — the batch demoted them, the eager sweep removed them), or `"pruning"` leftovers from a *pre-crash* terminal commit, which the resolver correctly skipped (desired ⊆ stored) and the first tick's prune phase sweeps; partially processed chunks were retried idempotently. The lowest chunk in the backfilled range is `chunkID(effectiveRetentionFloor(max(tip, lastCommitted), …))` — the same `max()` anchor the loop uses; if this falls mid-window, that window's finalized index is built with `lo` = that chunk (its terminal index key carrying that `lo` and `hi` = the window's last chunk). Streaming startup therefore doesn't re-validate contiguity or per-chunk flag-completeness — they follow from how backfill works. +`validateConfig` checks the config and, on the first start, resolves and pins `earliest_ledger` and `chunks_per_txhash_index`: ```go -func validateConfig(cfg Config, cat Catalog) { - // Stateless config validation (no pins touched yet). +func validateConfig(cfg Config) { + cat := cfg.Catalog if cfg.ChunksPerTxhashIndex == 0 || cfg.ChunksPerTxhashIndex > MaxChunksPerTxhashIndex { - fatalf("chunks_per_txhash_index must be in [1, %d] (it defines the index "+ - "layout, immutable once stored).", MaxChunksPerTxhashIndex) + fatalf("chunks_per_txhash_index must be in [1, %d]", MaxChunksPerTxhashIndex) } if cfg.Workers < 1 { - fatalf("workers must be > 0 (got %d) — a zero pool deadlocks executePlan.", cfg.Workers) + fatalf("workers must be > 0 (got %d)", cfg.Workers) } if cfg.MaxRetries < 0 { - fatalf("max_retries must be >= 0 (got %d).", cfg.MaxRetries) // 0 = run once, no retry + fatalf("max_retries must be >= 0 (got %d)", cfg.MaxRetries) } - // earliest_ledger must be "genesis", "now", or a chunk-aligned ledger >= - // genesis. Validating the full static form here keeps every later - // atoi(cfg.EarliestLedger) well-formed on both the restart and first-start - // paths (and out of chunkID's sub-genesis underflow domain). if cfg.EarliestLedger != "genesis" && cfg.EarliestLedger != "now" { n, err := parseUint32(cfg.EarliestLedger) if err != nil || n < GenesisLedger || n != chunkFirstLedger(chunkID(n)) { @@ -847,78 +590,50 @@ func validateConfig(cfg Config, cat Catalog) { "ledger >= %d; got %q.", GenesisLedger, cfg.EarliestLedger) } } - // The two layout pins (chunks_per_txhash_index, earliest_ledger) are - // committed together in one atomic batch on first start (below), so they - // exist all-or-nothing: BOTH present ⟹ a prior first start completed and the - // layout is immutable; otherwise startup never got past config validation, - // no artifacts exist, and re-validating + re-pinning is safe. + + // Both pins are committed together on first start, so either both exist (a + // restart — the layout is immutable) or neither does (re-pinning is safe). cpiStored, cpiPinned := cat.Get("config:chunks_per_txhash_index") earliestStored, earliestPinned := cat.Get("config:earliest_ledger") - if cpiPinned && earliestPinned { - // Restart: the layout is committed — confirm nothing changed, write nothing. + if cpiPinned && earliestPinned { // restart: confirm nothing changed, write nothing if cpiStored != itoa(cfg.ChunksPerTxhashIndex) { - fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", - cpiStored, cfg.ChunksPerTxhashIndex) + fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", cpiStored, cfg.ChunksPerTxhashIndex) } - // earliest_ledger immutability. The backend tip is NOT re-sampled (it - // may lag below the pinned floor — the startup loop's - // max(tip, lastCommitted) handles that). A genesis/numeric value must - // equal the stored pin or startup aborts. "now" cannot be re-resolved - // without re-sampling, so on a restart it is a deliberate no-op meaning - // "keep the pinned floor" — a frontfill deployment leaves "now" in its - // config across restarts and must not abort. One consequence: editing an - // existing deployment FROM genesis/numeric TO "now" is silently kept at - // the pinned floor rather than aborting (the floor is immutable either - // way); to actually move it, wipe the data dir. - if cfg.EarliestLedger != "now" { + if cfg.EarliestLedger != "now" { // "now" on restart keeps the pinned floor want := uint32(GenesisLedger) if cfg.EarliestLedger != "genesis" { want = atoi(cfg.EarliestLedger) } if want != atoi(earliestStored) { - fatalf("earliest_ledger changed: stored=%s, config=%s. Wipe the data "+ - "directory to change earliest_ledger (or use the future "+ - "set-earliest-ledger admin command).", earliestStored, cfg.EarliestLedger) + fatalf("earliest_ledger changed: stored=%s, config=%s; wipe the data dir to change it.", + earliestStored, cfg.EarliestLedger) } } return } - // First start (or an incomplete prior start — no artifacts yet). Resolve - // earliest_ledger, then commit BOTH layout pins in one atomic synced batch. - // The network tip is required to resolve "now" and to validate a numeric - // floor against the network before pinning it — both forms therefore need a - // reachable, ready backend on first start. A genesis floor needs no tip: - // GenesisLedger is always a valid lower bound. networkTip rejects an unready - // (< genesis) or unreachable tip, so no path can pin a garbage or future floor. + // First start: resolve earliest_ledger, then pin both. "now" and a numeric + // floor each need a reachable backend — "now" to resolve, a numeric floor to + // reject one past the tip (it is pinned immutably, so it can't be checked later). var earliest uint32 switch cfg.EarliestLedger { case "genesis": earliest = GenesisLedger case "now": - tip, err := networkTip(cfg) // no local substitute for "now": must succeed + tip, err := networkTip(cfg) if err != nil { - fatalf("earliest_ledger=now needs a reachable, ready backend: %v", err) + fatalf("earliest_ledger=now needs a reachable backend: %v", err) } - earliest = chunkFirstLedger(chunkID(tip)) // <= tip, so never past the tip + earliest = chunkFirstLedger(chunkID(tip)) default: - earliest = atoi(cfg.EarliestLedger) // already form-validated: parse, >= genesis, aligned - // A numeric floor is pinned immutably below, so it must be validated - // against a real tip FIRST — the check is mandatory, not best-effort. - // Skipping it when the backend is down would let a floor AHEAD of the - // network become permanent: on a later pass the catch-up loop's - // max(tip, earliest-1) anchor collapses the backfill range to empty - // (earliest-1 >= tip), and the daemon would resume ingestion from a - // future ledger with the bad floor already pinned. Like "now", a numeric - // first-start floor therefore requires a reachable, ready backend. + earliest = atoi(cfg.EarliestLedger) tip, err := networkTip(cfg) if err != nil { - fatalf("first start with a numeric earliest_ledger needs a reachable, "+ - "ready backend to validate the floor against the network tip: %v", err) + fatalf("a numeric earliest_ledger needs a reachable backend to validate against the tip: %v", err) } if earliest > tip { - fatalf("earliest_ledger (%d) is past the current network tip (%d); reject.", earliest, tip) + fatalf("earliest_ledger (%d) is past the network tip (%d)", earliest, tip) } } batch := cat.NewBatch() @@ -926,339 +641,231 @@ func validateConfig(cfg Config, cat Catalog) { batch.Put("config:earliest_ledger", itoa(earliest)) batch.Commit() } - -func openHotDBForChunk(cfg Config, cat Catalog, chunk ChunkID) *HotDB { - // createChunkHotDB creates the instance with its column families: - // ledgers, txhash, and the events CFs (schema per the events doc). - return openHotDB(cat, hotChunkKey(chunk), hotChunkPath(chunk), createChunkHotDB) -} ``` ### Hot DB helpers -These functions implement the hot DB state machine. Both startup and the lifecycle loop use them. - -`openHotDB` opens a ready hot DB, recovers from a prior crash, or creates a fresh one: +`openHotDBForChunk` opens a chunk's hot DB — the existing one, or a fresh one after a crash or on first use: ```go -// openHotDB returns an open handle to the hot DB. If the key is "ready", -// opens the existing DB. Otherwise — "transient" from a crashed create or -// discard, or absent on first use — wipes any leftover dir and creates -// fresh. The caller owns the returned handle. -func openHotDB(cat Catalog, hotKey, path string, create func(string) *HotDB) *HotDB { +func openHotDBForChunk(cat Catalog, chunk ChunkID) (*HotDB, error) { + hotKey, path := hotChunkKey(chunk), hotChunkPath(chunk) if state, _ := cat.Get(hotKey); state == "ready" { - if !dirExists(path) { - // The key promises a DB the filesystem doesn't have — hot storage - // was lost out from under a surviving meta store (e.g. ephemeral - // NVMe died). Recreating empty would silently lose the chunk's - // ledgers, so refuse: the operator demotes the orphaned hot:chunk - // keys to "transient" (surgical recovery case 4) and restarts — the - // watermark (count-only-ready) then lands at the last frozen boundary - // automatically, and re-ingestion fills the gap forward. The fatal - // stays (rather than auto-healing) because a missing dir can also - // mean a mount misconfiguration, where auto-wiping state would be wrong. - fatalf("%s is \"ready\" but %s is missing — hot storage lost; "+ - "run surgical recovery (case 4).", hotKey, path) + db, err := openExistingRocksDB(path) + if err != nil { + return nil, fmt.Errorf("hot DB for chunk %d is ready but won't open: %w", chunk, err) } - return openExistingRocksDB(path) + return db, nil } - // "transient" or absent — wipe any leftover dir and recreate. + // transient or absent: wipe any leftover dir and create fresh. deleteDirIfExists(path) cat.Put(hotKey, "transient") - db := create(path) - fsyncDir(path) // dir + dirent durable BEFORE "ready" — else a power - fsyncParentDir(path) // crash fabricates the ready-without-dir fatal above + db := createChunkHotDB(path) + fsyncDir(path) // durable before the key flips to "ready" + fsyncParentDir(path) cat.Put(hotKey, "ready") - return db -} -``` - -`discardHotDBForChunk` retires a chunk's hot DB once every cold artifact derived from the chunk is durable (or the chunk has fallen past retention): - -```go -func discardHotDBForChunk(chunk ChunkID, cat Catalog) { - if !cat.Has(hotChunkKey(chunk)) { - return - } - cat.Put(hotChunkKey(chunk), "transient") - deleteDirIfExists(hotChunkPath(chunk)) - fsyncParentDir(hotChunkPath(chunk)) - cat.Delete(hotChunkKey(chunk)) -} -``` - -`HotLedgers` is the hot-DB reader `catchupSource` returns when its hot branch wins — a read-only view of the `ledgers` CF, opened and completeness-checked by `catchupSource` itself (the loss-vs-staleness rule in rule 2) before the wrapper is handed out: - -```go -type HotLedgers struct { - chunk ChunkID - store *RocksDB // opened + completeness-checked by catchupSource; closed when - // processChunk's pass ends — before the same tick's discard can rmdir the dir -} - -func (h *HotLedgers) GetLedger(seq uint32) LedgerCloseMeta { - return decompressLCM(h.store.GetCF("ledgers", beUint32(seq))) + return db, nil } ``` ### Hot DB Ingestion ```go -func runIngestionLoop(ctx context.Context, cfg Config, core *CaptiveCore, hotDB *HotDB, - cat Catalog, doorbell chan struct{}) error { +func runIngestionLoop(ctx context.Context, cat Catalog, core LedgerBackend, hotDB *HotDB, + lifecycleCh chan<- ChunkID, resumeLedger uint32) error { - notify := func() { // payload-free doorbell: non-blocking send, coalescing + // A full lifecycleCh means freeze has fallen lifecycleQueueDepth boundaries + // behind ingestion — fail loud. + notify := func(complete ChunkID) { select { - case doorbell <- struct{}{}: + case lifecycleCh <- complete: default: + fatalf("lifecycle fell %d boundaries behind ingestion; investigate", lifecycleQueueDepth) } } - notify() // first act: the hot-chunk set just changed (the resume DB was opened) - for { - var lcm LedgerCloseMeta - select { - case <-ctx.Done(): - return nil // clean shutdown: the daemon was asked to stop - case l, ok := <-core.StreamLedgers(): - if !ok { - if ctx.Err() != nil { - return nil // stream closed *because* we're shutting down — clean - } - // Closed without a shutdown request — core crashed/exited. - // RESTARTABLE, not success: return an error so the process exits - // non-zero and the supervisor restarts it. Startup re-derives - // progress from durable state; the last synced batch is the - // watermark, so nothing is lost. - return fmt.Errorf("captive core stream closed unexpectedly") - } - lcm = l + for seq := resumeLedger; ; seq++ { + lcm, err := core.GetLedger(ctx, seq) // blocks until ledger seq is available + if err != nil { + return err } - // One atomic, synced WriteBatch across all CFs — a ledger is either - // fully in the hot DB or absent. The batch IS the durability boundary; - // the loop keeps no progress variable at all — progress is re-derived - // from durable state at the next startup. + // One atomic synced batch across all CFs, so a ledger is fully present or + // absent; it is the only per-ledger durability boundary. batch := hotDB.NewBatch() - putLedger(batch, lcm) // ledgers CF - putTxHashes(batch, lcm) // txhash CF - putEvents(batch, lcm) // events CFs + putLedger(batch, lcm) + putTxHashes(batch, lcm) + putEvents(batch, lcm) batch.Commit( /*sync=*/ true) - seq := lcm.LedgerSeq() - if seq == chunkLastLedger(chunkID(seq)) { // chunk boundary - // Close the write handle BEFORE creating the next chunk's hot - // key — the moment that key exists, a tick's derivation - // classifies this chunk as complete and may freeze and discard - // this hot DB, and no writer may hold it then. + if seq == chunkLastLedger(chunkID(seq)) { + // Close this chunk and open the next before notifying, so the lifecycle + // never races a live writer for the chunk it is about to freeze. hotDB.Close() - hotDB = openHotDBForChunk(cfg, cat, chunkID(seq)+1) - notify() + if hotDB, err = openHotDBForChunk(cat, chunkID(seq)+1); err != nil { + return err + } + notify(chunkID(seq)) } } } ``` -A batch error causes the loop to retry the entire ledger (the batch is all-or-nothing, so a retry can't double-apply). On repeated failure the daemon aborts; the next startup's derived watermark equals exactly what the last synced batch committed — there is no second durable write that could disagree with it — and ingestion resumes from the next seq. An *unexpected* close of the captive-core stream — core crash or exit, as opposed to a `ctx`-cancelled shutdown — is handled the same way: the loop returns an error so the process exits non-zero and the supervisor restarts it, resuming exactly where the last synced batch left off (a clean close would otherwise look like success and not restart). The close-before-open order at the boundary is load-bearing: the next chunk's hot key is what makes this chunk *visibly complete* to the lifecycle's derivation, so the write handle must already be released when that key appears — otherwise a tick still in flight from the *previous* notification could rmdir a dir whose writer is live. Readers hold their own independent read-only handles. - -The doorbell carries no payload, so its delivery semantics can be maximally sloppy: a non-blocking send on a size-1 buffered channel, coalescing freely. Nothing is lost because the notification carries no information to lose — eligibility derives entirely from durable state, and a tick triggered by one notification processes everything the catalog shows, however many boundaries contributed to it. The doorbell only answers "when should the lifecycle look", never "what should it see". +A `GetLedger` failure returns from the loop and exits the process; the next startup resumes from where the last synced batch left off, since the batch is all-or-nothing. A clean shutdown cancels `ctx` and returns the same way, distinguished from a crash at the daemon's top level. The completed chunk id is all ingestion sends the lifecycle — *how far to go*; what to build, discard, and prune the lifecycle reads from the catalog. ### Lifecycle -The lifecycle goroutine runs one **tick** per notification, in three stages: **plan-and-execute** (the same `resolve` + `executePlan` catch-up uses, from the retention floor up to `completeThrough` — this is where a just-closed chunk freezes, from its hot DB via `catchupSource`'s hot branch, and where the current window's index folds it in), then the **discard** scan (retire hot DBs the cold artifacts now fully serve), then the **prune** scan (sweep demoted and past-retention files). The retention floor plays two roles with *opposite safe directions*, and the design keeps them separate. As a **retention boundary** (the prune scan, the reader gate) it errs permissive: anchored on `completeThrough`, a floor that sits a little low keeps an extra chunk briefly, or admits a read that at worst lands on already-pruned data and returns not-found via the reader's missing-data-file rule — harmless either way. As a **production boundary** it would err dangerous: planning a build below existing storage means demanding chunks from the bulk source that nobody validated it can produce. So production below storage never consults the floor — the tick's plan range starts at the lowest chunk already materialized, and extending the *bottom* of storage (which is what retention widening means) is exclusively catch-up's job, the one path that runs `validateRangeProducible` before demanding anything. Ordering lives in two places, each natural to its half: freeze-before-build is a *plan dependency* (the window's `IndexBuild` waits on its in-coverage chunk builds' done-channels), and build-before-discard / demote-before-sweep is the *stage sequence* — the scans run after `executePlan` returns, so they see every commit the plan landed. Correctness never depends on any of it: every decision derives from durable keys, so work whose enabler hasn't landed is simply not scheduled, and the next tick picks it up. +The lifecycle is a background goroutine. Each notification — one per ingestion boundary, plus a startup seed — triggers one **run**, which does three stages in order: + +1. **Plan-and-execute** — `resolve` + `executePlan` over `[floor, last complete chunk]`, the same machinery backfill uses. In steady state this freezes the just-closed chunk from its hot DB and folds it into the current window's index; rebuilding the whole window each boundary costs ≈1 minute against a boundary that arrives only every ~14 h at mainnet rates. +2. **Discard** — retire hot DBs the cold artifacts now fully serve. +3. **Prune** — sweep demoted and past-retention files. + +At runtime the floor only rises (retention config is fixed for the life of the process; widening applies at the next startup), so `[floor, last complete chunk]` always sits within existing storage — a run produces only the just-closed chunk and never reaches below. Extending the *bottom* of storage — a fresh start, or filling to a widened floor — is startup backfill's job. -The one input the tick needs beyond the keys themselves is *how far ingestion has durably gotten* — which chunks are complete, and where the sliding retention floor anchors. That is `deriveCompleteThrough`, defined with the [derived-progress machinery](#meta-store-keys) in the data model; the tick derives it once, at tick start: +Everything the run does derives from the catalog plus the one chunk id ingestion hands it: ```go -func runLifecycleTick(ctx context.Context, cfg Config, cat Catalog) { - // One derivation per tick — all stages see the same snapshot, so a - // boundary committing mid-tick can't make one stage's view contradict - // another's; the new chunk is simply next tick's work. - through := deriveCompleteThrough(cat) - floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) - start := chunkID(floor) - if low, ok := lowestMaterializedChunk(cat); ok && low > start { - // floor is a retention boundary (pruning, read gating), where erring - // low is harmless. As a PRODUCTION boundary it would err dangerous: - // a below-storage build demands chunks from a bulk source nobody - // validated. So the tick's plan range starts at existing storage; - // extending the bottom is catch-up's job, behind validateRangeProducible. - start = low - } +func runLifecycle(ctx context.Context, cfg Config, lastChunk ChunkID) { + floor := retentionFloorChunk(lastChunk, cfg.RetentionChunks, cfg.Catalog.EarliestLedger()) - if err := executePlan(ctx, resolve(cfg, start, lastCompleteChunkAt(through)), cfg); err != nil { - fatalf("lifecycle tick: %v", err) // error policy: retries exhausted ⇒ abort; - // startup is the recovery path + if err := executePlan(ctx, cfg, resolve(cfg, floor, lastChunk)); err != nil { + fatalf("lifecycle run: %v", err) // abort; startup is the recovery path } - for _, op := range eligibleDiscardOps(cfg, cat, through) { + for _, op := range eligibleDiscardOps(cfg, lastChunk, floor) { op() } - for _, op := range eligiblePruneOps(cfg, cat, through) { + for _, op := range eligiblePruneOps(cfg, floor) { op() } - // Assertable postcondition: re-running resolve and both scans against - // this same `through` snapshot yields nothing — a tick finishes - // everything its snapshot showed. (A fresh derivation may legitimately - // see a boundary that landed mid-tick; that is the next tick's work, - // not a violation.) } -// lowestMaterializedChunk is one more derivation over the same keys: the -// lowest chunk holding any chunk:* artifact key or hot:chunk key; ok=false -// on an empty catalog (first frontfill tick — resolve's inverted-range -// guard makes that tick a no-op anyway). -func lowestMaterializedChunk(cat Catalog) (ChunkID, bool) +const lifecycleQueueDepth = 8 // far above the at-most-one a healthy daemon holds -func lifecycleLoop(ctx context.Context, cfg Config, cat Catalog, doorbell <-chan struct{}) { - for range doorbell { - runLifecycleTick(ctx, cfg, cat) +func lifecycleLoop(ctx context.Context, cfg Config, lifecycleCh <-chan ChunkID) { + for lastChunk := range lifecycleCh { + drain: // if several chunks queued, take the most recent — one run covers them + for { + select { + case lastChunk = <-lifecycleCh: + default: + break drain + } + } + runLifecycle(ctx, cfg, lastChunk) } } ``` -With this, the tick is a pure function of the catalog: the two goroutines share no state at all, and any process holding the meta store could run a tick and reach the same decisions. One narrow consequence of the positional term: a crash *between* closing chunk C's handle and creating chunk C+1's key leaves C as the highest hot chunk, so the derivation conservatively treats it as live until ingestion opens the resume chunk's DB. That is a latency wart, not a correctness one — C's hot DB keeps serving it — and it closes at the first tick, which ingestion rings the moment that DB is open (the reason ingestion notifies at start, not just at boundaries). - -The goroutine is event-driven, not polled. Notifications arrive from exactly one source — ingestion's hot-chunk-set changes: each boundary, plus the one at ingestion start, whose tick doubles as startup convergence. Between notifications the goroutine is idle — and idle means *quiescent*: a re-scan would produce no ops, so the [invariant audits](#correctness) are meaningful at any moment between ticks. - -**Error policy.** A failing op is retried with backoff a bounded number of times within the tick; on persistent failure the daemon aborts — the same policy as the ingestion loop. Aborting is safe because startup *is* the recovery path: catch-up plus the first tick re-derive or finish whatever the failure interrupted. No op failure is ever deferred to the next boundary (~14 h away) silently. - -#### Production (plan-and-execute) - -The tick's first stage is catch-up's machinery verbatim: `resolve` diffs `[floor, completeThrough]` against the catalog and `executePlan` runs the result. In steady state the plan is tiny — one `ChunkBuild` for the chunk that just closed (its artifacts produced from its hot DB, which `catchupSource`'s hot branch selects) and one `IndexBuild` folding it into the current window — and at quiescence the plan is empty. The hot DB is *not* touched by production: it keeps serving the chunk's tx lookups until the index covers it, and only the discard stage retires it. Nothing but a terminal `IndexBuild`'s commit ever finalizes a window. - -#### Discard +Between runs the goroutine is idle, and idle means **settled**: a re-scan would produce no ops and every storage invariant holds, so an [audit](#correctness) run at any such moment would pass. A failing op retries with backoff, then aborts the daemon — startup is the recovery path, the same policy as ingestion. -The discard scan walks `hot:chunk:*` keys. Per chunk: past retention → discard; complete, nothing pending, and the index covers it (cold artifacts fully serve it) → discard; otherwise (live, or frozen and awaiting coverage) → leave alone. `discardHotDBForChunk`'s coverage-gated branch is what retires hot DBs in steady state, and re-deriving it from durable keys makes it self-healing across a crash between build and discard. A past-retention discard leaves the chunk's artifact files to the prune stage on the same tick — they carry their own keys. - -#### Prune - -The prune scan is the system's only file-deleter, driven entirely by keys — one stage, both key families: - -- **`index:*` keys**: any key in a transient state is swept regardless of window — `"freezing"` (a crashed build attempt) means delete file and key, never salvage; `"pruning"` (a coverage demoted by a later build's commit batch, or by retention) means finish the removal. A `"frozen"` index key is swept only when its window has fallen wholly past the retention floor. -- **`chunk:*` keys**: chunks wholly past retention are swept whole (all artifacts, any state). Within retention, `chunk:c:txhash` keys reading `"pruning"` — demoted by their window's terminal commit batch — are swept batched. One more in-retention branch: a `"frozen"` *or* `"freezing"` `chunk:c:txhash` key inside a window whose frozen index key is terminal (re-derived — or left mid-write — by a widening catch-up that crashed before its rebuild, then abandoned when retention narrowed back) is provably redundant — the final `.idx` covers the chunk, and the resolver will never schedule re-materialization for a covered window — so it is swept here; this branch is what makes INV-2's no-leftover-txhash-keys clause self-healing rather than merely auditable. - -Every sweep, both families, runs the same mechanic: **demote to `"pruning"` if the key is still `"frozen"`** (never unlink under a frozen key), then unlink → `fsyncDir` → key delete, batched per family. The two prune walks never interact with each other — index sweeps touch only index keys, chunk sweeps only chunk keys — and their only cross-family *read* (is the window's frozen index key terminal?) concerns a key no sweep modifies. - -#### Eligibility - -Each `eligible*` function scans the meta store and returns a list of zero-arg callables — each one a closure over the op to run and its arguments. The lifecycle loop just calls them in order. +The discard and prune stages are the two `eligible*` scans below. **Discard** retires a chunk's hot DB once its cold artifacts fully serve it (the window's index covers the chunk), or once it falls past retention. **Prune** is the system's only file-deleter: it sweeps transient index keys, the `.bin` inputs a terminal commit demoted, and everything below the retention floor, through `sweepIndexKey`/`sweepChunkArtifacts`. Each scan returns zero-arg ops the run calls in order. ```go -func eligibleDiscardOps(cfg Config, cat Catalog, through uint32) []func() { - floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) +func eligibleDiscardOps(cfg Config, lastChunk, floor ChunkID) []func() { + cat := cfg.Catalog var ops []func() for _, chunk := range hotChunkKeys(cat) { switch { - case chunkLastLedger(chunk) < floor: // past retention OR below earliest_ledger - ops = append(ops, func() { discardHotDBForChunk(chunk, cat) }) - case chunkLastLedger(chunk) <= through && - pendingArtifacts(chunk, cfg, cat).Empty() && - indexCovers(chunk, cfg, cat): // cold artifacts fully serve it - ops = append(ops, func() { discardHotDBForChunk(chunk, cat) }) - // else: live, or frozen and awaiting coverage — leave alone. + case chunk < floor: + ops = append(ops, func() { discardHotDBForChunk(cat, chunk) }) + case chunk <= lastChunk && + pendingArtifacts(cfg, chunk).Empty() && + indexCovers(cfg, chunk): // cold artifacts fully serve it + ops = append(ops, func() { discardHotDBForChunk(cat, chunk) }) } } return ops } -// pendingArtifacts lists which processChunk outputs this chunk still needs. -// The per-chunk counterpart of catch-up's per-window rule: txhash/.bin is -// exempt when the window's index already covers the chunk — after -// finalization the chunk:c:txhash keys are legitimately demoted ("pruning") -// or swept away, and regenerating the .bin would orphan it. -func pendingArtifacts(chunk ChunkID, cfg Config, cat Catalog) ArtifactSet { +// pendingArtifacts lists which processChunk outputs the chunk still needs. The +// .bin is exempt once the window's index covers the chunk (the finalized window +// already demoted its key). +func pendingArtifacts(cfg Config, chunk ChunkID) ArtifactSet { + cat := cfg.Catalog var need ArtifactSet - for _, kind := range []Kind{LFS, Events} { + for _, kind := range []Kind{Ledgers, Events} { if cat.State(chunk, kind) != Frozen { need = need.Add(kind) } } - if cat.State(chunk, TxHashBin) != Frozen && !indexCovers(chunk, cfg, cat) { + if cat.State(chunk, TxHashBin) != Frozen && !indexCovers(cfg, chunk) { need = need.Add(TxHashBin) } return need } -// indexCovers reports whether the durable .idx for chunk's window already -// hashes that chunk. -func indexCovers(chunk ChunkID, cfg Config, cat Catalog) bool { - fk := frozenCoverage(cat, indexID(chunk)) // the unique "frozen" index key, or none +// indexCovers reports whether the window's durable .idx already hashes the chunk. +func indexCovers(cfg Config, chunk ChunkID) bool { + fk := frozenCoverage(cfg.Catalog, indexID(chunk)) return fk != nil && fk.Lo <= chunk && chunk <= fk.Hi } -func eligiblePruneOps(cfg Config, cat Catalog, through uint32) []func() { - floor := effectiveRetentionFloor(through, cfg.RetentionChunks, cat.EarliestLedger()) +func eligiblePruneOps(cfg Config, floor ChunkID) []func() { + cat := cfg.Catalog windowFloor := WindowID(-1) chunkFloor := ChunkID(-1) - if floor != GenesisLedger { - windowFloor = indexID(chunkID(floor)) - 1 - chunkFloor = lastCompleteChunkAt(floor - 1) + if floor > 0 { + windowFloor = indexID(floor) - 1 + chunkFloor = floor - 1 } var ops []func() - for _, key := range indexKeys(cat) { // index family + for _, key := range indexKeys(cat) { switch { - case key.State == Freezing || key.State == Pruning: - // Transient debris from any window — an abandoned attempt - // ("freezing": delete, never salvage — a retried coverage was - // re-marked and frozen before this scan ran) or an unfinished - // demotion ("pruning"). Safe to run only because no build is in - // flight when this scan runs: the prune stage follows - // executePlan's return within the tick, and catch-up finishes - // before the lifecycle goroutine starts. - ops = append(ops, func() { sweepIndexKey(key, cat) }) - case key.Window <= windowFloor: - // A frozen index key wholly below the floor; the sweep demotes - // it first — never unlink under a "frozen" key. - ops = append(ops, func() { sweepIndexKey(key, cat) }) + case key.State == Freezing || key.State == Pruning: // transient debris + ops = append(ops, func() { sweepIndexKey(cat, key) }) + case key.Window <= windowFloor: // frozen, wholly below the floor + ops = append(ops, func() { sweepIndexKey(cat, key) }) } } - var refs []ArtifactRef // chunk family, swept in one batch - for _, ref := range chunkArtifactKeys(cat) { // (chunk, kind) per key + var refs []ArtifactRef + for _, ref := range chunkArtifactKeys(cat) { switch { - case ref.Chunk <= chunkFloor: // wholly past retention: any state goes + case ref.Chunk <= chunkFloor: // wholly past retention refs = append(refs, ref) case cat.State(ref.Chunk, ref.Kind) == Pruning: - // In-retention .bin demoted by its window's terminal commit batch. refs = append(refs, ref) - case ref.Kind == TxHashBin: // "frozen" OR "freezing" inside a finalized window + case ref.Kind == TxHashBin: // redundant .bin in a finalized window if fk := frozenCoverage(cat, indexID(ref.Chunk)); fk != nil && fk.Hi == windowLastChunk(indexID(ref.Chunk)) { - // Redundant input: re-derived (or left mid-write) by a - // widening catch-up that crashed before its terminal rebuild, - // then abandoned. The terminal .idx provably covers the chunk - // and the resolver never re-materializes a covered window. refs = append(refs, ref) } } } if len(refs) > 0 { - ops = append(ops, func() { sweepChunkArtifacts(refs, cfg, cat) }) + ops = append(ops, func() { sweepChunkArtifacts(cat, refs) }) } return ops } ``` -Hot DBs that outlived their retention window because of long downtime are removed by the discard stage; their files, if any, carry their flag keys and are picked up by the prune stage in the same tick. - -#### Op bodies +The op bodies — one discard, two sweeps — are the daemon's entire directory- and file-deletion surface: ```go -func sweepChunkArtifacts(refs []ArtifactRef, cfg Config, cat Catalog) { - batch := cat.NewBatch() // demote first — never unlink under a "frozen" key +func discardHotDBForChunk(cat Catalog, chunk ChunkID) { + if !cat.Has(hotChunkKey(chunk)) { + return + } + cat.Put(hotChunkKey(chunk), "transient") + deleteDirIfExists(hotChunkPath(chunk)) + fsyncParentDir(hotChunkPath(chunk)) + cat.Delete(hotChunkKey(chunk)) +} + +func sweepChunkArtifacts(cat Catalog, refs []ArtifactRef) { + batch := cat.NewBatch() // demote before the unlink for _, ref := range refs { - if cat.State(ref.Chunk, ref.Kind) != Pruning { - batch.Put(chunkKey(ref.Chunk, ref.Kind), "pruning") - } + batch.Put(chunkKey(ref.Chunk, ref.Kind), "pruning") } batch.Commit() - var paths []string // unlink (idempotent on already-gone paths) + var paths []string for _, ref := range refs { deleteArtifactFiles(ref.Chunk, ref.Kind) paths = append(paths, artifactPaths(ref.Chunk, ref.Kind)...) } - fsyncParentDirs(paths) // unlinks durable BEFORE the keys go + fsyncParentDirs(paths) // unlinks durable before the keys go batch = cat.NewBatch() for _, ref := range refs { @@ -1267,76 +874,40 @@ func sweepChunkArtifacts(refs []ArtifactRef, cfg Config, cat Catalog) { batch.Commit() } -func sweepIndexKey(key IndexKey, cat Catalog) { - if key.State == Frozen { - cat.Put(key, "pruning") // never unlink under a "frozen" key — a crash - // mid-sweep must not leave a frozen key fileless - } - // "freezing" (crashed attempt — never salvage) and "pruning" (superseded - // or retention-demoted) take the same path from here; the key outlives - // the durable unlink, so a crash anywhere re-runs the sweep. - deleteFileIfExists(indexFilePath(key)) // filename derived from the key name +func sweepIndexKey(cat Catalog, key IndexKey) { + cat.Put(key, "pruning") // demote before the unlink (synced → durable first) + deleteFileIfExists(indexFilePath(key)) fsyncDir(indexWindowDir(key)) - cat.Delete(key) - rmdirIfEmpty(indexWindowDir(key)) // best-effort tidiness; an empty dir is - // not an artifact + cat.Delete(key) // key outlives the unlink, so a crash re-runs the sweep + rmdirIfEmpty(indexWindowDir(key)) } ``` -The discard stage has no separate op body — `discardHotDBForChunk` is called directly from the eligibility closures above. These two sweeps are the *entire* deletion surface of the system: one body per key family, identical internal shape (demote if frozen → unlink → `fsyncDir` → key delete). - -The prune walk's two families are independent of each other and of discard: a chunk swept while its containing window's `.idx` is still around could leave a `getTransaction` query resolving to a missing `.pack`, but the [reader retention contract](#reader-retention-contract) handles that — past-retention seqs return not-found regardless. Discard touches only hot DBs, which the prune walk's flag-key iteration can't see. +`discardHotDBForChunk` removes a hot DB directory under its `hot:chunk` key; the two `sweep*` functions are the entire file-deletion surface, one body per key family. The prune walk's two families are independent of each other and of discard — a chunk swept while its window's `.idx` still resolves to it could leave a `getTransaction` pointing at a deleted `.pack`, but a below-floor read is not-found regardless ([reader contract](#reader-contract)). ### Concurrency model -Two writers; readers only read. The ingestion loop is one goroutine; the lifecycle is one goroutine whose tick's plan stage fans work out to the executor's bounded worker pool — every worker operating strictly below the live chunk, so the pool inherits the lifecycle's side of the partition. Their domains partition at the live chunk: +Two writer goroutines and read-only readers. The catalog partitions their domains at the **live chunk** — the highest chunk with a `hot:chunk` key: -- **The ingestion loop owns the live chunk** — the highest chunk with a `hot:chunk:*` key. It is the only writer of that chunk's hot DB and the creator of each chunk's `hot:chunk:{chunk}` key (via `openHotDBForChunk` at the boundary). -- **The lifecycle goroutine owns everything below the live chunk** — handed-off hot DBs (freeze + discard), all `chunk:*` and `index:*` artifact keys, and the deletion side of `hot:chunk:*` keys. +- **Ingestion** owns the live chunk: the sole writer of its hot DB, and the creator of each `hot:chunk` key (via `openHotDBForChunk` at the boundary). +- **The lifecycle** owns everything below it: handed-off hot DBs (freeze + discard), all `chunk:*` and `index:*` keys, and the deletion side of `hot:chunk` keys. -**The two goroutines share no state.** Their only connection is the payload-free doorbell, and the partition itself is encoded in the catalog: the lifecycle treats the highest `hot:chunk` key — *any* value — as the live chunk and touches only what lies below it. (This ownership boundary is value-blind: any `hot:chunk` key marks an owned chunk. Only the *watermark* derivation counts `"ready"` keys exclusively — a distinct concern, [defined earlier](#meta-store-keys).) The handoff fence is the boundary's write order — the ingestion loop closes its write handle *before* creating the next chunk's hot key. Creating that key is the act that moves the partition: the instant it exists, the closed chunk lies below the live chunk and any lifecycle scan (including one already in flight from the previous notification) may freeze and discard it — by which point no writer holds it. The two goroutines never write the same meta-store key, and never touch the same per-chunk hot RocksDB instance; both do write the meta store concurrently — on disjoint keys, relying on RocksDB's thread safety for the instance itself. The derivation is monotonic within the run (hot keys and frozen keys only advance), so a tick racing a boundary only under-approximates eligibility — work deferred to the next tick, never incorrect work. Readers hold their own read-only handles and resolve files through meta-store keys, so writer-side activity never races them. (The serving side will also need a notion of current progress — the [reader retention contract](#reader-retention-contract) bounds every read by the retention window — but how readers obtain it is the query-routing design's concern, not this doc's.) +The two share no memory; their only link is the channel. The handoff is by write ordering — ingestion closes the chunk and opens the next (moving the partition) *before* sending it — so the lifecycle never freezes a chunk a writer still holds. Both write the catalog at the same time but never the same key (RocksDB handles concurrent writes safely). And because the chunk ids ingestion hands over only increase, a chunk completing while a lifecycle run is already in progress just bumps the starting point of the *next* run — it can't disturb the one underway. Readers hold their own read-only handles and resolve files through keys, so writer activity never races them. -### One boundary, end to end +**Single-process enforcement.** All of the above assumes a *single* daemon owns the data; two daemons sharing it would corrupt it. The daemon enforces that at startup by taking a kernel file lock (`flock`) on a `LOCK` file in **each** of its roots — the catalog and every configured storage tree. A second daemon pointed at any of those paths can't acquire the lock and exits; the lock releases on any exit, including `kill -9`, so it never goes stale. It has to lock every root, not just the catalog, because the catalog and the storage trees are configured as independent paths — otherwise two daemons with different catalogs could still share a storage tree. The hot tree matters most: its `hot/{chunk}` DBs are the only copy of recently-ingested ledgers that aren't frozen yet. -Ledger 53,510,001 closes chunk 5350 (window 5, floor at chunk 5100, frozen index covering chunks 5100–5349): +--- -``` -ingestion batch for seq 53_510_001 commits (one fsync) - hotDB.Close() ← chunk 5350 handed off - open chunk 5351's hot DB (hot:chunk:00005351 = "ready") - ← chunk 5350 now visibly complete: - it sits below the highest hot key - notify ──────────────────────────────────────────────────┐ -lifecycle deriveCompleteThrough → 53_510_001 (positional term) ▼ - plan-and-execute: - resolve → Plan{ChunkBuild 5350, IndexBuild w5 [5100,5350]} - ChunkBuild 5350: catchupSource picks the hot DB (ready, - complete) → .pack, events segment, .bin all "frozen" - (the hot DB itself stays) - IndexBuild w5 (waited on 5350's done-channel): - put index:00000005:00005100:00005350 = "freezing" - merge .bin[5100..5350] → write 00005100-00005350.idx - → fsync → commit batch {[5100,5350] → "frozen", - [5100,5349] → "pruning"} - → eager sweep: unlink 00005100-00005349.idx → fsyncDir - → delete key - discard stage: index covers 5350 → discard chunk 5350's hot DB - prune stage: nothing left (the eager sweep already ran; floor - pinned at 5100 by earliest_ledger, so it doesn't - slide this tick) -reads tx in 5350: hot DB's txhash CF until the discard, .idx after - tx in 5351: hot DB of the new live chunk -``` +## Reader contract -Every arrow is the one write protocol or its exit sweep; at the end of the tick a re-plan and re-scan find nothing to do. +A read resolves data through two rules, and the rest of the design relies on both: ---- +1. **Only `"ready"` and `"frozen"` are visible.** A read resolves a chunk only from a `"ready"` hot DB or a `"frozen"` cold file — never from a key in a transient state (`"freezing"`, `"pruning"`, `"transient"`). So a reader never sees a half-written file, crash debris, or an in-progress sweep; transient keys are invisible to it. +2. **Below the floor is *not found*.** A read for any seq below the retention floor returns not-found, whether or not the file still exists on disk. This is what lets pruning delete a chunk the instant it passes retention: a stale `.idx` might resolve a tx-hash to a `.pack` that's been unlinked, but the below-floor read is not-found anyway. -## Reader retention contract +Together they make retention the single source of truth for "is this data available?": the freeze, sweep, and prune stages constantly create transient states and delete below-floor data, and these rules guarantee a read never *resolves* either. (Whether a read already in flight survives a concurrent unlink is a separate question — see below.) -A read for any seq below `effectiveRetentionFloor` returns *not found*, regardless of whether the underlying file still exists on disk. This is the contract that lets pruning remove chunks the moment they pass retention **without coordinating with the index lifecycle**: a stale `.idx` may resolve a tx-hash to a `.pack` that's been deleted, but a below-floor read is not-found regardless. From the storage layer's perspective, retention is the single source of truth for "is this data available?", and it is all the prune and sweep stages rely on. - -How the reader actually dispatches between hot DBs and frozen `.idx` files, and how it stays correct while sweeps and pruning unlink files concurrently with in-flight reads (tier dispatch, coverage re-resolution, the file-vanishes-mid-read cases), is the **query-routing design's** concern — out of scope here and in the transactions design (§8.4). +How a read is actually served — choosing the hot DB or the cold files for a given query, reading across the cold artifact types (`.pack` ledgers, events segments, `.idx` index), and staying correct when a sweep or prune unlinks a file while a read is mid-flight — is the **query-routing design's** concern, out of scope here and in the transactions design (§8). --- @@ -1346,78 +917,90 @@ This section states what the streaming workflow guarantees, the assumptions it r ### Invariants -The **retention window** is `[effectiveRetentionFloor, last_committed_ledger]`. The floor serves *retention* consumers only — pruning and the reader gate, where erring low is safe; it is never a production boundary (the plan ranges that produce data start at existing storage in the tick, and at a validated floor in catch-up). A future floor consumer picks its side by the err-direction test: if that consumer erring low would be dangerous, it is a production consumer and belongs behind catch-up's validation. **Quiescence** means the tick's plan is empty and both scans produce empty op lists. +Two terms recur below. The **retention window** runs from the retention floor up to the last committed ledger; the reader gate and the prune scan both use the floor (rounding it a little low is harmless). The floor is also the bottom of the production range for both backfill and the lifecycle run, and at runtime it only rises — so a run never reaches below what's already on disk. The daemon is **settled** when a run's plan is empty and its discard and prune scans produce no ops: the state between runs, where the invariants below are meant to hold. + +**INV-1 (read correctness).** Any data request whose ledger scope falls entirely within the retention window returns correct results: the content matches what a conformant LedgerBackend would produce, no partial state is visible, and no in-retention range is unreachable. + +There is one transient exception. When surgical recovery demotes hot data down to the live chunk (scenario 3), the last committed ledger rewinds and the floor — anchored on the last complete chunk — regresses with it. For the few minutes until re-ingestion advances it again, the bottom of the window includes a handful of chunks already pruned under the old floor. Reads there fail soft — not-found, never wrong data, since files are write-once and pruning only unlinks — and the gap closes as the floor climbs back. + +**INV-2 (single canonical state).** The catalog records exactly one home for each data range. What it guarantees: -**INV-1 (read correctness).** Any data request whose ledger scope falls entirely within the retention window returns correct results — content matches what a conformant LedgerBackend would produce, no partial state is visible, no in-retention range is unreachable. One transient exception mirrors INV-2's: after hot-volume loss, the floor (anchored on `completeThrough`) regresses with the lost completeness, so for the minutes until re-ingestion re-advances it, the window's *bottom* admits a few chunks that were already pruned under the pre-loss floor — those reads fail soft via the reader's missing-data-file rule (not-found, never wrong data: artifacts are write-once and pruning only unlinks) and the gap closes as the floor re-advances. +- **One frozen index per window, at all times** (settled or not). The commit batch promotes the new coverage and demotes the old one in a single write, so "the window's index" is always well-defined for readers — never two frozen keys, never none once the window has one. +- **No transient artifact key survives a settled state.** Between runs, no `chunk:*` or `index:*` key is `"freezing"` or `"pruning"`. Each kind of transient has cleared: index transients by the run that observed them; per-chunk `"freezing"` keys by re-materialization (the plan stage rebuilds them, for chunks in `[floor, last complete chunk]`, from whatever source `backfillSource` picks); and `"pruning"` keys by the sweeps. +- **No leftover hot DB for a fully-cold chunk** (when settled). No `hot:chunk:c` exists for a chunk `c` whose artifacts are all durable *and* whose window's index covers `c` — that chunk is served entirely from cold files, so its hot DB must be gone. +- **No leftover `.bin` key in a finalized window** (when settled). No `chunk:c:txhash` exists for a chunk in a window whose frozen index is terminal: the terminal commit demotes the merged inputs `[lo, hi]` and the sweep removes them, chunks below the floor are cleared by retention pruning, and the prune scan's redundant-input branch catches any that a crashed widening re-froze. -**INV-2 (single canonical state).** The meta-store records one home for each data range: -- **at most one `"frozen"` index key per window — at all times**, quiescent or not (the commit batch promotes and demotes in one write; this is what makes "the window's index" well-defined for readers); -- at quiescence, no artifact key anywhere is `"freezing"` or `"pruning"` — index transients are swept by the tick that observes them; per-chunk `"freezing"` keys are repaired by re-materialization (the plan stage, for chunks within `[floor, completeThrough]`, from whichever source `catchupSource` selects) and `"pruning"` keys are finished by the sweeps. One reachable exception: after hot-volume loss, a partially-frozen chunk *above* the derived watermark can hold `"freezing"` keys at served quiescence — it lies above `completeThrough` (outside every plan range and the retention window, so no read can observe it) until re-ingestion replays it forward from the last frozen boundary and re-freezes it, minutes later; -- hot DB keys add one tolerated in-flight transient: `"transient"` brackets a directory operation in progress (the boundary's `openHotDBForChunk`, startup's resume-chunk open, a discard mid-op) and can be observed while the lifecycle sits idle between ticks; a crash-left bracket is finished by the next `openHotDB` or discard scan; -- at quiescence, no `hot:chunk:c` key for a chunk `c` whose artifacts are all durable *and* whose window's index covers `c` (the chunk is fully served by cold artifacts, so the hot DB must be gone); -- at quiescence, no `chunk:c:txhash` key for a chunk `c` in a window whose frozen index key is terminal (the terminal commit demoted them; the sweep removed them; the prune scan's redundant-input branch demotes any that a crashed widening re-froze or left mid-freeze). +Two transient states are tolerated even at a settled moment: -**INV-3 (disk matches meta-store).** At quiescence, the set of artifact files and hot DB directories on disk equals exactly the set the meta-store specifies. Every key names exactly one expected path, and the mark-before-write rule keeps even a partial file reachable from its key — so the correspondence holds whether a key is in a final state or in one of the transients INV-2 tolerates (the hot-key `"transient"` bracket around an in-flight directory operation; the above-watermark `"freezing"` artifact key left by hot-volume loss with a lagging tip). The disk holds those paths and no others — no orphan files, no dangling keys, no duplicate artifacts: a non-key-named file in an index window dir is a real bug, not mid-tick debris. +- **A hot DB's `"transient"` bracket** around an in-flight directory operation (the boundary's `openHotDBForChunk`, startup's resume-chunk open, a discard mid-op). A crash-left bracket is finished by the next `openHotDBForChunk` or discard scan. +- **After a hot-data recovery, a partially-frozen chunk above the last committed ledger** may hold `"freezing"` keys while serving and settled. It sits above the last complete chunk — outside every plan range and the retention window, so no read can observe it — until re-ingestion replays it forward from the last frozen boundary and re-freezes it, minutes later. -**INV-4 (retention bound).** At quiescence, no file or meta-store key maps to a ledger range strictly below the effective retention floor. +**INV-3 (disk matches catalog).** When settled, the files and hot-DB directories on disk are exactly the set the catalog names — no more, no less. Every key maps to one expected path, and because a key is written before its file (mark-before-write), even a partial file is reachable from its key. So the match holds whether a key is in a final state or in one of the transients INV-2 tolerates. No orphan files, no dangling keys, no duplicates: a file that no catalog key names is a real bug, not mid-run debris. -Each invariant has a distinct audit. INV-1 you check by issuing reads or by re-deriving artifacts and byte-comparing. INV-2 you check by walking meta-store keys and cross-checking forbidden co-existence. INV-3 you check by walking the filesystem against the meta-store. INV-4 you check by walking meta-store keys against the floor. None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a real invariant violation, not as something the buggy code silently considers acceptable. Quiescence between ticks makes these walks meaningful on a live daemon, so an `audit` admin command can implement them directly (with an optional deep mode that re-derives sampled artifacts via a conformant LedgerBackend and byte-compares, for INV-1). +**INV-4 (retention bound).** When settled, no file or catalog key maps to a ledger range strictly below the effective retention floor — with one exception: a frozen index key whose window straddles the floor keeps the `lo` it was built with, so its coverage `[lo, hi]` reaches below the floor. That below-floor portion is never served ([reader contract](#reader-contract) rule 2 returns not-found), and the key and its `.idx` are swept once the whole window falls below the floor. + +Each invariant has a distinct audit. INV-1 you check by issuing reads or by re-deriving artifacts and byte-comparing. INV-2 you check by walking catalog keys and cross-checking forbidden co-existence. INV-3 you check by walking the filesystem against the catalog. INV-4 you check by walking catalog keys against the floor. None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a real invariant violation, not as something the buggy code silently considers acceptable. A settled state between runs makes these walks meaningful on a live daemon, so an `audit` admin command can implement them directly (with an optional deep mode that re-derives sampled artifacts via a conformant LedgerBackend and byte-compares, for INV-1). ### Convergence -From any storage state — partial-completion crashes, the state left after an operator action, the state left after surgical recovery — **startup** (the catch-up pass, then the first lifecycle tick, rung at ingestion start) drives the system to a quiescent state satisfying INV-1 ∧ INV-2 ∧ INV-3 ∧ INV-4 within the first tick (typically seconds after serving opens; bounded by that tick's freeze, rebuild, and prune workload). From any state reachable *during* a run, the lifecycle tick alone does, within a bounded number of ticks — and since runtime op failure aborts the daemon, every state a run can leave behind is one startup is built to converge. +**Startup converges from any on-disk state.** Whatever a partial-completion crash, an operator action, or surgical recovery leaves behind, startup drives the system to a settled state satisfying INV-1 ∧ INV-2 ∧ INV-3 ∧ INV-4. Startup here is the backfill pass followed by the first lifecycle run (fired by the startup seed), and it reaches a settled state within that first run — typically seconds after serving opens, bounded by the run's freeze, rebuild, and prune workload. From any state reachable *during* a run, the lifecycle run alone converges, within a bounded number of runs. And since a runtime op failure aborts the daemon, every state a run can leave behind is one startup is built to converge. -The split matters because some repairs are inherently catch-up's, not the tick's: a per-chunk `"freezing"` key with no hot DB behind it (a crashed catch-up write) is repaired by re-materialization, and a surgically removed range is re-derived from the LedgerBackend — no tick phase produces data. The tick's province is everything else: index transients, demotions, freezes from live hot DBs, prunes. +The split matters because some repairs are inherently backfill's, not the run's: a per-chunk `"freezing"` key with no hot DB behind it (a crashed backfill write) is repaired by re-materialization, and a surgically removed range is re-derived from the LedgerBackend — no run phase produces data. The run's province is everything else: index transients, demotions, freezes from live hot DBs, prunes. -Convergence rests on three properties shared by the resolver and the scans — eligibility is computed from durable meta-store state alone; ops are idempotent; everything is re-derived on every notification — plus catch-up's postcondition contract. Together, whatever a crash leaves half-done, the next tick or the next startup finishes. +Convergence rests on three properties shared by the resolver and the scans — eligibility is computed from durable catalog state alone; ops are idempotent; everything is re-derived on every notification — plus backfill's postcondition contract. Together, whatever a crash leaves half-done, the next run or the next startup finishes. ### Substrate assumptions Properties we rely on the underlying storage to provide: -- **Sync WAL.** All meta-store puts and deletes that the invariants depend on use RocksDB's `WriteOptions.sync = true`, which fsyncs the WAL before the write returns. Multi-key commits — the index commit batch, the sweeps' key-delete batches — are single atomic synced WriteBatches: all-or-nothing across keys. -- **Per-ledger durability.** The chunk hot DB's synced WriteBatch (atomic across all CFs) is the sole per-ledger durability boundary; the watermark is derived from it, so no cross-store ordering exists to maintain. Per-artifact: the per-chunk file **and its directory entry** are fsynced before its key flips to `"frozen"`, and an index coverage's `.idx` (and its dir entry) is fsynced before the commit batch freezes its key. +- **Sync WAL.** All catalog puts and deletes that the invariants depend on use RocksDB's `WriteOptions.sync = true`, which fsyncs the WAL before the write returns. Multi-key commits — the index commit batch, the sweeps' key-delete batches — are single atomic synced WriteBatches: all-or-nothing across keys. +- **Per-ledger durability.** The chunk hot DB's synced WriteBatch (atomic across all CFs) is the sole per-ledger durability boundary; the last committed ledger is derived from it. Per-artifact: the per-chunk file **and its directory entry** are fsynced before its key flips to `"frozen"`, and an index coverage's `.idx` (and its dir entry) is fsynced before the commit batch freezes its key. - **Deterministic, idempotent writes.** Re-applying any write produces byte-identical state. Backed by deterministic LCM bytes from any conformant LedgerBackend and a byte-identical streamhash index from byte-identical sorted inputs. -- **Monotonic progress.** Within a process run, ingestion only moves forward (each synced batch extends the last), and the lifecycle's derived `completeThrough` only advances with it (hot keys and frozen keys move forward, never back). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value or marginally above it (a batch that committed in the instant before the crash); it sits *below* the pre-crash value only when hot state was lost or demoted to `"transient"`, or when — on a daemon interrupted during its first backfill, before any live ingestion — recovery demotes a finished window's index for rebuild: with no hot DBs to anchor the watermark, it drops below that whole window until catch-up rebuilds the index, re-deriving the untainted chunks' inputs from their on-disk `.pack`s and re-fetching only the tainted chunks. There is no stored watermark to rewind; surgical recovery shrinks the derivation's inputs by demoting state, not by editing a counter. +- **Monotonic progress.** Within a process run, ingestion only moves forward: each synced batch extends the last, and the last-complete-chunk it hands the lifecycle climbs with it (strictly increasing chunk ids). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value, or a hair above it (a batch that committed in the instant before the crash). It lands *below* the pre-crash value in only two cases: hot state was lost or demoted to `"transient"`, or recovery demoted a finished window's index for rebuild on a daemon interrupted during its first backfill (before any live ingestion). In that second case there are no hot DBs to anchor the last committed ledger, so it drops below that whole window until backfill rebuilds the index — re-deriving the untainted chunks from their on-disk `.pack`s and re-fetching only the tainted ones. Surgical recovery, in general, shrinks the derivation's inputs by demoting state. ### Design invariants These are streaming-specific properties the implementation guarantees on top of the substrate, and that INV-1 through INV-4 depend on: -- **Every key precedes its file.** The pre-write `"freezing"` mark and post-fsync `"frozen"` flip mean any file on disk — per-chunk artifact or index file, partial or complete — has its meta-store key set. Every scan and sweep iterates keys, so every file is reachable that way; nothing ever lists a directory to find work. +- **Every key precedes its file.** The pre-write `"freezing"` mark and post-fsync `"frozen"` flip mean any file on disk — per-chunk artifact or index file, partial or complete — has its catalog key set. Every scan and sweep iterates keys, so every file is reachable that way; nothing ever lists a directory to find work. - **Index promotion is atomic and gap-free.** The commit batch freezes the new coverage and demotes its predecessor in one synced write, so the window's unique frozen key changes hands atomically — never two frozen keys, never none once the window has one. A reader following the frozen key always lands on a complete, fsynced index; a crash mid-build leaves the prior coverage frozen and the attempt as `"freezing"` debris that is either overwritten by the next build of that coverage or deleted unread by the sweeps. - **Key absent ⟹ file gone.** Every sweep's shared ordering (unlink → `fsyncDir` → atomic key delete) gives the exit-side counterpart. - **Hot DB keys bracket the directory.** The `hot:chunk:{chunk}` key is put (`"transient"`) before the directory is created, and deleted only after rmdir completes — with `"transient"` re-marked first. - **Tx hashes always have a queryable home.** The hot DB is discarded only after the durable `.idx` covers the chunk — hot CF, then `.idx`, with no gap. (The `.bin` is never a serving tier; it is rebuild input, demoted to `"pruning"` by the terminal commit batch — the same write that freezes the final `.idx` — or by retention pruning once its chunk falls past the floor, and deleted only by the sweep after that.) -- **`"frozen"` ⟹ the file is durable and complete.** Flips to `"frozen"` happen only after fsync, and files are deleted only under non-frozen keys (sweeps demote first) — so frozen keys can be trusted blindly by readers, the resolver, and `buildTxhashIndex`'s precondition check. -- **`"pruning"` is committed.** Once a key is in `"pruning"` — demoted by a commit batch or by retention — the sweep runs to completion on subsequent scans. Catch-up treats any non-`"frozen"` state as empty and overwrites cleanly if the range is re-ingested. +- **`"frozen"` ⟹ the file is durable and complete.** Flips to `"frozen"` happen only after fsync, and files are deleted only under non-frozen keys (sweeps demote first) — so frozen keys can be trusted blindly by readers and the resolver. +- **`"pruning"` is committed.** Once a key is in `"pruning"` — demoted by a commit batch or by retention — the sweep runs to completion on subsequent scans. Backfill treats any non-`"frozen"` state as empty and overwrites cleanly if the range is re-ingested. ### Scenario coverage -INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every quiescence reached after the events below; startup's first quiescence arrives when the first tick completes, shortly after reads open. +INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every settled state reached after the events below; startup's first settled state arrives when the first run completes, shortly after reads open. + +1. **Steady-state operation.** Hot DB ingestion advances the last committed ledger; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on it. +2. **Operator state changes — widening or shortening retention (`retention_chunks`).** Changing `retention_chunks` recomputes the retention floor, and the next startup converges to the new state. Backfill's per-window rule rebuilds any window whose desired coverage now exceeds what's stored, and the prune stage removes anything below a raised floor. + + Widening takes effect on the *next startup*, not immediately: a running daemon holds the retention config it started with, so its floor never drops mid-run — the lower floor, and the backfill that fills down to it, apply only at the next startup. `earliest_ledger` is not a live change at all: it is pinned on the first start and immutable, so editing the config never moves the floor (the only way to change it is to wipe the data directory and start fresh). +3. **Surgical recovery (tainted data).** The operator never touches the filesystem. Recovery is **one atomic catalog batch** that *demotes* the affected keys — it never removes them — split by tier. Tainted cold artifacts (`chunk:{c}:*` and every overlapping `index:*` key) go to `"freezing"`, the state that already means *this file is not to be trusted: re-derive or delete*. For the hot tier, demote **every `hot:chunk` at or above the lowest tainted chunk — the live chunk always included** — to `"transient"`, not just the directly-tainted ones (the reason is the third paragraph). `"transient"` makes a hot DB instantly ineligible as a source (`backfillSource` reads only `"ready"`) and invisible to the last-committed-ledger derivation (which counts only `"ready"` keys). The batch commits atomically or not at all, and re-running it is a no-op; the catalog's lock means it can only be written against a stopped daemon. + + Everything then converges through machinery that already exists. Backfill re-derives the `"freezing"` cold artifacts from a conformant LedgerBackend — overwriting in place, the write protocol's ordinary re-materialization — and rebuilds each window's index. (If the backend tip lags below a re-derived chunk, `backfillSource` waits for coverage; see [the primitives](#the-primitives).) The `"transient"` hot DBs need no file surgery: `openHotDBForChunk` wipes and recreates one when re-ingestion re-opens that chunk, and the discard scan retires any sitting below the live chunk. -1. **Steady-state operation.** Hot DB ingestion advances `last_committed_ledger`; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on `last_committed_ledger`. -2. **Operator state changes** — widening or shortening retention (`retention_chunks`). A `retention_chunks` change reduces to "`effectiveRetentionFloor` recomputes; the next startup converges to the new state": catch-up's per-window resolver rule re-derives and rebuilds any window whose desired coverage now exceeds its stored coverage, and the prune stage removes anything below a raised floor. The "next startup" is load-bearing for widening, enforced by the floor's two-role split: a lowered floor takes effect immediately in its *retention* role (pruning simply stops sooner), but the tick's *production* range still starts at existing storage — only the next catch-up, behind `validateRangeProducible`, materializes the new bottom. **`earliest_ledger` is not a live operator change**: it is pinned on first start and immutable — `validateConfig` aborts on any later genesis/numeric value that differs from the pin, and treats `"now"` as the pinned floor (see [Configuration](#configuration)) — so a plain config edit never moves the floor. The same floor machinery *would* converge for either direction once a future `set-earliest-ledger` admin command demotes the pin; until then the only supported way to change it is wiping the data directory, which is simply a fresh first start. -3. **Surgical recovery (tainted data).** The operator never touches the filesystem. Recovery is **one atomic meta-store batch** that *demotes* the affected keys — never removes — split by tier: tainted cold artifacts (`chunk:{c}:*` and every overlapping `index:*` key) → `"freezing"`, the state that already means *this file is not to be trusted: re-derive or delete*; tainted or lost hot DBs (`hot:chunk`, the live chunk's included) → `"transient"`, instantly ineligible as a source (`catchupSource` reads only `"ready"`) and ignored by the watermark, which counts only `"ready"` keys. The batch commits atomically or not at all, so there is no interruption analysis and re-running it is a no-op; the meta store's lock means it can only be written against a stopped daemon. Everything converges through machinery that already exists: catch-up re-derives the `"freezing"` cold artifacts from a conformant LedgerBackend — overwriting in place, rule 1's ordinary re-materialization — and rebuilds each window's index (if the backend tip lags below a re-derived chunk, `catchupSource` waits for coverage rather than aborting — see [catch-up primitives](#catch-up-primitives)); the `"transient"` hot DBs need no file surgery — `openHotDB` wipes and recreates one when re-ingestion re-opens that chunk, and the discard scan retires any sitting below the live chunk. Demoting hot DBs is **self-correcting for `last_committed_ledger`** because the watermark ignores `"transient"`: a demotion that reaches the live chunk rewinds the watermark to the last frozen boundary, and captive core re-ingests the un-frozen tail **forward**, never through the lagging bulk backend; a demotion strictly below the live chunk leaves the watermark unchanged (those chunks aren't the highest `"ready"` key, and the live chunk's `"ready"` DB still pins it). This uniformity is what replaces the old demote-vs-remove / above-or-below-the-live-chunk split — every recovery demotes, nothing is removed by hand, so there is no dirs-first-keys-last ordering for an operator to get wrong; the daemon's own sweeps and `openHotDB` handle the dirs in their existing crash-safe order. -4. **Hot-volume loss.** The hot-tier demotion above, triggered by loss rather than taint: the hot storage tree is gone (e.g. ephemeral NVMe died) while the meta store survives, so its `hot:chunk` keys read `"ready"` with missing dirs. `deriveWatermark`/`openHotDB` fatal on that mismatch — deliberately, since a missing dir can also be a mount misconfiguration where auto-wiping would destroy state — and point the operator at recovery. The operator demotes the orphaned `hot:chunk` keys to `"transient"` (the case-3 batch, hot tier only). On restart the fatal no longer fires (it checks `"ready"` keys), the watermark falls to the last frozen boundary (the cold artifacts survive on durable storage), and captive core re-ingests the lost tail **forward**. The hot DB was the only copy of its un-frozen ledgers — losing it loses them — and the **derived watermark admits as much automatically**: with those keys no longer `"ready"`, derivation lands at the last frozen boundary and re-ingestion fills from there. There is no watermark to edit, and the dirs are already gone, so recovery is pure key demotion. -5. **First deployment / downtime between restarts.** `last_committed_ledger` derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). -6. **LedgerBackend choice or mid-flight swap.** The LedgerBackend contract guarantees canonical LCM bytes for any range, so any conformant backend produces byte-identical artifacts. Different backends differ in performance, not behavior. An operator using BSB for backfill and CaptiveCore for hot DB ingestion, or swapping mid-deployment, satisfies all four invariants. -7. **Crash at any point during any of the above.** Sync WAL plus per-ledger durability ordering mean the meta store on next start is internally coherent and the derived watermark equals exactly what the last synced batch committed. Idempotency means re-running any half-finished op is safe. Convergence finishes whatever the crash interrupted. + **Why every hot DB at or above the taint, not just the tainted one.** The hot tier is repaired only by re-ingestion, which replays **forward** from the last committed ledger — the highest `"ready"` hot chunk. To replay a tainted hot chunk, that watermark must first fall *below* it; and since the watermark is the maximum over all `"ready"` hot chunks, it falls below the taint only once every hot DB at or above the lowest tainted chunk is demoted. Demoting just the tainted chunk would leave a higher `"ready"` chunk — ultimately the live chunk — pinning the watermark above the taint, so re-ingestion would never reach it. Once they are all demoted, the watermark drops to the last frozen boundary below the taint, captive core re-ingests the tail forward, and the untainted hot chunks swept up in the demotion are re-derived byte-identically. Every recovery demotes; nothing is removed by hand — the daemon's own sweeps and `openHotDBForChunk` handle the dirs in their existing crash-safe order. +4. **First deployment / downtime between restarts.** The last committed ledger derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). +5. **LedgerBackend choice or mid-flight swap.** The LedgerBackend contract guarantees canonical LCM bytes for any range, so any conformant backend produces byte-identical artifacts. Different backends differ in performance, not behavior. An operator using BSB for backfill and CaptiveCore for hot DB ingestion, or swapping mid-deployment, satisfies all four invariants. +6. **Crash at any point during any of the above.** Sync WAL plus per-ledger durability ordering mean the catalog on next start is internally coherent and the derived last committed ledger equals exactly what the last synced batch committed. Idempotency means re-running any half-finished op is safe. Convergence finishes whatever the crash interrupted. ### What a bug looks like The invariants describe what storage should look like, not how the phase scans maintain it. So common bugs show up as concrete violations: -- **A meta-store key claims something the file doesn't actually deliver** — e.g., a per-chunk writer flips a key to `"frozen"` before fsync (leaving a partial file the meta store advertises as complete), or an index key freezes before its `.idx` is fully fsynced, or the key name's `{lo, hi}` doesn't match the file's actual coverage, or a frozen file is mutated post-freeze ⟹ reads through the meta key see wrong or missing data. **INV-1** violated. Detectable by re-deriving an artifact via a conformant LedgerBackend and byte-comparing against the on-disk file. +- **A catalog key claims something the file doesn't actually deliver** — e.g., a per-chunk writer flips a key to `"frozen"` before fsync (leaving a partial file the catalog advertises as complete), or an index key freezes before its `.idx` is fully fsynced, or the key name's `{lo, hi}` doesn't match the file's actual coverage, or a frozen file is mutated post-freeze ⟹ reads through the catalog key see wrong or missing data. **INV-1** violated. Detectable by re-deriving an artifact via a conformant LedgerBackend and byte-comparing against the on-disk file. - **Pruning too aggressive** ⟹ a request whose ledger scope is in retention returns wrong or missing results. Issue a read to find it. **INV-1** violated. - **Two frozen index keys in one window** — a commit batch failed to demote the predecessor, or promotion and demotion landed as separate writes ⟹ readers have no well-defined index. Walk `index:*` keys, count `"frozen"` per window. **INV-2** violated. -- **A `"freezing"` or `"pruning"` key within `[floor, completeThrough]` survives served quiescence** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup catch-up should have re-materialized. Walk keys for transient values at quiescence, excluding the one corner INV-2 tolerates — a `"freezing"` artifact key *above* `completeThrough` after hot-volume loss with a lagging tip, which no source can yet repair. **INV-2** violated. +- **A `"freezing"` or `"pruning"` key within `[floor, last complete chunk]` survives while serving and settled** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup backfill should have re-materialized. Walk keys for transient values when settled, excluding the one corner INV-2 tolerates — a `"freezing"` artifact key *above* the last complete chunk after a hot-data recovery with a lagging backend tip, which no source can yet repair. **INV-2** violated. - **Chunk scan misses an orphan** ⟹ a hot DB persists for a chunk that cold artifacts fully serve. Walk `hot:chunk:c` keys whose chunk has its artifacts durable and its window's index covering `c`. **INV-2** violated. - **Finalization demotions don't complete** ⟹ per-chunk frozen tx hash files outlive the index that consumed them. Walk `chunk:c:txhash` keys whose window's frozen key has `hi` = the window's last chunk. **INV-2** violated. -- **A writer leaves a file on disk without its meta-store key** (file fsynced before key was durable, or a sweep deleted the key before its unlink was durable) ⟹ orphan file — invisible to every key-driven scan. Walk the filesystem against the meta-store. **INV-3** violated. -- **A meta-store key persists without its file** (file deleted before key) ⟹ dangling key. Walk the meta-store against the filesystem. **INV-3** violated. -- **Duplicate cold artifacts for the same logical data** (e.g., two events files for the same chunk, from a migration or buggy retry) ⟹ the meta-store names one expected path; the extras are orphans. Walk the filesystem against meta-store-specified paths. **INV-3** violated. -- **Pruning fails past the floor** ⟹ files or keys remain for ranges below `effectiveRetentionFloor`. Walk meta-store keys, compare ledger ranges to the floor. **INV-4** violated. +- **A writer leaves a file on disk without its catalog key** (file fsynced before key was durable, or a sweep deleted the key before its unlink was durable) ⟹ orphan file — invisible to every key-driven scan. Walk the filesystem against the catalog. **INV-3** violated. +- **A catalog key persists without its file** (file deleted before key) ⟹ dangling key. Walk the catalog against the filesystem. **INV-3** violated. +- **Duplicate cold artifacts for the same logical data** (e.g., two events files for the same chunk, from a migration or buggy retry) ⟹ the catalog names one expected path; the extras are orphans. Walk the filesystem against catalog-specified paths. **INV-3** violated. +- **Pruning fails past the floor** ⟹ files or keys remain for ranges below the retention floor. Walk catalog keys, compare ledger ranges to the floor. **INV-4** violated. A storage walk against the invariants is enough to find these without inspecting the phase implementations. @@ -1425,7 +1008,7 @@ A storage walk against the invariants is enough to find these without inspecting ## Related documents -- The transactions design ([gettransaction-full-history-design.md](./gettransaction-full-history-design.md)) — the tx-by-hash subsystem end to end: the hot `txhash` CF, the `.bin`/`.idx` formats, the rolling window index build protocol with its pseudocode and safety arguments, the `getTransaction` read path, and the capacity numbers. Canonical for everything rules 3 and 5 and the index-key section summarize. +- The transactions design ([gettransaction-full-history-design.md](./gettransaction-full-history-design.md)) — the tx-by-hash subsystem end to end: the hot `txhash` CF, the `.bin`/`.idx` formats, the rolling window index rebuild — its streamhash merge internals and safety argument — the `getTransaction` read path, and the capacity numbers. Canonical for the streamhash `.bin`/`.idx` formats, the index merge internals, and the index-key coverage semantics this doc summarizes. - The events design ([getevents-full-history-design.md](./getevents-full-history-design.md), PR #635) — the cold-segment file formats and the hot events CF schema referenced by the data model. - The reader / query-routing design — how reads dispatch between hot DBs and frozen files for in-retention queries. -- The original backfill workflow design ([../full-history/design-docs/03-backfill-workflow.md](../full-history/design-docs/03-backfill-workflow.md)) — the standalone backfill mode, **subsumed by this document**: its `process_chunk`, tx-hash index build and cleanup, geometry, configuration, directory layout, meta-store keys, crash recovery, and `getStatus` are all redefined here (and in the transactions design) in their current form. It is retained for history; where the two disagree, this document is current. In particular, that doc predates the 2026-06 streamhash redesign, so its 16-CF RecSplit `.idx` files, unsorted 36-byte `.bin` entries, `"1"`-valued meta keys, task-DAG scheduler, and standalone `full-history-backfill` CLI are all superseded. +- The original backfill workflow design ([../full-history/design-docs/03-backfill-workflow.md](../full-history/design-docs/03-backfill-workflow.md)) — the standalone backfill mode, **subsumed by this document**: its `process_chunk`, tx-hash index build and cleanup, geometry, configuration, directory layout, catalog keys, and crash recovery are all redefined here (and in the transactions design) in their current form. It is retained for history; where the two disagree, this document is current. In particular, that doc predates the 2026-06 streamhash redesign, so its 16-CF RecSplit `.idx` files, unsorted 36-byte `.bin` entries, `"1"`-valued catalog keys, task-DAG scheduler, and standalone `full-history-backfill` CLI are all superseded. diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index 0899db3e4..d3a71eb01 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -1,25 +1,17 @@ # RPC getTransaction Full-History Design -## Summary - -How the full-history daemon ingests and serves transactions for the tx-by-hash endpoint (`getTransaction`). A transaction lookup is a two-step read: resolve the hash to a ledger sequence, then fetch the transaction from that ledger's stored LCM. This document covers the resolution structure: the hot tier (a column family in the per-chunk hot RocksDB) and the cold tier (per-chunk sorted runs merged into a per-window minimal-perfect-hash index), the file formats, the rolling rebuild that keeps the cold tier current, the read path, and the capacity numbers. - -The daemon context — chunk geometry, the meta store, the one write protocol, catch-up, and the lifecycle tick — is defined in [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) (the streaming doc). That document references this one for everything tx-hash-specific and restates only what its own protocols depend on. - ---- - # Part 1: Problem and Scope ## 1. Objective Serve `getTransaction(hash)` for any transaction whose ledger falls within the retention window (full history by default): -- **Complete.** Every transaction in every in-retention ledger is resolvable by hash — no gaps, including across crashes, restarts, and retention changes — with one quantified residual: the cold index keys on streamhash's 128-bit routing key (the hash's first 16 bytes, §6.1), so two distinct in-retention hashes that share a 16-byte prefix *and fall in the same window* collide as one index key. That is a ~10⁻²⁰-per-dense-window event, and it is *fail-stop* rather than silent — the window's index build fails loudly rather than dropping or mis-resolving a transaction. (A shared prefix across *different* windows is harmless: the two never meet in one build, and the read path's verify rejects the resulting cross-window false positive — §8.2.) The guarantee is otherwise absolute. +- **Complete.** Every transaction in every in-retention ledger is resolvable by its hash, with no gaps — across crashes, restarts, and retention changes alike. The one exception is a hash-prefix collision so rare (~10⁻²⁰ for a dense window) that it counts as negligible, and even then it fails loudly rather than silently. §8.2 has it. - **Correct.** A lookup never returns the wrong transaction; a missing or out-of-retention one returns not-found. -- **No in-memory index.** The hash→seq map is on-disk `.idx` files (read through the page cache), not a RAM-resident structure sized to the transaction count — so the daemon holds no memory proportional to the number of transactions in history. (A lookup probes one `.idx` per in-retention window — a hash carries no window hint; that probe set and its cost are the query-routing design's concern.) -- **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write path, and the cold index stays current with a rebuild that is small relative to its cadence. +- **No in-memory index.** The map lives in on-disk `.idx` files, read through the page cache — not a RAM structure sized to the transaction count. The daemon's memory does not grow with the number of transactions in history. +- **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write, and the cold index stays current with a rebuild that is small relative to how often it runs. -Out of scope: how readers obtain the daemon's current coverage and dispatch between tiers across rebuild/freeze/discard transitions (the query-routing design), and the storage of the transactions themselves (the ledger store — `.pack` files and the hot `ledgers` CF — covered by the streaming doc and the packfile library doc). +Out of scope: how a reader chooses which tier and window to consult and stays correct while files are added and removed (the query-routing design), and the storage of the transaction bytes themselves (the ledger store). ## 2. Lookup model @@ -30,11 +22,11 @@ hash ──► seq ──► LCM for seq ──► extract the tx ──► veri (this doc) (ledger store) ``` -The subsystem this document owns is the **hash → seq map**, plus the read-path rules that make the final fetch correct. Three properties of the key space shape the design: +Three properties of the transaction-hash key space shape the design: -- **Point lookups only.** There are no range or prefix queries over tx hashes, so order-preserving structures buy nothing — perfect-hash structures apply. -- **Hashes are uniform and immutable.** A transaction hash is never updated and corresponds to at most one applied transaction (the network's replay protection); the map is append-only, one batch of entries per ledger. -- **The full transaction is always fetched anyway.** The response needs the envelope/result/meta, so the read path always ends with the ledger store and can verify the full hash against the fetched transaction. The map therefore doesn't need to be exact — only *complete* (no false negatives; the one residual, a same-window 16-byte prefix collision, is fail-stop rather than a silent miss — §8.2); false positives are screened first by a fingerprint and finally by the fetch-and-verify step. +- **Point lookups only.** Every query is for one specific hash, never a range or prefix — exactly what a perfect hash is built for. +- **Hashes are uniform and immutable.** A transaction hash is never updated, and corresponds to at most one applied transaction (the network's replay protection). The map is append-only: one batch of entries per ledger. +- **The full transaction is always fetched anyway.** The response needs the envelope, result, and meta, so the read path always ends by fetching the transaction and checking its full 32-byte hash. That means the map needn't be exact — only *complete*, never missing a hash that is really there. False positives are harmless: a fingerprint screens most of them, and the final hash check catches the rest. --- @@ -42,7 +34,7 @@ The subsystem this document owns is the **hash → seq map**, plus the read-path ## 3. The two tiers -An in-retention transaction is stored in exactly one place — one tier, one window, never duplicated — but a bare hash doesn't say *which*, so a lookup probes every home (§8.1) and at most one confirms (none, if the hash isn't there). The two homes: +Each in-retention transaction lives in exactly one place — one tier, one window, never copied. But a hash on its own doesn't say which place, so a lookup checks them all, and at most one answers (none, if the hash isn't stored). The two places a transaction can live: | Tier | Structure | Serves | |---|---|---| @@ -57,16 +49,14 @@ An in-retention transaction is stored in exactly one place — one tier, one win coverage) written) ``` -The handoff is gap-free by write ordering: a chunk's hot DB is discarded only after the durable `.idx` covers the chunk (the streaming doc's discard stage gates on exactly this). Between a chunk's freeze and its coverage — normally one lifecycle tick, since freeze, rebuild, and discard chain within the tick — the chunk is served from its still-present hot DB. - -There is **no dedicated transaction store at any layer**. The map resolves hashes to sequences; transaction bytes live in the ledger store (`ledgers` CF while hot, `.pack` files when cold). +The two tiers hand off with no gap. A chunk's hot table is dropped only *after* the cold index covers that chunk. So a freshly frozen chunk keeps being answered from its hot table until the index can answer for it, and only then does the hot table go away. Every transaction is findable in exactly one tier at all times. ## 4. Geometry -Two units organize the map; both are defined in the streaming doc and restated here because every structure below is named by them: +Two units organize the map. Every structure below is named by them: - **Chunk** — 10,000 ledgers (hardcoded). The unit of the hot DB and of the sorted runs. -- **Window** — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the cold index. Configurable, but pinned in the meta store on first start and immutable thereafter. +- **Window** — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the cold index. Configurable, but pinned in the catalog on first start and immutable thereafter. ``` chunkID(seq) = (seq - 2) / 10_000 @@ -86,28 +76,28 @@ With the default `chunks_per_txhash_index = 1000`: window 0 spans ledgers 2–10 ### 5.1 Storage -The `txhash` CF is one column family of the per-chunk hot RocksDB (alongside `ledgers` and the events CFs — see the streaming doc's data model): +The hot tier is a plain key-value table, one per chunk, stored as a `txhash` column family in that chunk's RocksDB: - **Key**: the full 32-byte transaction hash. - **Value**: the 4-byte ledger sequence. -Full-key storage means the hot tier is *exact*: a lookup of a hash not in the chunk simply misses, with no fingerprint or verification subtleties. The CF carries its own tuning options (point-lookup-oriented: bloom filters, no ordering requirements) independent of its siblings. +Storing the full hash makes the hot tier **exact**: a lookup either finds the hash or it doesn't. There are no false positives to screen out and nothing to verify. The table is tuned for point lookups — bloom filters on, ordering off. ### 5.2 Write path -The hot write path is the streaming doc's ingestion loop verbatim: each ledger commits as **one atomic, synced WriteBatch across all CFs** of the chunk's hot DB, and `putTxHashes` contributes one `(hash, seq)` entry per transaction in the LCM. A ledger's hashes are either all present or all absent; there is no separate tx-hash durability boundary to reason about. +Writing is straightforward. As each ledger is ingested, one `(hash, seq)` entry is added for every transaction in it, in the same atomic write that stores the rest of the ledger. So a ledger's hashes are written all-or-nothing, together with the rest of the ledger. ### 5.3 Lifetime -The chunk's hot DB is created when ingestion enters the chunk and discarded whole once every cold artifact derived from the chunk is durable **and** the window index covers the chunk. The `txhash` CF is the reason for the *and*: the `.bin` (below) is never a serving tier, so without the coverage gate there would be a window where a frozen chunk's hashes had no queryable home. In steady state the freeze-to-coverage interval is the one tick in which the boundary's `ChunkBuild` and `IndexBuild` run; after catch-up or crashes it can span longer — the discard scan re-derives eligibility from durable keys, so the hot DB simply persists until coverage genuinely lands. +A chunk's hot table lives from the moment the chunk starts ingesting until the cold index covers it. Coverage can lag the chunk's freeze by a while; until it lands, the chunk is simply answered from its hot table. ## 6. Cold artifacts -Two artifact kinds, both produced and cataloged under the streaming doc's one write protocol (mark `"freezing"` before any I/O → write → fsync file + dirents → flip `"frozen"`): +The cold tier has two kinds of file: a per-chunk sorted run (`.bin`) and the per-window index (`.idx`). ### 6.1 The per-chunk sorted run: `.bin` -`txhash/raw/{bucket:05d}/{chunk:08d}.bin`, meta-store key `chunk:{chunk:08d}:txhash`. Produced by `processChunk` in the same streaming pass that writes the chunk's `.pack` and events segment — per ledger, the tx-hash extractor collects entries; at the end of the pass they are **sorted in memory** (~3M entries ≈ 60 MB for a dense chunk — negligible) and written out. +The `.bin` lives at `txhash/raw/{bucket:05d}/{chunk:08d}.bin`, with catalog key `chunk:{chunk:08d}:txhash`. It is produced once, when the chunk is frozen: as the chunk's ledgers are read, each transaction's `(hash, seq)` is collected, and at the end they are **sorted in memory** (~3M entries ≈ 60 MB for a dense chunk — negligible) and written out. **Format** (the streamhash merge format): @@ -116,157 +106,72 @@ uint64 LE entry count entry × count 20 bytes each: [key: 16][seq: 4 LE] ``` -- `key` is the **first 16 bytes of the transaction hash** — streamhash's routing-key width: it derives the perfect-hash slot from these 16 bytes alone, both when this run is built and when a lookup probes (§8.1). Two distinct in-retention hashes sharing these 16 bytes are a true duplicate key only when they land in the *same* window's build; across windows the shared prefix is harmless. Both cases, and their handling, are §8.2. +- `key` is the **first 16 bytes of the transaction hash**. The index uses only these 16 bytes to place and find a transaction; what happens when two hashes share a 16-byte prefix is in §8.2. - Entries are sorted ascending by the **big-endian `uint64` prefix of `key`**. -The `.bin` is a *map-side-sorted run*, never a serving tier. Sorted runs are what make the rolling rebuild cheap: the index builder consumes them in a single streaming k-way merge, instead of the two passes (count, then add) it needs over unsorted input. +The `.bin` is a pre-sorted file, and a lookup never reads it directly. It is sorted because streamhash builds an index **much faster, and with much less memory, when its keys arrive already sorted** — its *sorted-builder mode*. -A `.bin` lives as long as its window needs it as rebuild input: every boundary re-merges **all** of the current window's runs, so they are retained for the whole life of the window and demoted en masse by the terminal build's commit batch (§7.3), then swept. +A `.bin` is kept for as long as its window is still being rebuilt — every rebuild re-merges all of the window's `.bin` files. Once the window is complete and its final index is built, the `.bin` files are no longer needed, and are deleted. ### 6.2 The per-window index: `.idx` -`txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, meta-store key `index:{window:08d}:{lo:08d}:{hi:08d}`. One streamhash minimal-perfect-hash file per **coverage**, built by streamhash's `SortedBuilder` over the k-way merge of `.bin[lo..hi]`, with the cold-txhash option set: +The `.idx` lives at `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, tracked by the catalog key `index:{window:08d}:{lo:08d}:{hi:08d}`. There is one minimal-perfect-hash file per **coverage** — a coverage being the chunk range `[lo, hi]` the file actually hashes. Streamhash's `SortedBuilder` builds it from the k-way merge of `.bin[lo..hi]`. Two fields are sized per build: -- **Payload: `payloadWidth` bytes** — the ledger seq stored as an offset from `MinLedger`, where `MinLedger = chunkFirstLedger(lo)` is derived from the build range. The width is sized to the window so the format never caps `chunks_per_txhash_index`: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)`, the bytes needed to hold the largest in-window offset (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) this is **3 bytes** — a 24-bit offset spans 16.77M ledgers — and a window of 1678+ chunks (>16.77M ledgers) widens it to 4. (The format imposes no upper bound of its own; `chunks_per_txhash_index` is independently capped at `MaxChunksPerTxhashIndex` ≈ 429,496 — `floor(2³²/10_000)` — by the streaming doc's `validateConfig` so a window's ledger span always fits a uint32 seq, which is why the offset never needs more than 4 bytes.) Since `chunks_per_txhash_index` is immutable once stored, the width is fixed for every window's life; streamhash records it in the index file header (recovered when the file is opened), while `MinLedger` — which streamhash itself does not model — rides in the user-metadata slot. Both are read back at lookup time; no sidecar metadata. -- **Fingerprint: `fpWidth` bytes (default 1)** — a streamhash option screening foreign keys before fetch-and-verify. Since a hash lookup probes every in-retention window (§8.2), a wider fingerprint trades index size (+1 byte/tx) for fewer false-positive fetches across those windows. Fixed per build, like `payloadWidth`. +- **Payload (`payloadWidth` bytes): the answer the hash maps to — a ledger seq.** It is stored as an offset from the window's first ledger (`MinLedger = chunkFirstLedger(lo)`) rather than as a full seq, to save bytes. The width is sized to the window, so the format never limits how big a window can be: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)` — just enough bytes for the largest offset a window can produce (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) that is **3 bytes**, a 24-bit offset spanning 16.77M ledgers; a window of 1678+ chunks (>16.77M ledgers) bumps it to 4. It never needs more than 4: `chunks_per_txhash_index` is capped at `MaxChunksPerTxhashIndex` ≈ 429,496 (`floor(2³²/10_000)`), so a window's ledger span always fits in a uint32. Because `chunks_per_txhash_index` is fixed once stored, the width is fixed for every window's life. Streamhash writes the width into the index file's header; `MinLedger`, which streamhash does not model itself, rides in the file's user-metadata slot. Both are read back at lookup time, so there is no separate sidecar file. +- **Fingerprint (`fpWidth` bytes, default 1): a few bytes per entry to screen out wrong hashes** before the expensive fetch-and-verify. Because a lookup probes every in-retention window (§8.2), a wider fingerprint is a trade-off: it costs index size (+1 byte per transaction) but cuts the number of false-positive fetches across those windows. Fixed per build, like the payload width. All-in, at the default 3-byte payload the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. A window past the 4-byte payload threshold adds one byte per transaction. -These formats match the measured pipeline in the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`), which is where the performance figures in Part 4 come from — adopting the formats unchanged is what makes those figures transfer to this design. +### 6.3 Coverage and the live index -### 6.3 Keys, coverage, and the uniqueness invariant +An index file is named by its **coverage** — the chunk range `[lo, hi]` it hashes: -An index key's **name carries the coverage; the value carries only lifecycle state** (`"freezing"` / `"frozen"` / `"pruning"`, with the same meanings as every artifact key in the system). The filename is derived from the key by a fixed bijection, so resolving a key to its file never reads the value or lists a directory — every file on disk, including a crashed attempt's partial, is reachable from its key alone. +- **`lo`** — the lowest chunk the index covers. It is the window's first chunk, unless the retention floor has cut into the window, in which case it rises to the first chunk still retained. +- **`hi`** — the highest chunk the index covers. While the window is the current one (the network tip is in it), `hi` advances by one chunk on each rebuild. Once the window is complete, `hi` is its last chunk and the index is final. -- **Coverage is the whole identity** — there is no per-attempt counter. A retry of a crashed build re-marks the same key and rewrites the same file from scratch. The file readers hold is never a writer's target: a file is writable only under a key that has never been `"frozen"` in this run, and a scheduled build's coverage always differs from the window's frozen coverage — equality is precisely the case the build's skip check returns on. -- **`lo`** — the lowest chunk the perfect hash covers. Defaults to the window's first chunk; rises above that when `earliest_ledger` or the sliding retention floor cuts into the window at build time. `lo` rising is how a mid-window floor is encoded — there is no other floor representation in the index. -- **`hi`** — the highest chunk the perfect hash covers. For the *current* window (the one the network tip is in), `hi` advances by one chunk on each boundary. For a *finalized* window, `hi` is the window's last chunk. -- **Terminal-ness is derived, not stored**: the key whose `hi` equals its window's last chunk (`windowLastChunk` is computable forever from the window id and the immutable `chunks_per_txhash_index` pin). A window whose frozen key is terminal is finalized — never rebuilt again (only a retention-widening catch-up re-derives it, at its new, wider coverage). There is no `"finalized"` value and no marker: it would be a second copy of a derivable fact. +A window has exactly **one live index** at a time, and a lookup resolves "the window's index" to that one file. A rebuild builds a new index at a wider coverage and replaces the live one; the replacement is atomic, so a lookup always sees one complete index, never a half-built one. (How that swap stays atomic across a crash is the daemon's write protocol, in the streaming doc.) -**The uniqueness invariant: at most one coverage per window is `"frozen"` at any moment — at all times, not just at quiescence.** The rebuild's commit is one atomic synced batch that promotes the new coverage and demotes its predecessor in the same write, so the frozen coverage changes hands atomically. Readers resolve "the window's index" as *the unique frozen key*: no tie-break, no value parsing. Everything else under the window's prefix is transient debris with an unambiguous disposition: `"freezing"` = a crashed attempt — re-marked and overwritten if its coverage is built again, otherwise swept; `"pruning"` = a superseded coverage — finish the unlink and drop the key. +So the index hashes exactly the transactions in chunks `[lo, hi]`. Chunks below `lo` are out of scope — cut off by the floor. Chunks above `hi` aren't folded in yet, and are served from their hot tables until the next rebuild advances `hi`. -So the `.idx` hashes exactly the transactions in chunks `[lo, hi]`. Chunks below `lo` are out of scope (floor); chunks above `hi` are not yet folded in — their lookups are served from their chunks' hot DBs until the next rebuild advances `hi`. - -**Concrete example** (default `chunks_per_index = 1000`): while the tip is in chunk 5350, index 5 (chunks 5000–5999) is the current window. If the floor sits at chunk 5100, the store holds `index:00000005:00005100:00005349 = "frozen"` and the live file is `txhash/index/00000005/00005100-00005349.idx` — covering chunks 5100–5349, with chunk 5350 still streaming into its hot DB and chunks 5000–5099 below the floor. The next boundary puts `index:00000005:00005100:00005350 = "freezing"`, writes and fsyncs `00005100-00005350.idx`, then commits the batch {`[5100,5350]` → `"frozen"`, `[5100,5349]` → `"pruning"`}; the eager sweep unlinks the superseded file and deletes its key right after the commit. +**Example** (default 1000 chunks per window): the tip is in chunk 5350, so window 5 (chunks 5000–5999) is the current window, and the floor is at chunk 5100. The live index covers chunks 5100–5349, in the file `txhash/index/00000005/00005100-00005349.idx`; chunk 5350 is still in its hot table, and chunks 5000–5099 are below the floor. At the next boundary the index is rebuilt to cover 5100–5350, and the old file is deleted. ## 7. The rolling rebuild -### 7.1 Why rebuild from scratch on every boundary - -The current window's index is **re-derived from scratch on every chunk boundary** to absorb the chunk that just froze, growing until its window completes. Only the window the tip is in is ever rebuilt; a finalized window's index is static. +### 7.1 Rebuild cadence and cost -The rebuild is cheap relative to its cadence: a full-window build is ≈1 minute against a chunk boundary every ~14 hours at mainnet rates (Part 4). That headroom is what lets the index be rebuilt whole from sorted inputs every boundary, rather than updated incrementally: +The current window's index is **rebuilt from scratch on every chunk boundary**, to fold in the chunk that just froze; it grows until the window is complete. Only the current window is ever rebuilt — a finalized window's index never changes. -- There is no incremental-update machinery and no partially-updated index state for a crash to expose. Every `.idx` on disk is a complete, deterministic function of its coverage. -- `lo` tracks the floor and `hi` tracks the tip automatically — no separate floor-driven rebuild is ever needed while the window is current. -- Catch-up and the steady-state tick share the build path identically (one scheduler, one set of postconditions — see §9), so there is no second regime to verify. -- A same-coverage rebuild writes byte-identical output, which collapses crash recovery to "re-mark and rewrite" with no salvage analysis. +This is affordable because the rebuild is cheap relative to its cadence: a full-window build takes ≈1 minute, against a boundary only every ~14 hours at mainnet rates (Part 4). Rebuilding the whole index each time keeps every `.idx` on disk a complete index for its coverage, with no half-updated state. -### 7.2 The build protocol +### 7.2 The rebuild -`buildTxhashIndex(w, lo, hi)` runs the one write protocol with a batch-commit extension. The covered range is explicit: `lo` defaults to the window's first chunk, rising to `chunkID(effectiveRetentionFloor)` when the floor cuts in; `hi` is the highest frozen chunk in the window — the window's last chunk once it's complete, lower while it's still filling. +To rebuild window `w`'s index over coverage `[lo, hi]`: -1. **Skip check**: if the window's unique frozen key already covers exactly `[lo, hi]`, return — there is nothing to write, and any leftover transient keys are the sweeps' job, not the builder's. (A frozen key covering the full window is terminal by definition, so the skip also covers re-scheduled builds of finalized windows, which must not demand `.bin` inputs the sweep has deleted.) -2. **Mark**: put `index:{w:08d}:{lo:08d}:{hi:08d}` = `"freezing"` — an idempotent overwrite when a crashed attempt (or a demoted coverage made desired again by a cross-restart regression) left the key behind. The build is **terminal** iff `hi` is the window's last chunk — a derived property, marked nowhere. -3. **Write**: k-way merge the sorted `.bin` files for chunks `[lo, hi]` into streamhash's `SortedBuilder`, writing `{lo:08d}-{hi:08d}.idx` — created or truncated wholesale; a writer only ever holds a file whose key is non-frozen, never one a reader can resolve. Fsync the file and its dir (and the dir's own dirent when this build created the window dir — the first build of a window). -4. **Commit**: one atomic synced batch — this coverage `"freezing"` → `"frozen"`; the window's predecessor frozen coverage (if any) → `"pruning"`; and iff this is the terminal build, every `chunk:{c}:txhash` key in the window → `"pruning"`. This batch is the *entire* finalization protocol — there is no separate cleanup step; the demoted keys become ordinary sweep work. - -Precondition: every chunk in `[lo, hi]` has `chunk:{c}:txhash == "frozen"` (its `.bin` exists). The function fails loudly if violated — checked before any key is touched, which is also the backstop for a build whose input task failed in the executor (the streaming doc's done-channels broadcast completion, not success). +1. **Skip if already done.** If the live index already covers exactly `[lo, hi]`, there is nothing to do. +2. **Merge.** Merge the sorted `.bin` files for chunks `[lo, hi]` into a new index file, with streamhash's sorted-builder. (Every chunk in `[lo, hi]` must have a `.bin`; a missing one fails the merge.) +3. **Swap in.** Make the new file the window's live index, replacing the previous one. ```go -func buildTxhashIndex(w WindowID, lo, hi ChunkID, cat Catalog) error { - // Step 1 — skip check. Also covers re-scheduled builds of finalized - // windows, whose .bin inputs the sweeps may already have removed. - if fk := frozenCoverage(cat, w); fk != nil && fk.Lo == lo && fk.Hi == hi { - return nil - } - // Precondition, checked loudly before any write. - for c := lo; c <= hi; c++ { - if cat.State(c, TxHashBin) != Frozen { - return fmt.Errorf("window %d: chunk %d .bin not frozen", w, c) - } - } - - // Step 2 — mark the coverage. Re-marking a crashed attempt's key (or a - // demoted coverage a cross-restart regression made desired again) is an - // idempotent overwrite; the file is rewritten wholesale either way. - terminal := hi == windowLastChunk(w) // derived; no marker anywhere - key := indexKey(w, lo, hi) - cat.Put(key, "freezing") - - // Step 3 — write the coverage's file from scratch (create-or-truncate: - // a crashed attempt's partial is overwritten wholesale, never appended). - f := createTruncate(indexFilePath(key)) - merge := newKWayMerge(binPaths(lo, hi)) // sorted runs → one streaming pass - sb := streamhash.NewSortedBuilder(f, coldTxhashOptions(lo)) // §6.2 options - for merge.Next() { - sb.Add(merge.Entry()) - } - sb.Finish() - fsyncFile(f) - fsyncDir(indexWindowDir(w)) // + fsyncParentDir on the window dir's first build - - // Step 4 — commit: ONE atomic synced batch, the entire finalization - // protocol. Note: no file is unlinked here, ever — the batch only - // DEMOTES keys, and deletion is exclusively the sweeps' job (§7.4): - // eagerly by buildThenSweep right after this batch, in both regimes; - // the tick's prune scan is the crash backstop. - batch := cat.NewBatch() - batch.Put(key, "frozen") - if prev := frozenCoverage(cat, w); prev != nil { - batch.Put(prev.Key, "pruning") // supersede the predecessor — a distinct - // key: the skip check returned if the - // frozen coverage equaled [lo, hi] - } - if terminal { // demote every input key in the window - for c := windowFirstChunk(w); c <= hi; c++ { - if cat.Has(chunkKey(c, TxHashBin)) { - batch.Put(chunkKey(c, TxHashBin), "pruning") - } - } - } - batch.Commit() - return nil +// rebuild window w's index over [lo, hi] +sb := streamhash.NewSortedBuilder(newIndexFile, sortedBuilderOpts) +for entry := range kWayMerge(binFiles(lo, hi)) { // sorted .bin files → one stream + sb.Add(entry) } +sb.Finish() +// then make newIndexFile the window's live index, replacing the old one ``` -### 7.3 Finalization - -A window finalizes when its terminal build's commit batch lands — the same write that freezes the full-coverage `.idx` demotes every `chunk:{c}:txhash` key in the window to `"pruning"`. Finalization is therefore never a separate step, for either caller: catch-up calls the same function for every window its range overlaps (a complete window's full desired range is a terminal build; the trailing window's producible range is non-terminal), and the boundary tick's window-end rebuild is terminal by arithmetic. - -After finalization the `.idx` is static. If the floor later advances *within* the finalized window, the file becomes stale (`lo` references chunks pruning has since removed) — deliberately tolerated: index keys are swept only when their window falls *wholly* past the floor, and the read path handles the straddling case (§8.4). A window-straddling floor exists at most once at any moment — the window containing the effective retention floor. - -### 7.4 Sweeps and disk bounds - -The commit batch only demotes keys; all file deletion happens through the streaming doc's key-driven sweeps (unlink → `fsyncDir` → delete key — key absent ⟹ file gone). Two call sites: - -- **Eagerly, inside every `IndexBuild`'s execution** (`buildThenSweep`, right after the commit batch, in both regimes): sweep the window's superseded coverage and, after a terminal build, its demoted `.bin` inputs. The sweep is **window-local** — it walks only this window's keys, so concurrent windows' sweeps touch disjoint keys and files. -- **The tick's prune scan** — the crash backstop, and the owner of retention pruning. A `"freezing"` index key it observes was *not* retried (builds run before the sweep in every regime), so its coverage is no longer desired: **delete file and key, never salvage**. The file might even be complete, but proving that buys nothing — a rebuild re-derives identical bytes — so one no-questions rule covers every crashed attempt. - -The eager site is what bounds disk. Without it, a long backfill would accumulate every finalized window's demoted `.bin`s until the first tick (≈20 bytes per transaction across all of history); with it, transient `.bin` disk is bounded by the windows actually in flight — the floor is one dense window's worth (≈60 GB), irreducible because a window's build merges all of its runs at once. +Because a rebuild writes a whole new file and only swaps it in at the end, the live index is never partially updated: a lookup sees either the old index or the new one, never something in between. -**Provisioning note**: the old and new coverage files coexist from the start of a rebuild's write until the eager sweep's unlink, so the window dir transiently holds ~2× the index size (~25 GB at the end of a dense full window), and the window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial on instance NVMe, but worth provisioning for on throughput-capped volumes like EBS gp3. - -### 7.5 Why rewriting coverage-named files in place is safe - -The hazard: a reader holds the live `.idx` open while the next coverage is written into the same window directory. Four facts make the in-place rewrite safe anyway: - -1. **The skip rule.** A build's target name equals the live file's name only when its coverage equals the frozen coverage — which is exactly the case the skip check returns on. So no scheduled build ever opens the file readers resolve. -2. **Stage ordering and sweep scope.** No sweep runs where a build could collide with it: the `"freezing"`-key sweep — the only sweep that can touch a name a future build may target — lives solely in the tick's prune scan, which follows the plan stage (and catch-up precedes the lifecycle goroutine entirely); the eager sweep inside `buildThenSweep` touches only `"pruning"` keys in its own window, strictly after that window's commit; and a plan holds at most one `IndexBuild` per window — so concurrent windows' sweeps and builds touch disjoint keys and files. -3. **Floor monotonicity and reader lifetime.** Desired coverage is monotone within a run, and reader file handles die with the process — so a name any reader has resolved is never rewritten while held. -4. **Determinism.** A same-coverage rebuild writes identical bytes regardless — the merge is a deterministic function of the coverage — leaving a partial file under a `"freezing"` key as the only hazardous state, and no reader resolves a non-frozen key. +### 7.3 Finalization -A change to any of the four — the skip rule, the stage ordering or the eager sweep's window-locality and pruning-only scope, the floor's monotonicity, or reader lifetime — must re-prove this argument. +When a window's last chunk is folded in, its index is final: it covers the whole window and is not rebuilt again — unless retention later widens to include older chunks, when it is rebuilt wider to cover them. The window's `.bin` files have done their job as rebuild inputs, and are deleted. -### 7.6 Crash matrix +### 7.4 Disk use during a rebuild -| Crash point | Durable state left | Convergence | -|---|---|---| -| after step 2, or mid step 3 | predecessor coverage still `"frozen"` (readers unaffected); new key `"freezing"`, file absent/partial/complete | next build of the coverage re-marks and rewrites wholesale; if the coverage is no longer desired, the prune scan deletes file + key unread | -| after step 4, before the eager sweep | new coverage `"frozen"` and live; predecessor `"pruning"`; terminal: window's `.bin` keys `"pruning"` | the sweeps finish — eager on the next build, prune scan otherwise | -| mid-sweep | `"pruning"` key outlives the durable unlink | the sweep re-runs; key absent ⟹ file gone | +A rebuild writes a whole new index file before the old one is removed, so a window directory briefly holds ~2× the index size (~25 GB at the end of a dense window). The window's `.bin` files are also all on disk together, since the rebuild merges them at once — about 60 GB for a dense window. Both are transient. -At no crash instant are two coverages frozen, or none (once the window has one), or a `"frozen"` `chunk:{c}:txhash` key whose `.bin` has been deleted — the commit batch only ever demotes keys, and files are touched exclusively by the sweeps, under non-frozen keys. This is what lets the catch-up resolver (§9) trust `"frozen"` blindly. +The window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial on instance NVMe, but worth provisioning for on throughput-capped volumes like EBS gp3. ## 8. Query path @@ -279,14 +184,14 @@ A hash names no ledger, so the reader cannot know which home holds it in advance | cold — one `.idx` per window | **every in-retention window** | MPHF + fingerprint + verify (§8.2) | | hot — `txhash` CF per chunk | the chunks above any window's `hi` (live, or frozen awaiting coverage) | exact full-key get (§8.3) | -The hot tier is a few chunks at most — one window's tail, normally just the live chunk — so the probe set is `≈ (in-retention windows) + (a handful of chunks)`. How the reader learns current coverage and stays consistent across rebuilds is the query-routing design's concern; this document requires only that the homes' union covers the retention window — guaranteed by the discard gate (§5.3) and the uniqueness invariant (§6.3) — and that each ledger has exactly one home, so **at most one probe confirms** — the verify runs on every fingerprint hit but succeeds for at most one. +The hot tier is a few chunks at most — one window's tail, normally just the live chunk — so the probe set is `≈ (in-retention windows) + (a handful of chunks)`. How the reader learns current coverage and stays consistent across rebuilds is the query-routing design's concern. This document requires only two things: that the two tiers together cover the whole retention window (the gap-free hot→cold handoff, §5.3), and that each transaction lives in exactly one of them. So **at most one probe confirms**: the verify runs on every fingerprint hit but succeeds for at most one. ### 8.2 Cold lookup -The cold tier **probes every in-retention window's `.idx`** — a hash gives no window hint (the window is `chunkID(seq) / chunks_per_txhash_index`, and `seq` is exactly what the lookup is trying to find), so there is nothing to pre-select. Each window probe: +The cold tier **probes every in-retention window's `.idx`**. A hash gives no hint about which window it's in — to know the window you'd compute `chunkID(seq) / chunks_per_txhash_index`, and `seq` is the very thing the lookup is trying to find. So there is nothing to pre-select, and each window is probed in turn: ``` -for each in-retention window (its unique "frozen" key → {lo}-{hi}.idx): +for each in-retention window (its live index → {lo}-{hi}.idx): → MPHF probe on the hash's 16-byte prefix → fingerprint check (fpWidth bytes) — miss ⇒ skip this window → on a fingerprint hit: @@ -299,47 +204,25 @@ respond on the confirmed hit; not-found if no window confirms Because the hash belongs to at most one window, **at most one window confirms**; a not-found lookup — a non-existent or not-yet-ingested hash — confirms none and must rule out every in-retention window. -The final verification is **mandatory, not defensive**: a minimal perfect hash maps *any* probe key to some slot, so a hash that is not in the set resolves to an arbitrary entry — the fingerprint screens most foreign keys, and the fetch-and-verify rejects the remainder. +The final verification is essential: a minimal perfect hash returns a slot for *any* input, including a hash it doesn't contain, so every hit must be confirmed. The fingerprint screens out most foreign hashes cheaply, and the fetch-and-verify rejects the rest. A **16-byte prefix collision between two distinct in-retention transactions** has two cases, and only one bounds completeness. The cold index keys on streamhash's 128-bit routing key (§6.1), so two hashes sharing their first 16 bytes are indistinguishable *to a single window's build*. -*Different windows* — the more likely of the two, since a shared prefix is far more apt to straddle two of history's windows than to fall inside one. Each transaction keys into its own window's `.idx`, so neither build sees a duplicate and both transactions resolve normally. The collision shows up only as a fingerprint false-positive when a lookup probes the *other* window: that window's MPHF maps the shared prefix to its own resident transaction, the fingerprint (also derived from those 16 bytes) matches, and the fetch-and-verify above rejects it because the full 32-byte hashes differ. This is exactly the foreign-key path the verify already exists for — one wasted ledger fetch, no wrong answer and no false negative. +*Different windows* — the more likely of the two, since a shared prefix is far more apt to straddle two of history's windows than to fall inside one. Each transaction keys into its own window's `.idx`, so neither build sees a duplicate and both resolve normally. The collision shows up only as a fingerprint false-positive when a lookup probes the *other* window. That window's MPHF maps the shared prefix to its own resident transaction, and the fingerprint (also derived from those 16 bytes) matches — but the fetch-and-verify rejects it, because the full 32-byte hashes differ. This is exactly the foreign-key path the verify already exists for: one wasted ledger fetch, no wrong answer and no false negative. -*Same window* — the genuine residual. The two are a single key to that window's builder, so streamhash rejects the duplicate at build time (`ErrDuplicateKey`) and the window's build fails **loudly and deterministically** — never a silently dropped transaction (the false not-found a keep-one-payload index would produce), and never, thanks to the verify, a wrong transaction. This same-window case is the sole bound on completeness; the birthday bound over a dense window's ~3×10⁹ keys against 2¹²⁸ puts it at ~10⁻²⁰ per window — a cryptographic-scale probability accepted as negligible, comparable to the undetected storage bit-errors the design likewise does not defend against. The design carries no per-key full hash or collision list to drive it to zero, because at 10⁻²⁰ that machinery would cost far more than the risk it removes. +*Same window* — the genuine residual. The two are a single key to that window's builder, so streamhash rejects the duplicate at build time (`ErrDuplicateKey`) and the build fails **loudly**: it never silently drops a transaction, and the verify ensures it never returns a wrong one. This is the only bound on completeness, and it is tiny — the birthday probability over a dense window's ~3×10⁹ keys against 2¹²⁸ is ~10⁻²⁰ per window, a cryptographic-scale risk accepted as negligible. **Probe ordering, parallelism, early-stop, and the resulting latency and I/O are the query-routing design's concern** (§8.1), out of scope here. ### 8.3 Hot lookup -Chunks above `hi` are probed in their hot DBs' `txhash` CF — an exact full-key point get, so misses are genuine misses with no verification subtleties (the fetch-and-verify still runs, as the response needs the transaction anyway). In steady state this tier is the live chunk plus, briefly, the chunk inside the freeze-to-coverage interval; after catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. - -### 8.4 Reads and concurrent pruning - -The cold-tier lifecycle unlinks files concurrently with in-flight reads — sweeps remove superseded `.idx` coverages (§7.4), and retention pruning removes a window's `.idx` and its chunks' `.pack`s. What makes that safe to do **unilaterally** is the streaming doc's reader retention contract: a read for any seq below `effectiveRetentionFloor` is not-found regardless of what is still on disk. How a read stays correct across these transitions otherwise — tier dispatch, coverage re-resolution, a file that vanishes mid-read — is the **query-routing design's** concern (§8.1), out of scope here. - -## 9. Catch-up and recovery interaction - -The cold tier converges from any state through the streaming doc's postcondition-driven resolver; the tx-hash kind contributes the one **per-window rule** (every other kind is per-chunk): - -For each window overlapping the catch-up range, compare the **stored** coverage — `{lo, hi}` from the name of the window's unique frozen index key — with the **desired** coverage `[max(window_start, chunkID(floor)), min(window_last_chunk, range_end)]`. The upper cap is what makes the rule uniform: for a complete window it is the window's last chunk, for the trailing window it is the range end, and no special trailing case exists. - -- **Desired ⊆ stored** → schedule *nothing* for this window: no `.bin` production, no build. Three states land here: every steady-state restart; a floor that *rose* (the stale stored `lo` is the read path's problem, §8.4 — never a rebuild trigger); and a finalized window the range ends inside. -- **Desired exceeds stored** (`desired_lo < stored_lo`, or `desired_hi > stored_hi`, or no frozen key exists) → request `.bin` production for **every** chunk in the desired range — chunks whose `.bin` is already frozen self-skip inside `processChunk`; chunks the old `.idx` covered re-derive from their local `.pack` files with no bulk-backend download — and emit one `buildTxhashIndex(w, desired_lo, desired_hi)`, terminal iff `desired_hi` is the window's last chunk. - -Two clauses are load-bearing: - -- **The `stored_hi` clause** catches downtime that crosses a window boundary: a window that was *current* at shutdown carries a frozen key with `hi <` its last chunk, and classifying by `lo` alone would see a frozen key and strand chunks `(hi, last_chunk]` permanently. -- **`"frozen"` is blindly trustable**: a `"frozen"` `chunk:{c}:txhash` key implies its `.bin` exists, unconditionally — input demotions ride the same synced write that freezes the terminal coverage, and files are deleted only by sweeps under non-frozen keys (§7.6). The resolver classifies on frozen state only; transient keys are invisible to it and are the sweeps' job. - -**Retention widening** re-derives a finalized window at its new, wider coverage: `.bin`s for previously-covered chunks come from local `.pack`s; fully-pruned chunks refetch from the bulk source; the rebuild is terminal at the wider `[lo', last_chunk]` and its commit batch demotes the old coverage. This runs at the next startup — extending the bottom of storage is exclusively catch-up's job, behind backend validation — never in a tick. One corner needs help from outside the resolver: a widening that re-froze (or left mid-write) a finalized window's `.bin` keys and was then abandoned by narrowing retention back — the resolver correctly schedules nothing (desired ⊆ stored), so the tick's prune scan demotes and sweeps those provably-redundant inputs (`"frozen"` and `"freezing"` alike) — the final `.idx` covers their chunks, and the resolver never re-materializes a covered window. - -In the executor, an `IndexBuild` waits on the done-channels of the chunk builds inside its coverage and draws from the same worker pool — the `.bin`s are map-side-sorted runs, the build is the per-window reduce. Done-channels broadcast completion, not success; the build's loud precondition (§7.2) is the backstop. +Chunks above `hi` are probed in their hot DBs' `txhash` column family — an exact, full-key point get. A miss here is a real miss, with none of the cold tier's verification subtleties (the fetch-and-verify still runs, since the response needs the transaction anyway). In steady state this tier is just the live chunk, plus briefly the one chunk in the freeze-to-coverage gap. After catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. --- # Part 4: Capacity & Performance -## 10. Storage footprint +## 9. Storage footprint Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10⁹ transactions): @@ -349,13 +232,13 @@ Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10 | `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization | | `.idx` | ≈4.2 B/tx (3-byte payload) | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | -Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` floor is one dense in-flight window (~60 GB), bounded by the eager sweep (§7.4). Steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history (at the default window; +1 B/tx past the 4-byte payload threshold). +Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` files for the in-flight window total ~60 GB. Both are transient (§7.4). The steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history (at the default window; +1 B/tx past the 4-byte payload threshold). -## 11. Performance +## 10. Performance -- **Ingest, hot**: one `(hash, seq)` put per transaction inside the existing per-ledger WriteBatch — no separate sync, no separate store. +- **Ingest, hot**: one `(hash, seq)` put per transaction, inside the ledger's existing write. - **Ingest, cold**: the in-memory sort of ~3M entries is negligible against the chunk's streaming pass; the `.bin` write is sequential. -- **Rebuild**: a full dense window merges ~60 GB of sorted runs into a ~12.5 GB `.idx` in ≈1 minute (~200 MB/s write burst) — measured by the bench harness (`bench-fullhistory`: `cold-ingest --types=txhash` + `build-txhash-index`). Mid-window rebuilds scale with `hi − lo`. Against a ~14-hour boundary cadence at mainnet rates, the rebuild is ~0.1% duty cycle. +- **Rebuild**: a full dense window merges ~60 GB of sorted `.bin` files into a ~12.5 GB `.idx` in ≈1 minute (~200 MB/s write burst), measured in the `bench-fullhistory` harness. Mid-window rebuilds scale with `hi − lo`. Against a ~14-hour boundary cadence at mainnet rates, the rebuild is a ~0.1% duty cycle. - **Lookup, cold**: one MPHF probe per in-retention window — fingerprint screen, then fetch-and-verify on a hit. The hash is in at most one window, so at most one fetch confirms; fingerprint false positives (bounded by `fpWidth`, §6.2) are rejected by the full-hash verify. Probe ordering, parallelism, and the resulting latency/throughput are the query-routing design's concern (§8.1). - **Lookup, hot**: one RocksDB point get in a bloom-filtered CF, then the same ledger fetch. @@ -363,7 +246,7 @@ Transient peaks: ~2× the index size in the window dir during each rebuild (~25 ## Related documents -- [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) — the daemon this subsystem lives in: geometry, the meta store and one write protocol, `processChunk`, the resolver and executor, the lifecycle tick (freeze → rebuild → discard → prune), and the correctness invariants (INV-1 … INV-4) with their audits. +- [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) — the daemon this subsystem lives in: geometry, the catalog and one write protocol, `processChunk`, the resolver and executor, the lifecycle run (freeze → rebuild → discard → prune), and the correctness invariants (INV-1 … INV-4) with their audits. - The reader / query-routing design — how readers obtain current coverage and dispatch between hot DBs and frozen files across transitions. - [getevents-full-history-design.md](./getevents-full-history-design.md) — the sibling subsystem (events), same hot/cold architecture over the same chunk geometry. - [packfile-library.md](./packfile-library.md) — the `.pack` format the read path's ledger fetch lands on. From 21acdb03b7426f0cecad7e43ebffb18b0cf10045 Mon Sep 17 00:00:00 2001 From: tamirms Date: Thu, 25 Jun 2026 23:17:05 +0200 Subject: [PATCH 16/18] docs(full-history): fix chunks_per_txhash_index at 1000 (not configurable) The tx-hash window is now a hardcoded 1,000-chunk (10M-ledger) constant rather than a config key. Drops the [backfill] config row, the config:chunks_per_txhash_index catalog pin and its validateConfig mismatch path, the MaxChunksPerTxhashIndex cap, and the variable payload-width machinery (payload is a flat 3 bytes). Mirrors the change in the companion HTML explorer. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01GtG9MzaQ7h3LqvhQCkWhrg --- design-docs/full-history-design-explorer.html | 15 ++++--- .../full-history-streaming-workflow.md | 39 ++++++------------- .../gettransaction-full-history-design.md | 26 ++++++------- 3 files changed, 32 insertions(+), 48 deletions(-) diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html index b0dd35f1c..c008f80a3 100644 --- a/design-docs/full-history-design-explorer.html +++ b/design-docs/full-history-design-explorer.html @@ -394,8 +394,8 @@

Geometry

  • Chunk — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery.
  • -
  • Windowchunks_per_txhash_index chunks (default 1000 = 10M ledgers). The unit of - the rolling tx-hash index. Configurable, but immutable once stored.
  • +
  • Window — 1,000 chunks = 10,000,000 ledgers (hardcoded). The unit of + the rolling tx-hash index.
@@ -426,10 +426,9 @@

Geometry

- With the default chunks_per_txhash_index = 1000, the file-bucket size (fixed at 1000 - chunks) and the window size coincide numerically — but they are different concepts: buckets are purely a - filesystem concern and never appear in catalog keys; windows define the tx-hash index layout and are - pinned in the catalog forever. + The file-bucket size (fixed at 1000 chunks) and the window size (1,000 chunks) coincide + numerically — but they are different concepts: buckets are purely a filesystem concern and never + appear in catalog keys; windows define the tx-hash index layout.
@@ -526,7 +525,7 @@

Catalog keys

chunk:{c}:eventsPer-chunk events cold segment state. index:{w}:{lo}:{hi}One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment. hot:chunk:{c}"ready" = dir exists and is usable; "transient" = a directory operation (create or delete) is in flight — the recovery is the same either way, which is why one value suffices. - config:earliest_ledger
config:chunks_per_txhash_indexWritten on first start, immutable thereafter (startup aborts on mismatch). + config:earliest_ledgerWritten on first start, immutable thereafter (startup aborts on mismatch).
Key names carry identity; values carry only lifecycle. An index key's filename is derived from @@ -1649,7 +1648,7 @@

Why convergence works

nodes:[ { c:"dim", ic:"·", t:"window 7 .idx: MPHF probe → fingerprint miss → skip", s:"non-containing window rejected with no fetch" }, { c:"dim", ic:"·", t:"window 6 .idx: fingerprint miss → skip", s:"" }, - { c:"ok", ic:"✓", t:"window 5 .idx: fingerprint hit", s:"seq = MinLedger + payload (payloadWidth bytes)" }, + { c:"ok", ic:"✓", t:"window 5 .idx: fingerprint hit", s:"seq = MinLedger + payload (3 bytes)" }, { c:"ok", ic:"✓", t:"retention gate: seq ≥ floor → admitted", s:"" }, { c:"ok", ic:"✓", t:"fetch the LCM, verify the full 32-byte hash → confirms → return", s:"" }, ], diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index 8c7cd7157..c3b33b9fd 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -19,18 +19,18 @@ The daemon does three things: The Stellar blockchain starts at ledger 2 (`GENESIS_LEDGER`). Two units organize all storage; everything in this doc is described in terms of them: - **Chunk** — a run of 10,000 ledgers (hardcoded); the atomic unit of ingestion, freezing, and crash recovery. A hot DB holds at most one chunk, and each cold file — ledgers, events, transactions — spans exactly one chunk. -- **Window** — a run of chunks (`chunks_per_txhash_index`, default 1000 = 10M ledgers); the unit of the rolling tx-hash index. The index is the one exception to the per-chunk rule: it maps transaction hashes to ledger sequences across a whole window. Configurable, but immutable once stored. +- **Window** — 1,000 chunks (10M ledgers); the unit of the rolling tx-hash index. The index is the one exception to the per-chunk rule: it maps transaction hashes to ledger sequences across a whole window. ``` chunkID(seq) = floor((seq - 2) / 10_000) chunkFirstLedger(c) = c * 10_000 + 2 chunkLastLedger(c) = (c + 1) * 10_000 + 1 -indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id +indexID(c) = c / 1000 # takes a CHUNK id ``` Chunk ids are **signed**, because `chunkID` uses floor division. The only id below 0 is **chunk −1**, meaning "before the first chunk." It comes up in one place: the "nothing ingested yet" sentinel `earliest_ledger - 1`, which maps to chunk −1 (and `chunkLastLedger(-1) = 1` maps back). Chunk −1 only ever appears in startup arithmetic; every chunk id written to disk is `≥ 0`. -All chunk and window ids use uniform `%08d` zero-padding. Example, default `chunks_per_txhash_index = 1000`: +All chunk and window ids use uniform `%08d` zero-padding. Example (window = 1,000 chunks): | Window | First ledger | Last ledger | Chunks | |---|---|---|---| @@ -54,7 +54,6 @@ One TOML file (`--config`) configures the daemon. | Key | Type | Default | Description | |---|---|---|---| -| `chunks_per_txhash_index` | uint32 | `1000` | Chunks per tx-hash window. Defines data layout — immutable once stored (startup aborts on mismatch; see `validateConfig`). | | `workers` | int | `GOMAXPROCS` | Concurrent task slots for backfill. | | `max_retries` | int | `3` | Retries per backfill task before the daemon aborts. | @@ -149,7 +148,7 @@ CFs share the instance's WAL, so each ledger commits as **one atomic WriteBatch ### Catalog keys -The catalog holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and config pins. +The catalog holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and the config pin. **Artifact state keys**: @@ -170,12 +169,11 @@ For the per-chunk keys, `"freezing"` means the immutable file is being written; `"ready"` means the RocksDB dir exists and is usable. `"transient"` brackets a directory operation in flight — creation or deletion; no code path ever needs to know which, since the recovery is the same either way (the open path wipes and recreates; the discard scan re-runs). A crash mid-operation is detectable from the key value alone. One key per chunk; the column families inside the DB carry no individual catalog state. -**Config pins:** +**Config pin:** | Key | Value | Written when | |---|---|---| | `config:earliest_ledger` | `uint32` (decimal string, chunk-aligned) | On the first daemon start. Immutable thereafter — changing it currently requires wiping the data directory, until a `set-earliest-ledger` admin command exists (see [Configuration](#configuration); the floor machinery already converges for either direction). | -| `config:chunks_per_txhash_index` | `uint32` (decimal string) | On the first daemon start; immutable thereafter. Startup aborts if the config value doesn't match. | **Resume point.** Recomputed at startup from the durable keys plus a read of the live hot DB (see [Startup](#startup)). @@ -460,10 +458,9 @@ The retention floor and resume point are computed by: ```go const ( - GenesisLedger = 2 - LedgersPerChunk = 10_000 - // MaxChunksPerTxhashIndex caps the window so its ledger span fits a 4-byte offset. - MaxChunksPerTxhashIndex = 429_496 // floor(2^32 / LedgersPerChunk) + GenesisLedger = 2 + LedgersPerChunk = 10_000 + ChunksPerTxhashIndex = 1_000 // window = 10M ledgers ) // retentionFloorChunk: the lowest chunk kept — retentionChunks back from @@ -569,14 +566,11 @@ func startStreaming(ctx context.Context, cfg Config) error { } ``` -`validateConfig` checks the config and, on the first start, resolves and pins `earliest_ledger` and `chunks_per_txhash_index`: +`validateConfig` checks the config and, on the first start, resolves and pins `earliest_ledger`: ```go func validateConfig(cfg Config) { cat := cfg.Catalog - if cfg.ChunksPerTxhashIndex == 0 || cfg.ChunksPerTxhashIndex > MaxChunksPerTxhashIndex { - fatalf("chunks_per_txhash_index must be in [1, %d]", MaxChunksPerTxhashIndex) - } if cfg.Workers < 1 { fatalf("workers must be > 0 (got %d)", cfg.Workers) } @@ -591,15 +585,9 @@ func validateConfig(cfg Config) { } } - // Both pins are committed together on first start, so either both exist (a - // restart — the layout is immutable) or neither does (re-pinning is safe). - cpiStored, cpiPinned := cat.Get("config:chunks_per_txhash_index") earliestStored, earliestPinned := cat.Get("config:earliest_ledger") - if cpiPinned && earliestPinned { // restart: confirm nothing changed, write nothing - if cpiStored != itoa(cfg.ChunksPerTxhashIndex) { - fatalf("chunks_per_txhash_index changed: stored=%s, config=%d", cpiStored, cfg.ChunksPerTxhashIndex) - } + if earliestPinned { // restart: confirm nothing changed, write nothing if cfg.EarliestLedger != "now" { // "now" on restart keeps the pinned floor want := uint32(GenesisLedger) if cfg.EarliestLedger != "genesis" { @@ -613,7 +601,7 @@ func validateConfig(cfg Config) { return } - // First start: resolve earliest_ledger, then pin both. "now" and a numeric + // First start: resolve earliest_ledger, then pin it. "now" and a numeric // floor each need a reachable backend — "now" to resolve, a numeric floor to // reject one past the tip (it is pinned immutably, so it can't be checked later). var earliest uint32 @@ -636,10 +624,7 @@ func validateConfig(cfg Config) { fatalf("earliest_ledger (%d) is past the network tip (%d)", earliest, tip) } } - batch := cat.NewBatch() - batch.Put("config:chunks_per_txhash_index", itoa(cfg.ChunksPerTxhashIndex)) - batch.Put("config:earliest_ledger", itoa(earliest)) - batch.Commit() + cat.Put("config:earliest_ledger", itoa(earliest)) } ``` diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index d3a71eb01..a54179be5 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -56,17 +56,17 @@ The two tiers hand off with no gap. A chunk's hot table is dropped only *after* Two units organize the map. Every structure below is named by them: - **Chunk** — 10,000 ledgers (hardcoded). The unit of the hot DB and of the sorted runs. -- **Window** — `chunks_per_txhash_index` chunks (default 1000 = 10M ledgers). The unit of the cold index. Configurable, but pinned in the catalog on first start and immutable thereafter. +- **Window** — 1,000 chunks = 10,000,000 ledgers (hardcoded). The unit of the cold index. ``` chunkID(seq) = (seq - 2) / 10_000 chunkFirstLedger(c) = c * 10_000 + 2 chunkLastLedger(c) = (c + 1) * 10_000 + 1 -indexID(c) = c / chunks_per_txhash_index # takes a CHUNK id -chunksInIndex(w) = [w*cpi, (w+1)*cpi - 1] # cpi = chunks_per_txhash_index +indexID(c) = c / 1000 # takes a CHUNK id +chunksInIndex(w) = [w*1000, (w+1)*1000 - 1] ``` -With the default `chunks_per_txhash_index = 1000`: window 0 spans ledgers 2–10,000,001 (chunks 0–999), window N spans N×10M+2 – (N+1)×10M+1 (chunks N×1000 – (N+1)×1000−1). All ids zero-pad `%08d`. +Window 0 spans ledgers 2–10,000,001 (chunks 0–999), window N spans N×10M+2 – (N+1)×10M+1 (chunks N×1000 – (N+1)×1000−1). All ids zero-pad `%08d`. --- @@ -115,12 +115,12 @@ A `.bin` is kept for as long as its window is still being rebuilt — every rebu ### 6.2 The per-window index: `.idx` -The `.idx` lives at `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, tracked by the catalog key `index:{window:08d}:{lo:08d}:{hi:08d}`. There is one minimal-perfect-hash file per **coverage** — a coverage being the chunk range `[lo, hi]` the file actually hashes. Streamhash's `SortedBuilder` builds it from the k-way merge of `.bin[lo..hi]`. Two fields are sized per build: +The `.idx` lives at `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, tracked by the catalog key `index:{window:08d}:{lo:08d}:{hi:08d}`. There is one minimal-perfect-hash file per **coverage** — a coverage being the chunk range `[lo, hi]` the file actually hashes. Streamhash's `SortedBuilder` builds it from the k-way merge of `.bin[lo..hi]`. The index carries two per-entry fields: -- **Payload (`payloadWidth` bytes): the answer the hash maps to — a ledger seq.** It is stored as an offset from the window's first ledger (`MinLedger = chunkFirstLedger(lo)`) rather than as a full seq, to save bytes. The width is sized to the window, so the format never limits how big a window can be: `payloadWidth = ceil(log2(chunks_per_txhash_index * 10_000) / 8)` — just enough bytes for the largest offset a window can produce (`chunks_per_txhash_index * 10_000 - 1`). At the default 1000 chunks (10M ledgers) that is **3 bytes**, a 24-bit offset spanning 16.77M ledgers; a window of 1678+ chunks (>16.77M ledgers) bumps it to 4. It never needs more than 4: `chunks_per_txhash_index` is capped at `MaxChunksPerTxhashIndex` ≈ 429,496 (`floor(2³²/10_000)`), so a window's ledger span always fits in a uint32. Because `chunks_per_txhash_index` is fixed once stored, the width is fixed for every window's life. Streamhash writes the width into the index file's header; `MinLedger`, which streamhash does not model itself, rides in the file's user-metadata slot. Both are read back at lookup time, so there is no separate sidecar file. -- **Fingerprint (`fpWidth` bytes, default 1): a few bytes per entry to screen out wrong hashes** before the expensive fetch-and-verify. Because a lookup probes every in-retention window (§8.2), a wider fingerprint is a trade-off: it costs index size (+1 byte per transaction) but cuts the number of false-positive fetches across those windows. Fixed per build, like the payload width. +- **Payload (3 bytes): the answer the hash maps to — a ledger seq.** It is stored as an offset from the window's first ledger (`MinLedger = chunkFirstLedger(lo)`) rather than as a full seq, to save bytes. A window spans 10,000,000 ledgers, so the largest offset (`10_000_000 - 1`) fits in a 24-bit field. Streamhash writes the payload width into the index file's header; `MinLedger`, which streamhash does not model itself, rides in the file's user-metadata slot. Both are read back at lookup time, so there is no separate sidecar file. +- **Fingerprint (`fpWidth` bytes, default 1): a few bytes per entry to screen out wrong hashes** before the expensive fetch-and-verify. Because a lookup probes every in-retention window (§8.2), a wider fingerprint is a trade-off: it costs index size (+1 byte per transaction) but cuts the number of false-positive fetches across those windows. Fixed per build. -All-in, at the default 3-byte payload the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. A window past the 4-byte payload threshold adds one byte per transaction. +All-in, the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. ### 6.3 Coverage and the live index @@ -133,7 +133,7 @@ A window has exactly **one live index** at a time, and a lookup resolves "the wi So the index hashes exactly the transactions in chunks `[lo, hi]`. Chunks below `lo` are out of scope — cut off by the floor. Chunks above `hi` aren't folded in yet, and are served from their hot tables until the next rebuild advances `hi`. -**Example** (default 1000 chunks per window): the tip is in chunk 5350, so window 5 (chunks 5000–5999) is the current window, and the floor is at chunk 5100. The live index covers chunks 5100–5349, in the file `txhash/index/00000005/00005100-00005349.idx`; chunk 5350 is still in its hot table, and chunks 5000–5099 are below the floor. At the next boundary the index is rebuilt to cover 5100–5350, and the old file is deleted. +**Example** (1,000 chunks per window): the tip is in chunk 5350, so window 5 (chunks 5000–5999) is the current window, and the floor is at chunk 5100. The live index covers chunks 5100–5349, in the file `txhash/index/00000005/00005100-00005349.idx`; chunk 5350 is still in its hot table, and chunks 5000–5099 are below the floor. At the next boundary the index is rebuilt to cover 5100–5350, and the old file is deleted. ## 7. The rolling rebuild @@ -188,14 +188,14 @@ The hot tier is a few chunks at most — one window's tail, normally just the li ### 8.2 Cold lookup -The cold tier **probes every in-retention window's `.idx`**. A hash gives no hint about which window it's in — to know the window you'd compute `chunkID(seq) / chunks_per_txhash_index`, and `seq` is the very thing the lookup is trying to find. So there is nothing to pre-select, and each window is probed in turn: +The cold tier **probes every in-retention window's `.idx`**. A hash gives no hint about which window it's in — to know the window you'd compute `chunkID(seq) / 1000`, and `seq` is the very thing the lookup is trying to find. So there is nothing to pre-select, and each window is probed in turn: ``` for each in-retention window (its live index → {lo}-{hi}.idx): → MPHF probe on the hash's 16-byte prefix → fingerprint check (fpWidth bytes) — miss ⇒ skip this window → on a fingerprint hit: - seq = MinLedger + payload (payloadWidth bytes) + seq = MinLedger + payload (3 bytes) retention gate: seq ≥ floor? — else skip this window fetch the LCM for seq, extract the tx verify the full 32-byte hash — confirms, or rejects a false positive @@ -224,7 +224,7 @@ Chunks above `hi` are probed in their hot DBs' `txhash` column family — an exa ## 9. Storage footprint -Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10⁹ transactions): +Per dense chunk (~3M transactions) and dense window (1,000 chunks, ~3×10⁹ transactions): | Structure | Unit cost | Dense chunk | Dense window | Lifetime | |---|---|---|---|---| @@ -232,7 +232,7 @@ Per dense chunk (~3M transactions) and dense window (default 1000 chunks, ~3×10 | `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization | | `.idx` | ≈4.2 B/tx (3-byte payload) | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | -Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` files for the in-flight window total ~60 GB. Both are transient (§7.4). The steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history (at the default window; +1 B/tx past the 4-byte payload threshold). +Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` files for the in-flight window total ~60 GB. Both are transient (§7.4). The steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history. ## 10. Performance From f4be6da97301ea856681a5e46caef763a21b6d3b Mon Sep 17 00:00:00 2001 From: tamirms Date: Fri, 26 Jun 2026 15:14:00 +0200 Subject: [PATCH 17/18] docs(full-history): note retention pruning removes txhash .bin under short retention The .bin lifetime descriptions stated it is removed at window finalization, which only holds when retention is at least a window wide. Under retention narrower than a window the window never finalizes; retention pruning removes each .bin as its chunk drops below the floor. Clarified across the streaming workflow doc, the getTransaction design, and the HTML explorer, consistent with the docs' existing invariant and crash-safety text. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01GtG9MzaQ7h3LqvhQCkWhrg --- design-docs/full-history-design-explorer.html | 14 ++++++++------ design-docs/full-history-streaming-workflow.md | 6 +++--- design-docs/gettransaction-full-history-design.md | 4 ++-- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html index c008f80a3..e16290f23 100644 --- a/design-docs/full-history-design-explorer.html +++ b/design-docs/full-history-design-explorer.html @@ -477,7 +477,7 @@

On disk

├── ledgers/{bucket:05d}/{chunk:08d}.pack
├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash)
└── txhash/
-    ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization
+    ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization (or retention pruning)
    └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named
@@ -487,10 +487,12 @@

On disk

- The .bin is the interesting transient: it is the input to buildTxhashIndex and - is retained for the whole life of its window (at each boundary the rebuild reads all of the - window's .bin files). The terminal build's commit batch demotes them to - "pruning" and the sweep removes them. + The .bin is the interesting transient: it is the input to buildTxhashIndex, + retained while its chunk is still within the window's live [lo, hi] coverage (each boundary + the rebuild reads every in-coverage .bin). When the window finalizes, the terminal build's + commit batch demotes its inputs to "pruning" and the sweep removes + them — and under retention narrower than a window, a chunk drops below the floor before its window + completes, so retention pruning removes its .bin instead.

The chunk hot DB

@@ -521,7 +523,7 @@

Catalog keys

- + diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index c3b33b9fd..c095fbe0b 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -117,7 +117,7 @@ The per-chunk artifacts are each written once at chunk freeze; the txhash index | Sorted txhash file | per chunk | `.bin` (sorted **streamhash** entries — the sorted on-disk tx-hash index format, specified in [the transactions design](./gettransaction-full-history-design.md) §6) | `processChunk` | | Streamhash txhash index | per index | one `.idx` file per **coverage** (the chunk range `[lo, hi]` an index spans), named `{lo:08d}-{hi:08d}.idx` inside the window's dir; at most one coverage frozen at any moment | `buildTxhashIndex` | -The `.bin` files are transient — they are the input `buildTxhashIndex` merges, and the terminal build deletes them once its window is complete. The pack files, events segments, and `.idx` files persist until retention pruning removes them. State for each lives in [Catalog keys](#catalog-keys); the write ordering is [One write protocol](#one-write-protocol). +The `.bin` files are transient — they are the input `buildTxhashIndex` merges, and the terminal build deletes them once its window is complete (or retention pruning removes them first, once its chunks drop below the floor). The pack files, events segments, and `.idx` files persist until retention pruning removes them. State for each lives in [Catalog keys](#catalog-keys); the write ordering is [One write protocol](#one-write-protocol). ### Directory layout @@ -130,7 +130,7 @@ Chunk-level files group into buckets of 1,000 chunks (`bucket_id = chunk_id / 10 ├── ledgers/{bucket:05d}/{chunk:08d}.pack ├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash) └── txhash/ - ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization + ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization (or retention pruning) └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named ``` @@ -155,7 +155,7 @@ The catalog holds three groups of keys: per-chunk artifact state keys, hot DB st | Key | Value | Meaning | |---|---|---| | `chunk:{chunk:08d}:ledgers` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk pack file state. | -| `chunk:{chunk:08d}:txhash` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk `.bin` file state. Transient — removed at window finalization. | +| `chunk:{chunk:08d}:txhash` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk `.bin` file state. Transient — removed at window finalization, or by retention pruning if its chunk ages out first. | | `chunk:{chunk:08d}:events` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk events cold segment state. | | `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` | `"freezing"` \| `"frozen"` \| `"pruning"` | One key per index **coverage**. The key *name* carries the coverage `[lo, hi]` and maps 1:1 to the file `{lo:08d}-{hi:08d}.idx`; the *value* is pure lifecycle state — the same three values as every other artifact key. At most one coverage per window is `"frozen"` at any moment, and a key with `hi` = its window's last chunk is **terminal** by definition (see [Index keys](#index-keys) below). | diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md index a54179be5..cf07c0ec3 100644 --- a/design-docs/gettransaction-full-history-design.md +++ b/design-docs/gettransaction-full-history-design.md @@ -111,7 +111,7 @@ entry × count 20 bytes each: [key: 16][seq: 4 LE] The `.bin` is a pre-sorted file, and a lookup never reads it directly. It is sorted because streamhash builds an index **much faster, and with much less memory, when its keys arrive already sorted** — its *sorted-builder mode*. -A `.bin` is kept for as long as its window is still being rebuilt — every rebuild re-merges all of the window's `.bin` files. Once the window is complete and its final index is built, the `.bin` files are no longer needed, and are deleted. +A `.bin` is kept while it is still a rebuild input — every rebuild re-merges the `.bin` files for the chunks its window currently covers. Once the window is complete and its final index is built, the `.bin` files are no longer needed, and are deleted — or, if retention is narrower than a window so its chunks age out before the window completes, retention pruning deletes them first. ### 6.2 The per-window index: `.idx` @@ -229,7 +229,7 @@ Per dense chunk (~3M transactions) and dense window (1,000 chunks, ~3×10⁹ tra | Structure | Unit cost | Dense chunk | Dense window | Lifetime | |---|---|---|---|---| | hot `txhash` CF | 36 B/tx raw (32 key + 4 value), before RocksDB overhead | ~110 MB raw | — (per-chunk) | chunk ingestion → index coverage | -| `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization | +| `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization, or retention floor | | `.idx` | ≈4.2 B/tx (3-byte payload) | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` files for the in-flight window total ~60 GB. Both are transient (§7.4). The steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history. From 914e34aacbb1eec6d8ee6c9bbe62a0d28333aee6 Mon Sep 17 00:00:00 2001 From: tamirms Date: Wed, 1 Jul 2026 20:57:16 +0200 Subject: [PATCH 18/18] docs(full-history): delete subsumed standalone backfill design doc The standalone backfill workflow design (full-history/design-docs/03-backfill-workflow.md) is fully subsumed by the streaming workflow and getTransaction designs, and predates the 2026-06 streamhash redesign, so it no longer reflects the current design. Remove it, along with the now-dangling "Related documents" pointer to it in the streaming doc. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../full-history-streaming-workflow.md | 1 - .../design-docs/03-backfill-workflow.md | 698 ------------------ 2 files changed, 699 deletions(-) delete mode 100644 full-history/design-docs/03-backfill-workflow.md diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md index c095fbe0b..0c1e45fc2 100644 --- a/design-docs/full-history-streaming-workflow.md +++ b/design-docs/full-history-streaming-workflow.md @@ -996,4 +996,3 @@ A storage walk against the invariants is enough to find these without inspecting - The transactions design ([gettransaction-full-history-design.md](./gettransaction-full-history-design.md)) — the tx-by-hash subsystem end to end: the hot `txhash` CF, the `.bin`/`.idx` formats, the rolling window index rebuild — its streamhash merge internals and safety argument — the `getTransaction` read path, and the capacity numbers. Canonical for the streamhash `.bin`/`.idx` formats, the index merge internals, and the index-key coverage semantics this doc summarizes. - The events design ([getevents-full-history-design.md](./getevents-full-history-design.md), PR #635) — the cold-segment file formats and the hot events CF schema referenced by the data model. - The reader / query-routing design — how reads dispatch between hot DBs and frozen files for in-retention queries. -- The original backfill workflow design ([../full-history/design-docs/03-backfill-workflow.md](../full-history/design-docs/03-backfill-workflow.md)) — the standalone backfill mode, **subsumed by this document**: its `process_chunk`, tx-hash index build and cleanup, geometry, configuration, directory layout, catalog keys, and crash recovery are all redefined here (and in the transactions design) in their current form. It is retained for history; where the two disagree, this document is current. In particular, that doc predates the 2026-06 streamhash redesign, so its 16-CF RecSplit `.idx` files, unsorted 36-byte `.bin` entries, `"1"`-valued catalog keys, task-DAG scheduler, and standalone `full-history-backfill` CLI are all superseded. diff --git a/full-history/design-docs/03-backfill-workflow.md b/full-history/design-docs/03-backfill-workflow.md deleted file mode 100644 index dbb7aa05f..000000000 --- a/full-history/design-docs/03-backfill-workflow.md +++ /dev/null @@ -1,698 +0,0 @@ -# Backfill Workflow - -## Overview - -Backfill populates the immutable stores for a configured ledger range `[start_ledger, end_ledger]`. - -**What it does:** -- Ingests historical ledgers offline — no live queries served (only `getHealth` / `getStatus`). `getHealth` is the existing lightweight liveness check; `getStatus` is the new backfill-specific progress endpoint (see [getStatus API Response](#getstatus-api-response) below). -- Writes directly to immutable file formats — no RocksDB active stores -- Schedules work as a DAG of idempotent tasks, dispatched via a flat worker pool (default GOMAXPROCS slots) -- Exits when done; on failure, re-run the same command — completed work is never repeated - -**What it produces:** - -| Query it enables | Immutable output | Scope | -|-----------------|-----------------|-------| -| `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10K ledgers) | -| `getTransaction` | Txhash index files | Per txhash index (default 10M ledgers) | -| `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | - ---- - -## Geometry - -The Stellar blockchain starts at ledger 2. Backfill organizes data using two concepts: - -- **Chunk** — 10_000 ledgers (hardcoded, not configurable) - - Atomic unit of ingestion and crash recovery - - Produces: one ledger `.pack` file, one raw txhash `.bin` file, one events cold segment (`events.pack`, `index.pack`, `index.hash`) - - `chunk_id = (ledger_seq - 2) / 10_000` -- **Txhash Index** — `CHUNKS_PER_TXHASH_INDEX` chunks (default 1000 = 10M ledgers) - - One RecSplit index covers all transactions across `CHUNKS_PER_TXHASH_INDEX` chunks (default: 10M ledgers worth of transactions) - - Produces 16 CF (column family) `.idx` files per txhash index - - `index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX` - - Configurable via TOML, but must not change across runs — once set, it is fixed - -### ID Formulas - -``` -chunk_id = (ledger_seq - 2) / 10_000 -index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX -``` - -Example with `CHUNKS_PER_TXHASH_INDEX = 1000` (default): - -| Txhash Index ID | First Ledger | Last Ledger | Chunks | -|-----------------|-------------|------------|--------| -| 0 | 2 | 10_000_001 | 0–999 | -| 1 | 10_000_002 | 20_000_001 | 1000–1999 | -| 2 | 20_000_002 | 30_000_001 | 2000–2999 | -| N | (N × 10M) + 2 | ((N+1) × 10M) + 1 | N×1000 – (N+1)×1000 - 1 | - -All IDs use uniform `%08d` zero-padding (supports up to 99_999_999). - ---- - -## Configuration - -TOML file, passed via `stellar-rpc full-history-backfill --config path/to/config.toml`. - -- **TOML** defines data layout and storage paths — must be stable across runs -- **CLI flags** define per-run parameters (range, workers, retries) - -### TOML Config - -**[SERVICE]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | - -**[BACKFILL]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per txhash index. Defines data layout — must be stable across runs. | - -**[IMMUTABLE_STORAGE.LEDGERS]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/ledgers` | Base path for ledger pack files. | - -**[IMMUTABLE_STORAGE.EVENTS]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/events` | Base path for events cold segments. | - -**[IMMUTABLE_STORAGE.TXHASH_RAW]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/raw` | Base path for raw txhash `.bin` files (transient). | - -**[IMMUTABLE_STORAGE.TXHASH_INDEX]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/index` | Base path for RecSplit index files (permanent). | - -The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores used by the streaming workflow). - -**[BACKFILL.BSB]** — BSB / Buffered Storage Backend (required) - -| Key | Type | Default | Description | -|-----|------|---------|-------------------------------------------------------------------------------------| -| `BUCKET_PATH` | string | **required** | Remote object store path to fetch LedgerCloseMeta (without `gs://` prefix for GCS). | -| `BUFFER_SIZE` | int | `1000` | Prefetch buffer depth per connection. | -| `NUM_WORKERS` | int | `20` | Download workers per connection. | - -**[LOGGING]** - -Both keys are optional. When a key is set in both TOML and on the CLI, the CLI flag wins — specifying both is not an error. - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. | -| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. | - -### CLI Flags - -| Flag | Type | Default | Description | -|------|------|---------|-------------| -| `--start-ledger` | uint32 | **required** | First ledger (inclusive). Must be ≥ 2. | -| `--end-ledger` | uint32 | **required** | Last ledger (inclusive). Must be > `start_ledger`. | -| `--workers` | int | `GOMAXPROCS` | Total concurrent DAG task slots. | -| `--verify-recsplit` | bool | `true` | Run RecSplit verify phase after build. | -| `--max-retries` | int | `3` | Max retries per task before marking it failed. | -| `--log-level` | string | — | Overrides `[LOGGING].LEVEL` when set. | -| `--log-format` | string | — | Overrides `[LOGGING].FORMAT` when set. | - -### Optional TOML Sections - -| Section | Key | Default | Description | -|---------|-----|---------|-------------| -| `[META_STORE]` | `PATH` | `{DEFAULT_DATA_DIR}/meta/rocksdb` | Meta store RocksDB directory | - -### Validation Rules - -The only hard constraints are: - -- `start_ledger >= 2` -- `end_ledger > start_ledger` -- `[BACKFILL.BSB]` must be present -- `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — changing it invalidates existing txhash index boundaries -- Backfill never prunes existing data — narrowing the range between runs is safe (completed work outside the new range is simply left untouched) -- No txhash-index-alignment required — the operator can pass any arbitrary ledger range -- If gaps remain after backfill, streaming mode validates completeness for all chunks and all txhash indexes at startup, reports any gaps to the operator, and aborts - -#### Chunk Boundary Expansion - -- System expands the requested range **outward** to the nearest chunk boundaries -- Start expands DOWN to the first ledger of its chunk -- End expands UP to the last ledger of its chunk -- Never clamps inward — the effective range is always ≥ the requested range -- Operator doesn't need to manually calculate chunk-aligned values - -``` -Operator requests: --start-ledger 5_000_000 --end-ledger 56_337_842 -Chunk boundary expand: start=5_000_000 falls within chunk 499 (starts at 4_990_002) - → expand start to 4_990_002 - end=56_337_842 falls within chunk 5633 (ends at 56_340_001) - → expand end to 56_340_001 -Effective range: ledgers 4_990_002–56_340_001 = 5_135 chunks -``` - -#### BSB Availability Validation - -After expansion, the system validates that the remote object store referenced by BSB contains all ledgers in the expanded range: - -- Expanded end exceeds BSB availability → error at startup (no silent truncation) -- Operator must either reduce `--end-ledger` or wait for more ledgers to become available in BSB - -#### Partial Txhash Index Ranges - -If the expanded range does not complete a full txhash index: - -- Chunks are still backfilled and immediately serve `getLedger`/`getEvents` when the service is started in streaming mode -- Txhash index creation only happens once **all** input chunks for the txhash index are ready -- If txhash index creation does not happen in the current backfill run, the remaining chunks are completed either by a subsequent backfill run (should the operator run backfill again) or when streaming mode starts for the first time (see [Implications for Streaming Workflow](#implications-for-streaming-workflow) below) - -Ledger and events data are useful per-chunk and should not be blocked by txhash index alignment. Without relaxed validation: - -- A node at ledger 56_340_000 cannot backfill the latest ~6.3M ledgers because `50_000_002–56_340_001` doesn't align to a 10M txhash index boundary — the operator would have to wait until ledger 60_000_001 -- Incremental backfill (extending coverage from a completed txhash index to recent history) would be blocked unless the chain happens to sit on a txhash index boundary - -#### Implications for Streaming Workflow - -When backfill completes at a non-txhash-index-aligned boundary, a partially-filled txhash index remains. The streaming workflow completes the remaining chunks: - -- Streaming continues chunk ingestion from where backfill left off, writing the same per-chunk outputs (LFS, txhash, events) using the same flag-based idempotency -- When streaming completes the last chunk needed for a pending txhash index, txhash index creation becomes eligible and runs -- The meta store is the shared coordination point — streaming checks the same chunk flags as backfill, so there is no gap or overlap between backfill and streaming coverage - -See [PR #617 discussion](https://github.com/stellar/stellar-rpc/pull/617#discussion_r2969796337) for the original rationale. - -### Example: GCS Backfill Config - -```toml -[SERVICE] -DEFAULT_DATA_DIR = "/data/stellar-rpc" - -[BACKFILL] -CHUNKS_PER_TXHASH_INDEX = 1000 - -[IMMUTABLE_STORAGE.LEDGERS] -PATH = "/mnt/nvme/ledgers" - -[IMMUTABLE_STORAGE.EVENTS] -PATH = "/mnt/nvme/events" - -[IMMUTABLE_STORAGE.TXHASH_RAW] -PATH = "/mnt/nvme/txhash/raw" - -[IMMUTABLE_STORAGE.TXHASH_INDEX] -PATH = "/mnt/nvme/txhash/index" - -[BACKFILL.BSB] -BUCKET_PATH = "sdf-ledger-close-meta/v1/ledgers/pubnet" - -[LOGGING] -LEVEL = "info" -FORMAT = "text" -``` - -```bash -stellar-rpc full-history-backfill --config config.toml \ - --start-ledger 2 \ - --end-ledger 30_000_001 \ - --workers 40 -``` - ---- - -## Directory Structure - -With geometry (chunk, txhash index) and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is how they map to the filesystem. - -- Each data type has its own directory tree rooted at its `IMMUTABLE_STORAGE.*.PATH` -- Chunk-level files (ledgers, events, raw txhash) are grouped into subdirectories (bucket) of 1_000 chunks: - - `bucket_id = chunk_id / 1000` (hardcoded, not configurable), formatted as `%05d` - - `bucket_id` is purely a filesystem concern — it does not appear in meta store keys, DAG dependencies, or config -- Txhash index output is the only structure that uses `index_id` instead of `bucket_id` -- Directories are created on-demand via `os.MkdirAll` (safe for concurrent writes) - -``` -{DEFAULT_DATA_DIR}/ -├── meta/ -│ └── rocksdb/ ← Meta store (WAL always enabled) -│ -├── ledgers/ ← IMMUTABLE_STORAGE.LEDGERS.PATH -│ ├── 00000/ ← chunks 0–999 (1_000 .pack files) -│ │ ├── 00000000.pack ← ledger pack file (PR #633) -│ │ ├── 00000001.pack -│ │ └── ... -│ ├── 00001/ ← chunks 1000–1999 -│ │ └── ... -│ └── .../ -│ -├── events/ ← IMMUTABLE_STORAGE.EVENTS.PATH -│ ├── 00000/ ← chunks 0–999 (3_000 files: 3 per chunk) -│ │ ├── 00000000-events.pack ← compressed event blocks -│ │ ├── 00000000-index.pack ← serialized roaring bitmaps -│ │ ├── 00000000-index.hash ← MPHF for term → slot lookup -│ │ └── ... -│ └── .../ -│ -└── txhash/ - ├── raw/ ← IMMUTABLE_STORAGE.TXHASH_RAW.PATH - │ ├── 00000/ ← chunks 0–999 (1_000 .bin files) - │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit) - │ │ └── ... - │ └── .../ - └── index/ ← IMMUTABLE_STORAGE.TXHASH_INDEX.PATH - ├── 00000000/ ← txhash index 0 (16 RecSplit CF files) - │ └── cf-{0-f}.idx ← PERMANENT - └── .../ -``` - -`CHUNKS_PER_TXHASH_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. - -The directory tree above reflects the default `CHUNKS_PER_TXHASH_INDEX = 1000`. Using 20M ledgers (2_000 chunks) as an example: - -| `CHUNKS_PER_TXHASH_INDEX` | Txhash index dirs | Tradeoff | -|---------------------------|-------------------|----------| -| `1000` (default) | 2_000 / 1000 = 2 | Fewer dirs, larger indexes — longer build time per index, fewer files to search at query time | -| `100` | 2_000 / 100 = 20 | More dirs, smaller indexes — faster build time per index, more files to search at query time | -| `1` | 2_000 / 1 = 2_000 | One index per chunk — fastest build, most files to search | - -### Path Conventions - -| File Type | Pattern | Example | -|-----------|---------|---------| -| Ledger pack | `{IMMUTABLE_STORAGE.LEDGERS.PATH}/{bucketID:05d}/{chunkID:08d}.pack` | `ledgers/00000/00000042.pack` | -| Raw txhash | `{IMMUTABLE_STORAGE.TXHASH_RAW.PATH}/{bucketID:05d}/{chunkID:08d}.bin` | `txhash/raw/00000/00000042.bin` | -| RecSplit CF | `{IMMUTABLE_STORAGE.TXHASH_INDEX.PATH}/{indexID:08d}/cf-{nibble}.idx` | `txhash/index/00000000/cf-a.idx` | -| Events data | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-events.pack` | `events/00000/00000042-events.pack` | -| Events index | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.pack` | `events/00000/00000042-index.pack` | -| Events hash | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.hash` | `events/00000/00000042-index.hash` | - -- **Nibble** = high 4 bits of `txhash[0]`, i.e., `txhash[0] >> 4`. Values `0`–`f`. Determines which of 16 CFs a txhash is routed to. -- **Raw txhash format**: 36 bytes per entry, no header: `[txhash: 32 bytes][ledgerSeq: 4 bytes big-endian]` -- **Events cold segment**: See [getEvents full-history design](https://github.com/stellar/stellar-rpc/pull/635) for the full format specification. - ---- - -## Meta Store Keys - -- Single RocksDB instance with WAL (Write-Ahead Log) always enabled -- Authoritative source for crash recovery — all resume decisions derive from key presence in this store - -### Key Schema - -All IDs use uniform `%08d` zero-padding, matching the directory structure. - -| Key Pattern | Value | Written When | -|-------------|-------|-------------| -| `chunk:{C:08d}:lfs` | `"1"` | After ledger `.pack` file is fsynced | -| `chunk:{C:08d}:txhash` | `"1"` | After raw txhash `.bin` file is fsynced | -| `chunk:{C:08d}:events` | `"1"` | After events cold segment files (`events.pack`, `index.pack`, `index.hash`) are fsynced | -| `index:{N:08d}:txhash` | `"1"` | After all 16 RecSplit CF `.idx` files are built and fsynced | - -- Values are `"1"` (retained for `ldb`/`sst_dump` readability); key presence is the signal -- Key absence means not started or incomplete — treated identically on resume -- Each chunk flag is written independently after its output's fsync — a crash may leave some flags set and others absent for the same chunk -- On resume, each chunk's flags are checked independently — only missing outputs are produced -- WAL is always enabled — disabling it would invalidate all crash recovery -- `chunk:{C}:txhash` keys are deleted after the txhash index is built (the raw `.bin` files they reference are also deleted); all other flags are permanent - -**Examples:** -``` -chunk:00000000:lfs → "1" chunk 0 ledger pack done -chunk:00000000:txhash → "1" chunk 0 raw txhash done -chunk:00000000:events → "1" chunk 0 events cold segment done -chunk:00000999:events → "1" last chunk of txhash index 0 -index:00000000:txhash → "1" txhash index 0 RecSplit complete -index:00000001:txhash → absent txhash index 1 not yet built -``` - -### Key Lifecycle - -``` -chunk ingestion → sets chunk:{C}:lfs, chunk:{C}:txhash, chunk:{C}:events - (each independently, after its output's fsync) -txhash index build → sets index:{N}:txhash -txhash cleanup → deletes chunk:{C}:txhash keys + raw .bin files -``` - -After a completed txhash index: -- `chunk:{C}:lfs`, `chunk:{C}:events`, `index:{N}:txhash` — permanent -- `chunk:{C}:txhash` keys + raw `.bin` files — deleted after txhash index is built - ---- - -## Tasks and Dependencies - -The backfill DAG has three task types: - -| Task | Cadence | Dependencies | Produces | -|------|---------|-------------|----------| -| `process_chunk(chunk_id)` | Per chunk (10K ledgers) | None | Ledger `.pack` + raw txhash `.bin` + events cold segment | -| `build_txhash_index(index_id)` | Per txhash index | All `process_chunk` tasks for this txhash index | 16 RecSplit `.idx` files | -| `cleanup_txhash(index_id)` | Per txhash index | `build_txhash_index` for this txhash index | Deletes raw `.bin` files + `chunk:{C}:txhash` meta keys | - -- Each task is a black box to the DAG scheduler — it calls `Execute()` and waits for return -- What happens inside (goroutines, I/O, parallelism) is up to the task - -### Dependency Diagram - -For a single txhash index with N chunks: - -``` -process_chunk(chunk 0) ─┐ -process_chunk(chunk 1) ─┤ -process_chunk(chunk 2) ─┼──→ build_txhash_index(index_id) ──→ cleanup_txhash(index_id) -... │ -process_chunk(chunk N) ─┘ -``` - -- All `process_chunk` tasks for a txhash index must complete before `build_txhash_index` fires -- `cleanup_txhash` runs after `build_txhash_index` succeeds -- Cleanup deletes the raw `.bin` files and their `chunk:{C}:txhash` meta keys - -### Main Flow - -```python -def run_backfill(config, flags): - - # 1. Validate — abort before any work if config is incompatible with existing state - validate(config, flags) - - # 2. Build DAG — register all tasks; each task's execute() handles its own no-op check - dag = build_dag(config, flags) - - # 3. Execute — dispatch all tasks concurrently, bounded by worker count - dag.execute(max_workers=flags.workers) # default GOMAXPROCS -``` - -### Validation - -Validation runs before DAG construction, not as a DAG task. If it were a DAG task, other tasks with no dependencies would start executing concurrently before validation completes — and if validation fails, in-flight work that should never have started would need to be cancelled. Running it first means a clean abort with no partial work. - -```python -def validate(config, flags): - # See Validation Rules for the full list of checks. - assert flags.start_ledger >= 2 - assert flags.end_ledger > flags.start_ledger - assert config.backfill.bsb is not None - assert CHUNKS_PER_TXHASH_INDEX unchanged from prior runs (if meta store is non-empty) -``` - -### DAG Setup - -```python -def build_dag(config, flags): - # Wires up tasks and dependency edges — no completion checks or skip logic. - # Each task's execute() handles its own no-op check (early return if already complete). - - dag = new DAG() - - for index_id in configured_indexes(config, flags): - chunk_tasks = [] - for chunk_id in chunks_for_index(index_id): - t = dag.add(ProcessChunkTask(chunk_id), deps=[]) - chunk_tasks.append(t.id) - b = dag.add(BuildTxHashIndexTask(index_id), - deps=chunk_tasks) - dag.add(CleanupTxHashTask(index_id), deps=[b.id]) - - return dag -``` - ---- - -## Task Details - -### process_chunk(chunk_id) - -- Processes a single 10K-ledger chunk end-to-end -- Occupies one DAG worker slot -- Only produces missing outputs — checks each flag independently -- Internal concurrency is an implementation detail - -**Outputs** (all produced in a single task, only if missing): -- Ledger pack file (`{chunkID:08d}.pack`) — compressed ledger data in [packfile format](https://github.com/stellar/stellar-rpc/pull/633) -- Raw txhash flat file (`{chunkID:08d}.bin`) — 36-byte entries consumed by RecSplit builder -- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — per [getEvents design](https://github.com/stellar/stellar-rpc/pull/635) - -**Pseudocode:** - -```python -process_chunk(chunk_id): - bucket_id = chunk_id / 1000 # hardcoded subdirectory grouping (see Directory Structure) - first_ledger = chunk_first_ledger(chunk_id) - last_ledger = chunk_last_ledger(chunk_id) - - # 1. Check which outputs are missing - need_lfs = not meta_store.has(f"chunk:{chunk_id:08d}:lfs") - need_txhash = not meta_store.has(f"chunk:{chunk_id:08d}:txhash") - need_events = not meta_store.has(f"chunk:{chunk_id:08d}:events") - - if not (need_lfs or need_txhash or need_events): - return # all outputs already present - - # 2. Choose data source - if not need_lfs: - source = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe, no BSB - else: - source = BSBFactory.create(first_ledger, last_ledger) # BSB connection - - # 3. Open writers only for missing outputs - ledger_writer = packfile.create(ledger_pack_path(bucket_id, chunk_id), - overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(bucket_id, chunk_id), - overwrite=True) if need_txhash else None - events_writer = events_segment.create(events_path(bucket_id, chunk_id), - overwrite=True) if need_events else None - - # 4. Process each ledger - for seq in range(first_ledger, last_ledger + 1): - lcm = source.get_ledger(seq) - - if need_lfs: ledger_writer.append(compress(lcm)) - if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx - if need_events: events_writer.append(extract_events(lcm)) - - # 5. Fsync + flag each output independently - if need_lfs: - ledger_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") - - if need_txhash: - txhash_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") - - if need_events: - events_writer.finalize() # flush, build MPHF + bitmap index, fsync - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - - source.close() -``` - -Key properties: -- Only missing outputs are produced — a partially-completed chunk resumes from where it left off -- If LFS is already present, reads from local NVMe instead of BSB (avoids redundant download) -- Each flag is written independently after its output's fsync — no atomic WriteBatch needed -- `packfile.Create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed -- Naturally extends to new data types (add a fourth flag) - -**BSB** (BufferedStorageBackend): -- Ledger source backed by a remote object store -- Each `process_chunk` task creates its own BSB connection -- Internal prefetch workers: `BUFFER_SIZE` ledgers ahead, `NUM_WORKERS` download goroutines - -### build_txhash_index(index_id) - -- Builds the RecSplit txhash index for one completed txhash index -- Occupies one DAG worker slot, but spawns several goroutines internally -- The DAG guarantees all chunk `.bin` files exist before this runs - -**Pseudocode:** - -```python -build_txhash_index(index_id): - if meta_store.has(f"index:{index_id:08d}:txhash"): - return # already built — no-op - - bin_files = list_bin_files(index_id) # all .bin files for chunks in this txhash index - - # Phase 1: COUNT — scan all .bin files, count entries per CF - cf_counts = parallel_count(bin_files, workers=100) - # cf_counts[nibble] = number of (txhash, ledgerSeq) entries routed to that CF - - # Phase 2: ADD — re-read .bin files, route entries to CF builders - cf_builders = [RecSplitBuilder(cf_counts[n]) for n in range(16)] - parallel_add(bin_files, cf_builders, workers=100) - # each entry routed to cf_builders[txhash[0] >> 4] (mutex per CF) - - # Phase 3: BUILD — build MPH index per CF, one .idx file each - parallel_build(cf_builders, workers=16) - # each CF produces one .idx file; all fsynced - - # Phase 4: VERIFY (optional) — look up every key in the built indexes - if verify_recsplit: - parallel_verify(bin_files, cf_builders, workers=100) - - # Mark index complete - meta_store.put(f"index:{index_id:08d}:txhash", "1") -``` - -Key properties: -- COUNT and ADD each read all `.bin` files (two full passes over the data) -- BUILD runs 16 goroutines in parallel (one per CF) — each CF is independent -- VERIFY is skippable via `--verify-recsplit=false` cli flag -- All-or-nothing recovery: if `index:{N}:txhash` is absent on restart → delete partial `.idx` files → rerun entire build - -### cleanup_txhash(index_id) - -- Runs after `build_txhash_index` completes successfully - -**Pseudocode:** - -```python -cleanup_txhash(index_id): - for chunk_id in chunks_for_index(index_id): - if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already cleaned up — skip - delete(raw_txhash_path(bucket_id, chunk_id)) # remove .bin file - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key -``` - -Key properties: -- Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery works naturally -- Per-chunk idempotency: each chunk checks its own `chunk:{C}:txhash` key before deleting — a crash mid-cleanup resumes from where cleanup left off -- On restart: DAG sees txhash index key present (build complete) but `chunk:{C}:txhash` keys still exist → cleanup runs as a normal task - ---- - -## Execution Model - -### DAG Scheduler - -- Pipeline builds a single DAG at startup, executes it with bounded concurrency -- The DAG is the only scheduling mechanism — no per-txhash-index coordinators, no secondary worker pools -- Each task's `Execute()` is wrapped with a retry loop bounded by `--max-retries` (default 3). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. - -```python -run_dag(dag, max_workers): - worker_slots = Semaphore(max_workers) - runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) - - def execute_task(task): - """Runs in a background thread — one per dispatched task.""" - for attempt in range(1, max_retries + 1): - error = task.execute() - if error is None: - break - if attempt == max_retries: - mark_failed(task, error) # halt all dependents - break - log.warn("retry", task, attempt, error) - - worker_slots.release() # free worker slot - - # Check if completing this task unblocks any downstream tasks - for downstream in dag.dependents_of(task): - downstream.mark_dependency_done(task) - if downstream.all_dependencies_done(): - runnable_tasks.push(downstream) # now eligible to run - - # Main loop — dispatches tasks as they become runnable - while runnable_tasks: - current_task = runnable_tasks.pop() - worker_slots.acquire() # block until a worker slot is free - run_in_background(execute_task, current_task) # launch — returns immediately -``` - -### Worker Pool - -- Single flat pool of `workers` slots (default `GOMAXPROCS`) -- Any mix of task types can occupy slots simultaneously -- `process_chunk`: 1 slot per task -- `build_txhash_index`: 1 slot per task (uses many goroutines internally) -- `cleanup_txhash`: 1 slot per task - -### How Work Flows Through the Pipeline - -- All `process_chunk` tasks have no dependencies → DAG dispatches up to `workers` slots immediately at startup -- Chunks from different txhash indexes run side by side — the scheduler does not process txhash indexes sequentially -- When the last chunk of a txhash index completes → `build_txhash_index` becomes eligible, claims a slot -- After build completes → `cleanup_txhash` becomes eligible -- Remaining slots continue processing chunks for other txhash indexes throughout — no special coordination needed - ---- - -## Crash Recovery - -There is no separate crash recovery, reconciliation, or startup triage phase. Recovery happens organically because every task's `execute()` checks its own completion state: - -- On every startup, `build_dag()` registers ALL tasks for the configured range — no meta store scanning in DAG setup -- `process_chunk` checks each output flag independently — missing outputs are produced, existing outputs are skipped -- `build_txhash_index` checks `index:{N}:txhash` — if present, returns immediately; if absent, deletes partial `.idx` files and reruns the full build -- `cleanup_txhash` checks `chunk:{C}:txhash` per-chunk — already-cleaned chunks are skipped, remaining chunks are cleaned up - -This works because of three invariants: - -1. **Key implies durable file** — a meta store flag is set only after fsync -2. **Tasks are idempotent** — each checks its own outputs and skips or overwrites what exists -3. **DAG registers all tasks on every startup** — completed tasks return immediately from `execute()` - -### Concurrent Access Prevention - -- Meta store RocksDB uses kernel-level `flock()` on a `LOCK` file -- A second process attempting to open the same meta store fails immediately -- Released automatically on process exit (including `kill -9`) - - ---- - -## getStatus API Response - -During backfill, `getStatus` returns progress as task-type summaries: -- No per-txhash-index breakdown — just completed/pending/in_progress counts per task type - -```json -{ - "mode": "BACKFILL", - "tasks": { - "process_chunk": {"completed": 288, "pending": 5712, "in_progress": 40}, - "build_txhash_index": {"completed": 0, "pending": 6, "in_progress": 0}, - "cleanup_txhash": {"completed": 0, "pending": 6, "in_progress": 0} - }, - "eta_seconds": 1820 -} -``` - ---- - -## Error Handling - -Two layers of retry: - -- **BSB retries** — BSB handles transient errors internally (connection resets, throttling, etc). These retries happen within a single task execution and are not visible to the DAG scheduler. -- **Task-level retries** — the DAG scheduler wraps each task's `execute()` with a retry loop bounded by `--max-retries` (default 3). If a task returns an error after BSB has exhausted its own retries, the scheduler retries the entire task. After `--max-retries` exhausted → task marked failed → DAG halts all dependent tasks → process exits non-zero. - -Operator re-runs the same command; completed work is never repeated. - -| Error | Handled by | Action | -|-------|-----------|--------| -| BSB transient error (throttle, connection reset) | BSB internal retry | Retried within the task; transparent to DAG | -| BSB persistent error (BSB retries exhausted) | Task-level retry | `--max-retries` attempts; then ABORT | -| Ledger pack write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| TxHash write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| Events write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| RecSplit build failure | Task-level retry | `--max-retries` attempts; then ABORT; txhash index key absent | -| Verify phase mismatch | None | ABORT immediately — data corruption, operator investigates | -| Meta store write failure | None | ABORT immediately — treat as crash, operator re-runs |
KeyMeaning
chunk:{c}:ledgersPer-chunk .pack file state.
chunk:{c}:txhashPer-chunk .bin file state. Transient — removed at window finalization.
chunk:{c}:txhashPer-chunk .bin file state. Transient — removed at window finalization, or by retention pruning if its chunk ages out first.
chunk:{c}:eventsPer-chunk events cold segment state.
index:{w}:{lo}:{hi}One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment.
hot:chunk:{c}"ready" = dir exists and is usable; "transient" = a directory operation (create or delete) is in flight — the recovery is the same either way, which is why one value suffices.