The Full-History Streaming Design
+The big picture
++ Full-history RPC runs as one daemon in one mode. There is no separate backfill command and no + explicit step for the operator: on startup the daemon figures out how far behind the network tip it is + and backfills to it automatically, then serves live ledgers as they're produced. +
+1 · Backfill
+Runs bulk backfill as a subroutine: any chunk inside the retention window that isn't already + frozen is pulled from the configured LedgerBackend (BSB by default) — skipping the tip chunk that + captive core is actively ingesting. Covers first-ever start, downtime gaps, and retention widening.
2 · Ingest
+Streams live ledgers from CaptiveStellarCore into one hot RocksDB per chunk —
+ ledgers, tx hashes, and events as column families, written as one atomic synced WriteBatch per
+ ledger. A ledger is either fully in the hot DB or absent.
3 · Freeze & prune
+A background goroutine wakes on each chunk boundary and runs one run: freeze the completed + chunk to immutable files, rebuild the current tx-hash index to fold it in, discard hot DBs the cold + artifacts now serve, and prune everything superseded or past retention.
.pack / events segment / .bin → rolled into a per-window
+ .idx), and every transition is recorded in a catalog key before the bytes move —
+ so a crash at any instant is recoverable from keys alone.
+ Geometry
+
+ The chain starts at ledger 2 (GENESIS_LEDGER). Two units organize all storage:
+
-
+
- Chunk — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery. +
- Window — 1,000 chunks = 10,000,000 ledgers (hardcoded). The unit of + the rolling tx-hash index. +
The four guarantees
++ The daemon is built around four guarantees over its data. Everything else in the design — the write + protocol, the derived last committed ledger, the key-driven sweeps — exists to maintain these through any crash at + any instant. +
+Retention is complete
+No gaps within the retention window — for every ledger from the retention floor up to + the last committed ledger, all data derived from it (transactions, + events) is present on disk and can serve any request that falls entirely inside the window.
Cold is canonical, hot is transient
+Frozen chunks and finalized indexes live in immutable cold artifacts. A chunk's hot DB is discarded
+ once every cold artifact derived from it is durable and the rolling index covers it — so a tx
+ lookup always has exactly one home: the hot DB until coverage, the .idx after.
The catalog catalogs what's on disk
+Disk content is exactly what the catalog specifies — every file is named by a catalog key and + every key in a final state has its file. File and key writes/deletes are ordered to preserve this + across crashes.
Storage tracks retention
+Disk usage scales with retention_chunks, not with uptime — files and keys for ledger
+ ranges below the effective retention floor are pruned as the floor advances.
Data model
++ Durable state lives in two places: the catalog RocksDB (state markers and config pins) and the + filesystem (immutable files, plus one per-chunk hot RocksDB holding in-progress data during + ingestion). +
+ +On disk
++├── meta/rocksdb/ ← catalog (WAL always on)
+├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient)
+├── ledgers/{bucket:05d}/{chunk:08d}.pack
+├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash)
+└── txhash/
+ ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization (or retention pruning)
+ └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named +
+ The .bin is the interesting transient: it is the input to buildTxhashIndex,
+ retained while its chunk is still within the window's live [lo, hi] coverage (each boundary
+ the rebuild reads every in-coverage .bin). When the window finalizes, the terminal build's
+ commit batch demotes its inputs to "pruning" and the sweep removes
+ them — and under retention narrower than a window, a chunk drops below the floor before its window
+ completes, so retention pruning removes its .bin instead.
+
The chunk hot DB
+
+ One RocksDB per chunk at hot/{chunk:08d}/, holding everything for that chunk not yet
+ materialized to cold artifacts. The data types are column families of one instance — they share the
+ instance's WAL, so each ledger commits as one atomic WriteBatch across all CFs.
+
| Column family | Holds | Serves |
|---|---|---|
ledgers | compressed LCMs, keyed by seq | getLedger for the live chunk; the source processChunk reads at freeze |
txhash | tx hash → seq | getTransaction for the live chunk |
| events CFs | live events (schema per the events doc) | getEvents for the live chunk |
Catalog keys
+Three groups: per-chunk artifact state, hot DB state, and config pins. Lifecycle states are shared by + every artifact key in the system:
+| Key | Meaning |
|---|---|
chunk:{c}:ledgers | Per-chunk .pack file state. |
chunk:{c}:txhash | Per-chunk .bin file state. Transient — removed at window finalization, or by retention pruning if its chunk ages out first. |
chunk:{c}:events | Per-chunk events cold segment state. |
index:{w}:{lo}:{hi} | One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment. |
hot:chunk:{c} | "ready" = dir exists and is usable; "transient" = a directory operation (create or delete) is in flight — the recovery is the same either way, which is why one value suffices. |
config:earliest_ledger | Written on first start, immutable thereafter (startup aborts on mismatch). |
Artifact lifecycles
++ Three state machines cover every durable thing in the system. Click any state to see what it means and + what recovery does if a crash leaves the system there. +
+ + + +One write protocol
+
+ Every durable artifact — per-chunk files and index coverages alike — uses the same protocol,
+ mark-then-write: put "freezing" before any I/O;
+ write the file; fsync the file and its dirent(s); flip the key to
+ "frozen". The pre-mark guarantees every file on disk has a
+ key, so all cleanup is key-driven. Deletion mirrors it: demote, unlink the
+ file before the key, with an fsyncDir barrier between — giving the complementary
+ guarantee, key absent ⟹ file gone.
+
Progress is derived, never stored
+
+ There is no stored progress value. The hot DB's synced per-ledger WriteBatch is the durable commit;
+ recording it again in the catalog would create a second copy of the same fact. Instead, startup
+ recomputes the exact last committed ledger from the catalog, and during operation ingestion hands the
+ lifecycle each chunk as it completes. The recomputation leans on one key-creation invariant: a
+ hot:chunk key is created only after every ledger below its chunk has durably committed — so
+ everything below the highest hot key is complete, and a single read of the live hot DB pins the exact
+ ledger inside it.
+
The rolling tx-hash index
+
+ The current window's index is re-derived from scratch on every chunk boundary to absorb the chunk
+ that just froze, growing until its window is complete. Only the window the network tip is in is ever
+ rebuilt; a completed window's index is finalized (its .bin inputs swept) and never touched
+ again. The rebuild is cheap relative to the cadence: a full-window streamhash build is ≈1 minute against
+ a chunk boundary every ~14 hours at mainnet rates.
+
.bin files make this affordable: processChunk sorts each
+ chunk's ~3M entries in memory before writing, so the rebuild feeds streamhash sorted keys — its fast,
+ low-memory sorted-builder mode. Transient .bin disk is bounded by the windows actually in
+ flight (floor: one dense window ≈ 60 GB), because a finalized window's inputs are deleted as soon as its
+ final index is built.
+ A chunk boundary, end to end
+
+ The micro view: ledger 53,510,001 closes chunk 5350 (window 5, floor pinned at
+ chunk 5100 by earliest_ledger, frozen index covering chunks 5100–5349). Step through every
+ write the boundary performs — watch the catalog, the filesystem, and where reads are served at each
+ instant.
+
Backfill & the resolver
++ Backfill has a contract — given a range, ensure every artifact derived from every ledger in it is + durable and servable — and resolves what's missing before scheduling anything, so a restart + re-plans from what is on disk instead of redoing finished work. Each artifact kind contributes one rule + that compares its postcondition against the catalog and emits the difference as tasks: +
+-
+
ledgers/events(per-chunk): needed for chunk c iff the key isn't "frozen".
+ txhash(per-window): compare the stored coverage (from the window's unique frozen + index key) with the desired coverage[max(window_start, floor), min(window_last, range_end)]. + Desired ⊆ stored → schedule nothing. Desired exceeds stored → request.bin+ production for every chunk in the desired range (already-frozen ones self-skip; previously-covered ones + re-derive from local.pack) and emit one +buildTxhashIndex(w, desired_lo, desired_hi).
+
+ The plan is just a value — pure data recomputed from durable keys on every run, so a restart + re-plans from what is actually on disk with nothing to resume and nothing to reconcile. And the + comparison can trust "frozen" blindly: input keys are demoted in the + same synced write that freezes the terminal coverage, and files are only ever deleted by sweeps under + non-frozen keys — no crash can leave a frozen key whose file is gone. +
+ + + +The execution model
+
+ executePlan is map/reduce without the shuffle or the job tracker: chunk builds are the maps,
+ index builds are the per-window reduces, and completion is recorded as the artifacts themselves.
+ Dependencies are simple, and nothing is persisted:
+
-
+
- The dependency structure is two strata with one edge type — an index build waits on the chunk builds
+ inside its coverage — expressed directly with done-channels. Thousands of goroutines may exist, parked
+ on a single worker semaphore (
cfg.Workers, the only concurrency knob); at most +Workerstasks execute at any instant.
+ - Done-channels signal success: a chunk build closes its channel only once its
.bin+ is frozen, so an index build proceeds only when every input it needs exists. A chunk build that exhausts + its retries leaves its channel open and returns an error, which cancels the group context; any dependent + waiting on it unblocks through the<-gctx.Done()case and bails — the daemon aborts and a + restart re-resolves from durable keys.
+ resolvere-plans from the artifact keys on every run, so completed work never repeats and + interrupted work needs no reconciliation.
+
+ The same resolve + executePlan pair is the lifecycle run's first stage — one
+ scheduler, two callers, so the two regimes can never disagree about what "done" looks like.
+ processChunk's source selection (backfillSource) is also shared: a ready,
+ complete hot DB beats the local .pack beats the backfill backend — which is exactly what lets
+ the lifecycle's freeze be ordinary plan execution rather than a special path.
+
Startup: the backfill loop
+
+ Before it serves anything, the daemon runs backfill in a loop until on-disk coverage reaches the last
+ complete chunk at the network tip. Each pass re-reads the tip, plans [floor, last complete
+ chunk], and executes it; if the tip advanced while the pass ran, another pass picks up the chunks it
+ moved past. The partial chunk still forming at the tip is never backfilled — its ledgers are already
+ in the live hot DB, and hot-DB ingestion finishes it. When the loop exits, the daemon opens the resume
+ chunk's hot DB, seeds the lifecycle, and starts serving.
+
resolve +
+ executePlan pair runs here and in every lifecycle run — startup just drives the
+ bottom of storage down to the floor, where the running lifecycle never reaches.
+ Concurrency: two writers, one fence
++ Two writers; readers only read. Their domains partition at the live chunk, and the partition + itself is encoded in the catalog — the lifecycle's derivation treats the highest hot key as the live + chunk and touches only what lies below it. +
+Ingestion loop — owns the live chunk
+The only writer of the live chunk's hot DB, and the creator of each chunk's
+ hot:chunk:{c} key. One synced WriteBatch per ledger; no progress variable at all.
Lifecycle goroutine — owns everything below
+Handed-off hot DBs (freeze + discard), all chunk:* and index:* keys, and
+ the deletion side of hot:chunk:*. The run's plan stage fans out to the bounded worker
+ pool — every worker operating strictly below the live chunk.
+ The handoff fence is the boundary's write order: the ingestion loop closes its write handle and + opens the next chunk — which moves the partition, since the closed chunk now lies below the new live + chunk — before it hands the completed chunk to the lifecycle on the channel. So by the time the + lifecycle freezes and discards it, no writer holds it. +
+
+ The only connection between the goroutines is the channel, which carries the chunk ingestion just
+ completed on a buffered channel of depth lifecycleQueueDepth. The lifecycle drains it to the
+ highest value each wake and plans up to that chunk, folding a backlog of boundaries into one run. The
+ value sets only the run's range; the work is still gated by durable keys — resolve
+ and the scans decide what to build, discard, and prune. A send onto a full buffer means the lifecycle has
+ fallen lifecycleQueueDepth boundaries behind ingestion — a fatal "freeze can't keep up,"
+ never a silent drop (the depth sits well above the at-most-one signal a healthy daemon holds).
+
The reader contract
++ A read resolves data through two rules, and the rest of the design relies on both: +
+-
+
- Only "ready" and "frozen" are + visible. A read resolves a chunk only from a "ready" hot DB or a + "frozen" cold file — never a key in a transient state + ("freezing", "pruning", + "transient"). So a reader never sees a half-written file, crash + debris, or an in-flight sweep. +
- Below the floor is not-found. A read for any seq below the retention floor returns
+ not-found regardless of what's still on disk — the contract that lets pruning unlink files
+ unilaterally (a stale
.idxmay resolve a hash to a.packthat's been + deleted, but the below-floor read is not-found anyway).
+
+ Together these make retention the single source of truth for "is this available?". Everything else
+ about serving a read — how the reader picks the tier, probes the right window, and stays correct while
+ a sweep unlinks a file mid-read — is the query-routing design's concern, out of scope here (and in
+ the streaming doc). The explorer below illustrates the cold-tier getTransaction probe from the
+ transactions design, for reference:
+
Correctness
++ settled means the run's plan is empty and both scans produce empty op lists — the state the + system returns to between boundaries, and the state in which the invariants below are auditable on a + live daemon. From any storage state — partial-completion crashes, operator actions, surgical + recovery — startup (backfill + the first run) drives the system to settled satisfying all four. +
+ +▸INV-1 Read correctness
+▸INV-2 Single canonical state
+chunk:c:txhash key survives in a finalized window. Two transients are tolerated even at
+ settled: a hot DB's "transient" bracket around an in-flight directory
+ op, and — after surgical recovery — a partially-frozen chunk above the last committed ledger
+ (no read can observe it). Audit: walk catalog keys, cross-check forbidden co-existence.▸INV-3 Disk matches catalog
+▸INV-4 Retention bound
+lo it was built with (its below-floor coverage is never served; the reader gate returns
+ not-found). Audit: walk catalog keys, compare ledger ranges to the floor.audit admin command can implement the walks directly.
+ hot:chunk at or above the lowest tainted chunk — the live chunk always included —
+ to "transient". Why the whole tail, not just the tainted chunk: the
+ hot tier is repaired only by re-ingestion, which replays forward from the last committed ledger
+ (the highest "ready" hot chunk). To replay a tainted hot chunk the
+ watermark must first fall below it — and since it's the max over all
+ "ready" keys, that means demoting every hot DB at or above the lowest
+ tainted one. Then captive core re-ingests the tail forward; the untainted chunks swept up in the demotion
+ are re-derived byte-identically. (A lost hot volume is the same recovery, triggered by loss rather than
+ taint.)
+ What a bug looks like
+Common bugs land as concrete, detectable violations:
+| Symptom | Violates | Detected by |
|---|---|---|
A key flips "frozen" before fsync; key's {lo,hi} doesn't match the file; a frozen file mutated post-freeze | INV-1 | re-derive via a conformant backend, byte-compare |
| Pruning too aggressive — an in-retention read returns wrong/missing results | INV-1 | issue reads |
| Two frozen index keys in one window (promotion and demotion landed as separate writes) | INV-2 | walk index:*, count "frozen" per window |
| A "freezing"/"pruning" key survives served settled | INV-2 | walk keys for transient values at settled |
| A hot DB persists for a chunk cold artifacts fully serve | INV-2 | walk hot:chunk:* against coverage |
Finalization demotions don't complete — .bin keys outlive their terminal index | INV-2 | walk chunk:c:txhash in finalized windows |
| A file on disk without its key (orphan — invisible to every key-driven scan) | INV-3 | walk filesystem against catalog |
| A key without its file (dangling) | INV-3 | walk catalog against filesystem |
| Duplicate cold artifacts for the same logical data | INV-3 | walk filesystem against key-specified paths |
| Files or keys remain below the retention floor | INV-4 | walk keys against the floor |
Why convergence works
+Three properties shared by the resolver and the scans, plus backfill's postcondition contract:
+-
+
- Eligibility from durable state alone — every decision derives from catalog keys; nothing depends on in-memory history. +
- Idempotent ops — re-running any half-finished op is safe; re-materialization overwrites at canonical paths, sweeps re-run until the key is gone. +
- Everything re-derived on every notification — there is no persisted plan to drift. +
+ Runtime op failure aborts the daemon (after bounded retries) rather than deferring silently — safe + because startup is the recovery path: every state a run can leave behind is one startup is built + to converge. +
+