diff --git a/design-docs/full-history-design-explorer.html b/design-docs/full-history-design-explorer.html new file mode 100644 index 000000000..e16290f23 --- /dev/null +++ b/design-docs/full-history-design-explorer.html @@ -0,0 +1,1864 @@ + + + + + +Full-History Design — Interactive Explorer + + + +
+ +
+ +
+
Full-History RPC · Interactive Design Explorer
+

The Full-History Streaming Design

+
+ How the full-history daemon backfills old history, ingests live ledgers, freezes immutable history, + and serves transactions by hash — explained with interactive models you can poke at. + Companion to the streaming workflow and + getTransaction design docs; the markdown remains the + normative spec. Each section links to the doc that owns it. +
+
+ + +
+

The big picture

+ +

+ Full-history RPC runs as one daemon in one mode. There is no separate backfill command and no + explicit step for the operator: on startup the daemon figures out how far behind the network tip it is + and backfills to it automatically, then serves live ledgers as they're produced. +

+
+
startup

1 · Backfill

+

Runs bulk backfill as a subroutine: any chunk inside the retention window that isn't already + frozen is pulled from the configured LedgerBackend (BSB by default) — skipping the tip chunk that + captive core is actively ingesting. Covers first-ever start, downtime gaps, and retention widening.

+
steady state

2 · Ingest

+

Streams live ledgers from CaptiveStellarCore into one hot RocksDB per chunk — + ledgers, tx hashes, and events as column families, written as one atomic synced WriteBatch per + ledger. A ledger is either fully in the hot DB or absent.

+
steady state

3 · Freeze & prune

+

A background goroutine wakes on each chunk boundary and runs one run: freeze the completed + chunk to immutable files, rebuild the current tx-hash index to fold it in, discard hot DBs the cold + artifacts now serve, and prune everything superseded or past retention.

+
+ +
+
Data flow
+
Two sources feed one set of artifacts. Whatever produced the bytes, the artifacts — and the catalog keys that catalog them — are identical.
+
+ + + + + + + + + CaptiveStellarCore + live ledgers at the tip + + + Object store (BSB) + or any conformant backend + + + + Hot RocksDB · one per chunk + column families: + ledgers · txhash · events + serves reads for the live chunk + + + + processChunk + one streaming pass over 10,000 LCMs + + + + {chunk}.pack + + events segment + + {chunk}.bin + per-chunk, write-once + + + per-window .idx (streamhash MPHF) + rebuilt from .bin files on every chunk boundary + + + + stream + + backfill + + freeze at the chunk + boundary (hot branch) + + + + + k-way merge + + + + + catalog RocksDB — catalogs every file and directory above: mark-then-write keys, synced WAL, no directory is ever listed to find work + + +
+
+ +
+ The one-sentence summary: data is born hot (one RocksDB per chunk), becomes cold and immutable + at the chunk boundary (.pack / events segment / .bin → rolled into a per-window + .idx), and every transition is recorded in a catalog key before the bytes move — + so a crash at any instant is recoverable from keys alone. +
+
+ + +
+

Geometry

+ +

+ The chain starts at ledger 2 (GENESIS_LEDGER). Two units organize all storage: +

+
    +
  • Chunk — 10,000 ledgers (hardcoded). The atomic unit of ingestion, freezing, and crash recovery.
  • +
  • Window — 1,000 chunks = 10,000,000 ledgers (hardcoded). The unit of + the rolling tx-hash index.
  • +
+ +
+
Geometry explorer
+
Drag the slider or type any ledger sequence to see where it lives. All ids are zero-padded %08d; file buckets group 1000 chunks (%05d).
+
+ + +
+ + + + +
+
+ +
+
+
WINDOW — 1000 chunks · 10,000,000 ledgers
+
+
+
CHUNK — 10,000 ledgers
+
+
+
+
+
+
+ +
+ The file-bucket size (fixed at 1000 chunks) and the window size (1,000 chunks) coincide + numerically — but they are different concepts: buckets are purely a filesystem concern and never + appear in catalog keys; windows define the tx-hash index layout. +
+
+ + +
+

The four guarantees

+ +

+ The daemon is built around four guarantees over its data. Everything else in the design — the write + protocol, the derived last committed ledger, the key-driven sweeps — exists to maintain these through any crash at + any instant. +

+
+

Retention is complete

+

No gaps within the retention window — for every ledger from the retention floor up to + the last committed ledger, all data derived from it (transactions, + events) is present on disk and can serve any request that falls entirely inside the window.

+

Cold is canonical, hot is transient

+

Frozen chunks and finalized indexes live in immutable cold artifacts. A chunk's hot DB is discarded + once every cold artifact derived from it is durable and the rolling index covers it — so a tx + lookup always has exactly one home: the hot DB until coverage, the .idx after.

+

The catalog catalogs what's on disk

+

Disk content is exactly what the catalog specifies — every file is named by a catalog key and + every key in a final state has its file. File and key writes/deletes are ordered to preserve this + across crashes.

+

Storage tracks retention

+

Disk usage scales with retention_chunks, not with uptime — files and keys for ledger + ranges below the effective retention floor are pruned as the floor advances.

+
+
+ + +
+

Data model

+ +

+ Durable state lives in two places: the catalog RocksDB (state markers and config pins) and the + filesystem (immutable files, plus one per-chunk hot RocksDB holding in-progress data during + ingestion). +

+ +

On disk

+
{default_data_dir}/
+├── meta/rocksdb/ ← catalog (WAL always on)
+├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient)
+├── ledgers/{bucket:05d}/{chunk:08d}.pack
+├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash)
+└── txhash/
+    ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization (or retention pruning)
+    └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named +
+
+ hot / transient-per-chunk + cold, persists until retention pruning + transient rebuild input +
+ +

+ The .bin is the interesting transient: it is the input to buildTxhashIndex, + retained while its chunk is still within the window's live [lo, hi] coverage (each boundary + the rebuild reads every in-coverage .bin). When the window finalizes, the terminal build's + commit batch demotes its inputs to "pruning" and the sweep removes + them — and under retention narrower than a window, a chunk drops below the floor before its window + completes, so retention pruning removes its .bin instead. +

+ +

The chunk hot DB

+

+ One RocksDB per chunk at hot/{chunk:08d}/, holding everything for that chunk not yet + materialized to cold artifacts. The data types are column families of one instance — they share the + instance's WAL, so each ledger commits as one atomic WriteBatch across all CFs. +

+ + + + + +
Column familyHoldsServes
ledgerscompressed LCMs, keyed by seqgetLedger for the live chunk; the source processChunk reads at freeze
txhashtx hash → seqgetTransaction for the live chunk
events CFslive events (schema per the events doc)getEvents for the live chunk
+ +

Catalog keys

+

Three groups: per-chunk artifact state, hot DB state, and config pins. Lifecycle states are shared by + every artifact key in the system:

+
+ "freezing" = file being written (or crashed mid-write) — delete or re-derive +
+
+ "frozen" = fsynced and durable — truth; the only state readers resolve +
+
+ "pruning" = queued for removal — finish the delete +
+ + + + + + + + +
KeyMeaning
chunk:{c}:ledgersPer-chunk .pack file state.
chunk:{c}:txhashPer-chunk .bin file state. Transient — removed at window finalization, or by retention pruning if its chunk ages out first.
chunk:{c}:eventsPer-chunk events cold segment state.
index:{w}:{lo}:{hi}One key per index coverage. The key name carries the coverage and maps 1:1 to the file {lo}-{hi}.idx; the value is pure lifecycle state. At most one coverage per window is "frozen" at any moment.
hot:chunk:{c}"ready" = dir exists and is usable; "transient" = a directory operation (create or delete) is in flight — the recovery is the same either way, which is why one value suffices.
config:earliest_ledgerWritten on first start, immutable thereafter (startup aborts on mismatch).
+
+ Key names carry identity; values carry only lifecycle. An index key's filename is derived from + its name by a fixed bijection — resolving a key to its file never reads the value or lists a directory. + Every file on disk, including a crashed attempt's partial, is reachable from its key alone. +
+
+ + +
+

Artifact lifecycles

+ +

+ Three state machines cover every durable thing in the system. Click any state to see what it means and + what recovery does if a crash leaves the system there. +

+
+
Per-chunk artifacts — .pack, events segment, .bin
+
+
Click a state above.
+
+
+
Index coverage — {lo}-{hi}.idx
+
The one logically-mutable cold artifact: "mutation" happens by freezing the next coverage and demoting the old one in a single atomic batch. The frozen file readers resolve is immutable until unlinked.
+
+
Click a state above.
+
+
+
Hot DB — hot/{chunk:08d}/
+
+
Click a state above.
+
+
+ + +
+

One write protocol

+ +

+ Every durable artifact — per-chunk files and index coverages alike — uses the same protocol, + mark-then-write: put "freezing" before any I/O; + write the file; fsync the file and its dirent(s); flip the key to + "frozen". The pre-mark guarantees every file on disk has a + key, so all cleanup is key-driven. Deletion mirrors it: demote, unlink the + file before the key, with an fsyncDir barrier between — giving the complementary + guarantee, key absent ⟹ file gone. +

+
+
Crash simulator
+
Pick a protocol, then click any step. The right panel shows the durable state after that step completes — and what happens if the process dies right there.
+
+
+
+
+
+
+
+ Why the dirent fsyncs matter: step 3 fsyncs the directory entries, not just the file, so a file's + (or a freshly created directory's) existence on disk is durable before its key flips to + "frozen" — which is why a write that creates its parent directory also + barriers the grandparent. +
+
+ A crashed index build is deleted, not salvaged: a rebuild re-derives byte-identical output (the merge + is a deterministic function of the coverage), so a partial + "freezing" file is just re-derived from scratch. +
+
+ + +
+

Progress is derived, never stored

+ +

+ There is no stored progress value. The hot DB's synced per-ledger WriteBatch is the durable commit; + recording it again in the catalog would create a second copy of the same fact. Instead, startup + recomputes the exact last committed ledger from the catalog, and during operation ingestion hands the + lifecycle each chunk as it completes. The recomputation leans on one key-creation invariant: a + hot:chunk key is created only after every ledger below its chunk has durably committed — so + everything below the highest hot key is complete, and a single read of the live hot DB pins the exact + ledger inside it. +

+
+
lastCommittedLedger — two terms, take the higher
+
The COLD term is the last ledger of the highest fully-frozen chunk; the HOT term reads the live hot DB for the exact ledger inside it. The higher chunk wins — HOT in steady state, COLD only when no hot chunk sits above the frozen ones. Pick a state startup might find:
+
+
+
+
+ Postcondition-driven backfill is what makes a recomputed last committed ledger safe: backfill converges + whole ranges, so recomputation can never skip a hole. And because nothing is stored, a lost hot + volume drops the recomputed answer to the last frozen boundary on its own (surgical recovery). +
+
+ + +
+

The rolling tx-hash index

+ +

+ The current window's index is re-derived from scratch on every chunk boundary to absorb the chunk + that just froze, growing until its window is complete. Only the window the network tip is in is ever + rebuilt; a completed window's index is finalized (its .bin inputs swept) and never touched + again. The rebuild is cheap relative to the cadence: a full-window streamhash build is ≈1 minute against + a chunk boundary every ~14 hours at mainnet rates. +

+
+
Rolling-window simulator
+
+ Scaled down to 8 chunks per window so you can watch it roll (real default: 1000). Each step is + one chunk boundary: the live chunk freezes, the window's coverage advances by one atomic + promote-and-demote, the hot DB is discarded once covered. Enable retention to watch the floor chase the + tip and lo rise. +
+
+ + + + +
+
+
+ live chunk (hot DB, being written) + hot DB awaiting coverage + frozen (.pack + events durable) + .bin present (rebuild input) + index coverage [lo, hi] + pruned (past retention) +
+
+
+
+ Why per-chunk .bin files make this affordable: processChunk sorts each + chunk's ~3M entries in memory before writing, so the rebuild feeds streamhash sorted keys — its fast, + low-memory sorted-builder mode. Transient .bin disk is bounded by the windows actually in + flight (floor: one dense window ≈ 60 GB), because a finalized window's inputs are deleted as soon as its + final index is built. +
+
+ Provisioning note: old and new coverage files coexist from the start of a rebuild's write until + the eager sweep's unlink, so the window dir transiently holds ~2× the index size (~25 GB at the end of a + dense full window), and the window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial + on instance NVMe, worth provisioning for on throughput-capped volumes like EBS gp3. +
+
+ + +
+

A chunk boundary, end to end

+ +

+ The micro view: ledger 53,510,001 closes chunk 5350 (window 5, floor pinned at + chunk 5100 by earliest_ledger, frozen index covering chunks 5100–5349). Step through every + write the boundary performs — watch the catalog, the filesystem, and where reads are served at each + instant. +

+
+
+ + + + +
+
+
+
+
+
+
+
+
+ Every arrow in this walkthrough is the one write protocol or its exit sweep. At the end of the run a + re-plan and re-scan find nothing to do — that settled is what makes the + invariant audits meaningful on a live daemon. +
+
+ + +
+

Backfill & the resolver

+ +

+ Backfill has a contract — given a range, ensure every artifact derived from every ledger in it is + durable and servable — and resolves what's missing before scheduling anything, so a restart + re-plans from what is on disk instead of redoing finished work. Each artifact kind contributes one rule + that compares its postcondition against the catalog and emits the difference as tasks: +

+
    +
  • ledgers / events (per-chunk): needed for chunk c iff the key isn't "frozen".
  • +
  • txhash (per-window): compare the stored coverage (from the window's unique frozen + index key) with the desired coverage [max(window_start, floor), min(window_last, range_end)]. + Desired ⊆ stored → schedule nothing. Desired exceeds stored → request .bin + production for every chunk in the desired range (already-frozen ones self-skip; previously-covered ones + re-derive from local .pack) and emit one + buildTxhashIndex(w, desired_lo, desired_hi).
  • +
+

+ The plan is just a value — pure data recomputed from durable keys on every run, so a restart + re-plans from what is actually on disk with nothing to resume and nothing to reconcile. And the + comparison can trust "frozen" blindly: input keys are demoted in the + same synced write that freezes the terminal coverage, and files are only ever deleted by sweeps under + non-frozen keys — no crash can leave a frozen key whose file is gone. +

+ +
+
Resolver playground
+
Six situations the daemon actually encounters. Solid bar = stored coverage (the frozen index key); dashed bar = desired coverage. The plan below is what resolve() emits.
+
+
+
+
+
+
+ +

The execution model

+

+ executePlan is map/reduce without the shuffle or the job tracker: chunk builds are the maps, + index builds are the per-window reduces, and completion is recorded as the artifacts themselves. + Dependencies are simple, and nothing is persisted: +

+
    +
  • The dependency structure is two strata with one edge type — an index build waits on the chunk builds + inside its coverage — expressed directly with done-channels. Thousands of goroutines may exist, parked + on a single worker semaphore (cfg.Workers, the only concurrency knob); at most + Workers tasks execute at any instant.
  • +
  • Done-channels signal success: a chunk build closes its channel only once its .bin + is frozen, so an index build proceeds only when every input it needs exists. A chunk build that exhausts + its retries leaves its channel open and returns an error, which cancels the group context; any dependent + waiting on it unblocks through the <-gctx.Done() case and bails — the daemon aborts and a + restart re-resolves from durable keys.
  • +
  • resolve re-plans from the artifact keys on every run, so completed work never repeats and + interrupted work needs no reconciliation.
  • +
+

+ The same resolve + executePlan pair is the lifecycle run's first stage — one + scheduler, two callers, so the two regimes can never disagree about what "done" looks like. + processChunk's source selection (backfillSource) is also shared: a ready, + complete hot DB beats the local .pack beats the backfill backend — which is exactly what lets + the lifecycle's freeze be ordinary plan execution rather than a special path. +

+ +
+
executePlan — bounded concurrency & the done-channel barrier
+
+ Five chunk builds feeding one window's index build, run under 2 worker slots. Step the scheduler: + chunk builds fill the slots, each closes its done-channel on success, and the index build stays parked + until every input it needs is frozen. Toggle a failure to watch a build leave its channel open, cancel + the group, and bail its dependents. +
+
+ + + + +
+
+
+
+
+
+ + +
+

Startup: the backfill loop

+
Normative spec: streaming — Startup
+

+ Before it serves anything, the daemon runs backfill in a loop until on-disk coverage reaches the last + complete chunk at the network tip. Each pass re-reads the tip, plans [floor, last complete + chunk], and executes it; if the tip advanced while the pass ran, another pass picks up the chunks it + moved past. The partial chunk still forming at the tip is never backfilled — its ledgers are already + in the live hot DB, and hot-DB ingestion finishes it. When the loop exits, the daemon opens the resume + chunk's hot DB, seeds the lifecycle, and starts serving. +

+
+
Startup walkthrough
+
+ Three situations, each running the real startStreaming loop arithmetic + (lastCompleteChunkAt, the near-tip/mid-chunk trim, the rangeEnd ≤ backfilledThrough + exit). Each card is one loop pass; the last is the hand-off to serve + ingest. +
+
+
+
+
+
+ The loop reads durable keys only, so it is its own crash recovery: a restart re-plans from what is on disk, + redoing no finished chunk and skipping no unfinished one. The same resolve + + executePlan pair runs here and in every lifecycle run — startup just drives the + bottom of storage down to the floor, where the running lifecycle never reaches. +
+
+ + +
+

Concurrency: two writers, one fence

+ +

+ Two writers; readers only read. Their domains partition at the live chunk, and the partition + itself is encoded in the catalog — the lifecycle's derivation treats the highest hot key as the live + chunk and touches only what lies below it. +

+
+
+

Ingestion loop — owns the live chunk

+

The only writer of the live chunk's hot DB, and the creator of each chunk's + hot:chunk:{c} key. One synced WriteBatch per ledger; no progress variable at all.

+
+
+

Lifecycle goroutine — owns everything below

+

Handed-off hot DBs (freeze + discard), all chunk:* and index:* keys, and + the deletion side of hot:chunk:*. The run's plan stage fans out to the bounded worker + pool — every worker operating strictly below the live chunk.

+
+
+

+ The handoff fence is the boundary's write order: the ingestion loop closes its write handle and + opens the next chunk — which moves the partition, since the closed chunk now lies below the new live + chunk — before it hands the completed chunk to the lifecycle on the channel. So by the time the + lifecycle freezes and discards it, no writer holds it. +

+

+ The only connection between the goroutines is the channel, which carries the chunk ingestion just + completed on a buffered channel of depth lifecycleQueueDepth. The lifecycle drains it to the + highest value each wake and plans up to that chunk, folding a backlog of boundaries into one run. The + value sets only the run's range; the work is still gated by durable keys — resolve + and the scans decide what to build, discard, and prune. A send onto a full buffer means the lifecycle has + fallen lifecycleQueueDepth boundaries behind ingestion — a fatal "freeze can't keep up," + never a silent drop (the depth sits well above the at-most-one signal a healthy daemon holds). +

+
+ + +
+

The reader contract

+ +

+ A read resolves data through two rules, and the rest of the design relies on both: +

+
    +
  1. Only "ready" and "frozen" are + visible. A read resolves a chunk only from a "ready" hot DB or a + "frozen" cold file — never a key in a transient state + ("freezing", "pruning", + "transient"). So a reader never sees a half-written file, crash + debris, or an in-flight sweep.
  2. +
  3. Below the floor is not-found. A read for any seq below the retention floor returns + not-found regardless of what's still on disk — the contract that lets pruning unlink files + unilaterally (a stale .idx may resolve a hash to a .pack that's been + deleted, but the below-floor read is not-found anyway).
  4. +
+

+ Together these make retention the single source of truth for "is this available?". Everything else + about serving a read — how the reader picks the tier, probes the right window, and stays correct while + a sweep unlinks a file mid-read — is the query-routing design's concern, out of scope here (and in + the streaming doc). The explorer below illustrates the cold-tier getTransaction probe from the + transactions design, for reference: +

+
+
Read-path explorer · query-routing, out of scope of the streaming doc
+
Three cold lookups over a multi-window retention. The chain shows each per-window probe.
+
+
+
+
+
+ + +
+

Correctness

+ +

+ settled means the run's plan is empty and both scans produce empty op lists — the state the + system returns to between boundaries, and the state in which the invariants below are auditable on a + live daemon. From any storage state — partial-completion crashes, operator actions, surgical + recovery — startup (backfill + the first run) drives the system to settled satisfying all four. +

+ +
+ INV-1 Read correctness +
Any data request whose ledger scope falls entirely within the retention window returns + correct results — content matches what a conformant LedgerBackend would produce, no partial state + visible, no in-retention range unreachable. Audit: issue reads, or re-derive artifacts via a + conformant backend and byte-compare. One transient exception: when surgical recovery demotes hot data down + to the live chunk, the last committed ledger rewinds and the floor regresses with it, briefly admitting a + few already-pruned bottom chunks — those reads fail soft (not-found, never wrong data) until the + floor re-advances.
+
+
+ INV-2 Single canonical state +
At most one "frozen" index key per window — + at all times, settled or not (the commit batch promotes and demotes in one write). At + settled: no key anywhere is "freezing" or + "pruning"; no hot DB persists for a chunk cold artifacts fully serve; + no chunk:c:txhash key survives in a finalized window. Two transients are tolerated even at + settled: a hot DB's "transient" bracket around an in-flight directory + op, and — after surgical recovery — a partially-frozen chunk above the last committed ledger + (no read can observe it). Audit: walk catalog keys, cross-check forbidden co-existence.
+
+
+ INV-3 Disk matches catalog +
At settled, the set of artifact files and hot DB directories on disk equals exactly + the set the catalog specifies — no orphan files, no dangling keys, no duplicate artifacts. A + non-key-named file in an index window dir is a real bug, not mid-run debris. Audit: walk the + filesystem against the catalog, both directions.
+
+
+ INV-4 Retention bound +
At settled, no file or catalog key maps to a ledger range strictly below the + effective retention floor — except a frozen index key whose window straddles the floor, which keeps the + lo it was built with (its below-floor coverage is never served; the reader gate returns + not-found). Audit: walk catalog keys, compare ledger ranges to the floor.
+
+ +
+ None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a + real invariant violation, not as something the buggy code silently considers acceptable. An + audit admin command can implement the walks directly. +
+ +
+ Surgical recovery (tainted data). The operator never touches the filesystem — recovery is one + atomic catalog batch that demotes keys. Tainted cold artifacts go to + "freezing", and backfill re-derives them. For the hot tier, demote + every hot:chunk at or above the lowest tainted chunk — the live chunk always included — + to "transient". Why the whole tail, not just the tainted chunk: the + hot tier is repaired only by re-ingestion, which replays forward from the last committed ledger + (the highest "ready" hot chunk). To replay a tainted hot chunk the + watermark must first fall below it — and since it's the max over all + "ready" keys, that means demoting every hot DB at or above the lowest + tainted one. Then captive core re-ingests the tail forward; the untainted chunks swept up in the demotion + are re-derived byte-identically. (A lost hot volume is the same recovery, triggered by loss rather than + taint.) +
+ +

What a bug looks like

+

Common bugs land as concrete, detectable violations:

+ + + + + + + + + + + + +
SymptomViolatesDetected by
A key flips "frozen" before fsync; key's {lo,hi} doesn't match the file; a frozen file mutated post-freezeINV-1re-derive via a conformant backend, byte-compare
Pruning too aggressive — an in-retention read returns wrong/missing resultsINV-1issue reads
Two frozen index keys in one window (promotion and demotion landed as separate writes)INV-2walk index:*, count "frozen" per window
A "freezing"/"pruning" key survives served settledINV-2walk keys for transient values at settled
A hot DB persists for a chunk cold artifacts fully serveINV-2walk hot:chunk:* against coverage
Finalization demotions don't complete — .bin keys outlive their terminal indexINV-2walk chunk:c:txhash in finalized windows
A file on disk without its key (orphan — invisible to every key-driven scan)INV-3walk filesystem against catalog
A key without its file (dangling)INV-3walk catalog against filesystem
Duplicate cold artifacts for the same logical dataINV-3walk filesystem against key-specified paths
Files or keys remain below the retention floorINV-4walk keys against the floor
+ +

Why convergence works

+

Three properties shared by the resolver and the scans, plus backfill's postcondition contract:

+
    +
  • Eligibility from durable state alone — every decision derives from catalog keys; nothing depends on in-memory history.
  • +
  • Idempotent ops — re-running any half-finished op is safe; re-materialization overwrites at canonical paths, sweeps re-run until the key is gone.
  • +
  • Everything re-derived on every notification — there is no persisted plan to drift.
  • +
+

+ Runtime op failure aborts the daemon (after bounded retries) rather than deferring silently — safe + because startup is the recovery path: every state a run can leave behind is one startup is built + to converge. +

+
+ +
+ Interactive companion to full-history-streaming-workflow.md + (the daemon: backfill, ingestion, lifecycle, invariants) and + gettransaction-full-history-design.md + (the tx-by-hash subsystem: formats, the rolling index, the read path) — + the markdown is the normative spec; numbers here (chunk = 10,000 ledgers, window default = 1000 chunks, + build ≈ 1 min) come from those docs and the bench-fullhistory measurements they cite. + Re-synced to the current docs 2026-06-18. Self-contained; no external dependencies. +
+ + + +
+
+ + diff --git a/design-docs/full-history-streaming-workflow.md b/design-docs/full-history-streaming-workflow.md new file mode 100644 index 000000000..0c1e45fc2 --- /dev/null +++ b/design-docs/full-history-streaming-workflow.md @@ -0,0 +1,998 @@ +# Streaming Workflow + +## Overview + +Full-history RPC runs as one daemon in one mode: it both backfills old history and follows the live network. + +It keeps two tiers of data. **Hot** data is the most recent ledgers near the network tip, written append-only into RocksDB. **Cold** data is older ledgers, held as immutable files on disk. On startup RPC backfills to the current tip, then ingests new ledgers continuously into the hot DB; when the hot DB fills, it writes the immutable cold files for that ledger range and discards the hot DB. This migration from hot to cold is called **freezing**. + +The daemon does three things: + +- **Backfills on startup.** Before it serves anything, it runs backfill as a subroutine to bring what's on disk in line with the current retention window. It pulls every chunk inside that window that isn't already frozen from a configured `LedgerBackend` — by default BSB (the Buffered Storage Backend, which reads ledgers from an object store), or captive core or any other conformant backend if BSB isn't available. It skips the partial chunk still forming at the tip; hot-DB ingestion fills that one once it starts. This single mechanism covers a first-ever start, gaps left by downtime, and gaps opened by widening retention. +- **Ingests** live ledgers from `CaptiveStellarCore` into one hot RocksDB per chunk — ledgers, transaction hashes, and events as column families, written in one atomic batch per ledger. +- **Freezes** completed chunks to immutable files, **rebuilds** the current tx-hash index from its frozen inputs on every chunk boundary, and **prunes** superseded and past-retention artifacts. All run in a background lifecycle goroutine. + +--- + +## Geometry + +The Stellar blockchain starts at ledger 2 (`GENESIS_LEDGER`). Two units organize all storage; everything in this doc is described in terms of them: + +- **Chunk** — a run of 10,000 ledgers (hardcoded); the atomic unit of ingestion, freezing, and crash recovery. A hot DB holds at most one chunk, and each cold file — ledgers, events, transactions — spans exactly one chunk. +- **Window** — 1,000 chunks (10M ledgers); the unit of the rolling tx-hash index. The index is the one exception to the per-chunk rule: it maps transaction hashes to ledger sequences across a whole window. + +``` +chunkID(seq) = floor((seq - 2) / 10_000) +chunkFirstLedger(c) = c * 10_000 + 2 +chunkLastLedger(c) = (c + 1) * 10_000 + 1 +indexID(c) = c / 1000 # takes a CHUNK id +``` + +Chunk ids are **signed**, because `chunkID` uses floor division. The only id below 0 is **chunk −1**, meaning "before the first chunk." It comes up in one place: the "nothing ingested yet" sentinel `earliest_ledger - 1`, which maps to chunk −1 (and `chunkLastLedger(-1) = 1` maps back). Chunk −1 only ever appears in startup arithmetic; every chunk id written to disk is `≥ 0`. + +All chunk and window ids use uniform `%08d` zero-padding. Example (window = 1,000 chunks): + +| Window | First ledger | Last ledger | Chunks | +|---|---|---|---| +| 0 | 2 | 10,000,001 | 0–999 | +| 1 | 10,000,002 | 20,000,001 | 1000–1999 | +| N | N×10M + 2 | (N+1)×10M + 1 | N×1000 – (N+1)×1000−1 | + +--- + +## Configuration + +One TOML file (`--config`) configures the daemon. + +**[service]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `default_data_dir` | string | **required** | Base directory for the catalog and default storage paths. | + +**[backfill]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `workers` | int | `GOMAXPROCS` | Concurrent task slots for backfill. | +| `max_retries` | int | `3` | Retries per backfill task before the daemon aborts. | + +**[backfill.bsb]** — Buffered Storage Backend (the default backfill `LedgerBackend`; required **unless** another conformant `LedgerBackend` is configured as the backfill source — `backendNetworkTip`/`processChunk`'s default `source` all go through whichever backend is configured) + +| Key | Type | Default | Description | +|---|---|---|---| +| `bucket_path` | string | **required** | Remote object store path for LedgerCloseMeta (no `gs://` prefix for GCS). | +| `buffer_size` | int | `1000` | Prefetch buffer depth per connection. | +| `num_workers` | int | `20` | Download workers per connection. | + +**[immutable_storage.*]** — one optional `path` per artifact tree (defaults under `{default_data_dir}`): + +| Section | Default path | Holds | +|---|---|---| +| `[immutable_storage.ledgers]` | `{default_data_dir}/ledgers` | `.pack` files | +| `[immutable_storage.events]` | `{default_data_dir}/events` | events cold segments | +| `[immutable_storage.txhash_raw]` | `{default_data_dir}/txhash/raw` | transient `.bin` files | +| `[immutable_storage.txhash_index]` | `{default_data_dir}/txhash/index` | per-window `.idx` | + +**[catalog]** — optional `path` (default `{default_data_dir}/catalog/rocksdb`). + +**[logging]** — optional `level` (`debug`/`info`/`warn`/`error`, default `info`) and `format` (`text`/`json`, default `text`). + +**[streaming]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `retention_chunks` | uint32 | `0` | Retention window in chunks. `0` = full history. | +| `earliest_ledger` | uint32 \| `"genesis"` \| `"now"` | `"genesis"` | Earliest ledger this daemon will ever have data for — a fixed lower floor on history. Combined with `retention_chunks`, the effective floor is the higher of the two. Must be chunk-aligned; `"now"` resolves to the current network tip's chunk at first start. Resolved and stored on the first start (a reachable backend is required for `"now"` and numeric floors; see `validateConfig`), immutable thereafter. Setting it above genesis skips upfront backfill — useful when no fast backfill source is available and the daemon only follows the live network (`earliest_ledger = "now"`). | +| `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | + +**[streaming.hot_storage]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `path` | string | `{default_data_dir}/hot` | Base path for hot RocksDB databases. | + +**CLI** + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--config` | string | **required** | Path to TOML config file. | + +--- + +## Data model + +The daemon's durable state lives in two places. The **catalog** — a small RocksDB — records what's on disk and the state each file is in, plus a few config values fixed on the first start. The **filesystem** holds the data itself: the immutable cold files, and one per-chunk hot RocksDB for data still being ingested. + +Throughout this section, `chunk` is a chunk id and `txhash_index` is a window id. + +### Filesystem artifacts + +The per-chunk artifacts are each written once at chunk freeze; the txhash index is rebuilt on each chunk boundary while its window is current and then finalized. All four are produced by [the primitives](#the-primitives): + +| Artifact | Granularity | Format | Produced by | +|---|---|---|---| +| Ledger pack file | per chunk | `.pack` | `processChunk` | +| Events cold segment | per chunk | three files per chunk (format defined in the events doc) | `processChunk` | +| Sorted txhash file | per chunk | `.bin` (sorted **streamhash** entries — the sorted on-disk tx-hash index format, specified in [the transactions design](./gettransaction-full-history-design.md) §6) | `processChunk` | +| Streamhash txhash index | per index | one `.idx` file per **coverage** (the chunk range `[lo, hi]` an index spans), named `{lo:08d}-{hi:08d}.idx` inside the window's dir; at most one coverage frozen at any moment | `buildTxhashIndex` | + +The `.bin` files are transient — they are the input `buildTxhashIndex` merges, and the terminal build deletes them once its window is complete (or retention pruning removes them first, once its chunks drop below the floor). The pack files, events segments, and `.idx` files persist until retention pruning removes them. State for each lives in [Catalog keys](#catalog-keys); the write ordering is [One write protocol](#one-write-protocol). + +### Directory layout + +Chunk-level files group into buckets of 1,000 chunks (`bucket_id = chunk_id / 1000`, formatted `%05d`) — a filesystem concern only; bucket ids never appear in catalog keys. Directories are created on demand. + +``` +{default_data_dir}/ +├── catalog/rocksdb/ ← catalog (WAL always on) +├── hot/{chunk:08d}/ ← per-chunk hot RocksDB (transient) +├── ledgers/{bucket:05d}/{chunk:08d}.pack +├── events/{bucket:05d}/{chunk:08d}-events.pack (+ -index.pack, -index.hash) +└── txhash/ + ├── raw/{bucket:05d}/{chunk:08d}.bin ← transient until window finalization (or retention pruning) + └── index/{window:08d}/{lo:08d}-{hi:08d}.idx ← one frozen file per window, coverage-named +``` + +### The chunk hot DB + +During ingestion the daemon maintains **one hot RocksDB per chunk** at `{hot_storage.path}/{chunk:08d}/`, holding everything for that chunk not yet materialized to cold artifacts. The data types are column families of the one instance: + +| Column family | Holds | Serves | +|---|---|---| +| `ledgers` | compressed LCMs (LedgerCloseMeta), keyed by seq | `getLedger` for the live chunk; the source `processChunk` reads at freeze | +| `txhash` | tx hash → seq | `getTransaction` for the live chunk | +| events CFs | live events (schema per the events doc) | `getEvents` for the live chunk | + +CFs share the instance's WAL, so each ledger commits as **one atomic WriteBatch across all CFs**. Per-CF options keep tuning independent (the events CFs carry their own settings). The DB is created when ingestion enters the chunk. It is discarded whole once every cold artifact derived from the chunk is durable **and** the rolling index covers the chunk. It keeps serving tx lookups across the brief freeze-to-coverage interval; freeze, rebuild, and discard all chain within one lifecycle run. + +### Catalog keys + +The catalog holds three groups of keys: per-chunk artifact state keys, hot DB state keys, and the config pin. + +**Artifact state keys**: + +| Key | Value | Meaning | +|---|---|---| +| `chunk:{chunk:08d}:ledgers` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk pack file state. | +| `chunk:{chunk:08d}:txhash` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk `.bin` file state. Transient — removed at window finalization, or by retention pruning if its chunk ages out first. | +| `chunk:{chunk:08d}:events` | `"freezing"` \| `"frozen"` \| `"pruning"` | Per-chunk events cold segment state. | +| `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` | `"freezing"` \| `"frozen"` \| `"pruning"` | One key per index **coverage**. The key *name* carries the coverage `[lo, hi]` and maps 1:1 to the file `{lo:08d}-{hi:08d}.idx`; the *value* is pure lifecycle state — the same three values as every other artifact key. At most one coverage per window is `"frozen"` at any moment, and a key with `hi` = its window's last chunk is **terminal** by definition (see [Index keys](#index-keys) below). | + +For the per-chunk keys, `"freezing"` means the immutable file is being written; `"frozen"` means it's fsynced and durable; `"pruning"` means the file is queued for removal; key absent means neither file nor in-progress write exists. Index keys use the **same three states with the same meanings** — a rebuild marks its coverage `"freezing"` before any I/O, and its commit batch flips it to `"frozen"` while demoting the superseded coverage to `"pruning"`. Every artifact key therefore obeys one set of crash rules: `"freezing"` = delete (or re-derive) the file, `"pruning"` = finish the delete, `"frozen"` = truth. + +**Hot DB state key**: + +| Key | Value | Tracks | +|---|---|---| +| `hot:chunk:{chunk:08d}` | `"transient"` \| `"ready"` | The chunk's hot DB. | + +`"ready"` means the RocksDB dir exists and is usable. `"transient"` brackets a directory operation in flight — creation or deletion; no code path ever needs to know which, since the recovery is the same either way (the open path wipes and recreates; the discard scan re-runs). A crash mid-operation is detectable from the key value alone. One key per chunk; the column families inside the DB carry no individual catalog state. + +**Config pin:** + +| Key | Value | Written when | +|---|---|---| +| `config:earliest_ledger` | `uint32` (decimal string, chunk-aligned) | On the first daemon start. Immutable thereafter — changing it currently requires wiping the data directory, until a `set-earliest-ledger` admin command exists (see [Configuration](#configuration); the floor machinery already converges for either direction). | + +**Resume point.** Recomputed at startup from the durable keys plus a read of the live hot DB (see [Startup](#startup)). + +### Index keys + +An index key `index:{txhash_index:08d}:{lo:08d}:{hi:08d}` names the chunk range `[lo, hi]` that its `.idx` covers, mapping 1:1 to the file `txhash/index/{txhash_index:08d}/{lo:08d}-{hi:08d}.idx`. + +`hi` grows as the window fills: at each chunk boundary the rebuild folds in the chunk that just froze, advancing `hi` by one. When `hi` reaches the window's last chunk, the window is **complete** and its index is **terminal** — final, never rebuilt again. + +`lo` is the higher of the window's first chunk and the retention floor, fixed when the index is built. So: + +- a window still being rebuilt each boundary has its `lo` recomputed every time, so it rises as the floor does, dropping chunks that have aged out of retention; +- a terminal window's `.idx` keeps the `lo` it was built with; if the floor later climbs past that `lo`, the index still covers chunks that have dropped out of retention — but a read for any ledger below the floor returns not-found regardless of what the index says, so that stale coverage is never served. + +So `lo` equals the window's first chunk unless the start of the window has dropped below the floor. + +[The transactions design](./gettransaction-full-history-design.md) (§6.3) is canonical for coverage semantics, with a worked example. + +### One write protocol + +Every durable artifact — per-chunk files and index coverages alike — is written the same way, **mark-then-write**: + +1. put `"freezing"` *before* any I/O; +2. write the file; +3. fsync the file, its parent dirent, and — when the parent was just created — the grandparent dirent; +4. flip the key to `"frozen"`. + +The key is always written before the file. So every file can be found from its key — cleanup walks keys, never directories — and a file left half-written by a crash carries a `"freezing"` key, which marks it for re-derivation or removal. Step 3 fsyncs the directory entries, not just the file, so the file's existence on disk survives a crash before its key flips to `"frozen"`. + +Deletion is the same protocol in reverse: demote the key to `"pruning"`, unlink the file, then delete the key, with an `fsyncDir` between the unlink and the key delete. So a key is gone only once its file is — **key absent ⟹ file gone**. Two functions do all file deletion: `sweepChunkArtifacts` for per-chunk artifacts and `sweepIndexKey` for index files. + +--- + +## Backfill + +Backfill makes every artifact derived from a range of ledgers durable and servable. It has three parts, in the order below: a **resolver** (`resolve`) that diffs what's wanted against the catalog and returns a plan of the missing work; the **primitives** (`processChunk`, `buildTxhashIndex`) that produce each artifact; and an **executor** (`executePlan`) that runs the plan concurrently. The [Startup](#startup) backfill loop and the [Lifecycle](#lifecycle) run are its two callers. + +### Postcondition-driven planning + +Backfill works from a postcondition: *given a range, every artifact derived from every ledger in it must be durable and servable.* `resolve` reads the catalog and returns a `Plan` of only the missing work — per-chunk artifacts whose key isn't `"frozen"`, and window indexes whose frozen coverage doesn't yet span the range. It reads nothing but durable keys, so every run re-plans from what's on disk; a restart neither redoes finished work nor skips unfinished work. The plan is a flat list of chunk builds and index builds: + +```go +type ChunkBuild struct { + Chunk ChunkID + Artifacts ArtifactSet // which kinds this chunk still needs — one processChunk pass produces all +} + +type IndexBuild struct { + Window WindowID + Lo, Hi ChunkID // coverage to build; terminal iff Hi == windowLastChunk(Window) + // dependencies are derivable (the ChunkBuilds in [Lo, Hi]), so no input list +} + +type Plan struct { + ChunkBuilds []ChunkBuild + IndexBuilds []IndexBuild +} + +// resolve returns the work missing for [rangeStart, rangeEnd]. +func resolve(cfg Config, rangeStart, rangeEnd ChunkID) Plan { + if rangeEnd < rangeStart { + return Plan{} // young network: no complete chunk yet + } + cat := cfg.Catalog + needs := map[ChunkID]ArtifactSet{} + + for c := rangeStart; c <= rangeEnd; c++ { + for _, kind := range []Kind{Ledgers, Events} { + if cat.State(c, kind) != Frozen { + needs[c] = needs[c].Add(kind) + } + } + } + + var builds []IndexBuild + for _, w := range windowsOverlapping(rangeStart, rangeEnd) { + desired := Range{ + Lo: max(windowFirstChunk(w), rangeStart), + Hi: min(windowLastChunk(w), rangeEnd), + } + if frozenCoverage(cat, w).Covers(desired) { + continue + } + for c := desired.Lo; c <= desired.Hi; c++ { + if cat.State(c, TxHashBin) != Frozen { + needs[c] = needs[c].Add(TxHashBin) + } + } + builds = append(builds, IndexBuild{Window: w, Lo: desired.Lo, Hi: desired.Hi}) + } + return Plan{ChunkBuilds: chunkBuilds(needs), IndexBuilds: builds} +} +``` + +### The primitives + +`processChunk` writes a chunk's requested artifacts through the [one write protocol](#one-write-protocol), reading ledgers from `backfillSource`. Its hot-DB branch is what lets the lifecycle freeze a just-closed chunk from its own hot DB, on the same path as a cold backfill. + +```go +func processChunk(cfg Config, chunk ChunkID, artifacts ArtifactSet) error { + cat := cfg.Catalog + source, err := backfillSource(cfg, chunk, artifacts) + if err != nil { + return err + } + + batch := cat.NewBatch() // mark "freezing" before any I/O + for _, kind := range artifacts.Kinds() { + batch.Put(chunkKey(chunk, kind), "freezing") + } + batch.Commit() + + w := newArtifactWriters(chunk, artifacts) + for seq := chunkFirstLedger(chunk); seq <= chunkLastLedger(chunk); seq++ { + w.Add(source.GetLedger(seq)) + } + w.Finish() + w.FsyncAll() // durable before the keys flip to "frozen" + + batch = cat.NewBatch() + for _, kind := range artifacts.Kinds() { + batch.Put(chunkKey(chunk, kind), "frozen") + } + batch.Commit() + return nil +} + +// backfillSource picks a chunk's ledger source in a fixed preference order. The +// hot branch errors only when a "ready" hot DB won't open — its data is lost. +// An incomplete-but-present DB is just stale: it falls through to the next +// source, which re-derives the chunk and recovers it. +func backfillSource(cfg Config, chunk ChunkID, artifacts ArtifactSet) (LedgerSource, error) { + cat := cfg.Catalog + if state, _ := cat.Get(hotChunkKey(chunk)); state == "ready" { + db, err := openRocksDBReadOnly(hotChunkPath(chunk)) + if err != nil { + return nil, fmt.Errorf("hot DB for chunk %d is ready but won't open: %w", chunk, err) + } + if maxCommittedSeq(db) >= chunkLastLedger(chunk) { + return &HotLedgers{chunk: chunk, store: db}, nil + } + db.Close() // incomplete: stale leftover — close and fall through; the discard scan owns it + } + if cat.State(chunk, Ledgers) == Frozen && !artifacts.Has(Ledgers) { + return packReader(chunk), nil // re-derive locally + } + // Backfill backend: the only source for a chunk with no local copy. If its + // tip lags below this chunk, wait for coverage. + waitForBackendCoverage(cfg, chunk) // bounded; fatal on timeout + return backfillBackend(cfg), nil // BSB by default +} +``` + +**`buildTxhashIndex(w, lo, hi, cat)`** rebuilds window `w`'s index to cover chunks `[lo, hi]` — `lo` the lowest in-floor chunk, `hi` the highest frozen chunk (the window's last once the window is complete). The lifecycle calls it on every chunk boundary while the window is current. + +```go +func buildTxhashIndex(w WindowID, lo, hi ChunkID, cat Catalog) error { + prev := frozenCoverage(cat, w) + if prev != nil && prev.Lo == lo && prev.Hi == hi { + return nil // already built (e.g. a buildThenSweep retry re-entering after the commit) + } + + key := indexKey(w, lo, hi) + cat.Put(key, "freezing") // mark before any I/O + + sb := streamhash.NewSortedBuilder(indexFilePath(key)) + for entry := range kWayMerge(binFiles(lo, hi)) { // sorted .bin files → one stream + sb.Add(entry) + } + sb.Finish() + fsyncFile(indexFilePath(key)) + fsyncDir(indexWindowDir(key)) // + grandparent on the window's first build + + batch := cat.NewBatch() // one atomic synced write — the whole finalization + batch.Put(key, "frozen") + if prev != nil { + batch.Put(indexKey(w, prev.Lo, prev.Hi), "pruning") // demote predecessor + } + if hi == windowLastChunk(w) { // terminal: the merged .bin inputs are spent + for c := lo; c <= hi; c++ { + batch.Put(chunkKey(c, TxHashBin), "pruning") + } + } + batch.Commit() + return nil +} +``` + +`kWayMerge` and `SortedBuilder` are streamhash internals, covered in [the transactions design](./gettransaction-full-history-design.md) (§6–§7). + +### Execution model + +`executePlan` runs a plan from either caller — startup backfill or the [lifecycle run](#lifecycle). Chunk builds run concurrently under one worker semaphore; each index build waits on the done-channels of the chunk builds inside its coverage, then runs. + +```go +func executePlan(ctx context.Context, cfg Config, plan Plan) error { + slots := make(chan struct{}, cfg.Workers) // the only concurrency knob + done := make(map[ChunkID]chan struct{}, len(plan.ChunkBuilds)) + for _, cb := range plan.ChunkBuilds { + done[cb.Chunk] = make(chan struct{}) + } + + g, gctx := errgroup.WithContext(ctx) + for _, cb := range plan.ChunkBuilds { + g.Go(func() error { + slots <- struct{}{} + defer func() { <-slots }() + if err := withRetries(gctx, cfg.MaxRetries, func() error { + return processChunk(cfg, cb.Chunk, cb.Artifacts) + }); err != nil { + return err // leave done[cb.Chunk] open; the error cancels gctx, freeing waiters + } + close(done[cb.Chunk]) // success: dependents may now read this chunk's .bin + return nil + }) + } + for _, b := range plan.IndexBuilds { + g.Go(func() error { + for c := b.Lo; c <= b.Hi; c++ { // wait on the in-coverage chunk builds + if ch, ok := done[c]; ok { + select { + case <-ch: // this chunk's .bin is frozen + case <-gctx.Done(): // a build failed (or cancel) — bail + return gctx.Err() + } + } + } + slots <- struct{}{} + defer func() { <-slots }() + return withRetries(gctx, cfg.MaxRetries, func() error { + return buildThenSweep(cfg, b) + }) + }) + } + return g.Wait() +} + +// buildThenSweep runs an IndexBuild, then eagerly sweeps the keys its commit +// demoted (this window only), so freed disk returns without waiting for a run. +func buildThenSweep(cfg Config, b IndexBuild) error { + cat := cfg.Catalog + if err := buildTxhashIndex(b.Window, b.Lo, b.Hi, cat); err != nil { + return err + } + for _, key := range indexKeys(cat, b.Window) { // superseded coverage(s) + if key.State == Pruning { + sweepIndexKey(cat, key) + } + } + var demoted []ArtifactRef // terminal build: the window's .bin inputs + for c := windowFirstChunk(b.Window); c <= windowLastChunk(b.Window); c++ { + if cat.State(c, TxHashBin) == Pruning { + demoted = append(demoted, ArtifactRef{Chunk: c, Kind: TxHashBin}) + } + } + if len(demoted) > 0 { + sweepChunkArtifacts(cat, demoted) + } + return nil +} +``` + +- **`cfg.Workers`** (default `GOMAXPROCS`) is the only resource knob: at most that many tasks run at once, drawn from all windows' eligible work. Goroutines are cheap structure — thousands may be parked on the semaphore or on done-channels. +- Done-channels signal *success*: a chunk build closes its channel only once its `.bin` is frozen, so an index build proceeds only when every input it needs exists. A chunk build that exhausts its retries leaves its channel open and returns an error, which cancels `gctx`; any dependent waiting on it unblocks through the `<-gctx.Done()` case and bails. A task that exhausts its retries aborts the daemon ([error policy](#lifecycle)); restart re-resolves from durable keys and completed work never repeats. + +--- + +## Daemon flow + +After startup, the daemon runs two goroutines. **Hot-DB ingestion** pulls new ledgers from captive core into the per-chunk hot DBs as the network closes them, and hands each completed chunk to the lifecycle. (This is the live-network loop — distinct from startup backfill, which reads *old* ledgers into cold files.) The **lifecycle** is a background goroutine responsible for everything else, and it does two kinds of work: **freezing** complete chunks from hot storage into immutable cold files (rolling the tx-hash index forward as it goes), and **cleanup** — discarding hot DBs the cold files now serve, and pruning artifacts that are superseded or have fallen past the retention floor. The sections below cover startup, then each goroutine in turn. + +### Startup + +Startup runs in two steps, both in `startStreaming` below: + +1. **Backfill** brings on-disk coverage in line with the retention window, up through the last *complete* chunk at the tip. The partial chunk still forming at the tip is left to hot-DB ingestion: its ledgers so far are already in the live hot DB (which serves them), and ingestion completes the chunk as new ledgers arrive. Backfill re-runs if the tip advances mid-pass, and when it returns, the whole in-retention history up to that point is on disk as frozen files — ready to serve. +2. **Serve + ingest** opens the resume chunk's hot DB, starts captive core, serving, the lifecycle goroutine, and the hot-DB ingestion loop. The lifecycle is seeded with the last complete chunk so its first run fires at once; that run finishes any crash/downtime leftovers concurrently with serving. Reads never wait for it, because a reader only ever resolves a `"ready"` hot DB or a `"frozen"` cold file — never a transient key. + +Operational note — **peak disk after long downtime**: pruning runs only in the first run's prune stage, *after* backfill has materialized every newly-in-retention chunk, so a downtime approaching or exceeding the retention window transiently holds up to ~2× the retention footprint (the stale window plus its replacement). Size volumes accordingly, or prune stale ranges manually before restarting after very long downtime; a disk-full during backfill otherwise aborts before the relieving prune can run, on every retry. + +The retention floor and resume point are computed by: + +```go +const ( + GenesisLedger = 2 + LedgersPerChunk = 10_000 + ChunksPerTxhashIndex = 1_000 // window = 10M ledgers +) + +// retentionFloorChunk: the lowest chunk kept — retentionChunks back from +// lastChunk, never below earliest's chunk. +func retentionFloorChunk(lastChunk ChunkID, retentionChunks uint32, earliest uint32) ChunkID { + floor := chunkID(earliest) + if retentionChunks > 0 { + floor = max(floor, lastChunk-ChunkID(retentionChunks)+1) + } + return floor +} + +// lastCompleteChunkAt: the largest chunk whose last ledger is <= ledger. +func lastCompleteChunkAt(ledger uint32) int64 { + return (int64(ledger)-1)/LedgersPerChunk - 1 +} + +// maxCommittedSeq returns the highest ledger committed to a hot DB; for a +// freshly opened, empty chunk-C DB it returns chunkFirstLedger(C) - 1 (the +// watermark just below the chunk), so the boundary-crash derivation is exact. +// +// lastCommittedLedger: the highest ledger in durable storage — the live hot DB's +// last, the highest frozen chunk's if it leads, or earliest-1 if neither exists. +func lastCommittedLedger(cat Catalog) uint32 { + base := cat.EarliestLedger() - 1 + cold := highestDurableChunk(cat) + hot := highestReadyHotChunk(cat) + switch { + case hot > cold: + db := openReadOnly(hot) + defer db.Close() + return max(base, maxCommittedSeq(db)) + case cold >= 0: + return max(base, chunkLastLedger(cold)) + default: + return base + } +} + +func networkTip(cfg Config) (uint32, error) { + tip, err := withBackoff(func() (uint32, error) { return backendNetworkTip(cfg) }) + if err != nil { + return 0, err + } + if tip < GenesisLedger { + return 0, fmt.Errorf("backend tip %d is below genesis — backend not ready", tip) + } + return tip, nil +} +``` + +```go +func startStreaming(ctx context.Context, cfg Config) error { + cat := openCatalog(cfg) + cfg.Catalog = cat + validateConfig(cfg) + + earliest := cat.EarliestLedger() + lastCommitted := lastCommittedLedger(cat) + + // Step 1: backfill from the floor up to the last complete chunk at the tip, + // leaving the partial tip chunk to ingestion. Re-pass while the tip moves. + backfilledThrough := int64(-1) + for { + tip, err := networkTip(cfg) + if err != nil { + if lastCommitted < earliest { + fatalf("network tip unavailable and no local history to serve: %v", err) + } + tip = lastCommitted // backend down, but local data exists: serve it + } + anchor := max(tip, lastCommitted) + rangeEnd := lastCompleteChunkAt(anchor) + rangeStart := retentionFloorChunk(rangeEnd, cfg.RetentionChunks, earliest) + midChunk := lastCommitted != chunkLastLedger(chunkID(lastCommitted)) + nearTip := int64(tip)-int64(lastCommitted) < LedgersPerChunk + if nearTip && midChunk { + rangeEnd = chunkID(lastCommitted) - 1 // leave the partial resume chunk to ingestion + } + if rangeEnd < rangeStart || rangeEnd <= backfilledThrough { + break + } + if err := executePlan(ctx, cfg, resolve(cfg, rangeStart, rangeEnd)); err != nil { + return err + } + lastCommitted = max(lastCommitted, chunkLastLedger(rangeEnd)) + backfilledThrough = rangeEnd + } + resumeLedger := lastCommitted + 1 + + // Step 2: serve + ingest. Seed the lifecycle with the last complete chunk so + // its first run clears crash/downtime leftovers while serving is already live. + hotDB, err := openHotDBForChunk(cat, chunkID(resumeLedger)) + if err != nil { + return err + } + core := startCaptiveCore(cfg, resumeLedger) + lifecycleCh := make(chan ChunkID, lifecycleQueueDepth) + lifecycleCh <- lastCompleteChunkAt(resumeLedger - 1) // seed the first run + go lifecycleLoop(ctx, cfg, lifecycleCh) + serveReads() + return runIngestionLoop(ctx, cat, core, hotDB, lifecycleCh, resumeLedger) +} +``` + +`validateConfig` checks the config and, on the first start, resolves and pins `earliest_ledger`: + +```go +func validateConfig(cfg Config) { + cat := cfg.Catalog + if cfg.Workers < 1 { + fatalf("workers must be > 0 (got %d)", cfg.Workers) + } + if cfg.MaxRetries < 0 { + fatalf("max_retries must be >= 0 (got %d)", cfg.MaxRetries) + } + if cfg.EarliestLedger != "genesis" && cfg.EarliestLedger != "now" { + n, err := parseUint32(cfg.EarliestLedger) + if err != nil || n < GenesisLedger || n != chunkFirstLedger(chunkID(n)) { + fatalf("earliest_ledger must be \"genesis\", \"now\", or a chunk-aligned "+ + "ledger >= %d; got %q.", GenesisLedger, cfg.EarliestLedger) + } + } + + earliestStored, earliestPinned := cat.Get("config:earliest_ledger") + + if earliestPinned { // restart: confirm nothing changed, write nothing + if cfg.EarliestLedger != "now" { // "now" on restart keeps the pinned floor + want := uint32(GenesisLedger) + if cfg.EarliestLedger != "genesis" { + want = atoi(cfg.EarliestLedger) + } + if want != atoi(earliestStored) { + fatalf("earliest_ledger changed: stored=%s, config=%s; wipe the data dir to change it.", + earliestStored, cfg.EarliestLedger) + } + } + return + } + + // First start: resolve earliest_ledger, then pin it. "now" and a numeric + // floor each need a reachable backend — "now" to resolve, a numeric floor to + // reject one past the tip (it is pinned immutably, so it can't be checked later). + var earliest uint32 + switch cfg.EarliestLedger { + case "genesis": + earliest = GenesisLedger + case "now": + tip, err := networkTip(cfg) + if err != nil { + fatalf("earliest_ledger=now needs a reachable backend: %v", err) + } + earliest = chunkFirstLedger(chunkID(tip)) + default: + earliest = atoi(cfg.EarliestLedger) + tip, err := networkTip(cfg) + if err != nil { + fatalf("a numeric earliest_ledger needs a reachable backend to validate against the tip: %v", err) + } + if earliest > tip { + fatalf("earliest_ledger (%d) is past the network tip (%d)", earliest, tip) + } + } + cat.Put("config:earliest_ledger", itoa(earliest)) +} +``` + +### Hot DB helpers + +`openHotDBForChunk` opens a chunk's hot DB — the existing one, or a fresh one after a crash or on first use: + +```go +func openHotDBForChunk(cat Catalog, chunk ChunkID) (*HotDB, error) { + hotKey, path := hotChunkKey(chunk), hotChunkPath(chunk) + if state, _ := cat.Get(hotKey); state == "ready" { + db, err := openExistingRocksDB(path) + if err != nil { + return nil, fmt.Errorf("hot DB for chunk %d is ready but won't open: %w", chunk, err) + } + return db, nil + } + // transient or absent: wipe any leftover dir and create fresh. + deleteDirIfExists(path) + cat.Put(hotKey, "transient") + db := createChunkHotDB(path) + fsyncDir(path) // durable before the key flips to "ready" + fsyncParentDir(path) + cat.Put(hotKey, "ready") + return db, nil +} +``` + +### Hot DB Ingestion + +```go +func runIngestionLoop(ctx context.Context, cat Catalog, core LedgerBackend, hotDB *HotDB, + lifecycleCh chan<- ChunkID, resumeLedger uint32) error { + + // A full lifecycleCh means freeze has fallen lifecycleQueueDepth boundaries + // behind ingestion — fail loud. + notify := func(complete ChunkID) { + select { + case lifecycleCh <- complete: + default: + fatalf("lifecycle fell %d boundaries behind ingestion; investigate", lifecycleQueueDepth) + } + } + + for seq := resumeLedger; ; seq++ { + lcm, err := core.GetLedger(ctx, seq) // blocks until ledger seq is available + if err != nil { + return err + } + + // One atomic synced batch across all CFs, so a ledger is fully present or + // absent; it is the only per-ledger durability boundary. + batch := hotDB.NewBatch() + putLedger(batch, lcm) + putTxHashes(batch, lcm) + putEvents(batch, lcm) + batch.Commit( /*sync=*/ true) + + if seq == chunkLastLedger(chunkID(seq)) { + // Close this chunk and open the next before notifying, so the lifecycle + // never races a live writer for the chunk it is about to freeze. + hotDB.Close() + if hotDB, err = openHotDBForChunk(cat, chunkID(seq)+1); err != nil { + return err + } + notify(chunkID(seq)) + } + } +} +``` + +A `GetLedger` failure returns from the loop and exits the process; the next startup resumes from where the last synced batch left off, since the batch is all-or-nothing. A clean shutdown cancels `ctx` and returns the same way, distinguished from a crash at the daemon's top level. The completed chunk id is all ingestion sends the lifecycle — *how far to go*; what to build, discard, and prune the lifecycle reads from the catalog. + +### Lifecycle + +The lifecycle is a background goroutine. Each notification — one per ingestion boundary, plus a startup seed — triggers one **run**, which does three stages in order: + +1. **Plan-and-execute** — `resolve` + `executePlan` over `[floor, last complete chunk]`, the same machinery backfill uses. In steady state this freezes the just-closed chunk from its hot DB and folds it into the current window's index; rebuilding the whole window each boundary costs ≈1 minute against a boundary that arrives only every ~14 h at mainnet rates. +2. **Discard** — retire hot DBs the cold artifacts now fully serve. +3. **Prune** — sweep demoted and past-retention files. + +At runtime the floor only rises (retention config is fixed for the life of the process; widening applies at the next startup), so `[floor, last complete chunk]` always sits within existing storage — a run produces only the just-closed chunk and never reaches below. Extending the *bottom* of storage — a fresh start, or filling to a widened floor — is startup backfill's job. + +Everything the run does derives from the catalog plus the one chunk id ingestion hands it: + +```go +func runLifecycle(ctx context.Context, cfg Config, lastChunk ChunkID) { + floor := retentionFloorChunk(lastChunk, cfg.RetentionChunks, cfg.Catalog.EarliestLedger()) + + if err := executePlan(ctx, cfg, resolve(cfg, floor, lastChunk)); err != nil { + fatalf("lifecycle run: %v", err) // abort; startup is the recovery path + } + for _, op := range eligibleDiscardOps(cfg, lastChunk, floor) { + op() + } + for _, op := range eligiblePruneOps(cfg, floor) { + op() + } +} + +const lifecycleQueueDepth = 8 // far above the at-most-one a healthy daemon holds + +func lifecycleLoop(ctx context.Context, cfg Config, lifecycleCh <-chan ChunkID) { + for lastChunk := range lifecycleCh { + drain: // if several chunks queued, take the most recent — one run covers them + for { + select { + case lastChunk = <-lifecycleCh: + default: + break drain + } + } + runLifecycle(ctx, cfg, lastChunk) + } +} +``` + +Between runs the goroutine is idle, and idle means **settled**: a re-scan would produce no ops and every storage invariant holds, so an [audit](#correctness) run at any such moment would pass. A failing op retries with backoff, then aborts the daemon — startup is the recovery path, the same policy as ingestion. + +The discard and prune stages are the two `eligible*` scans below. **Discard** retires a chunk's hot DB once its cold artifacts fully serve it (the window's index covers the chunk), or once it falls past retention. **Prune** is the system's only file-deleter: it sweeps transient index keys, the `.bin` inputs a terminal commit demoted, and everything below the retention floor, through `sweepIndexKey`/`sweepChunkArtifacts`. Each scan returns zero-arg ops the run calls in order. + +```go +func eligibleDiscardOps(cfg Config, lastChunk, floor ChunkID) []func() { + cat := cfg.Catalog + var ops []func() + for _, chunk := range hotChunkKeys(cat) { + switch { + case chunk < floor: + ops = append(ops, func() { discardHotDBForChunk(cat, chunk) }) + case chunk <= lastChunk && + pendingArtifacts(cfg, chunk).Empty() && + indexCovers(cfg, chunk): // cold artifacts fully serve it + ops = append(ops, func() { discardHotDBForChunk(cat, chunk) }) + } + } + return ops +} + +// pendingArtifacts lists which processChunk outputs the chunk still needs. The +// .bin is exempt once the window's index covers the chunk (the finalized window +// already demoted its key). +func pendingArtifacts(cfg Config, chunk ChunkID) ArtifactSet { + cat := cfg.Catalog + var need ArtifactSet + for _, kind := range []Kind{Ledgers, Events} { + if cat.State(chunk, kind) != Frozen { + need = need.Add(kind) + } + } + if cat.State(chunk, TxHashBin) != Frozen && !indexCovers(cfg, chunk) { + need = need.Add(TxHashBin) + } + return need +} + +// indexCovers reports whether the window's durable .idx already hashes the chunk. +func indexCovers(cfg Config, chunk ChunkID) bool { + fk := frozenCoverage(cfg.Catalog, indexID(chunk)) + return fk != nil && fk.Lo <= chunk && chunk <= fk.Hi +} + +func eligiblePruneOps(cfg Config, floor ChunkID) []func() { + cat := cfg.Catalog + windowFloor := WindowID(-1) + chunkFloor := ChunkID(-1) + if floor > 0 { + windowFloor = indexID(floor) - 1 + chunkFloor = floor - 1 + } + var ops []func() + + for _, key := range indexKeys(cat) { + switch { + case key.State == Freezing || key.State == Pruning: // transient debris + ops = append(ops, func() { sweepIndexKey(cat, key) }) + case key.Window <= windowFloor: // frozen, wholly below the floor + ops = append(ops, func() { sweepIndexKey(cat, key) }) + } + } + + var refs []ArtifactRef + for _, ref := range chunkArtifactKeys(cat) { + switch { + case ref.Chunk <= chunkFloor: // wholly past retention + refs = append(refs, ref) + case cat.State(ref.Chunk, ref.Kind) == Pruning: + refs = append(refs, ref) + case ref.Kind == TxHashBin: // redundant .bin in a finalized window + if fk := frozenCoverage(cat, indexID(ref.Chunk)); fk != nil && fk.Hi == windowLastChunk(indexID(ref.Chunk)) { + refs = append(refs, ref) + } + } + } + if len(refs) > 0 { + ops = append(ops, func() { sweepChunkArtifacts(cat, refs) }) + } + return ops +} +``` + +The op bodies — one discard, two sweeps — are the daemon's entire directory- and file-deletion surface: + +```go +func discardHotDBForChunk(cat Catalog, chunk ChunkID) { + if !cat.Has(hotChunkKey(chunk)) { + return + } + cat.Put(hotChunkKey(chunk), "transient") + deleteDirIfExists(hotChunkPath(chunk)) + fsyncParentDir(hotChunkPath(chunk)) + cat.Delete(hotChunkKey(chunk)) +} + +func sweepChunkArtifacts(cat Catalog, refs []ArtifactRef) { + batch := cat.NewBatch() // demote before the unlink + for _, ref := range refs { + batch.Put(chunkKey(ref.Chunk, ref.Kind), "pruning") + } + batch.Commit() + + var paths []string + for _, ref := range refs { + deleteArtifactFiles(ref.Chunk, ref.Kind) + paths = append(paths, artifactPaths(ref.Chunk, ref.Kind)...) + } + fsyncParentDirs(paths) // unlinks durable before the keys go + + batch = cat.NewBatch() + for _, ref := range refs { + batch.Delete(chunkKey(ref.Chunk, ref.Kind)) + } + batch.Commit() +} + +func sweepIndexKey(cat Catalog, key IndexKey) { + cat.Put(key, "pruning") // demote before the unlink (synced → durable first) + deleteFileIfExists(indexFilePath(key)) + fsyncDir(indexWindowDir(key)) + cat.Delete(key) // key outlives the unlink, so a crash re-runs the sweep + rmdirIfEmpty(indexWindowDir(key)) +} +``` + +`discardHotDBForChunk` removes a hot DB directory under its `hot:chunk` key; the two `sweep*` functions are the entire file-deletion surface, one body per key family. The prune walk's two families are independent of each other and of discard — a chunk swept while its window's `.idx` still resolves to it could leave a `getTransaction` pointing at a deleted `.pack`, but a below-floor read is not-found regardless ([reader contract](#reader-contract)). + +### Concurrency model + +Two writer goroutines and read-only readers. The catalog partitions their domains at the **live chunk** — the highest chunk with a `hot:chunk` key: + +- **Ingestion** owns the live chunk: the sole writer of its hot DB, and the creator of each `hot:chunk` key (via `openHotDBForChunk` at the boundary). +- **The lifecycle** owns everything below it: handed-off hot DBs (freeze + discard), all `chunk:*` and `index:*` keys, and the deletion side of `hot:chunk` keys. + +The two share no memory; their only link is the channel. The handoff is by write ordering — ingestion closes the chunk and opens the next (moving the partition) *before* sending it — so the lifecycle never freezes a chunk a writer still holds. Both write the catalog at the same time but never the same key (RocksDB handles concurrent writes safely). And because the chunk ids ingestion hands over only increase, a chunk completing while a lifecycle run is already in progress just bumps the starting point of the *next* run — it can't disturb the one underway. Readers hold their own read-only handles and resolve files through keys, so writer activity never races them. + +**Single-process enforcement.** All of the above assumes a *single* daemon owns the data; two daemons sharing it would corrupt it. The daemon enforces that at startup by taking a kernel file lock (`flock`) on a `LOCK` file in **each** of its roots — the catalog and every configured storage tree. A second daemon pointed at any of those paths can't acquire the lock and exits; the lock releases on any exit, including `kill -9`, so it never goes stale. It has to lock every root, not just the catalog, because the catalog and the storage trees are configured as independent paths — otherwise two daemons with different catalogs could still share a storage tree. The hot tree matters most: its `hot/{chunk}` DBs are the only copy of recently-ingested ledgers that aren't frozen yet. + +--- + +## Reader contract + +A read resolves data through two rules, and the rest of the design relies on both: + +1. **Only `"ready"` and `"frozen"` are visible.** A read resolves a chunk only from a `"ready"` hot DB or a `"frozen"` cold file — never from a key in a transient state (`"freezing"`, `"pruning"`, `"transient"`). So a reader never sees a half-written file, crash debris, or an in-progress sweep; transient keys are invisible to it. +2. **Below the floor is *not found*.** A read for any seq below the retention floor returns not-found, whether or not the file still exists on disk. This is what lets pruning delete a chunk the instant it passes retention: a stale `.idx` might resolve a tx-hash to a `.pack` that's been unlinked, but the below-floor read is not-found anyway. + +Together they make retention the single source of truth for "is this data available?": the freeze, sweep, and prune stages constantly create transient states and delete below-floor data, and these rules guarantee a read never *resolves* either. (Whether a read already in flight survives a concurrent unlink is a separate question — see below.) + +How a read is actually served — choosing the hot DB or the cold files for a given query, reading across the cold artifact types (`.pack` ledgers, events segments, `.idx` index), and staying correct when a sweep or prune unlinks a file while a read is mid-flight — is the **query-routing design's** concern, out of scope here and in the transactions design (§8). + +--- + +## Correctness + +This section states what the streaming workflow guarantees, the assumptions it relies on, and the operator actions and crash timings the design covers. + +### Invariants + +Two terms recur below. The **retention window** runs from the retention floor up to the last committed ledger; the reader gate and the prune scan both use the floor (rounding it a little low is harmless). The floor is also the bottom of the production range for both backfill and the lifecycle run, and at runtime it only rises — so a run never reaches below what's already on disk. The daemon is **settled** when a run's plan is empty and its discard and prune scans produce no ops: the state between runs, where the invariants below are meant to hold. + +**INV-1 (read correctness).** Any data request whose ledger scope falls entirely within the retention window returns correct results: the content matches what a conformant LedgerBackend would produce, no partial state is visible, and no in-retention range is unreachable. + +There is one transient exception. When surgical recovery demotes hot data down to the live chunk (scenario 3), the last committed ledger rewinds and the floor — anchored on the last complete chunk — regresses with it. For the few minutes until re-ingestion advances it again, the bottom of the window includes a handful of chunks already pruned under the old floor. Reads there fail soft — not-found, never wrong data, since files are write-once and pruning only unlinks — and the gap closes as the floor climbs back. + +**INV-2 (single canonical state).** The catalog records exactly one home for each data range. What it guarantees: + +- **One frozen index per window, at all times** (settled or not). The commit batch promotes the new coverage and demotes the old one in a single write, so "the window's index" is always well-defined for readers — never two frozen keys, never none once the window has one. +- **No transient artifact key survives a settled state.** Between runs, no `chunk:*` or `index:*` key is `"freezing"` or `"pruning"`. Each kind of transient has cleared: index transients by the run that observed them; per-chunk `"freezing"` keys by re-materialization (the plan stage rebuilds them, for chunks in `[floor, last complete chunk]`, from whatever source `backfillSource` picks); and `"pruning"` keys by the sweeps. +- **No leftover hot DB for a fully-cold chunk** (when settled). No `hot:chunk:c` exists for a chunk `c` whose artifacts are all durable *and* whose window's index covers `c` — that chunk is served entirely from cold files, so its hot DB must be gone. +- **No leftover `.bin` key in a finalized window** (when settled). No `chunk:c:txhash` exists for a chunk in a window whose frozen index is terminal: the terminal commit demotes the merged inputs `[lo, hi]` and the sweep removes them, chunks below the floor are cleared by retention pruning, and the prune scan's redundant-input branch catches any that a crashed widening re-froze. + +Two transient states are tolerated even at a settled moment: + +- **A hot DB's `"transient"` bracket** around an in-flight directory operation (the boundary's `openHotDBForChunk`, startup's resume-chunk open, a discard mid-op). A crash-left bracket is finished by the next `openHotDBForChunk` or discard scan. +- **After a hot-data recovery, a partially-frozen chunk above the last committed ledger** may hold `"freezing"` keys while serving and settled. It sits above the last complete chunk — outside every plan range and the retention window, so no read can observe it — until re-ingestion replays it forward from the last frozen boundary and re-freezes it, minutes later. + +**INV-3 (disk matches catalog).** When settled, the files and hot-DB directories on disk are exactly the set the catalog names — no more, no less. Every key maps to one expected path, and because a key is written before its file (mark-before-write), even a partial file is reachable from its key. So the match holds whether a key is in a final state or in one of the transients INV-2 tolerates. No orphan files, no dangling keys, no duplicates: a file that no catalog key names is a real bug, not mid-run debris. + +**INV-4 (retention bound).** When settled, no file or catalog key maps to a ledger range strictly below the effective retention floor — with one exception: a frozen index key whose window straddles the floor keeps the `lo` it was built with, so its coverage `[lo, hi]` reaches below the floor. That below-floor portion is never served ([reader contract](#reader-contract) rule 2 returns not-found), and the key and its `.idx` are swept once the whole window falls below the floor. + +Each invariant has a distinct audit. INV-1 you check by issuing reads or by re-deriving artifacts and byte-comparing. INV-2 you check by walking catalog keys and cross-checking forbidden co-existence. INV-3 you check by walking the filesystem against the catalog. INV-4 you check by walking catalog keys against the floor. None of the invariants reference the phase scans that maintain them — so a bug in any scan shows up as a real invariant violation, not as something the buggy code silently considers acceptable. A settled state between runs makes these walks meaningful on a live daemon, so an `audit` admin command can implement them directly (with an optional deep mode that re-derives sampled artifacts via a conformant LedgerBackend and byte-compares, for INV-1). + +### Convergence + +**Startup converges from any on-disk state.** Whatever a partial-completion crash, an operator action, or surgical recovery leaves behind, startup drives the system to a settled state satisfying INV-1 ∧ INV-2 ∧ INV-3 ∧ INV-4. Startup here is the backfill pass followed by the first lifecycle run (fired by the startup seed), and it reaches a settled state within that first run — typically seconds after serving opens, bounded by the run's freeze, rebuild, and prune workload. From any state reachable *during* a run, the lifecycle run alone converges, within a bounded number of runs. And since a runtime op failure aborts the daemon, every state a run can leave behind is one startup is built to converge. + +The split matters because some repairs are inherently backfill's, not the run's: a per-chunk `"freezing"` key with no hot DB behind it (a crashed backfill write) is repaired by re-materialization, and a surgically removed range is re-derived from the LedgerBackend — no run phase produces data. The run's province is everything else: index transients, demotions, freezes from live hot DBs, prunes. + +Convergence rests on three properties shared by the resolver and the scans — eligibility is computed from durable catalog state alone; ops are idempotent; everything is re-derived on every notification — plus backfill's postcondition contract. Together, whatever a crash leaves half-done, the next run or the next startup finishes. + +### Substrate assumptions + +Properties we rely on the underlying storage to provide: + +- **Sync WAL.** All catalog puts and deletes that the invariants depend on use RocksDB's `WriteOptions.sync = true`, which fsyncs the WAL before the write returns. Multi-key commits — the index commit batch, the sweeps' key-delete batches — are single atomic synced WriteBatches: all-or-nothing across keys. +- **Per-ledger durability.** The chunk hot DB's synced WriteBatch (atomic across all CFs) is the sole per-ledger durability boundary; the last committed ledger is derived from it. Per-artifact: the per-chunk file **and its directory entry** are fsynced before its key flips to `"frozen"`, and an index coverage's `.idx` (and its dir entry) is fsynced before the commit batch freezes its key. +- **Deterministic, idempotent writes.** Re-applying any write produces byte-identical state. Backed by deterministic LCM bytes from any conformant LedgerBackend and a byte-identical streamhash index from byte-identical sorted inputs. +- **Monotonic progress.** Within a process run, ingestion only moves forward: each synced batch extends the last, and the last-complete-chunk it hands the lifecycle climbs with it (strictly increasing chunk ids). Across a crash, the startup derivation equals exactly the durable state — the pre-crash value, or a hair above it (a batch that committed in the instant before the crash). It lands *below* the pre-crash value in only two cases: hot state was lost or demoted to `"transient"`, or recovery demoted a finished window's index for rebuild on a daemon interrupted during its first backfill (before any live ingestion). In that second case there are no hot DBs to anchor the last committed ledger, so it drops below that whole window until backfill rebuilds the index — re-deriving the untainted chunks from their on-disk `.pack`s and re-fetching only the tainted ones. Surgical recovery, in general, shrinks the derivation's inputs by demoting state. + +### Design invariants + +These are streaming-specific properties the implementation guarantees on top of the substrate, and that INV-1 through INV-4 depend on: + +- **Every key precedes its file.** The pre-write `"freezing"` mark and post-fsync `"frozen"` flip mean any file on disk — per-chunk artifact or index file, partial or complete — has its catalog key set. Every scan and sweep iterates keys, so every file is reachable that way; nothing ever lists a directory to find work. +- **Index promotion is atomic and gap-free.** The commit batch freezes the new coverage and demotes its predecessor in one synced write, so the window's unique frozen key changes hands atomically — never two frozen keys, never none once the window has one. A reader following the frozen key always lands on a complete, fsynced index; a crash mid-build leaves the prior coverage frozen and the attempt as `"freezing"` debris that is either overwritten by the next build of that coverage or deleted unread by the sweeps. +- **Key absent ⟹ file gone.** Every sweep's shared ordering (unlink → `fsyncDir` → atomic key delete) gives the exit-side counterpart. +- **Hot DB keys bracket the directory.** The `hot:chunk:{chunk}` key is put (`"transient"`) before the directory is created, and deleted only after rmdir completes — with `"transient"` re-marked first. +- **Tx hashes always have a queryable home.** The hot DB is discarded only after the durable `.idx` covers the chunk — hot CF, then `.idx`, with no gap. (The `.bin` is never a serving tier; it is rebuild input, demoted to `"pruning"` by the terminal commit batch — the same write that freezes the final `.idx` — or by retention pruning once its chunk falls past the floor, and deleted only by the sweep after that.) +- **`"frozen"` ⟹ the file is durable and complete.** Flips to `"frozen"` happen only after fsync, and files are deleted only under non-frozen keys (sweeps demote first) — so frozen keys can be trusted blindly by readers and the resolver. +- **`"pruning"` is committed.** Once a key is in `"pruning"` — demoted by a commit batch or by retention — the sweep runs to completion on subsequent scans. Backfill treats any non-`"frozen"` state as empty and overwrites cleanly if the range is re-ingested. + +### Scenario coverage + +INV-1 holds at every point the daemon is serving reads — transient states are never externally visible, because a read resolves only a `"ready"` hot DB or a `"frozen"` cold artifact — never a `"freezing"`/`"pruning"`/`"transient"` key, and the retention check masks everything else. INV-2, INV-3, and INV-4 hold at every settled state reached after the events below; startup's first settled state arrives when the first run completes, shortly after reads open. + +1. **Steady-state operation.** Hot DB ingestion advances the last committed ledger; the lifecycle goroutine freezes complete chunks within retention and prunes anything past it. All four invariants hold by induction on it. +2. **Operator state changes — widening or shortening retention (`retention_chunks`).** Changing `retention_chunks` recomputes the retention floor, and the next startup converges to the new state. Backfill's per-window rule rebuilds any window whose desired coverage now exceeds what's stored, and the prune stage removes anything below a raised floor. + + Widening takes effect on the *next startup*, not immediately: a running daemon holds the retention config it started with, so its floor never drops mid-run — the lower floor, and the backfill that fills down to it, apply only at the next startup. `earliest_ledger` is not a live change at all: it is pinned on the first start and immutable, so editing the config never moves the floor (the only way to change it is to wipe the data directory and start fresh). +3. **Surgical recovery (tainted data).** The operator never touches the filesystem. Recovery is **one atomic catalog batch** that *demotes* the affected keys — it never removes them — split by tier. Tainted cold artifacts (`chunk:{c}:*` and every overlapping `index:*` key) go to `"freezing"`, the state that already means *this file is not to be trusted: re-derive or delete*. For the hot tier, demote **every `hot:chunk` at or above the lowest tainted chunk — the live chunk always included** — to `"transient"`, not just the directly-tainted ones (the reason is the third paragraph). `"transient"` makes a hot DB instantly ineligible as a source (`backfillSource` reads only `"ready"`) and invisible to the last-committed-ledger derivation (which counts only `"ready"` keys). The batch commits atomically or not at all, and re-running it is a no-op; the catalog's lock means it can only be written against a stopped daemon. + + Everything then converges through machinery that already exists. Backfill re-derives the `"freezing"` cold artifacts from a conformant LedgerBackend — overwriting in place, the write protocol's ordinary re-materialization — and rebuilds each window's index. (If the backend tip lags below a re-derived chunk, `backfillSource` waits for coverage; see [the primitives](#the-primitives).) The `"transient"` hot DBs need no file surgery: `openHotDBForChunk` wipes and recreates one when re-ingestion re-opens that chunk, and the discard scan retires any sitting below the live chunk. + + **Why every hot DB at or above the taint, not just the tainted one.** The hot tier is repaired only by re-ingestion, which replays **forward** from the last committed ledger — the highest `"ready"` hot chunk. To replay a tainted hot chunk, that watermark must first fall *below* it; and since the watermark is the maximum over all `"ready"` hot chunks, it falls below the taint only once every hot DB at or above the lowest tainted chunk is demoted. Demoting just the tainted chunk would leave a higher `"ready"` chunk — ultimately the live chunk — pinning the watermark above the taint, so re-ingestion would never reach it. Once they are all demoted, the watermark drops to the last frozen boundary below the taint, captive core re-ingests the tail forward, and the untainted hot chunks swept up in the demotion are re-derived byte-identically. Every recovery demotes; nothing is removed by hand — the daemon's own sweeps and `openHotDBForChunk` handle the dirs in their existing crash-safe order. +4. **First deployment / downtime between restarts.** The last committed ledger derives to `max(frozen/hot maxima, earliest_ledger - 1)`, ensuring `resumeLedger ≥ earliest_ledger`. Backfill fills `[earliest_ledger, lastCompleteChunkAt(network_tip)]` if needed (a no-op for `earliest_ledger = "now"` first deployment). +5. **LedgerBackend choice or mid-flight swap.** The LedgerBackend contract guarantees canonical LCM bytes for any range, so any conformant backend produces byte-identical artifacts. Different backends differ in performance, not behavior. An operator using BSB for backfill and CaptiveCore for hot DB ingestion, or swapping mid-deployment, satisfies all four invariants. +6. **Crash at any point during any of the above.** Sync WAL plus per-ledger durability ordering mean the catalog on next start is internally coherent and the derived last committed ledger equals exactly what the last synced batch committed. Idempotency means re-running any half-finished op is safe. Convergence finishes whatever the crash interrupted. + +### What a bug looks like + +The invariants describe what storage should look like, not how the phase scans maintain it. So common bugs show up as concrete violations: + +- **A catalog key claims something the file doesn't actually deliver** — e.g., a per-chunk writer flips a key to `"frozen"` before fsync (leaving a partial file the catalog advertises as complete), or an index key freezes before its `.idx` is fully fsynced, or the key name's `{lo, hi}` doesn't match the file's actual coverage, or a frozen file is mutated post-freeze ⟹ reads through the catalog key see wrong or missing data. **INV-1** violated. Detectable by re-deriving an artifact via a conformant LedgerBackend and byte-comparing against the on-disk file. +- **Pruning too aggressive** ⟹ a request whose ledger scope is in retention returns wrong or missing results. Issue a read to find it. **INV-1** violated. +- **Two frozen index keys in one window** — a commit batch failed to demote the predecessor, or promotion and demotion landed as separate writes ⟹ readers have no well-defined index. Walk `index:*` keys, count `"frozen"` per window. **INV-2** violated. +- **A `"freezing"` or `"pruning"` key within `[floor, last complete chunk]` survives while serving and settled** ⟹ its recovery mechanism was skipped — an index transient the sweeps should have deleted, a `"pruning"` demotion the sweeps should have finished, or a per-chunk `"freezing"` key that the freeze phase or startup backfill should have re-materialized. Walk keys for transient values when settled, excluding the one corner INV-2 tolerates — a `"freezing"` artifact key *above* the last complete chunk after a hot-data recovery with a lagging backend tip, which no source can yet repair. **INV-2** violated. +- **Chunk scan misses an orphan** ⟹ a hot DB persists for a chunk that cold artifacts fully serve. Walk `hot:chunk:c` keys whose chunk has its artifacts durable and its window's index covering `c`. **INV-2** violated. +- **Finalization demotions don't complete** ⟹ per-chunk frozen tx hash files outlive the index that consumed them. Walk `chunk:c:txhash` keys whose window's frozen key has `hi` = the window's last chunk. **INV-2** violated. +- **A writer leaves a file on disk without its catalog key** (file fsynced before key was durable, or a sweep deleted the key before its unlink was durable) ⟹ orphan file — invisible to every key-driven scan. Walk the filesystem against the catalog. **INV-3** violated. +- **A catalog key persists without its file** (file deleted before key) ⟹ dangling key. Walk the catalog against the filesystem. **INV-3** violated. +- **Duplicate cold artifacts for the same logical data** (e.g., two events files for the same chunk, from a migration or buggy retry) ⟹ the catalog names one expected path; the extras are orphans. Walk the filesystem against catalog-specified paths. **INV-3** violated. +- **Pruning fails past the floor** ⟹ files or keys remain for ranges below the retention floor. Walk catalog keys, compare ledger ranges to the floor. **INV-4** violated. + +A storage walk against the invariants is enough to find these without inspecting the phase implementations. + +--- + +## Related documents + +- The transactions design ([gettransaction-full-history-design.md](./gettransaction-full-history-design.md)) — the tx-by-hash subsystem end to end: the hot `txhash` CF, the `.bin`/`.idx` formats, the rolling window index rebuild — its streamhash merge internals and safety argument — the `getTransaction` read path, and the capacity numbers. Canonical for the streamhash `.bin`/`.idx` formats, the index merge internals, and the index-key coverage semantics this doc summarizes. +- The events design ([getevents-full-history-design.md](./getevents-full-history-design.md), PR #635) — the cold-segment file formats and the hot events CF schema referenced by the data model. +- The reader / query-routing design — how reads dispatch between hot DBs and frozen files for in-retention queries. diff --git a/design-docs/gettransaction-full-history-design.md b/design-docs/gettransaction-full-history-design.md new file mode 100644 index 000000000..cf07c0ec3 --- /dev/null +++ b/design-docs/gettransaction-full-history-design.md @@ -0,0 +1,253 @@ +# RPC getTransaction Full-History Design + +# Part 1: Problem and Scope + +## 1. Objective + +Serve `getTransaction(hash)` for any transaction whose ledger falls within the retention window (full history by default): + +- **Complete.** Every transaction in every in-retention ledger is resolvable by its hash, with no gaps — across crashes, restarts, and retention changes alike. The one exception is a hash-prefix collision so rare (~10⁻²⁰ for a dense window) that it counts as negligible, and even then it fails loudly rather than silently. §8.2 has it. +- **Correct.** A lookup never returns the wrong transaction; a missing or out-of-retention one returns not-found. +- **No in-memory index.** The map lives in on-disk `.idx` files, read through the page cache — not a RAM structure sized to the transaction count. The daemon's memory does not grow with the number of transactions in history. +- **Cheap to maintain.** Ingestion adds negligible cost to the per-ledger write, and the cold index stays current with a rebuild that is small relative to how often it runs. + +Out of scope: how a reader chooses which tier and window to consult and stays correct while files are added and removed (the query-routing design), and the storage of the transaction bytes themselves (the ledger store). + +## 2. Lookup model + +`getTransaction` takes a 32-byte transaction hash and returns the transaction's envelope, result, and meta, plus its ledger and close time. The data flow: + +``` +hash ──► seq ──► LCM for seq ──► extract the tx ──► verify hash ──► respond + (this doc) (ledger store) +``` + +Three properties of the transaction-hash key space shape the design: + +- **Point lookups only.** Every query is for one specific hash, never a range or prefix — exactly what a perfect hash is built for. +- **Hashes are uniform and immutable.** A transaction hash is never updated, and corresponds to at most one applied transaction (the network's replay protection). The map is append-only: one batch of entries per ledger. +- **The full transaction is always fetched anyway.** The response needs the envelope, result, and meta, so the read path always ends by fetching the transaction and checking its full 32-byte hash. That means the map needn't be exact — only *complete*, never missing a hash that is really there. False positives are harmless: a fingerprint screens most of them, and the final hash check catches the rest. + +--- + +# Part 2: Architecture + +## 3. The two tiers + +Each in-retention transaction lives in exactly one place — one tier, one window, never copied. But a hash on its own doesn't say which place, so a lookup checks them all, and at most one answers (none, if the hash isn't stored). The two places a transaction can live: + +| Tier | Structure | Serves | +|---|---|---| +| **Hot** | `txhash` CF of the per-chunk hot RocksDB | the live chunk, plus any frozen chunk the window index doesn't cover yet | +| **Cold** | one streamhash `.idx` per window, covering chunks `[lo, hi]` | every chunk in `[lo, hi]` (at/below the frozen `hi`, at/above the floor chunk `lo`) | + +``` + window w + chunks: [lo ···························· hi] [hi+1 ···] [live] + served by: └──────── {lo}-{hi}.idx ─────────┘ hot DBs hot DB + (awaiting (being + coverage) written) +``` + +The two tiers hand off with no gap. A chunk's hot table is dropped only *after* the cold index covers that chunk. So a freshly frozen chunk keeps being answered from its hot table until the index can answer for it, and only then does the hot table go away. Every transaction is findable in exactly one tier at all times. + +## 4. Geometry + +Two units organize the map. Every structure below is named by them: + +- **Chunk** — 10,000 ledgers (hardcoded). The unit of the hot DB and of the sorted runs. +- **Window** — 1,000 chunks = 10,000,000 ledgers (hardcoded). The unit of the cold index. + +``` +chunkID(seq) = (seq - 2) / 10_000 +chunkFirstLedger(c) = c * 10_000 + 2 +chunkLastLedger(c) = (c + 1) * 10_000 + 1 +indexID(c) = c / 1000 # takes a CHUNK id +chunksInIndex(w) = [w*1000, (w+1)*1000 - 1] +``` + +Window 0 spans ledgers 2–10,000,001 (chunks 0–999), window N spans N×10M+2 – (N+1)×10M+1 (chunks N×1000 – (N+1)×1000−1). All ids zero-pad `%08d`. + +--- + +# Part 3: Implementation Reference + +## 5. Hot tier + +### 5.1 Storage + +The hot tier is a plain key-value table, one per chunk, stored as a `txhash` column family in that chunk's RocksDB: + +- **Key**: the full 32-byte transaction hash. +- **Value**: the 4-byte ledger sequence. + +Storing the full hash makes the hot tier **exact**: a lookup either finds the hash or it doesn't. There are no false positives to screen out and nothing to verify. The table is tuned for point lookups — bloom filters on, ordering off. + +### 5.2 Write path + +Writing is straightforward. As each ledger is ingested, one `(hash, seq)` entry is added for every transaction in it, in the same atomic write that stores the rest of the ledger. So a ledger's hashes are written all-or-nothing, together with the rest of the ledger. + +### 5.3 Lifetime + +A chunk's hot table lives from the moment the chunk starts ingesting until the cold index covers it. Coverage can lag the chunk's freeze by a while; until it lands, the chunk is simply answered from its hot table. + +## 6. Cold artifacts + +The cold tier has two kinds of file: a per-chunk sorted run (`.bin`) and the per-window index (`.idx`). + +### 6.1 The per-chunk sorted run: `.bin` + +The `.bin` lives at `txhash/raw/{bucket:05d}/{chunk:08d}.bin`, with catalog key `chunk:{chunk:08d}:txhash`. It is produced once, when the chunk is frozen: as the chunk's ledgers are read, each transaction's `(hash, seq)` is collected, and at the end they are **sorted in memory** (~3M entries ≈ 60 MB for a dense chunk — negligible) and written out. + +**Format** (the streamhash merge format): + +``` +uint64 LE entry count +entry × count 20 bytes each: [key: 16][seq: 4 LE] +``` + +- `key` is the **first 16 bytes of the transaction hash**. The index uses only these 16 bytes to place and find a transaction; what happens when two hashes share a 16-byte prefix is in §8.2. +- Entries are sorted ascending by the **big-endian `uint64` prefix of `key`**. + +The `.bin` is a pre-sorted file, and a lookup never reads it directly. It is sorted because streamhash builds an index **much faster, and with much less memory, when its keys arrive already sorted** — its *sorted-builder mode*. + +A `.bin` is kept while it is still a rebuild input — every rebuild re-merges the `.bin` files for the chunks its window currently covers. Once the window is complete and its final index is built, the `.bin` files are no longer needed, and are deleted — or, if retention is narrower than a window so its chunks age out before the window completes, retention pruning deletes them first. + +### 6.2 The per-window index: `.idx` + +The `.idx` lives at `txhash/index/{window:08d}/{lo:08d}-{hi:08d}.idx`, tracked by the catalog key `index:{window:08d}:{lo:08d}:{hi:08d}`. There is one minimal-perfect-hash file per **coverage** — a coverage being the chunk range `[lo, hi]` the file actually hashes. Streamhash's `SortedBuilder` builds it from the k-way merge of `.bin[lo..hi]`. The index carries two per-entry fields: + +- **Payload (3 bytes): the answer the hash maps to — a ledger seq.** It is stored as an offset from the window's first ledger (`MinLedger = chunkFirstLedger(lo)`) rather than as a full seq, to save bytes. A window spans 10,000,000 ledgers, so the largest offset (`10_000_000 - 1`) fits in a 24-bit field. Streamhash writes the payload width into the index file's header; `MinLedger`, which streamhash does not model itself, rides in the file's user-metadata slot. Both are read back at lookup time, so there is no separate sidecar file. +- **Fingerprint (`fpWidth` bytes, default 1): a few bytes per entry to screen out wrong hashes** before the expensive fetch-and-verify. Because a lookup probes every in-retention window (§8.2), a wider fingerprint is a trade-off: it costs index size (+1 byte per transaction) but cuts the number of false-positive fetches across those windows. Fixed per build. + +All-in, the index costs ≈4.2 bytes per transaction (MPHF structure + payload + fingerprint) — ≈12.5 GB for a dense full window, versus the ≈60 GB of `.bin` runs it consumes. + +### 6.3 Coverage and the live index + +An index file is named by its **coverage** — the chunk range `[lo, hi]` it hashes: + +- **`lo`** — the lowest chunk the index covers. It is the window's first chunk, unless the retention floor has cut into the window, in which case it rises to the first chunk still retained. +- **`hi`** — the highest chunk the index covers. While the window is the current one (the network tip is in it), `hi` advances by one chunk on each rebuild. Once the window is complete, `hi` is its last chunk and the index is final. + +A window has exactly **one live index** at a time, and a lookup resolves "the window's index" to that one file. A rebuild builds a new index at a wider coverage and replaces the live one; the replacement is atomic, so a lookup always sees one complete index, never a half-built one. (How that swap stays atomic across a crash is the daemon's write protocol, in the streaming doc.) + +So the index hashes exactly the transactions in chunks `[lo, hi]`. Chunks below `lo` are out of scope — cut off by the floor. Chunks above `hi` aren't folded in yet, and are served from their hot tables until the next rebuild advances `hi`. + +**Example** (1,000 chunks per window): the tip is in chunk 5350, so window 5 (chunks 5000–5999) is the current window, and the floor is at chunk 5100. The live index covers chunks 5100–5349, in the file `txhash/index/00000005/00005100-00005349.idx`; chunk 5350 is still in its hot table, and chunks 5000–5099 are below the floor. At the next boundary the index is rebuilt to cover 5100–5350, and the old file is deleted. + +## 7. The rolling rebuild + +### 7.1 Rebuild cadence and cost + +The current window's index is **rebuilt from scratch on every chunk boundary**, to fold in the chunk that just froze; it grows until the window is complete. Only the current window is ever rebuilt — a finalized window's index never changes. + +This is affordable because the rebuild is cheap relative to its cadence: a full-window build takes ≈1 minute, against a boundary only every ~14 hours at mainnet rates (Part 4). Rebuilding the whole index each time keeps every `.idx` on disk a complete index for its coverage, with no half-updated state. + +### 7.2 The rebuild + +To rebuild window `w`'s index over coverage `[lo, hi]`: + +1. **Skip if already done.** If the live index already covers exactly `[lo, hi]`, there is nothing to do. +2. **Merge.** Merge the sorted `.bin` files for chunks `[lo, hi]` into a new index file, with streamhash's sorted-builder. (Every chunk in `[lo, hi]` must have a `.bin`; a missing one fails the merge.) +3. **Swap in.** Make the new file the window's live index, replacing the previous one. + +```go +// rebuild window w's index over [lo, hi] +sb := streamhash.NewSortedBuilder(newIndexFile, sortedBuilderOpts) +for entry := range kWayMerge(binFiles(lo, hi)) { // sorted .bin files → one stream + sb.Add(entry) +} +sb.Finish() +// then make newIndexFile the window's live index, replacing the old one +``` + +Because a rebuild writes a whole new file and only swaps it in at the end, the live index is never partially updated: a lookup sees either the old index or the new one, never something in between. + +### 7.3 Finalization + +When a window's last chunk is folded in, its index is final: it covers the whole window and is not rebuilt again — unless retention later widens to include older chunks, when it is rebuilt wider to cover them. The window's `.bin` files have done their job as rebuild inputs, and are deleted. + +### 7.4 Disk use during a rebuild + +A rebuild writes a whole new index file before the old one is removed, so a window directory briefly holds ~2× the index size (~25 GB at the end of a dense window). The window's `.bin` files are also all on disk together, since the rebuild merges them at once — about 60 GB for a dense window. Both are transient. + +The window-end rebuild writes ~12.5 GB in ~1 minute (~200 MB/s burst) — trivial on instance NVMe, but worth provisioning for on throughput-capped volumes like EBS gp3. + +## 8. Query path + +### 8.1 Routing + +A hash names no ledger, so the reader cannot know which home holds it in advance — it **probes them all**, and the hash resolves in exactly one: + +| Tier | Probe set | How | +|---|---|---| +| cold — one `.idx` per window | **every in-retention window** | MPHF + fingerprint + verify (§8.2) | +| hot — `txhash` CF per chunk | the chunks above any window's `hi` (live, or frozen awaiting coverage) | exact full-key get (§8.3) | + +The hot tier is a few chunks at most — one window's tail, normally just the live chunk — so the probe set is `≈ (in-retention windows) + (a handful of chunks)`. How the reader learns current coverage and stays consistent across rebuilds is the query-routing design's concern. This document requires only two things: that the two tiers together cover the whole retention window (the gap-free hot→cold handoff, §5.3), and that each transaction lives in exactly one of them. So **at most one probe confirms**: the verify runs on every fingerprint hit but succeeds for at most one. + +### 8.2 Cold lookup + +The cold tier **probes every in-retention window's `.idx`**. A hash gives no hint about which window it's in — to know the window you'd compute `chunkID(seq) / 1000`, and `seq` is the very thing the lookup is trying to find. So there is nothing to pre-select, and each window is probed in turn: + +``` +for each in-retention window (its live index → {lo}-{hi}.idx): + → MPHF probe on the hash's 16-byte prefix + → fingerprint check (fpWidth bytes) — miss ⇒ skip this window + → on a fingerprint hit: + seq = MinLedger + payload (3 bytes) + retention gate: seq ≥ floor? — else skip this window + fetch the LCM for seq, extract the tx + verify the full 32-byte hash — confirms, or rejects a false positive +respond on the confirmed hit; not-found if no window confirms +``` + +Because the hash belongs to at most one window, **at most one window confirms**; a not-found lookup — a non-existent or not-yet-ingested hash — confirms none and must rule out every in-retention window. + +The final verification is essential: a minimal perfect hash returns a slot for *any* input, including a hash it doesn't contain, so every hit must be confirmed. The fingerprint screens out most foreign hashes cheaply, and the fetch-and-verify rejects the rest. + +A **16-byte prefix collision between two distinct in-retention transactions** has two cases, and only one bounds completeness. The cold index keys on streamhash's 128-bit routing key (§6.1), so two hashes sharing their first 16 bytes are indistinguishable *to a single window's build*. + +*Different windows* — the more likely of the two, since a shared prefix is far more apt to straddle two of history's windows than to fall inside one. Each transaction keys into its own window's `.idx`, so neither build sees a duplicate and both resolve normally. The collision shows up only as a fingerprint false-positive when a lookup probes the *other* window. That window's MPHF maps the shared prefix to its own resident transaction, and the fingerprint (also derived from those 16 bytes) matches — but the fetch-and-verify rejects it, because the full 32-byte hashes differ. This is exactly the foreign-key path the verify already exists for: one wasted ledger fetch, no wrong answer and no false negative. + +*Same window* — the genuine residual. The two are a single key to that window's builder, so streamhash rejects the duplicate at build time (`ErrDuplicateKey`) and the build fails **loudly**: it never silently drops a transaction, and the verify ensures it never returns a wrong one. This is the only bound on completeness, and it is tiny — the birthday probability over a dense window's ~3×10⁹ keys against 2¹²⁸ is ~10⁻²⁰ per window, a cryptographic-scale risk accepted as negligible. + +**Probe ordering, parallelism, early-stop, and the resulting latency and I/O are the query-routing design's concern** (§8.1), out of scope here. + +### 8.3 Hot lookup + +Chunks above `hi` are probed in their hot DBs' `txhash` column family — an exact, full-key point get. A miss here is a real miss, with none of the cold tier's verification subtleties (the fetch-and-verify still runs, since the response needs the transaction anyway). In steady state this tier is just the live chunk, plus briefly the one chunk in the freeze-to-coverage gap. After catch-up or a crash it can be several chunks, shrinking as rebuilds advance `hi`. + +--- + +# Part 4: Capacity & Performance + +## 9. Storage footprint + +Per dense chunk (~3M transactions) and dense window (1,000 chunks, ~3×10⁹ transactions): + +| Structure | Unit cost | Dense chunk | Dense window | Lifetime | +|---|---|---|---|---| +| hot `txhash` CF | 36 B/tx raw (32 key + 4 value), before RocksDB overhead | ~110 MB raw | — (per-chunk) | chunk ingestion → index coverage | +| `.bin` sorted run | 20 B/tx exactly | ~60 MB | ~60 GB | chunk freeze → window finalization, or retention floor | +| `.idx` | ≈4.2 B/tx (3-byte payload) | — (per-window) | ~12.5 GB | build → superseded next boundary, or retention | + +Transient peaks: ~2× the index size in the window dir during each rebuild (~25 GB at window end); the `.bin` files for the in-flight window total ~60 GB. Both are transient (§7.4). The steady-state durable cost of the cold tier is the `.idx` files alone: ≈4.2 bytes per transaction across all retained history. + +## 10. Performance + +- **Ingest, hot**: one `(hash, seq)` put per transaction, inside the ledger's existing write. +- **Ingest, cold**: the in-memory sort of ~3M entries is negligible against the chunk's streaming pass; the `.bin` write is sequential. +- **Rebuild**: a full dense window merges ~60 GB of sorted `.bin` files into a ~12.5 GB `.idx` in ≈1 minute (~200 MB/s write burst), measured in the `bench-fullhistory` harness. Mid-window rebuilds scale with `hi − lo`. Against a ~14-hour boundary cadence at mainnet rates, the rebuild is a ~0.1% duty cycle. +- **Lookup, cold**: one MPHF probe per in-retention window — fingerprint screen, then fetch-and-verify on a hit. The hash is in at most one window, so at most one fetch confirms; fingerprint false positives (bounded by `fpWidth`, §6.2) are rejected by the full-hash verify. Probe ordering, parallelism, and the resulting latency/throughput are the query-routing design's concern (§8.1). +- **Lookup, hot**: one RocksDB point get in a bloom-filtered CF, then the same ledger fetch. + +--- + +## Related documents + +- [full-history-streaming-workflow.md](./full-history-streaming-workflow.md) — the daemon this subsystem lives in: geometry, the catalog and one write protocol, `processChunk`, the resolver and executor, the lifecycle run (freeze → rebuild → discard → prune), and the correctness invariants (INV-1 … INV-4) with their audits. +- The reader / query-routing design — how readers obtain current coverage and dispatch between hot DBs and frozen files across transitions. +- [getevents-full-history-design.md](./getevents-full-history-design.md) — the sibling subsystem (events), same hot/cold architecture over the same chunk geometry. +- [packfile-library.md](./packfile-library.md) — the `.pack` format the read path's ledger fetch lands on. +- `bench-fullhistory` — the measurement harness behind every figure in Part 4. diff --git a/full-history/design-docs/03-backfill-workflow.md b/full-history/design-docs/03-backfill-workflow.md deleted file mode 100644 index dbb7aa05f..000000000 --- a/full-history/design-docs/03-backfill-workflow.md +++ /dev/null @@ -1,698 +0,0 @@ -# Backfill Workflow - -## Overview - -Backfill populates the immutable stores for a configured ledger range `[start_ledger, end_ledger]`. - -**What it does:** -- Ingests historical ledgers offline — no live queries served (only `getHealth` / `getStatus`). `getHealth` is the existing lightweight liveness check; `getStatus` is the new backfill-specific progress endpoint (see [getStatus API Response](#getstatus-api-response) below). -- Writes directly to immutable file formats — no RocksDB active stores -- Schedules work as a DAG of idempotent tasks, dispatched via a flat worker pool (default GOMAXPROCS slots) -- Exits when done; on failure, re-run the same command — completed work is never repeated - -**What it produces:** - -| Query it enables | Immutable output | Scope | -|-----------------|-----------------|-------| -| `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10K ledgers) | -| `getTransaction` | Txhash index files | Per txhash index (default 10M ledgers) | -| `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | - ---- - -## Geometry - -The Stellar blockchain starts at ledger 2. Backfill organizes data using two concepts: - -- **Chunk** — 10_000 ledgers (hardcoded, not configurable) - - Atomic unit of ingestion and crash recovery - - Produces: one ledger `.pack` file, one raw txhash `.bin` file, one events cold segment (`events.pack`, `index.pack`, `index.hash`) - - `chunk_id = (ledger_seq - 2) / 10_000` -- **Txhash Index** — `CHUNKS_PER_TXHASH_INDEX` chunks (default 1000 = 10M ledgers) - - One RecSplit index covers all transactions across `CHUNKS_PER_TXHASH_INDEX` chunks (default: 10M ledgers worth of transactions) - - Produces 16 CF (column family) `.idx` files per txhash index - - `index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX` - - Configurable via TOML, but must not change across runs — once set, it is fixed - -### ID Formulas - -``` -chunk_id = (ledger_seq - 2) / 10_000 -index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX -``` - -Example with `CHUNKS_PER_TXHASH_INDEX = 1000` (default): - -| Txhash Index ID | First Ledger | Last Ledger | Chunks | -|-----------------|-------------|------------|--------| -| 0 | 2 | 10_000_001 | 0–999 | -| 1 | 10_000_002 | 20_000_001 | 1000–1999 | -| 2 | 20_000_002 | 30_000_001 | 2000–2999 | -| N | (N × 10M) + 2 | ((N+1) × 10M) + 1 | N×1000 – (N+1)×1000 - 1 | - -All IDs use uniform `%08d` zero-padding (supports up to 99_999_999). - ---- - -## Configuration - -TOML file, passed via `stellar-rpc full-history-backfill --config path/to/config.toml`. - -- **TOML** defines data layout and storage paths — must be stable across runs -- **CLI flags** define per-run parameters (range, workers, retries) - -### TOML Config - -**[SERVICE]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | - -**[BACKFILL]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per txhash index. Defines data layout — must be stable across runs. | - -**[IMMUTABLE_STORAGE.LEDGERS]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/ledgers` | Base path for ledger pack files. | - -**[IMMUTABLE_STORAGE.EVENTS]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/events` | Base path for events cold segments. | - -**[IMMUTABLE_STORAGE.TXHASH_RAW]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/raw` | Base path for raw txhash `.bin` files (transient). | - -**[IMMUTABLE_STORAGE.TXHASH_INDEX]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/index` | Base path for RecSplit index files (permanent). | - -The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores used by the streaming workflow). - -**[BACKFILL.BSB]** — BSB / Buffered Storage Backend (required) - -| Key | Type | Default | Description | -|-----|------|---------|-------------------------------------------------------------------------------------| -| `BUCKET_PATH` | string | **required** | Remote object store path to fetch LedgerCloseMeta (without `gs://` prefix for GCS). | -| `BUFFER_SIZE` | int | `1000` | Prefetch buffer depth per connection. | -| `NUM_WORKERS` | int | `20` | Download workers per connection. | - -**[LOGGING]** - -Both keys are optional. When a key is set in both TOML and on the CLI, the CLI flag wins — specifying both is not an error. - -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. | -| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. | - -### CLI Flags - -| Flag | Type | Default | Description | -|------|------|---------|-------------| -| `--start-ledger` | uint32 | **required** | First ledger (inclusive). Must be ≥ 2. | -| `--end-ledger` | uint32 | **required** | Last ledger (inclusive). Must be > `start_ledger`. | -| `--workers` | int | `GOMAXPROCS` | Total concurrent DAG task slots. | -| `--verify-recsplit` | bool | `true` | Run RecSplit verify phase after build. | -| `--max-retries` | int | `3` | Max retries per task before marking it failed. | -| `--log-level` | string | — | Overrides `[LOGGING].LEVEL` when set. | -| `--log-format` | string | — | Overrides `[LOGGING].FORMAT` when set. | - -### Optional TOML Sections - -| Section | Key | Default | Description | -|---------|-----|---------|-------------| -| `[META_STORE]` | `PATH` | `{DEFAULT_DATA_DIR}/meta/rocksdb` | Meta store RocksDB directory | - -### Validation Rules - -The only hard constraints are: - -- `start_ledger >= 2` -- `end_ledger > start_ledger` -- `[BACKFILL.BSB]` must be present -- `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — changing it invalidates existing txhash index boundaries -- Backfill never prunes existing data — narrowing the range between runs is safe (completed work outside the new range is simply left untouched) -- No txhash-index-alignment required — the operator can pass any arbitrary ledger range -- If gaps remain after backfill, streaming mode validates completeness for all chunks and all txhash indexes at startup, reports any gaps to the operator, and aborts - -#### Chunk Boundary Expansion - -- System expands the requested range **outward** to the nearest chunk boundaries -- Start expands DOWN to the first ledger of its chunk -- End expands UP to the last ledger of its chunk -- Never clamps inward — the effective range is always ≥ the requested range -- Operator doesn't need to manually calculate chunk-aligned values - -``` -Operator requests: --start-ledger 5_000_000 --end-ledger 56_337_842 -Chunk boundary expand: start=5_000_000 falls within chunk 499 (starts at 4_990_002) - → expand start to 4_990_002 - end=56_337_842 falls within chunk 5633 (ends at 56_340_001) - → expand end to 56_340_001 -Effective range: ledgers 4_990_002–56_340_001 = 5_135 chunks -``` - -#### BSB Availability Validation - -After expansion, the system validates that the remote object store referenced by BSB contains all ledgers in the expanded range: - -- Expanded end exceeds BSB availability → error at startup (no silent truncation) -- Operator must either reduce `--end-ledger` or wait for more ledgers to become available in BSB - -#### Partial Txhash Index Ranges - -If the expanded range does not complete a full txhash index: - -- Chunks are still backfilled and immediately serve `getLedger`/`getEvents` when the service is started in streaming mode -- Txhash index creation only happens once **all** input chunks for the txhash index are ready -- If txhash index creation does not happen in the current backfill run, the remaining chunks are completed either by a subsequent backfill run (should the operator run backfill again) or when streaming mode starts for the first time (see [Implications for Streaming Workflow](#implications-for-streaming-workflow) below) - -Ledger and events data are useful per-chunk and should not be blocked by txhash index alignment. Without relaxed validation: - -- A node at ledger 56_340_000 cannot backfill the latest ~6.3M ledgers because `50_000_002–56_340_001` doesn't align to a 10M txhash index boundary — the operator would have to wait until ledger 60_000_001 -- Incremental backfill (extending coverage from a completed txhash index to recent history) would be blocked unless the chain happens to sit on a txhash index boundary - -#### Implications for Streaming Workflow - -When backfill completes at a non-txhash-index-aligned boundary, a partially-filled txhash index remains. The streaming workflow completes the remaining chunks: - -- Streaming continues chunk ingestion from where backfill left off, writing the same per-chunk outputs (LFS, txhash, events) using the same flag-based idempotency -- When streaming completes the last chunk needed for a pending txhash index, txhash index creation becomes eligible and runs -- The meta store is the shared coordination point — streaming checks the same chunk flags as backfill, so there is no gap or overlap between backfill and streaming coverage - -See [PR #617 discussion](https://github.com/stellar/stellar-rpc/pull/617#discussion_r2969796337) for the original rationale. - -### Example: GCS Backfill Config - -```toml -[SERVICE] -DEFAULT_DATA_DIR = "/data/stellar-rpc" - -[BACKFILL] -CHUNKS_PER_TXHASH_INDEX = 1000 - -[IMMUTABLE_STORAGE.LEDGERS] -PATH = "/mnt/nvme/ledgers" - -[IMMUTABLE_STORAGE.EVENTS] -PATH = "/mnt/nvme/events" - -[IMMUTABLE_STORAGE.TXHASH_RAW] -PATH = "/mnt/nvme/txhash/raw" - -[IMMUTABLE_STORAGE.TXHASH_INDEX] -PATH = "/mnt/nvme/txhash/index" - -[BACKFILL.BSB] -BUCKET_PATH = "sdf-ledger-close-meta/v1/ledgers/pubnet" - -[LOGGING] -LEVEL = "info" -FORMAT = "text" -``` - -```bash -stellar-rpc full-history-backfill --config config.toml \ - --start-ledger 2 \ - --end-ledger 30_000_001 \ - --workers 40 -``` - ---- - -## Directory Structure - -With geometry (chunk, txhash index) and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is how they map to the filesystem. - -- Each data type has its own directory tree rooted at its `IMMUTABLE_STORAGE.*.PATH` -- Chunk-level files (ledgers, events, raw txhash) are grouped into subdirectories (bucket) of 1_000 chunks: - - `bucket_id = chunk_id / 1000` (hardcoded, not configurable), formatted as `%05d` - - `bucket_id` is purely a filesystem concern — it does not appear in meta store keys, DAG dependencies, or config -- Txhash index output is the only structure that uses `index_id` instead of `bucket_id` -- Directories are created on-demand via `os.MkdirAll` (safe for concurrent writes) - -``` -{DEFAULT_DATA_DIR}/ -├── meta/ -│ └── rocksdb/ ← Meta store (WAL always enabled) -│ -├── ledgers/ ← IMMUTABLE_STORAGE.LEDGERS.PATH -│ ├── 00000/ ← chunks 0–999 (1_000 .pack files) -│ │ ├── 00000000.pack ← ledger pack file (PR #633) -│ │ ├── 00000001.pack -│ │ └── ... -│ ├── 00001/ ← chunks 1000–1999 -│ │ └── ... -│ └── .../ -│ -├── events/ ← IMMUTABLE_STORAGE.EVENTS.PATH -│ ├── 00000/ ← chunks 0–999 (3_000 files: 3 per chunk) -│ │ ├── 00000000-events.pack ← compressed event blocks -│ │ ├── 00000000-index.pack ← serialized roaring bitmaps -│ │ ├── 00000000-index.hash ← MPHF for term → slot lookup -│ │ └── ... -│ └── .../ -│ -└── txhash/ - ├── raw/ ← IMMUTABLE_STORAGE.TXHASH_RAW.PATH - │ ├── 00000/ ← chunks 0–999 (1_000 .bin files) - │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit) - │ │ └── ... - │ └── .../ - └── index/ ← IMMUTABLE_STORAGE.TXHASH_INDEX.PATH - ├── 00000000/ ← txhash index 0 (16 RecSplit CF files) - │ └── cf-{0-f}.idx ← PERMANENT - └── .../ -``` - -`CHUNKS_PER_TXHASH_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. - -The directory tree above reflects the default `CHUNKS_PER_TXHASH_INDEX = 1000`. Using 20M ledgers (2_000 chunks) as an example: - -| `CHUNKS_PER_TXHASH_INDEX` | Txhash index dirs | Tradeoff | -|---------------------------|-------------------|----------| -| `1000` (default) | 2_000 / 1000 = 2 | Fewer dirs, larger indexes — longer build time per index, fewer files to search at query time | -| `100` | 2_000 / 100 = 20 | More dirs, smaller indexes — faster build time per index, more files to search at query time | -| `1` | 2_000 / 1 = 2_000 | One index per chunk — fastest build, most files to search | - -### Path Conventions - -| File Type | Pattern | Example | -|-----------|---------|---------| -| Ledger pack | `{IMMUTABLE_STORAGE.LEDGERS.PATH}/{bucketID:05d}/{chunkID:08d}.pack` | `ledgers/00000/00000042.pack` | -| Raw txhash | `{IMMUTABLE_STORAGE.TXHASH_RAW.PATH}/{bucketID:05d}/{chunkID:08d}.bin` | `txhash/raw/00000/00000042.bin` | -| RecSplit CF | `{IMMUTABLE_STORAGE.TXHASH_INDEX.PATH}/{indexID:08d}/cf-{nibble}.idx` | `txhash/index/00000000/cf-a.idx` | -| Events data | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-events.pack` | `events/00000/00000042-events.pack` | -| Events index | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.pack` | `events/00000/00000042-index.pack` | -| Events hash | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.hash` | `events/00000/00000042-index.hash` | - -- **Nibble** = high 4 bits of `txhash[0]`, i.e., `txhash[0] >> 4`. Values `0`–`f`. Determines which of 16 CFs a txhash is routed to. -- **Raw txhash format**: 36 bytes per entry, no header: `[txhash: 32 bytes][ledgerSeq: 4 bytes big-endian]` -- **Events cold segment**: See [getEvents full-history design](https://github.com/stellar/stellar-rpc/pull/635) for the full format specification. - ---- - -## Meta Store Keys - -- Single RocksDB instance with WAL (Write-Ahead Log) always enabled -- Authoritative source for crash recovery — all resume decisions derive from key presence in this store - -### Key Schema - -All IDs use uniform `%08d` zero-padding, matching the directory structure. - -| Key Pattern | Value | Written When | -|-------------|-------|-------------| -| `chunk:{C:08d}:lfs` | `"1"` | After ledger `.pack` file is fsynced | -| `chunk:{C:08d}:txhash` | `"1"` | After raw txhash `.bin` file is fsynced | -| `chunk:{C:08d}:events` | `"1"` | After events cold segment files (`events.pack`, `index.pack`, `index.hash`) are fsynced | -| `index:{N:08d}:txhash` | `"1"` | After all 16 RecSplit CF `.idx` files are built and fsynced | - -- Values are `"1"` (retained for `ldb`/`sst_dump` readability); key presence is the signal -- Key absence means not started or incomplete — treated identically on resume -- Each chunk flag is written independently after its output's fsync — a crash may leave some flags set and others absent for the same chunk -- On resume, each chunk's flags are checked independently — only missing outputs are produced -- WAL is always enabled — disabling it would invalidate all crash recovery -- `chunk:{C}:txhash` keys are deleted after the txhash index is built (the raw `.bin` files they reference are also deleted); all other flags are permanent - -**Examples:** -``` -chunk:00000000:lfs → "1" chunk 0 ledger pack done -chunk:00000000:txhash → "1" chunk 0 raw txhash done -chunk:00000000:events → "1" chunk 0 events cold segment done -chunk:00000999:events → "1" last chunk of txhash index 0 -index:00000000:txhash → "1" txhash index 0 RecSplit complete -index:00000001:txhash → absent txhash index 1 not yet built -``` - -### Key Lifecycle - -``` -chunk ingestion → sets chunk:{C}:lfs, chunk:{C}:txhash, chunk:{C}:events - (each independently, after its output's fsync) -txhash index build → sets index:{N}:txhash -txhash cleanup → deletes chunk:{C}:txhash keys + raw .bin files -``` - -After a completed txhash index: -- `chunk:{C}:lfs`, `chunk:{C}:events`, `index:{N}:txhash` — permanent -- `chunk:{C}:txhash` keys + raw `.bin` files — deleted after txhash index is built - ---- - -## Tasks and Dependencies - -The backfill DAG has three task types: - -| Task | Cadence | Dependencies | Produces | -|------|---------|-------------|----------| -| `process_chunk(chunk_id)` | Per chunk (10K ledgers) | None | Ledger `.pack` + raw txhash `.bin` + events cold segment | -| `build_txhash_index(index_id)` | Per txhash index | All `process_chunk` tasks for this txhash index | 16 RecSplit `.idx` files | -| `cleanup_txhash(index_id)` | Per txhash index | `build_txhash_index` for this txhash index | Deletes raw `.bin` files + `chunk:{C}:txhash` meta keys | - -- Each task is a black box to the DAG scheduler — it calls `Execute()` and waits for return -- What happens inside (goroutines, I/O, parallelism) is up to the task - -### Dependency Diagram - -For a single txhash index with N chunks: - -``` -process_chunk(chunk 0) ─┐ -process_chunk(chunk 1) ─┤ -process_chunk(chunk 2) ─┼──→ build_txhash_index(index_id) ──→ cleanup_txhash(index_id) -... │ -process_chunk(chunk N) ─┘ -``` - -- All `process_chunk` tasks for a txhash index must complete before `build_txhash_index` fires -- `cleanup_txhash` runs after `build_txhash_index` succeeds -- Cleanup deletes the raw `.bin` files and their `chunk:{C}:txhash` meta keys - -### Main Flow - -```python -def run_backfill(config, flags): - - # 1. Validate — abort before any work if config is incompatible with existing state - validate(config, flags) - - # 2. Build DAG — register all tasks; each task's execute() handles its own no-op check - dag = build_dag(config, flags) - - # 3. Execute — dispatch all tasks concurrently, bounded by worker count - dag.execute(max_workers=flags.workers) # default GOMAXPROCS -``` - -### Validation - -Validation runs before DAG construction, not as a DAG task. If it were a DAG task, other tasks with no dependencies would start executing concurrently before validation completes — and if validation fails, in-flight work that should never have started would need to be cancelled. Running it first means a clean abort with no partial work. - -```python -def validate(config, flags): - # See Validation Rules for the full list of checks. - assert flags.start_ledger >= 2 - assert flags.end_ledger > flags.start_ledger - assert config.backfill.bsb is not None - assert CHUNKS_PER_TXHASH_INDEX unchanged from prior runs (if meta store is non-empty) -``` - -### DAG Setup - -```python -def build_dag(config, flags): - # Wires up tasks and dependency edges — no completion checks or skip logic. - # Each task's execute() handles its own no-op check (early return if already complete). - - dag = new DAG() - - for index_id in configured_indexes(config, flags): - chunk_tasks = [] - for chunk_id in chunks_for_index(index_id): - t = dag.add(ProcessChunkTask(chunk_id), deps=[]) - chunk_tasks.append(t.id) - b = dag.add(BuildTxHashIndexTask(index_id), - deps=chunk_tasks) - dag.add(CleanupTxHashTask(index_id), deps=[b.id]) - - return dag -``` - ---- - -## Task Details - -### process_chunk(chunk_id) - -- Processes a single 10K-ledger chunk end-to-end -- Occupies one DAG worker slot -- Only produces missing outputs — checks each flag independently -- Internal concurrency is an implementation detail - -**Outputs** (all produced in a single task, only if missing): -- Ledger pack file (`{chunkID:08d}.pack`) — compressed ledger data in [packfile format](https://github.com/stellar/stellar-rpc/pull/633) -- Raw txhash flat file (`{chunkID:08d}.bin`) — 36-byte entries consumed by RecSplit builder -- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — per [getEvents design](https://github.com/stellar/stellar-rpc/pull/635) - -**Pseudocode:** - -```python -process_chunk(chunk_id): - bucket_id = chunk_id / 1000 # hardcoded subdirectory grouping (see Directory Structure) - first_ledger = chunk_first_ledger(chunk_id) - last_ledger = chunk_last_ledger(chunk_id) - - # 1. Check which outputs are missing - need_lfs = not meta_store.has(f"chunk:{chunk_id:08d}:lfs") - need_txhash = not meta_store.has(f"chunk:{chunk_id:08d}:txhash") - need_events = not meta_store.has(f"chunk:{chunk_id:08d}:events") - - if not (need_lfs or need_txhash or need_events): - return # all outputs already present - - # 2. Choose data source - if not need_lfs: - source = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe, no BSB - else: - source = BSBFactory.create(first_ledger, last_ledger) # BSB connection - - # 3. Open writers only for missing outputs - ledger_writer = packfile.create(ledger_pack_path(bucket_id, chunk_id), - overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(bucket_id, chunk_id), - overwrite=True) if need_txhash else None - events_writer = events_segment.create(events_path(bucket_id, chunk_id), - overwrite=True) if need_events else None - - # 4. Process each ledger - for seq in range(first_ledger, last_ledger + 1): - lcm = source.get_ledger(seq) - - if need_lfs: ledger_writer.append(compress(lcm)) - if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx - if need_events: events_writer.append(extract_events(lcm)) - - # 5. Fsync + flag each output independently - if need_lfs: - ledger_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") - - if need_txhash: - txhash_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") - - if need_events: - events_writer.finalize() # flush, build MPHF + bitmap index, fsync - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - - source.close() -``` - -Key properties: -- Only missing outputs are produced — a partially-completed chunk resumes from where it left off -- If LFS is already present, reads from local NVMe instead of BSB (avoids redundant download) -- Each flag is written independently after its output's fsync — no atomic WriteBatch needed -- `packfile.Create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed -- Naturally extends to new data types (add a fourth flag) - -**BSB** (BufferedStorageBackend): -- Ledger source backed by a remote object store -- Each `process_chunk` task creates its own BSB connection -- Internal prefetch workers: `BUFFER_SIZE` ledgers ahead, `NUM_WORKERS` download goroutines - -### build_txhash_index(index_id) - -- Builds the RecSplit txhash index for one completed txhash index -- Occupies one DAG worker slot, but spawns several goroutines internally -- The DAG guarantees all chunk `.bin` files exist before this runs - -**Pseudocode:** - -```python -build_txhash_index(index_id): - if meta_store.has(f"index:{index_id:08d}:txhash"): - return # already built — no-op - - bin_files = list_bin_files(index_id) # all .bin files for chunks in this txhash index - - # Phase 1: COUNT — scan all .bin files, count entries per CF - cf_counts = parallel_count(bin_files, workers=100) - # cf_counts[nibble] = number of (txhash, ledgerSeq) entries routed to that CF - - # Phase 2: ADD — re-read .bin files, route entries to CF builders - cf_builders = [RecSplitBuilder(cf_counts[n]) for n in range(16)] - parallel_add(bin_files, cf_builders, workers=100) - # each entry routed to cf_builders[txhash[0] >> 4] (mutex per CF) - - # Phase 3: BUILD — build MPH index per CF, one .idx file each - parallel_build(cf_builders, workers=16) - # each CF produces one .idx file; all fsynced - - # Phase 4: VERIFY (optional) — look up every key in the built indexes - if verify_recsplit: - parallel_verify(bin_files, cf_builders, workers=100) - - # Mark index complete - meta_store.put(f"index:{index_id:08d}:txhash", "1") -``` - -Key properties: -- COUNT and ADD each read all `.bin` files (two full passes over the data) -- BUILD runs 16 goroutines in parallel (one per CF) — each CF is independent -- VERIFY is skippable via `--verify-recsplit=false` cli flag -- All-or-nothing recovery: if `index:{N}:txhash` is absent on restart → delete partial `.idx` files → rerun entire build - -### cleanup_txhash(index_id) - -- Runs after `build_txhash_index` completes successfully - -**Pseudocode:** - -```python -cleanup_txhash(index_id): - for chunk_id in chunks_for_index(index_id): - if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already cleaned up — skip - delete(raw_txhash_path(bucket_id, chunk_id)) # remove .bin file - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key -``` - -Key properties: -- Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery works naturally -- Per-chunk idempotency: each chunk checks its own `chunk:{C}:txhash` key before deleting — a crash mid-cleanup resumes from where cleanup left off -- On restart: DAG sees txhash index key present (build complete) but `chunk:{C}:txhash` keys still exist → cleanup runs as a normal task - ---- - -## Execution Model - -### DAG Scheduler - -- Pipeline builds a single DAG at startup, executes it with bounded concurrency -- The DAG is the only scheduling mechanism — no per-txhash-index coordinators, no secondary worker pools -- Each task's `Execute()` is wrapped with a retry loop bounded by `--max-retries` (default 3). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. - -```python -run_dag(dag, max_workers): - worker_slots = Semaphore(max_workers) - runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) - - def execute_task(task): - """Runs in a background thread — one per dispatched task.""" - for attempt in range(1, max_retries + 1): - error = task.execute() - if error is None: - break - if attempt == max_retries: - mark_failed(task, error) # halt all dependents - break - log.warn("retry", task, attempt, error) - - worker_slots.release() # free worker slot - - # Check if completing this task unblocks any downstream tasks - for downstream in dag.dependents_of(task): - downstream.mark_dependency_done(task) - if downstream.all_dependencies_done(): - runnable_tasks.push(downstream) # now eligible to run - - # Main loop — dispatches tasks as they become runnable - while runnable_tasks: - current_task = runnable_tasks.pop() - worker_slots.acquire() # block until a worker slot is free - run_in_background(execute_task, current_task) # launch — returns immediately -``` - -### Worker Pool - -- Single flat pool of `workers` slots (default `GOMAXPROCS`) -- Any mix of task types can occupy slots simultaneously -- `process_chunk`: 1 slot per task -- `build_txhash_index`: 1 slot per task (uses many goroutines internally) -- `cleanup_txhash`: 1 slot per task - -### How Work Flows Through the Pipeline - -- All `process_chunk` tasks have no dependencies → DAG dispatches up to `workers` slots immediately at startup -- Chunks from different txhash indexes run side by side — the scheduler does not process txhash indexes sequentially -- When the last chunk of a txhash index completes → `build_txhash_index` becomes eligible, claims a slot -- After build completes → `cleanup_txhash` becomes eligible -- Remaining slots continue processing chunks for other txhash indexes throughout — no special coordination needed - ---- - -## Crash Recovery - -There is no separate crash recovery, reconciliation, or startup triage phase. Recovery happens organically because every task's `execute()` checks its own completion state: - -- On every startup, `build_dag()` registers ALL tasks for the configured range — no meta store scanning in DAG setup -- `process_chunk` checks each output flag independently — missing outputs are produced, existing outputs are skipped -- `build_txhash_index` checks `index:{N}:txhash` — if present, returns immediately; if absent, deletes partial `.idx` files and reruns the full build -- `cleanup_txhash` checks `chunk:{C}:txhash` per-chunk — already-cleaned chunks are skipped, remaining chunks are cleaned up - -This works because of three invariants: - -1. **Key implies durable file** — a meta store flag is set only after fsync -2. **Tasks are idempotent** — each checks its own outputs and skips or overwrites what exists -3. **DAG registers all tasks on every startup** — completed tasks return immediately from `execute()` - -### Concurrent Access Prevention - -- Meta store RocksDB uses kernel-level `flock()` on a `LOCK` file -- A second process attempting to open the same meta store fails immediately -- Released automatically on process exit (including `kill -9`) - - ---- - -## getStatus API Response - -During backfill, `getStatus` returns progress as task-type summaries: -- No per-txhash-index breakdown — just completed/pending/in_progress counts per task type - -```json -{ - "mode": "BACKFILL", - "tasks": { - "process_chunk": {"completed": 288, "pending": 5712, "in_progress": 40}, - "build_txhash_index": {"completed": 0, "pending": 6, "in_progress": 0}, - "cleanup_txhash": {"completed": 0, "pending": 6, "in_progress": 0} - }, - "eta_seconds": 1820 -} -``` - ---- - -## Error Handling - -Two layers of retry: - -- **BSB retries** — BSB handles transient errors internally (connection resets, throttling, etc). These retries happen within a single task execution and are not visible to the DAG scheduler. -- **Task-level retries** — the DAG scheduler wraps each task's `execute()` with a retry loop bounded by `--max-retries` (default 3). If a task returns an error after BSB has exhausted its own retries, the scheduler retries the entire task. After `--max-retries` exhausted → task marked failed → DAG halts all dependent tasks → process exits non-zero. - -Operator re-runs the same command; completed work is never repeated. - -| Error | Handled by | Action | -|-------|-----------|--------| -| BSB transient error (throttle, connection reset) | BSB internal retry | Retried within the task; transparent to DAG | -| BSB persistent error (BSB retries exhausted) | Task-level retry | `--max-retries` attempts; then ABORT | -| Ledger pack write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| TxHash write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| Events write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| RecSplit build failure | Task-level retry | `--max-retries` attempts; then ABORT; txhash index key absent | -| Verify phase mismatch | None | ABORT immediately — data corruption, operator investigates | -| Meta store write failure | None | ABORT immediately — treat as crash, operator re-runs | diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md deleted file mode 100644 index e682068d6..000000000 --- a/full-history/design-docs/README.md +++ /dev/null @@ -1,26 +0,0 @@ -# Stellar Full History RPC Service — Design Docs - -> **Scope**: Backfill pipeline only. Streaming pipeline design is covered separately. - -## Documents - -| Document | Description | -|----------|-------------| -| [03-backfill-workflow.md](./03-backfill-workflow.md) | Complete backfill design — geometry, meta store keys, directory layout, configuration, DAG task graph, execution model, crash recovery, getStatus API | - -The backfill doc is self-contained. Read it top-to-bottom for the full picture. - -## Quick Context - -The Stellar Full History RPC Service ingests the complete blockchain history. Primary use cases: - -- Retrieve any ledger from history -- Retrieve any transaction from history -- Retrieve any events with filter matching from history - -It has two modes: - -- **Backfill** — offline bulk import. Writes directly to immutable files (LFS chunks + RecSplit indexes). No RocksDB, no queries during ingestion. DAG-scheduled with a flat worker pool. -- **Streaming** — real-time ingestion via CaptiveStellarCore. Writes to RocksDB active stores, serves queries, transitions to immutable storage at index boundaries. Covered in a separate design doc. - -These modes are fully independent — separate code, separate crash recovery, separate transition workflows.