From 50d5da3a9acac410d9856fffaf1132bef338c3e8 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 15 Apr 2026 16:56:44 -0700 Subject: [PATCH 01/34] Add streaming workflow design doc, renumber design docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add 02-streaming-workflow.md: streaming mode design covering startup validation, first-start .bin loading, per-ledger ingestion loop, three independent sub-flow transitions (LFS, events, txhash), crash recovery invariants, backfill-to-streaming migration, and error handling - Rename 03-backfill-workflow.md → 01-backfill-workflow.md - Update README with new numbering, reading order, and completeness status - Add _ref-old-* files to .gitignore (local reference only) --- .gitignore | 3 + ...ll-workflow.md => 01-backfill-workflow.md} | 0 .../design-docs/02-streaming-workflow.md | 780 ++++++++++++++++++ full-history/design-docs/README.md | 74 +- 4 files changed, 843 insertions(+), 14 deletions(-) rename full-history/design-docs/{03-backfill-workflow.md => 01-backfill-workflow.md} (100%) create mode 100644 full-history/design-docs/02-streaming-workflow.md diff --git a/.gitignore b/.gitignore index 3978ca2bd..bd24138b6 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,6 @@ captive-core/ .soroban/ !test.toml *.sqlite* + +# Old design-docs-og reference files (local only) +full-history/design-docs/_ref-old-* diff --git a/full-history/design-docs/03-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md similarity index 100% rename from full-history/design-docs/03-backfill-workflow.md rename to full-history/design-docs/01-backfill-workflow.md diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md new file mode 100644 index 000000000..cc26e424c --- /dev/null +++ b/full-history/design-docs/02-streaming-workflow.md @@ -0,0 +1,780 @@ +# Streaming Workflow + +## Overview + +- Streaming mode ingests live Stellar ledgers from CaptiveStellarCore, one ledger at a time, while serving queries +- Default mode — when `--mode` is not specified, the service runs in streaming mode + +**What streaming does:** +- Ingests ledgers from CaptiveStellarCore in real-time (~1 ledger every 6 seconds) +- Writes each ledger to three active stores: ledger RocksDB, txhash RocksDB (16 CFs), and events hot segment +- Serves `getLedger`, `getTransaction`, and `getEvents` queries concurrently with ingestion +- At chunk boundaries (every 10_000 ledgers): transitions ledger store to LFS pack file and freezes events hot segment to cold segment — both in background +- At index boundaries (every 10_000_000 ledgers): builds RecSplit txhash index from transitioning txhash store — in background +- Long-running daemon that exits only on fatal error + +**How streaming differs from backfill:** + +| Dimension | Backfill | Streaming | +|---|---|---| +| Data source | BSB (GCS) | CaptiveStellarCore | +| RocksDB for ingestion | No — writes directly to files | Yes — three active stores | +| Txhash write format | `.bin` flat files (36 bytes/entry) | RocksDB txhash store (16 CFs) | +| RecSplit input | `.bin` flat files | RocksDB txhash store | +| Checkpoint granularity | Per-chunk (10_000 ledgers) | Per-ledger | +| Concurrency | Flat worker pool (DAG scheduler, default GOMAXPROCS slots) | Single ingestion goroutine, background transition goroutines | +| Queries | Not served (`getHealth`/`getStatus` only) | All endpoints available | +| Crash recovery | Re-run from first incomplete chunk | Resume from `streaming:last_committed_ledger + 1` | +| Process lifecycle | Exits when done | Long-running daemon | + +### Boundary Math Reference + +Boundary formulas are defined in [01-backfill-workflow.md](./01-backfill-workflow.md#geometry). Quick reference: + +```python +chunk_id = (ledger_seq - 2) // 10_000 +chunk_first_ledger(C) = (C * 10_000) + 2 +chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 +index_id = chunk_id // chunks_per_txhash_index +index_last_ledger(N) = ((N + 1) * chunks_per_txhash_index * 10_000) + 1 +``` + +--- + +## Configuration + +### Immutable Config (shared with backfill) + +Streaming reads the same TOML config as backfill. The following keys are immutable — set on first backfill, fatal error if changed: + +| Key | Stored in Meta Store | Set By | Description | +|---|---|---|---| +| `chunks_per_txhash_index` | `config:chunks_per_txhash_index` | First backfill run | Chunks per txhash index. Must never change. | + +All `[immutable_storage.*]` paths from the backfill config apply unchanged. + +### Streaming-Specific TOML + +**[streaming]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | +| `drift_warning_ledgers` | int | `10` | Log warning when ingestion lags network tip by this many ledgers (~60 seconds at 10 ledgers). | + +**[streaming.active_storage]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `path` | string | `{default_data_dir}/active` | Base path for active RocksDB stores (ledger, txhash, events). | + +### CLI Flags + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--mode` | string | `streaming` | `streaming` or `full-history-backfill`. | +| `--config` | string | **required** | Path to TOML config file. | + +--- + +## Active Store Architecture + +Streaming maintains three active stores for the current ingestion position: + +| Store | Path | Key | Value | Transition cadence | +|---|---|---|---|---| +| Ledger | `{active_storage.path}/ledger-store-chunk-{chunkID:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | +| TxHash | `{active_storage.path}/txhash-store-index-{indexID:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every 10_000_000 ledgers (index) | +| Events | In-memory hot segment + WAL-backed deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | + +- **Ledger store**: default CF only. One RocksDB instance per chunk. WAL required. +- **TxHash store**: 16 column families (`cf-0` through `cf-f`), routed by `txhash[0] >> 4`. One RocksDB instance per index (spans 1_000 chunks). WAL required. +- **Events hot segment**: in-memory roaring bitmaps + WAL-backed per-ledger deltas. ~370 MB per segment (measured). See [getEvents full-history design](../../design-docs/getevents-full-history-design.md) for format details. + +### Store Pre-creation + +- Stores for the next chunk/index are pre-created before the boundary is reached +- At transition time, only internal pointers change (active → transitioning, pre-created → active) +- Pre-created stores eliminate creation failures at boundary time +- On restart, pre-created stores are expected to exist — not treated as orphans + +### Max Concurrent Stores + +| Sub-flow | Max active | Max transitioning | Max total | +|---|---|---|---| +| Ledger | 1 | 1 | 2 | +| Events | 1 (hot) | 1 (freezing) | 2 | +| TxHash | 1 | 1 | 2 | + +--- + +## Startup Sequence + +- The startup sequence is the same code path for first start in streaming mode (after backfill) and regular restarts +- The presence or absence of `streaming:last_committed_ledger` in the meta store distinguishes the two cases + +### Startup Flow + +```python +def start_streaming(config): + meta_store = open_meta_store(config) + + # ── 1. Validate immutable config ── + validate_config(config, meta_store) + + # ── 2. Validate backfill completeness ── + head_chunk, durable_tail = validate_chunk_coverage(config, meta_store) + + # ── 3. First-start txhash store initialization ── + # Load backfill .bin files into RocksDB, delete .bin files + txhash flags. + # On restart (streaming:last_committed_ledger present), usually a no-op. + load_backfill_bins(config, meta_store) + + # ── 4. Reconcile orphaned transitions ── + # Complete any in-flight transitions from a previous crash. + reconcile_orphaned_transitions(config, meta_store) + + # ── 5. Replay missed boundary handling ── + # If checkpoint is at a chunk/index boundary but transitions never fired. + replay_missed_boundaries(config, meta_store) + + # ── 6. Detect BUILD_READY indexes, spawn background RecSplit ── + spawn_pending_recsplit_builds(config, meta_store) + + # ── 7. Determine resume ledger ── + last_committed = meta_store.get("streaming:last_committed_ledger") + if last_committed is None: + # First start in streaming mode: set checkpoint to backfill's last ledger + last_committed = chunk_last_ledger(durable_tail) + meta_store.put("streaming:last_committed_ledger", last_committed) + resume_ledger = last_committed + 1 + + # ── 8. Open/create active stores for the resume position ── + active_stores = open_active_stores(config, meta_store, resume_ledger) + + # ── 9. Start CaptiveStellarCore ── + # CaptiveStellarCore takes ~4-5 minutes to spin up to the target ledger. + # Steps 1-8 also take minutes (WAL replay, .bin loading) and complete + # before CaptiveStellarCore is needed — no parallelism required. + core = start_captive_core(config, resume_ledger) + + # ── 10. Begin ingestion loop ── + run_ingestion_loop(core, active_stores, meta_store) +``` + +### Step 1: Validate Immutable Config + +```python +def validate_config(config, meta_store): + stored_cpi = meta_store.get("config:chunks_per_txhash_index") + if stored_cpi is not None and stored_cpi != config.chunks_per_txhash_index: + fatal(f"chunks_per_txhash_index changed: {stored_cpi} -> {config.chunks_per_txhash_index}") + if stored_cpi is None: + # First ever run writes the value. Backfill writes this on first run; + # if streaming runs first (no prior backfill), streaming writes it. + meta_store.put("config:chunks_per_txhash_index", config.chunks_per_txhash_index) +``` + +### Step 2: Validate Chunk Coverage + +```python +def validate_chunk_coverage(config, meta_store): + cpi = config.chunks_per_txhash_index + + lfs_set = meta_store.scan_keys_with_suffix(":lfs") # set of chunk IDs + events_set = meta_store.scan_keys_with_suffix(":events") + txhash_set = meta_store.scan_keys_with_suffix(":txhash") # backfill-only flags + + if not lfs_set: + fatal("no chunk data found — run backfill first") + + head_chunk = min(lfs_set | events_set) + head_index = head_chunk // cpi + + # Head must be index-aligned + if head_chunk != head_index * cpi: + fatal(f"partial index at head: chunk {head_chunk} is not first chunk " + f"of index {head_index} (expected {head_index * cpi}). " + f"Run backfill to complete index {head_index}.") + + # Find contiguous tails + durable_tail_lfs = contiguous_tail(lfs_set, head_chunk) + durable_tail_events = contiguous_tail(events_set, head_chunk) + durable_tail = min(durable_tail_lfs, durable_tail_events) + + # Verify complete indexes have index:{N}:txhash keys + for index_id in range(head_index, durable_tail // cpi + 1): + idx_last = (index_id + 1) * cpi - 1 + if idx_last > durable_tail: + break # partial index at tail — streaming will complete + if not meta_store.has(f"index:{index_id:08d}:txhash"): + # All chunks present but RecSplit not built — BUILD_READY + # (handled in step 6, not an error) + pass + + # First start in streaming mode only: verify backfill chunks have all three flags + if not meta_store.has("streaming:last_committed_ledger"): + for chunk_id in range(head_chunk, durable_tail + 1): + index_id = chunk_id // cpi + if meta_store.has(f"index:{index_id:08d}:txhash"): + continue # complete index — txhash flags already cleaned up + if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + fatal(f"first start in streaming mode: chunk {chunk_id} missing txhash flag " + f"in incomplete index {index_id} — run backfill to complete") + + return head_chunk, durable_tail +``` + +**Example — first start in streaming mode after backfill:** +``` +Backfill ran: --start-ledger 20_000_002 --end-ledger 56_340_001 +Expanded to chunks 2000-5633. chunks_per_txhash_index = 1000. + +Meta store: + chunk:00002000:lfs through chunk:00005633:lfs = "1" (3634 chunks) + chunk:00002000:txhash through chunk:00005633:txhash = "1" + chunk:00002000:events through chunk:00005633:events = "1" + index:00000002:txhash = "1" (chunks 2000-2999) + index:00000003:txhash = "1" (chunks 3000-3999) + index:00000004:txhash = "1" (chunks 4000-4999) + index:00000005:txhash = absent (chunks 5000-5633 done, 5634-5999 missing) + +Validation: + head_chunk = 2000 → 2000 % 1000 == 0 → index-aligned + durable_tail = 5633 + Indexes 2,3,4 → COMPLETE + Index 5: partial (634 of 1000 chunks) + First start in streaming mode: all backfill chunks 2000-5633 have txhash flags → valid + +Result: head_chunk=2000, durable_tail=5633 +``` + +### Step 3: First-Boot TxHash Store Initialization + +- On first start in streaming mode, backfill's `.bin` files contain txhash data that must be loaded into the RocksDB txhash store — needed for query serving and as the single input source for RecSplit +- This step runs on every startup (not just first start in streaming mode) for robustness — on restarts, the loop finds no `txhash` flags and is a no-op + +```python +def load_backfill_bins(config, meta_store): + cpi = config.chunks_per_txhash_index + + # 1. Clean up complete indexes with leftover .bin files + # (backfill crashed after RecSplit but before cleanup_txhash) + for index_id in all_indexes_with_txhash_key(meta_store): + for chunk_id in range(index_id * cpi, (index_id + 1) * cpi): + if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") + delete_if_exists(raw_txhash_path(chunk_id)) + + # 2. Load .bin files for current incomplete index into RocksDB txhash store + # IMPORTANT: open existing store (WAL recovery), do NOT delete-and-recreate — + # previously loaded chunks' .bin files are already deleted + txhash_store = open_or_create_txhash_store(config, current_incomplete_index(meta_store)) + for chunk_id in chunks_for_current_incomplete_index(meta_store, cpi): + if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + continue # already loaded (flag deleted), or streaming chunk (no .bin) + bin_path = raw_txhash_path(chunk_id) + if os.path.exists(bin_path): + load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first + delete_if_exists(bin_path) # delete .bin second + + # 3. Sweep orphaned .bin files (flag gone, file lingering from prior crash) + for bin_file in scan_bin_files_for_index(current_incomplete_index(meta_store)): + chunk_id = parse_chunk_id(bin_file) + if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + os.remove(bin_file) +``` + +**Crash safety for first-start loading:** +- If crash during loading: `streaming:last_committed_ledger` is still absent → next startup redoes the first-start sequence +- Already-loaded chunks: `.bin` deleted, flag deleted, data in RocksDB via WAL recovery +- Not-yet-loaded chunks: `.bin` and flag still present → loop picks up where it left off +- The txhash store MUST be opened (WAL recovery), never deleted-and-recreated — already-loaded chunks' `.bin` files are gone and cannot be re-read + +### Step 4: Reconcile Orphaned Transitions + +```python +def reconcile_orphaned_transitions(config, meta_store): + last_committed = meta_store.get("streaming:last_committed_ledger") + if last_committed is None: + return # first start in streaming mode — no prior streaming state to reconcile + + current_chunk = (last_committed - 2) // 10_000 + current_index = current_chunk // config.chunks_per_txhash_index + + # Orphaned transitioning ledger stores + for store_dir in scan_ledger_store_dirs(config): + chunk_id = parse_chunk_id_from_dir(store_dir) + if chunk_id == current_chunk: + continue # active store — keep (WAL recovery) + if chunk_id == current_chunk + 1: + continue # pre-created store — keep + if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): + delete_dir(store_dir) # flag set, cleanup didn't finish → delete + elif chunk_id < current_chunk: + # transitioning store, flush didn't complete → complete it + complete_lfs_flush(store_dir, chunk_id, meta_store) + else: + delete_dir(store_dir) # orphaned future store → delete + + # Orphaned transitioning txhash stores + for store_dir in scan_txhash_store_dirs(config): + index_id = parse_index_id_from_dir(store_dir) + if index_id == current_index: + continue # active store — keep + if meta_store.has(f"index:{index_id:08d}:txhash"): + delete_dir(store_dir) # complete, cleanup didn't finish → delete + # BUILD_READY handled in step 6 +``` + +### Step 5: Replay Missed Boundary Handling + +If `streaming:last_committed_ledger` is exactly at a chunk or index boundary but the boundary transitions never fired (crash between checkpoint and transitions): + +```python +def replay_missed_boundaries(config, meta_store): + last_committed = meta_store.get("streaming:last_committed_ledger") + if last_committed is None: + return + + current_chunk = (last_committed - 2) // 10_000 + cpi = config.chunks_per_txhash_index + + # Check if checkpoint is at a chunk boundary + if last_committed == chunk_last_ledger(current_chunk): + # Chunk boundary was reached but transitions may not have fired + if not meta_store.has(f"chunk:{current_chunk:08d}:lfs"): + trigger_lfs_flush(current_chunk, config, meta_store) + if not meta_store.has(f"chunk:{current_chunk:08d}:events"): + trigger_events_freeze(current_chunk, config, meta_store) + + # Check if checkpoint is also at an index boundary + current_index = current_chunk // cpi + if current_chunk == (current_index + 1) * cpi - 1: + # Index boundary — handled by step 6 (BUILD_READY detection) + pass +``` + +**Example — crash between checkpoint and chunk boundary transitions:** +``` +streaming:last_committed_ledger = 56_370_001 (= chunk_last_ledger(5636)) +chunk:00005636:lfs = absent (swap never happened, flush never spawned) +chunk:00005636:events = absent (freeze never started) +Active ledger store: ledger-store-chunk-005636/ (has all 10_000 ledgers via WAL) + +Detection: 56_370_001 == chunk_last_ledger(5636) AND flags absent +Action: trigger LFS flush for chunk 5636, trigger events freeze for chunk 5636 +``` + +--- + +## Ingestion Loop + +### Per-Ledger Processing + +```python +def run_ingestion_loop(core, active_stores, meta_store): + for lcm in core.stream_ledgers(): + ledger_seq = lcm.ledger_sequence + process_ledger(ledger_seq, lcm, active_stores, meta_store) + +def process_ledger(ledger_seq, lcm, active_stores, meta_store): + # 1. Write to all three stores in parallel goroutines. + # Each store's write is atomic (WriteBatch + WAL for RocksDB, + # atomic commit for events hot segment). + run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm) + run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm) + run_in_background(write_events_hot_segment, active_stores.events, ledger_seq, lcm) + wait_for_all() # all three must succeed + + # 2. Set per-ledger checkpoint AFTER all writes succeed. + # INVARIANT: checkpoint is written ONLY after all three stores + # have durably committed the ledger data (WAL flush). + meta_store.put("streaming:last_committed_ledger", ledger_seq) + + # 3. If chunk boundary: trigger sub-flow transitions. + # Transitions happen AFTER checkpoint — a crash between checkpoint + # and transitions is detected on startup (see replay_missed_boundaries). + current_chunk = (ledger_seq - 2) // 10_000 + if ledger_seq == chunk_last_ledger(current_chunk): + on_chunk_boundary(current_chunk, active_stores, meta_store) + + # 4. If index boundary: trigger index-level transitions. + # The index boundary ledger is always also a chunk boundary ledger. + current_index = current_chunk // chunks_per_txhash_index + if ledger_seq == index_last_ledger(current_index): + on_index_boundary(current_index, active_stores, meta_store) +``` + +### Per-Store Write Details + +**Ledger store write:** +```python +def write_ledger_store(store, ledger_seq, lcm): + key = uint32_big_endian(ledger_seq) + value = zstd_compress(lcm.to_bytes()) + store.put(key, value) # WriteBatch + WAL +``` + +**TxHash store write:** +```python +def write_txhash_store(store, ledger_seq, lcm): + batch = WriteBatch() + for tx in lcm.transactions: + cf_name = f"cf-{tx.hash[0] >> 4:x}" # route by first nibble + batch.put_cf(cf_name, tx.hash, uint32_big_endian(ledger_seq)) + store.write(batch) # single WriteBatch across all CFs + WAL +``` + +**Events hot segment write:** +```python +def write_events_hot_segment(hot_segment, ledger_seq, lcm): + events = extract_contract_and_system_events(lcm) # excludes diagnostic events + for event in events: + event_id = hot_segment.next_event_id() + hot_segment.store_event(event_id, event) # persist event data + for term_key in index_terms(event): # contractId + topic0-3 + hot_segment.add_to_bitmap(term_key, event_id) # in-memory bitmap update + hot_segment.persist_deltas(ledger_seq) # WAL-backed per-ledger delta + hot_segment.update_offset_array(ledger_seq) # cumulative event count + hot_segment.commit(ledger_seq) # atomic commit +``` + +--- + +## Sub-flow Transitions + +Three independent sub-flows, each with its own goroutine, flag, and cleanup step. No combined transitions — each sub-flow waits for its own predecessor only. + +### Chunk Boundary (every 10_000 ledgers) + +Triggered when `ledger_seq == chunk_last_ledger(current_chunk)`: + +```python +def on_chunk_boundary(chunk_id, active_stores, meta_store): + # ── LFS sub-flow ── + # Wait for OWN predecessor only (max 1 transitioning ledger store) + wait_for_lfs_complete() + transitioning_ledger = active_stores.ledger + active_stores.ledger = open_precreated_ledger_store(chunk_id + 1) + run_in_background(lfs_transition,chunk_id, transitioning_ledger, meta_store) + + # ── Events sub-flow ── + # Wait for OWN predecessor only (max 1 freezing events segment) + wait_for_events_complete() + freezing_segment = active_stores.events + active_stores.events = create_events_hot_segment(chunk_id + 1) + run_in_background(events_transition,chunk_id, freezing_segment, meta_store) +``` + +### LFS Transition (background goroutine) + +```python +def lfs_transition(chunk_id, transitioning_ledger_store, meta_store): + # ── Transition: read from RocksDB, write pack file ── + pack_path = ledger_pack_path(chunk_id) + first_ledger = chunk_first_ledger(chunk_id) + last_ledger = chunk_last_ledger(chunk_id) + + writer = packfile.create(pack_path, overwrite=True) # handles partial files + for seq in range(first_ledger, last_ledger + 1): + lcm_bytes = transitioning_ledger_store.get(uint32_big_endian(seq)) + writer.append(lcm_bytes) + writer.fsync_and_close() + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # flag after fsync + + # ── Cleanup (separate step — if crash here, flag is set, retry cleanup) ── + transitioning_ledger_store.close() + delete_dir(ledger_store_path(chunk_id)) + signal_lfs_complete() +``` + +### Events Transition (background goroutine) + +```python +def events_transition(chunk_id, freezing_segment, meta_store): + # ── Transition: freeze hot segment to cold segment ── + events_path = events_segment_path(chunk_id) + + # Write three cold segment files: + # {chunkID:08d}-events.pack — zstd-compressed event blocks + # {chunkID:08d}-index.pack — serialized roaring bitmaps + # {chunkID:08d}-index.hash — MPHF for term lookup + write_cold_segment(freezing_segment, events_path) + fsync_all(events_path) + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # flag after fsync + + # ── Cleanup (separate step) ── + freezing_segment.discard() # delete WAL deltas + in-memory bitmaps + signal_events_complete() +``` + +### Index Boundary (every 10_000_000 ledgers) + +- Triggered when `ledger_seq == index_last_ledger(current_index)` +- The index boundary ledger is always also a chunk boundary ledger — `on_chunk_boundary` fires first, then `on_index_boundary` + +```python +def on_index_boundary(index_id, active_stores, meta_store): + # Wait for ALL chunk-level sub-flows to complete + # (the last chunk's LFS flush and events freeze must finish + # before the txhash store can be promoted) + wait_for_lfs_complete() + wait_for_events_complete() + + # Defense-in-depth: verify all chunk flags for the index + verify_all_chunk_flags(index_id, meta_store) + + # ── TxHash sub-flow ── + transitioning_txhash = active_stores.txhash + active_stores.txhash = open_precreated_txhash_store(index_id + 1) + + run_in_background(recsplit_transition,index_id, transitioning_txhash, meta_store) +``` + +### RecSplit Transition (background goroutine) + +```python +def recsplit_transition(index_id, transitioning_txhash_store, meta_store): + # ── Transition: build RecSplit from RocksDB ── + # RecSplit builder reads from the transitioning txhash store (RocksDB, 16 CFs). + # This is the ONLY input source — both backfill .bin data (loaded at startup) + # and streaming txhash data are in the same RocksDB store. + # + # Contrast with backfill: backfill builds RecSplit from .bin flat files. + # Streaming builds RecSplit from RocksDB. + + idx_path = recsplit_index_path(index_id) + delete_partial_idx_files(idx_path) # clean up any partial files from prior crash + + # Build all 16 CF index files — what happens inside (goroutines, + # parallelism, memory) is up to the task. All-or-nothing. + build_recsplit(transitioning_txhash_store, idx_path) + fsync_all_idx_files(idx_path) + + # Verify: spot-check random ledgers and txhashes against immutable stores + verify_spot_check(index_id, idx_path, meta_store) + + meta_store.put(f"index:{index_id:08d}:txhash", "1") # flag after fsync + verify + + # ── Cleanup (separate step) ── + transitioning_txhash_store.close() + delete_dir(txhash_store_path(index_id)) +``` + +### Transition DAG Structure + +The transitions form a DAG where tasks fire when dependencies are met. Unlike backfill (where the full DAG is known upfront), streaming builds the DAG dynamically as boundaries are hit. + +```mermaid +flowchart TD + CB(["Chunk C boundary"]) --> LFS["lfs_transition(C)"] + CB --> EVT["events_transition(C)"] + LFS --> LFS_CLEAN["cleanup: delete ledger store"] + EVT --> EVT_CLEAN["cleanup: delete events WAL"] + + LFS_CLEAN --> NEXT_CB(["Chunk C+1 boundary"]) + EVT_CLEAN --> NEXT_CB + + IB(["Index N boundary\n= last chunk boundary"]) --> WAIT["wait for ALL\nchunk transitions"] + WAIT --> RS["recsplit_transition(N)"] + RS --> RS_CLEAN["cleanup: delete txhash store"] +``` + +- Each sub-flow waits for its own predecessor (not all sub-flows) at chunk boundaries +- At the index boundary, ALL sub-flows must complete before the txhash transition starts +- Cleanup is always a separate step after the flag is set — if crash after flag, cleanup is retried on restart + +--- + +## Crash Recovery + +### Invariants + +Six invariants handle all crash recovery scenarios. No special-case logic exists outside these invariants. + +1. **Flag-after-fsync** — `chunk:{C:08d}:lfs`, `chunk:{C:08d}:events`, and `index:{N:08d}:txhash` are set in the meta store only after the corresponding file(s) are fsynced. A crash before the flag is set means the output is treated as absent — the transition is retried from scratch. + +2. **Idempotent writes** — every write to every store produces the same key-value pair for the same input ledger. Re-processing a ledger after crash is always safe. RocksDB WriteBatch guarantees atomicity per-store. + +3. **Per-ledger checkpoint** — `streaming:last_committed_ledger` is written only after all three stores have durably committed the ledger data. On crash, resume from `last_committed_ledger + 1`. The events system truncates hot segment data beyond the checkpoint on startup, preventing duplicate event IDs. + +4. **No separate recovery phase** — on startup, reconciliation derives the state of every sub-flow from meta store keys and on-disk artifacts, then completes or discards incomplete work. The same startup code handles first start in streaming mode, regular restarts, and crash recovery — no mode-specific recovery paths. + +5. **Max-1-transitioning per sub-flow** — at most one transitioning store per sub-flow at any time. The previous transition must complete (flag set + cleanup) before the next transition starts. This constraint applies to both steady-state operation and crash recovery — recovery respects the same ordering as steady state. + +6. **DAG-structured cleanup** — cleanup (deleting transitioning stores, discarding WAL data) is a separate step that runs after the flag is set. If crash between flag and cleanup: flag is durable, artifact is on disk, startup detects this and retries just the cleanup. The transition is never re-run. + +### Startup Validation Guards + +In addition to the crash recovery invariants, four validation rules prevent starting in an invalid state: + +- **`chunks_per_txhash_index` immutable** — fatal if changed after first run +- **Head index-aligned** — fatal if the lowest chunk is not the first chunk of its index +- **Contiguous flags** — fatal if any gap exists in `lfs` or `events` flags within backfill's range +- **Backfill completeness (first start in streaming mode)** — fatal if any backfill chunk is missing a `txhash` flag in an incomplete index + +### How Invariants Resolve the Hardest Scenarios + +**Compound recovery — orphaned transition from chunk C-1 + missed boundary for chunk C:** +``` +State: streaming:last_committed_ledger = 56_380_001 (= chunk_last_ledger(5637)) + chunk:00005636:events = absent (freeze crashed) + chunk:00005637:lfs = absent (transitions never started) + chunk:00005637:events = absent + Events WAL has data for chunks 5636 + 5637 + +Invariants applied: + - Flag-after-fsync (#1): chunk 5636 events flag absent → freeze is retried + - No separate recovery (#4): startup detects absent flags, triggers transitions + - Max-1-transitioning (#5): 5636 freeze completes BEFORE 5637 freeze starts + - Per-ledger checkpoint (#3): resume from 56_380_002 after transitions complete +``` + +**Checkpoint-boundary gap — crash between checkpoint and boundary transitions:** +``` +State: streaming:last_committed_ledger = 56_370_001 (= chunk_last_ledger(5636)) + chunk:00005636:lfs = absent (swap never happened) + chunk:00005636:events = absent (freeze never started) + +Invariants applied: + - Per-ledger checkpoint (#3): checkpoint is at the boundary, data is in WAL + - No separate recovery (#4): startup detects checkpoint at boundary + absent flags → + triggers LFS flush and events freeze before resuming + - Idempotent writes (#2): if any partial data exists from a half-started transition, + re-writing with overwrite=True produces identical results +``` + +**First-start .bin loading crash — partial load with some .bin files already deleted:** +``` +State: chunks 5000-5399: .bin deleted, txhash flag deleted, data in RocksDB (WAL) + chunks 5400-5633: .bin present, txhash flag present, not yet loaded + streaming:last_committed_ledger = absent + +Invariants applied: + - Per-ledger checkpoint (#3): absence of checkpoint signals "first start in streaming mode" → redo sequence + - Idempotent writes (#2): WAL-recovered data for 5000-5399 survives; loop loads 5400-5633 + - Flag-after-fsync (#1): txhash flags track which .bin files are pending (flag deleted = loaded) + - CRITICAL: must open existing txhash store (WAL recovery), not delete-and-recreate — + chunks 5000-5399 data would be lost since .bin files are gone +``` + +### Recovery Decision Tree + +```mermaid +flowchart TD + START["Open meta store"] --> HAS_LCL{"streaming:last_committed_ledger\npresent?"} + HAS_LCL -->|no| FIRST["First start:\nvalidate backfill flags\nload .bin files\nset checkpoint"] + HAS_LCL -->|yes| RECONCILE["Reconcile:\ncomplete orphaned transitions\ndelete orphaned artifacts"] + FIRST --> RECONCILE + RECONCILE --> BOUNDARY{"Checkpoint at\nchunk/index boundary\nwith absent flags?"} + BOUNDARY -->|yes| REPLAY["Replay missed transitions"] + BOUNDARY -->|no| BUILD + REPLAY --> BUILD{"Any BUILD_READY\nindexes?"} + BUILD -->|yes| SPAWN["Spawn background\nRecSplit builds"] + BUILD -->|no| RESUME + SPAWN --> RESUME["Open active stores\nStart CaptiveStellarCore\nBegin ingestion"] +``` + +--- + +## Meta Store Keys + +### Keys Introduced by Streaming + +| Key | Value | Written When | +|---|---|---| +| `streaming:last_committed_ledger` | `uint32` | After every successfully committed ledger | +| `config:chunks_per_txhash_index` | `uint32` | On first run (backfill or streaming) — immutable | + +### Keys Shared with Backfill + +| Key | Written By | Notes | +|---|---|---| +| `chunk:{C:08d}:lfs` | Both | Backfill: after pack file fsync in `process_chunk`. Streaming: after LFS flush goroutine fsync. | +| `chunk:{C:08d}:events` | Both | Backfill: after cold segment fsync in `process_chunk`. Streaming: after events freeze goroutine fsync. | +| `chunk:{C:08d}:txhash` | Backfill only | After `.bin` file fsync. Streaming does NOT write `.bin` files — txhash data goes to RocksDB. These flags are deleted during first-start .bin loading (step 3). | +| `index:{N:08d}:txhash` | Both | After all 16 RecSplit CF `.idx` files built + fsynced. Backfill: from `.bin` files. Streaming: from RocksDB txhash store. | + +### Key Lifecycle in Streaming + +``` +Per ledger: + streaming:last_committed_ledger = ledger_seq (after all 3 stores commit) + +Per chunk (background, after chunk boundary): + chunk:{C:08d}:lfs = "1" (after pack file fsync) + chunk:{C:08d}:events = "1" (after cold segment fsync) + +Per index (background, after index boundary): + index:{N:08d}:txhash = "1" (after all 16 .idx files fsync + verify) + +First-start only (step 3): + chunk:{C:08d}:txhash → DELETED (after .bin loaded into RocksDB) + .bin files → DELETED (after flag deleted) +``` + +--- + +## Backpressure and Drift Detection + +- Streaming ingests ledgers at the Stellar network's production rate (~1 ledger every 6 seconds) +- All transitions (LFS flush, events freeze, RecSplit build) run in background goroutines — the ingestion loop should never stall +- If ingestion falls behind, the cause is typically disk I/O saturation or RocksDB compaction stalls + +### Drift Metric + +```python +drift_ledgers = network_tip_ledger - last_committed_ledger +``` + +- `network_tip_ledger` obtained from CaptiveStellarCore metadata +- Exposed as a Prometheus gauge: `streaming_drift_ledgers` +- Threshold: **10 ledgers** (~60 seconds). More than 10 ledgers behind means something is wrong — transitions are background and should not cause drift. + +### Health Endpoint + +- `getHealth` returns unhealthy when `drift_ledgers > drift_warning_ledgers` (default 10) +- Kubernetes readiness probes use `getHealth` to remove the node from the service pool +- No automatic response (no pause, no abort) — the operator investigates and acts + +--- + +## Error Handling + +| Error | Action | +|---|---| +| CaptiveStellarCore unavailable | RETRY with backoff; ABORT after N retries | +| Ledger store write failure | ABORT — disk full or storage corruption | +| TxHash store write failure | ABORT — disk full or storage corruption | +| Events hot segment write failure | ABORT — disk full or storage corruption | +| Meta store write failure | ABORT — cannot maintain checkpoint | +| LFS flush failure (pack file write/fsync) | Do NOT set `chunk:{C:08d}:lfs`; ABORT transition goroutine; restart re-triggers flush | +| Events freeze failure (cold segment write/fsync) | Do NOT set `chunk:{C:08d}:events`; ABORT transition goroutine; restart re-triggers freeze | +| RecSplit build failure | Do NOT set `index:{N:08d}:txhash`; ABORT transition goroutine; restart deletes partials and rebuilds | +| RecSplit verification mismatch | ABORT; do NOT delete transitioning txhash store; log error; operator investigates | +| Startup: `chunks_per_txhash_index` changed | FATAL — cannot change after first run | +| Startup: head not index-aligned | FATAL — run backfill to complete the head index | +| Startup: gap in chunk flags | FATAL — run backfill to fill the gap | +| Startup: backfill chunk missing txhash flag (first start in streaming mode) | FATAL — run backfill to complete | + +--- + +## Query Routing + +Query routing during streaming transitions is covered in a separate design document. For reference, the routing summary: + +| Query | Active phase | Transitioning phase | Complete phase | +|---|---|---|---| +| `getLedger` | Active ledger store (or LFS for already-flushed chunks) | LFS (all ledger stores already flushed) | LFS | +| `getTransaction` | Active txhash store (RocksDB CF lookup) | Transitioning txhash store (still open for reads) | RecSplit index | +| `getEvents` | Events hot segment (in-memory bitmap lookup) | Freezing segment (still serves reads) | Cold segment (MPHF + packfile) | + +--- + +## Related Documents + +- [01-backfill-workflow.md](./01-backfill-workflow.md) — backfill pipeline, DAG tasks, partial index handling +- [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segments, bitmap indexes, freeze process +- Query routing — separate design document (TBD) diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md index 2198233fd..af0bc3e76 100644 --- a/full-history/design-docs/README.md +++ b/full-history/design-docs/README.md @@ -1,26 +1,72 @@ # Stellar Full History RPC Service — Design Docs -> **Scope**: Backfill pipeline only. Streaming pipeline design follows in a separate PR. +## Quick Context + +The Stellar Full History RPC Service ingests the complete blockchain history and serves queries: + +- `getLedger` — retrieve any ledger from history +- `getTransaction` — retrieve any transaction from history +- `getEvents` — retrieve events with filter matching from history + +Two mutually exclusive modes: + +- **Backfill** — offline bulk import + - Writes directly to immutable files (LFS pack files + RecSplit indexes + events cold segments) + - No RocksDB active stores, no queries during ingestion + - DAG-scheduled with a flat worker pool + - Exits when done +- **Streaming** (default) — real-time ingestion via CaptiveStellarCore + - Writes to RocksDB active stores + events hot segment + - Serves queries concurrently with ingestion + - Transitions completed data to immutable storage in background + - Long-running daemon +- Backfill typically runs first to populate historical data +- Streaming picks up where backfill left off ## Documents -| Document | Description | -|----------|-------------| -| [03-backfill-workflow.md](./03-backfill-workflow.md) | Complete backfill design — geometry, meta store keys, directory layout, configuration, DAG task graph, execution model, crash recovery, getStatus API | +| Doc | Title | Status | +|-----|-------|--------| +| [events](../../design-docs/getevents-full-history-design.md) | getEvents Full-History Design | Complete | +| [01](./01-backfill-workflow.md) | Backfill Workflow | Complete | +| [02](./02-streaming-workflow.md) | Streaming Workflow | Complete | +| 03 | Query Routing | **Not started** | +| 04 | Operator Guide | **Not started** | -The backfill doc is self-contained. Read it top-to-bottom for the full picture. +**What each doc covers:** -## Quick Context +- **Events** — hot/cold segments, roaring bitmap indexes, MPHF, freeze process, query path +- **01 Backfill** — geometry, directory layout, meta store keys, config, DAG tasks, execution model, crash recovery +- **02 Streaming** — startup, ingestion loop, three sub-flow transitions, crash recovery invariants, backfill-to-streaming migration +- **03 Query Routing** — routing `getLedger`/`getTransaction`/`getEvents` to correct store during active/transitioning/complete phases +- **04 Operator Guide** — end-to-end setup, hardware sizing, monitoring, troubleshooting + +**What's folded into existing docs (no separate doc needed):** + +- Architecture overview → split across 01 overview, 02 overview, this README +- Meta store keys → defined inline in 01 and 02 +- Directory structure → inline in 01 +- Configuration → inline in 01 and 02 +- Checkpointing math → inline in 01, referenced by 02 +- Crash recovery → inline in 01 (backfill) and 02 (streaming invariants) -The Stellar Full History RPC Service ingests the complete blockchain history. Primary use cases: +## Reading Order -- Retrieve any ledger from history -- Retrieve any transaction from history -- Retrieve any events with filter matching from history +- Read events doc first — standalone, no prerequisites +- Read 01 (backfill) second — defines all shared concepts: geometry, meta store keys, directory layout, flag-after-fsync invariant +- Read 02 (streaming) third — assumes familiarity with 01 -It has two modes: +## Shared Concepts -- **Backfill** (this PR) — offline bulk import. Writes directly to immutable files (LFS chunks + RecSplit indexes). No RocksDB, no queries during ingestion. DAG-scheduled with a flat worker pool. -- **Streaming** (future PR) — real-time ingestion via CaptiveStellarCore. Writes to RocksDB active stores, serves queries, transitions to immutable storage at index boundaries. +Defined in the backfill doc, used by all documents: -These modes are fully independent — separate code, separate crash recovery, separate transition workflows. +- **Chunk** — 10_000 ledgers, atomic unit of ingestion and file I/O +- **Index** — `chunks_per_txhash_index` chunks (default 1_000 = 10_000_000 ledgers), unit of RecSplit build +- **Meta store** — single RocksDB instance, source of truth for crash recovery + - `chunk:{C:08d}:lfs` — ledger pack file complete + - `chunk:{C:08d}:txhash` — raw txhash `.bin` file complete (backfill only) + - `chunk:{C:08d}:events` — events cold segment complete + - `index:{N:08d}:txhash` — RecSplit index complete + - `streaming:last_committed_ledger` — per-ledger checkpoint (streaming only) + - `config:chunks_per_txhash_index` — immutable after first run +- **Flag-after-fsync** — meta store flags set only after durable file writes, core crash recovery invariant for both modes From 40669f6606cedce7db2722d6e171a2104a36778e Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 15 Apr 2026 17:25:47 -0700 Subject: [PATCH 02/34] Refine streaming design doc: code over mermaid, tighten language - Replace DAG mermaid diagram with pseudocode dependency comments - Convert remaining prose paragraphs to bullet lists - Restructure crash recovery invariants as bullet sub-lists - Fix section heading from "First-Boot TxHash Store" to "Load Backfill TxHash Data into RocksDB" - Replace all "first boot" with "first start in streaming mode" - Add inline comments explaining boundary math (subtract 2, chunk_last_ledger formula) - Add concise main flow (run_streaming) near top of doc - Add dynamic vs static DAG explanation - Call out .bin loading as one-time cost --- .../design-docs/02-streaming-workflow.md | 235 +++++++++++------- 1 file changed, 149 insertions(+), 86 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index cc26e424c..a22d6fb36 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -26,17 +26,44 @@ | Queries | Not served (`getHealth`/`getStatus` only) | All endpoints available | | Crash recovery | Re-run from first incomplete chunk | Resume from `streaming:last_committed_ledger + 1` | | Process lifecycle | Exits when done | Long-running daemon | +| Transition workflow | Direct-write to immutable files (no active store to tear down) | Active RocksDB → immutable (LFS + RecSplit) via background goroutines | +| Dependency tracking | Static DAG — all tasks known upfront, dispatched as dependencies resolve | Dynamic DAG — tasks created at chunk/index boundaries, dependencies enforced by per-sub-flow sync points | + +**Why streaming uses a dynamic DAG:** +- Backfill knows the full ledger range at startup — all `process_chunk`, `build_txhash_index`, and `cleanup_txhash` tasks can be registered upfront +- Streaming ingests indefinitely — chunk/index boundaries are discovered as ledgers arrive, so tasks cannot be registered upfront +- The dependency structure is the same (flush/freeze must complete before next boundary, all chunk transitions must complete before RecSplit) — only the registration timing differs +- See [Sub-flow Transitions](#sub-flow-transitions) for the full transition DAG + +### Main Flow + +```python +def run_streaming(config): + + # 1. Startup — validate, reconcile, recover from any prior crash + meta_store, resume_ledger = startup(config) + + # 2. Start CaptiveStellarCore — takes ~4-5 min to reach resume_ledger + core = start_captive_core(config, resume_ledger) + + # 3. Ingest — one ledger at a time, forever + for lcm in core.stream_ledgers(): + process_ledger(lcm, meta_store) +``` + +Each step is detailed in the sections that follow: [Startup Sequence](#startup-sequence), [Ingestion Loop](#ingestion-loop), and [Sub-flow Transitions](#sub-flow-transitions). ### Boundary Math Reference Boundary formulas are defined in [01-backfill-workflow.md](./01-backfill-workflow.md#geometry). Quick reference: ```python -chunk_id = (ledger_seq - 2) // 10_000 -chunk_first_ledger(C) = (C * 10_000) + 2 -chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 -index_id = chunk_id // chunks_per_txhash_index -index_last_ledger(N) = ((N + 1) * chunks_per_txhash_index * 10_000) + 1 +# Stellar ledgers start at 2 (not 0 or 1), so all formulas subtract 2 to zero-base +chunk_id = (ledger_seq - 2) // 10_000 # e.g. ledger 56_340_001 → chunk 5633 +chunk_first_ledger(C) = (C * 10_000) + 2 # e.g. chunk 5634 → ledger 56_340_002 +chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 # e.g. chunk 5634 → ledger 56_350_001 +index_id = chunk_id // chunks_per_txhash_index # e.g. chunk 5634 → index 5 +index_last_ledger(N) = ((N + 1) * chunks_per_txhash_index * 10_000) + 1 # e.g. index 5 → ledger 60_000_001 ``` --- @@ -45,7 +72,8 @@ index_last_ledger(N) = ((N + 1) * chunks_per_txhash_index * 10_000) + 1 ### Immutable Config (shared with backfill) -Streaming reads the same TOML config as backfill. The following keys are immutable — set on first backfill, fatal error if changed: +- Streaming reads the same TOML config as backfill +- The following keys are immutable — set on first backfill, fatal error if changed: | Key | Stored in Meta Store | Set By | Description | |---|---|---|---| @@ -125,9 +153,10 @@ def start_streaming(config): # ── 2. Validate backfill completeness ── head_chunk, durable_tail = validate_chunk_coverage(config, meta_store) - # ── 3. First-start txhash store initialization ── - # Load backfill .bin files into RocksDB, delete .bin files + txhash flags. - # On restart (streaming:last_committed_ledger present), usually a no-op. + # ── 3. Load backfill txhash data into RocksDB ── + # Backfill wrote txhash data as .bin flat files. Streaming needs the data in RocksDB + # for queries and as the single input source for RecSplit. Load .bin → RocksDB, + # then delete .bin files + txhash flags. Only does work on first start. No-op on restarts. load_backfill_bins(config, meta_store) # ── 4. Reconcile orphaned transitions ── @@ -154,8 +183,10 @@ def start_streaming(config): # ── 9. Start CaptiveStellarCore ── # CaptiveStellarCore takes ~4-5 minutes to spin up to the target ledger. - # Steps 1-8 also take minutes (WAL replay, .bin loading) and complete - # before CaptiveStellarCore is needed — no parallelism required. + # Steps 1-8 run sequentially before this point. On the first start in + # streaming mode, step 3 (.bin loading) takes ~4-5 minutes — a one-time + # cost that does not recur on subsequent restarts. On restarts, steps 1-8 + # complete in seconds (no .bin files to load, WAL replay is fast). core = start_captive_core(config, resume_ledger) # ── 10. Begin ingestion loop ── @@ -188,23 +219,25 @@ def validate_chunk_coverage(config, meta_store): if not lfs_set: fatal("no chunk data found — run backfill first") - head_chunk = min(lfs_set | events_set) - head_index = head_chunk // cpi + head_chunk = min(lfs_set | events_set) # lowest chunk with any flag + head_index = head_chunk // cpi # which index that chunk belongs to - # Head must be index-aligned - if head_chunk != head_index * cpi: + # Head must be index-aligned — the first chunk must be the first chunk of its index. + # If not, the head index can never be completed (RecSplit needs all cpi chunks). + if head_chunk != head_index * cpi: # e.g. head_chunk=2300, expected=2000 fatal(f"partial index at head: chunk {head_chunk} is not first chunk " f"of index {head_index} (expected {head_index * cpi}). " f"Run backfill to complete index {head_index}.") - # Find contiguous tails - durable_tail_lfs = contiguous_tail(lfs_set, head_chunk) - durable_tail_events = contiguous_tail(events_set, head_chunk) - durable_tail = min(durable_tail_lfs, durable_tail_events) + # Find contiguous tails — walk forward from head_chunk, stop at first gap + durable_tail_lfs = contiguous_tail(lfs_set, head_chunk) # last contiguous lfs chunk + durable_tail_events = contiguous_tail(events_set, head_chunk) # last contiguous events chunk + durable_tail = min(durable_tail_lfs, durable_tail_events) # both must be present - # Verify complete indexes have index:{N}:txhash keys + # Verify complete indexes have index:{N}:txhash keys. + # durable_tail // cpi gives the index containing the tail chunk. for index_id in range(head_index, durable_tail // cpi + 1): - idx_last = (index_id + 1) * cpi - 1 + idx_last = (index_id + 1) * cpi - 1 # last chunk of this index if idx_last > durable_tail: break # partial index at tail — streaming will complete if not meta_store.has(f"index:{index_id:08d}:txhash"): @@ -249,7 +282,7 @@ Validation: Result: head_chunk=2000, durable_tail=5633 ``` -### Step 3: First-Boot TxHash Store Initialization +### Step 3: Load Backfill TxHash Data into RocksDB - On first start in streaming mode, backfill's `.bin` files contain txhash data that must be loaded into the RocksDB txhash store — needed for query serving and as the single input source for RecSplit - This step runs on every startup (not just first start in streaming mode) for robustness — on restarts, the loop finds no `txhash` flags and is a no-op @@ -286,8 +319,8 @@ def load_backfill_bins(config, meta_store): os.remove(bin_file) ``` -**Crash safety for first-start loading:** -- If crash during loading: `streaming:last_committed_ledger` is still absent → next startup redoes the first-start sequence +**Crash safety for .bin loading (step 3):** +- If crash during loading: `streaming:last_committed_ledger` is still absent → next startup redoes the sequence - Already-loaded chunks: `.bin` deleted, flag deleted, data in RocksDB via WAL recovery - Not-yet-loaded chunks: `.bin` and flag still present → loop picks up where it left off - The txhash store MUST be opened (WAL recovery), never deleted-and-recreated — already-loaded chunks' `.bin` files are gone and cannot be re-read @@ -300,8 +333,10 @@ def reconcile_orphaned_transitions(config, meta_store): if last_committed is None: return # first start in streaming mode — no prior streaming state to reconcile - current_chunk = (last_committed - 2) // 10_000 - current_index = current_chunk // config.chunks_per_txhash_index + # Derive which chunk and index the checkpoint falls in. + # Subtract 2 because Stellar ledgers start at 2, not 0. + current_chunk = (last_committed - 2) // 10_000 # chunk containing the last committed ledger + current_index = current_chunk // config.chunks_per_txhash_index # index containing that chunk # Orphaned transitioning ledger stores for store_dir in scan_ledger_store_dirs(config): @@ -330,7 +365,7 @@ def reconcile_orphaned_transitions(config, meta_store): ### Step 5: Replay Missed Boundary Handling -If `streaming:last_committed_ledger` is exactly at a chunk or index boundary but the boundary transitions never fired (crash between checkpoint and transitions): +- Handles the case where `streaming:last_committed_ledger` is exactly at a boundary but transitions never fired (crash between checkpoint and transitions) ```python def replay_missed_boundaries(config, meta_store): @@ -338,18 +373,21 @@ def replay_missed_boundaries(config, meta_store): if last_committed is None: return - current_chunk = (last_committed - 2) // 10_000 + current_chunk = (last_committed - 2) // 10_000 # subtract 2: ledgers start at 2 cpi = config.chunks_per_txhash_index - # Check if checkpoint is at a chunk boundary + # Check if checkpoint is exactly at a chunk boundary. + # chunk_last_ledger(C) = ((C + 1) * 10_000) + 1, the last ledger that belongs to chunk C. + # If checkpoint == this value, the chunk is fully ingested but transitions may not have fired. if last_committed == chunk_last_ledger(current_chunk): - # Chunk boundary was reached but transitions may not have fired if not meta_store.has(f"chunk:{current_chunk:08d}:lfs"): trigger_lfs_flush(current_chunk, config, meta_store) if not meta_store.has(f"chunk:{current_chunk:08d}:events"): trigger_events_freeze(current_chunk, config, meta_store) - # Check if checkpoint is also at an index boundary + # Check if checkpoint is also at an index boundary. + # The last chunk of index N is ((N + 1) * cpi) - 1. + # If current_chunk equals this, the index is fully ingested. current_index = current_chunk // cpi if current_chunk == (current_index + 1) * cpi - 1: # Index boundary — handled by step 6 (BUILD_READY detection) @@ -396,13 +434,18 @@ def process_ledger(ledger_seq, lcm, active_stores, meta_store): # 3. If chunk boundary: trigger sub-flow transitions. # Transitions happen AFTER checkpoint — a crash between checkpoint # and transitions is detected on startup (see replay_missed_boundaries). + # Subtract 2 because Stellar ledgers start at 2. current_chunk = (ledger_seq - 2) // 10_000 + # chunk_last_ledger(C) = ((C+1) * 10_000) + 1 — the last ledger in chunk C. + # Equality means this ledger completes the chunk. if ledger_seq == chunk_last_ledger(current_chunk): on_chunk_boundary(current_chunk, active_stores, meta_store) # 4. If index boundary: trigger index-level transitions. - # The index boundary ledger is always also a chunk boundary ledger. + # The index boundary ledger is always also a chunk boundary ledger + # (chunk boundaries align exactly with index boundaries). current_index = current_chunk // chunks_per_txhash_index + # index_last_ledger(N) = ((N+1) * cpi * 10_000) + 1 — the last ledger in index N. if ledger_seq == index_last_ledger(current_index): on_index_boundary(current_index, active_stores, meta_store) ``` @@ -445,7 +488,8 @@ def write_events_hot_segment(hot_segment, ledger_seq, lcm): ## Sub-flow Transitions -Three independent sub-flows, each with its own goroutine, flag, and cleanup step. No combined transitions — each sub-flow waits for its own predecessor only. +- Three independent sub-flows, each with its own goroutine, flag, and cleanup step +- No combined transitions — each sub-flow waits for its own predecessor only ### Chunk Boundary (every 10_000 ledgers) @@ -458,14 +502,14 @@ def on_chunk_boundary(chunk_id, active_stores, meta_store): wait_for_lfs_complete() transitioning_ledger = active_stores.ledger active_stores.ledger = open_precreated_ledger_store(chunk_id + 1) - run_in_background(lfs_transition,chunk_id, transitioning_ledger, meta_store) + run_in_background(lfs_transition, chunk_id, transitioning_ledger, meta_store) # ── Events sub-flow ── # Wait for OWN predecessor only (max 1 freezing events segment) wait_for_events_complete() freezing_segment = active_stores.events active_stores.events = create_events_hot_segment(chunk_id + 1) - run_in_background(events_transition,chunk_id, freezing_segment, meta_store) + run_in_background(events_transition, chunk_id, freezing_segment, meta_store) ``` ### LFS Transition (background goroutine) @@ -530,7 +574,7 @@ def on_index_boundary(index_id, active_stores, meta_store): transitioning_txhash = active_stores.txhash active_stores.txhash = open_precreated_txhash_store(index_id + 1) - run_in_background(recsplit_transition,index_id, transitioning_txhash, meta_store) + run_in_background(recsplit_transition, index_id, transitioning_txhash, meta_store) ``` ### RecSplit Transition (background goroutine) @@ -563,28 +607,21 @@ def recsplit_transition(index_id, transitioning_txhash_store, meta_store): delete_dir(txhash_store_path(index_id)) ``` -### Transition DAG Structure - -The transitions form a DAG where tasks fire when dependencies are met. Unlike backfill (where the full DAG is known upfront), streaming builds the DAG dynamically as boundaries are hit. - -```mermaid -flowchart TD - CB(["Chunk C boundary"]) --> LFS["lfs_transition(C)"] - CB --> EVT["events_transition(C)"] - LFS --> LFS_CLEAN["cleanup: delete ledger store"] - EVT --> EVT_CLEAN["cleanup: delete events WAL"] - - LFS_CLEAN --> NEXT_CB(["Chunk C+1 boundary"]) - EVT_CLEAN --> NEXT_CB +### Transition Dependencies - IB(["Index N boundary\n= last chunk boundary"]) --> WAIT["wait for ALL\nchunk transitions"] - WAIT --> RS["recsplit_transition(N)"] - RS --> RS_CLEAN["cleanup: delete txhash store"] +```python +# Per chunk boundary: +# lfs_transition(C) → cleanup: delete ledger store → unblocks lfs_transition(C+1) +# events_transition(C) → cleanup: delete events WAL → unblocks events_transition(C+1) +# +# Per index boundary (= last chunk boundary of that index): +# ALL lfs_transition + events_transition for the index must complete +# → recsplit_transition(N) → cleanup: delete txhash store ``` -- Each sub-flow waits for its own predecessor (not all sub-flows) at chunk boundaries -- At the index boundary, ALL sub-flows must complete before the txhash transition starts -- Cleanup is always a separate step after the flag is set — if crash after flag, cleanup is retried on restart +- Each sub-flow waits for its own predecessor at chunk boundaries +- At index boundary: ALL sub-flows must complete before RecSplit starts +- Cleanup is a separate step after the flag — crash after flag = retry just cleanup on restart --- @@ -592,23 +629,26 @@ flowchart TD ### Invariants -Six invariants handle all crash recovery scenarios. No special-case logic exists outside these invariants. - -1. **Flag-after-fsync** — `chunk:{C:08d}:lfs`, `chunk:{C:08d}:events`, and `index:{N:08d}:txhash` are set in the meta store only after the corresponding file(s) are fsynced. A crash before the flag is set means the output is treated as absent — the transition is retried from scratch. - -2. **Idempotent writes** — every write to every store produces the same key-value pair for the same input ledger. Re-processing a ledger after crash is always safe. RocksDB WriteBatch guarantees atomicity per-store. - -3. **Per-ledger checkpoint** — `streaming:last_committed_ledger` is written only after all three stores have durably committed the ledger data. On crash, resume from `last_committed_ledger + 1`. The events system truncates hot segment data beyond the checkpoint on startup, preventing duplicate event IDs. - -4. **No separate recovery phase** — on startup, reconciliation derives the state of every sub-flow from meta store keys and on-disk artifacts, then completes or discards incomplete work. The same startup code handles first start in streaming mode, regular restarts, and crash recovery — no mode-specific recovery paths. - -5. **Max-1-transitioning per sub-flow** — at most one transitioning store per sub-flow at any time. The previous transition must complete (flag set + cleanup) before the next transition starts. This constraint applies to both steady-state operation and crash recovery — recovery respects the same ordering as steady state. - -6. **DAG-structured cleanup** — cleanup (deleting transitioning stores, discarding WAL data) is a separate step that runs after the flag is set. If crash between flag and cleanup: flag is durable, artifact is on disk, startup detects this and retries just the cleanup. The transition is never re-run. +Six invariants handle all crash recovery. No special-case logic outside these. + +1. **Flag-after-fsync** — meta store flags set only after corresponding files are fsynced + - Flag absent = output treated as missing → transition retried from scratch +2. **Idempotent writes** — same input ledger always produces same key-value pairs in all stores + - Re-processing after crash is always safe +3. **Per-ledger checkpoint** — `streaming:last_committed_ledger` written only after all three stores durably commit + - On crash: resume from `last_committed_ledger + 1` + - Events system truncates hot segment data beyond checkpoint on startup (prevents duplicate event IDs) +4. **No separate recovery phase** — startup derives state from meta store keys + on-disk artifacts, completes or discards incomplete work + - Same code path for first start, restarts, and crash recovery +5. **Max-1-transitioning per sub-flow** — previous transition must complete before next starts + - Applies to both steady-state and crash recovery +6. **DAG-structured cleanup** — cleanup runs as a separate step after flag is set + - Crash between flag and cleanup: flag is durable, startup retries just the cleanup + - The transition itself is never re-run ### Startup Validation Guards -In addition to the crash recovery invariants, four validation rules prevent starting in an invalid state: +Four validation rules prevent starting in an invalid state: - **`chunks_per_txhash_index` immutable** — fatal if changed after first run - **Head index-aligned** — fatal if the lowest chunk is not the first chunk of its index @@ -646,7 +686,7 @@ Invariants applied: re-writing with overwrite=True produces identical results ``` -**First-start .bin loading crash — partial load with some .bin files already deleted:** +**Crash during .bin loading (step 3) — partial load, some .bin files already deleted:** ``` State: chunks 5000-5399: .bin deleted, txhash flag deleted, data in RocksDB (WAL) chunks 5400-5633: .bin present, txhash flag present, not yet loaded @@ -662,19 +702,41 @@ Invariants applied: ### Recovery Decision Tree -```mermaid -flowchart TD - START["Open meta store"] --> HAS_LCL{"streaming:last_committed_ledger\npresent?"} - HAS_LCL -->|no| FIRST["First start:\nvalidate backfill flags\nload .bin files\nset checkpoint"] - HAS_LCL -->|yes| RECONCILE["Reconcile:\ncomplete orphaned transitions\ndelete orphaned artifacts"] - FIRST --> RECONCILE - RECONCILE --> BOUNDARY{"Checkpoint at\nchunk/index boundary\nwith absent flags?"} - BOUNDARY -->|yes| REPLAY["Replay missed transitions"] - BOUNDARY -->|no| BUILD - REPLAY --> BUILD{"Any BUILD_READY\nindexes?"} - BUILD -->|yes| SPAWN["Spawn background\nRecSplit builds"] - BUILD -->|no| RESUME - SPAWN --> RESUME["Open active stores\nStart CaptiveStellarCore\nBegin ingestion"] +```python +def recover_on_startup(config, meta_store): + last_committed = meta_store.get("streaming:last_committed_ledger") + + if last_committed is None: + # First start in streaming mode — no prior streaming checkpoint exists. + # Validate that backfill left complete data, then load .bin files into RocksDB. + validate_backfill_flags(config, meta_store) + load_backfill_bins(config, meta_store) + last_committed = chunk_last_ledger(durable_tail) + meta_store.put("streaming:last_committed_ledger", last_committed) + + # Reconcile: complete orphaned transitions, delete orphaned artifacts. + # Handles crashes during LFS flush, events freeze, or RecSplit build. + reconcile_orphaned_transitions(config, meta_store) + + # Replay missed boundary transitions. + # If the checkpoint landed exactly on a chunk boundary but the process crashed + # before the boundary transitions (swap store, spawn flush/freeze) fired, + # the flags for that chunk are absent. Detect and trigger them now. + current_chunk = (last_committed - 2) // 10_000 # subtract 2: Stellar ledgers start at 2 + if last_committed == chunk_last_ledger(current_chunk): + if not meta_store.has(f"chunk:{current_chunk:08d}:lfs"): + trigger_lfs_flush(current_chunk) + if not meta_store.has(f"chunk:{current_chunk:08d}:events"): + trigger_events_freeze(current_chunk) + + # Spawn background RecSplit for any BUILD_READY indexes. + # An index is BUILD_READY when all chunk lfs+events flags are set but + # index:{N:08d}:txhash is absent (RecSplit not yet built or crashed mid-build). + for index_id in indexes_with_all_chunk_flags_but_no_index_flag(meta_store): + run_in_background(recsplit_transition, index_id) + + resume_ledger = last_committed + 1 + return meta_store, resume_ledger ``` --- @@ -694,7 +756,7 @@ flowchart TD |---|---|---| | `chunk:{C:08d}:lfs` | Both | Backfill: after pack file fsync in `process_chunk`. Streaming: after LFS flush goroutine fsync. | | `chunk:{C:08d}:events` | Both | Backfill: after cold segment fsync in `process_chunk`. Streaming: after events freeze goroutine fsync. | -| `chunk:{C:08d}:txhash` | Backfill only | After `.bin` file fsync. Streaming does NOT write `.bin` files — txhash data goes to RocksDB. These flags are deleted during first-start .bin loading (step 3). | +| `chunk:{C:08d}:txhash` | Backfill only | After `.bin` file fsync. Streaming does NOT write `.bin` files — txhash data goes to RocksDB. Flags deleted during startup step 3 (.bin loading). | | `index:{N:08d}:txhash` | Both | After all 16 RecSplit CF `.idx` files built + fsynced. Backfill: from `.bin` files. Streaming: from RocksDB txhash store. | ### Key Lifecycle in Streaming @@ -710,7 +772,7 @@ Per chunk (background, after chunk boundary): Per index (background, after index boundary): index:{N:08d}:txhash = "1" (after all 16 .idx files fsync + verify) -First-start only (step 3): +Startup step 3 (first start in streaming mode only): chunk:{C:08d}:txhash → DELETED (after .bin loaded into RocksDB) .bin files → DELETED (after flag deleted) ``` @@ -763,7 +825,8 @@ drift_ledgers = network_tip_ledger - last_committed_ledger ## Query Routing -Query routing during streaming transitions is covered in a separate design document. For reference, the routing summary: +- Covered in a separate design document +- Summary routing table for reference: | Query | Active phase | Transitioning phase | Complete phase | |---|---|---|---| From 3e4a2cfdc2f4e9a977ad7dcd9f68afcbd331e858 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 15 Apr 2026 18:41:13 -0700 Subject: [PATCH 03/34] Fix .bin loading language: not always present on first start - .bin files only exist if backfill left a partial txhash index - If backfill ended on an index boundary, step 3 is a no-op - Remove language implying .bin loading always happens on first start --- .../design-docs/02-streaming-workflow.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index a22d6fb36..102659703 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -154,9 +154,9 @@ def start_streaming(config): head_chunk, durable_tail = validate_chunk_coverage(config, meta_store) # ── 3. Load backfill txhash data into RocksDB ── - # Backfill wrote txhash data as .bin flat files. Streaming needs the data in RocksDB - # for queries and as the single input source for RecSplit. Load .bin → RocksDB, - # then delete .bin files + txhash flags. Only does work on first start. No-op on restarts. + # If backfill left a partial txhash index, .bin files exist for the backfill chunks. + # Load .bin → RocksDB, then delete .bin files + txhash flags. + # No-op if backfill ended on an index boundary or on restarts. load_backfill_bins(config, meta_store) # ── 4. Reconcile orphaned transitions ── @@ -183,10 +183,9 @@ def start_streaming(config): # ── 9. Start CaptiveStellarCore ── # CaptiveStellarCore takes ~4-5 minutes to spin up to the target ledger. - # Steps 1-8 run sequentially before this point. On the first start in - # streaming mode, step 3 (.bin loading) takes ~4-5 minutes — a one-time - # cost that does not recur on subsequent restarts. On restarts, steps 1-8 - # complete in seconds (no .bin files to load, WAL replay is fast). + # Steps 1-8 run sequentially before this point. + # If backfill left a partial index with .bin files, step 3 takes ~4-5 minutes + # (one-time cost, does not recur). Otherwise, steps 1-8 complete in seconds. core = start_captive_core(config, resume_ledger) # ── 10. Begin ingestion loop ── @@ -284,8 +283,9 @@ Result: head_chunk=2000, durable_tail=5633 ### Step 3: Load Backfill TxHash Data into RocksDB -- On first start in streaming mode, backfill's `.bin` files contain txhash data that must be loaded into the RocksDB txhash store — needed for query serving and as the single input source for RecSplit -- This step runs on every startup (not just first start in streaming mode) for robustness — on restarts, the loop finds no `txhash` flags and is a no-op +- If backfill left a partial txhash index, `.bin` files exist for the backfill chunks. Streaming loads these into the RocksDB txhash store — needed for query serving and as the single input source for RecSplit. +- If backfill ended on an index boundary, no `.bin` files exist — this step is a no-op +- Runs on every startup for robustness. On restarts, the loop finds no `txhash` flags and skips. ```python def load_backfill_bins(config, meta_store): From 7c7f014188f8e8ac91be050fcfd07f043313ab60 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 15 Apr 2026 18:55:28 -0700 Subject: [PATCH 04/34] Fix events terminology: persisted index deltas, not WAL-backed deltas MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Events system persists (term_key, event_id) pairs per ledger in the embedded DB for crash recovery. On restart, deltas are replayed to rebuild in-memory bitmaps. - WAL is an internal DB mechanism — nobody reads it directly. - Replace all "WAL-backed deltas" / "Events WAL" with "persisted deltas" / "persisted index deltas" for events-specific references. - RocksDB WAL references for ledger/txhash stores remain unchanged (correct usage). --- full-history/design-docs/02-streaming-workflow.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 102659703..885b67a7e 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -113,11 +113,11 @@ Streaming maintains three active stores for the current ingestion position: |---|---|---|---|---| | Ledger | `{active_storage.path}/ledger-store-chunk-{chunkID:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | | TxHash | `{active_storage.path}/txhash-store-index-{indexID:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every 10_000_000 ledgers (index) | -| Events | In-memory hot segment + WAL-backed deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | +| Events | In-memory hot segment + persisted index deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | - **Ledger store**: default CF only. One RocksDB instance per chunk. WAL required. - **TxHash store**: 16 column families (`cf-0` through `cf-f`), routed by `txhash[0] >> 4`. One RocksDB instance per index (spans 1_000 chunks). WAL required. -- **Events hot segment**: in-memory roaring bitmaps + WAL-backed per-ledger deltas. ~370 MB per segment (measured). See [getEvents full-history design](../../design-docs/getevents-full-history-design.md) for format details. +- **Events hot segment**: in-memory roaring bitmaps + persisted per-ledger index deltas (for crash recovery — replayed on startup to rebuild bitmaps). ~370 MB per segment (measured). See [getEvents full-history design](../../design-docs/getevents-full-history-design.md) for format details. ### Store Pre-creation @@ -479,7 +479,7 @@ def write_events_hot_segment(hot_segment, ledger_seq, lcm): hot_segment.store_event(event_id, event) # persist event data for term_key in index_terms(event): # contractId + topic0-3 hot_segment.add_to_bitmap(term_key, event_id) # in-memory bitmap update - hot_segment.persist_deltas(ledger_seq) # WAL-backed per-ledger delta + hot_segment.persist_deltas(ledger_seq) # (term_key, event_id) pairs → DB for crash recovery hot_segment.update_offset_array(ledger_seq) # cumulative event count hot_segment.commit(ledger_seq) # atomic commit ``` @@ -550,7 +550,7 @@ def events_transition(chunk_id, freezing_segment, meta_store): meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # flag after fsync # ── Cleanup (separate step) ── - freezing_segment.discard() # delete WAL deltas + in-memory bitmaps + freezing_segment.discard() # delete persisted deltas + in-memory bitmaps signal_events_complete() ``` @@ -612,7 +612,7 @@ def recsplit_transition(index_id, transitioning_txhash_store, meta_store): ```python # Per chunk boundary: # lfs_transition(C) → cleanup: delete ledger store → unblocks lfs_transition(C+1) -# events_transition(C) → cleanup: delete events WAL → unblocks events_transition(C+1) +# events_transition(C) → cleanup: delete persisted deltas → unblocks events_transition(C+1) # # Per index boundary (= last chunk boundary of that index): # ALL lfs_transition + events_transition for the index must complete @@ -663,7 +663,7 @@ State: streaming:last_committed_ledger = 56_380_001 (= chunk_last_ledger(5637)) chunk:00005636:events = absent (freeze crashed) chunk:00005637:lfs = absent (transitions never started) chunk:00005637:events = absent - Events WAL has data for chunks 5636 + 5637 + Events DB has persisted deltas for chunks 5636 + 5637 Invariants applied: - Flag-after-fsync (#1): chunk 5636 events flag absent → freeze is retried From 53c3d94778652505ac16bf2d3756d3951f844fc1 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Tue, 21 Apr 2026 08:26:09 -0700 Subject: [PATCH 05/34] Rewrite streaming doc as unified-daemon design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces the two-mode streaming spec with the unified daemon design that absorbs Tamir's gist plus the four resolved questions and Karthik's session corrections. No separate backfill subcommand — one stellar-rpc --config ... invocation, four startup phases, index-atomic pruning, HTTP 4xx query gating, retention-immutable enforcement, load-then-delete .bin hydration. Adds: - Terminology section (low-water, leapfrog, backfill subroutine, active vs immutable store, freeze transitions, chunk/index boundaries) - Operator Scenarios section (fresh archive with seamless cutover; Alice's tip-tracker without BSB) - WHY comments on every non-trivial pseudocode block - Ledger Source abstraction (BSBSource / CaptiveCoreSource) - Pruning section (index-atomic, two-phase "deleting" marker) - Query Contract section (HTTP 4xx during Phases 1-3, in-memory daemon_ready flag) - Required Backfill Design Changes appendix Removes: - --mode CLI flag and all per-run flags (now just --config / log overrides) - Two-mode "How streaming differs from backfill" framing - Mutability-era retention language - "streaming mode" / "sub-flow" legacy wording Reorders: Active Store Architecture now precedes Ingestion Loop so readers see data structures before the code that uses them. --- .../design-docs/02-streaming-workflow.md | 1508 ++++++++++------- 1 file changed, 890 insertions(+), 618 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 885b67a7e..dd1d42bc1 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -2,804 +2,1079 @@ ## Overview -- Streaming mode ingests live Stellar ledgers from CaptiveStellarCore, one ledger at a time, while serving queries -- Default mode — when `--mode` is not specified, the service runs in streaming mode +The stellar-rpc daemon is the full-history RPC service. One binary, one invocation, one long-running process. + +- Operator runs `stellar-rpc --config path/to/config.toml`. No subcommand. No `--mode` flag. No behavior-switching flags. +- On every start the daemon runs four sequential startup phases, then enters a live ingestion loop it stays in until killed. +- Behavior across the three operator profiles (archive, pruning-history, tip-tracker) is determined entirely by TOML config — no profile flag. +- Backfill (`01-backfill-workflow.md`) is used as an internal subroutine by Startup Phase 1. Operators never invoke backfill directly. + +**What the daemon does end-to-end:** +- Validates config against immutable meta-store state (`CHUNKS_PER_TXHASH_INDEX` and `RETENTION_LEDGERS`). +- Catches up to the current network tip using BSB or captive core, whichever is configured. +- Hydrates any in-flight state left by a prior run. +- Ingests live ledgers from `CaptiveStellarCore` at ~1 per 6 seconds. +- Writes each live ledger to three active stores (ledger, txhash, events). +- Freezes active stores to immutable files at chunk and index boundaries in background. +- Prunes past-retention indexes atomically when retention is configured. +- Serves `getLedger`, `getTransaction`, `getEvents` only after startup phases complete. Returns HTTP 4xx during startup. -**What streaming does:** -- Ingests ledgers from CaptiveStellarCore in real-time (~1 ledger every 6 seconds) -- Writes each ledger to three active stores: ledger RocksDB, txhash RocksDB (16 CFs), and events hot segment -- Serves `getLedger`, `getTransaction`, and `getEvents` queries concurrently with ingestion -- At chunk boundaries (every 10_000 ledgers): transitions ledger store to LFS pack file and freezes events hot segment to cold segment — both in background -- At index boundaries (every 10_000_000 ledgers): builds RecSplit txhash index from transitioning txhash store — in background -- Long-running daemon that exits only on fatal error - -**How streaming differs from backfill:** - -| Dimension | Backfill | Streaming | -|---|---|---| -| Data source | BSB (GCS) | CaptiveStellarCore | -| RocksDB for ingestion | No — writes directly to files | Yes — three active stores | -| Txhash write format | `.bin` flat files (36 bytes/entry) | RocksDB txhash store (16 CFs) | -| RecSplit input | `.bin` flat files | RocksDB txhash store | -| Checkpoint granularity | Per-chunk (10_000 ledgers) | Per-ledger | -| Concurrency | Flat worker pool (DAG scheduler, default GOMAXPROCS slots) | Single ingestion goroutine, background transition goroutines | -| Queries | Not served (`getHealth`/`getStatus` only) | All endpoints available | -| Crash recovery | Re-run from first incomplete chunk | Resume from `streaming:last_committed_ledger + 1` | -| Process lifecycle | Exits when done | Long-running daemon | -| Transition workflow | Direct-write to immutable files (no active store to tear down) | Active RocksDB → immutable (LFS + RecSplit) via background goroutines | -| Dependency tracking | Static DAG — all tasks known upfront, dispatched as dependencies resolve | Dynamic DAG — tasks created at chunk/index boundaries, dependencies enforced by per-sub-flow sync points | - -**Why streaming uses a dynamic DAG:** -- Backfill knows the full ledger range at startup — all `process_chunk`, `build_txhash_index`, and `cleanup_txhash` tasks can be registered upfront -- Streaming ingests indefinitely — chunk/index boundaries are discovered as ledgers arrive, so tasks cannot be registered upfront -- The dependency structure is the same (flush/freeze must complete before next boundary, all chunk transitions must complete before RecSplit) — only the registration timing differs -- See [Sub-flow Transitions](#sub-flow-transitions) for the full transition DAG - -### Main Flow - -```python -def run_streaming(config): - - # 1. Startup — validate, reconcile, recover from any prior crash - meta_store, resume_ledger = startup(config) - - # 2. Start CaptiveStellarCore — takes ~4-5 min to reach resume_ledger - core = start_captive_core(config, resume_ledger) +--- - # 3. Ingest — one ledger at a time, forever - for lcm in core.stream_ledgers(): - process_ledger(lcm, meta_store) -``` +## Terminology + +Terms used repeatedly throughout this doc. Skim on first read, refer back when a term surfaces later. + +- **Daemon** — the stellar-rpc binary running as one long-lived process. The only operator-facing entry point. +- **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 is reached, it stays there until the process exits. [Details](#startup-sequence). +- **Phase 1 catchup** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. +- **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI entry point exists. +- **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. +- **Phase 1 low-water** (`derive_phase1_low_water`) — the last ledger of the contiguous prefix of `chunk:{C}:lfs` flags starting from the lowest chunk on disk. Phase 1 uses this to decide what's still left to ingest. **Not the same** as `streaming:last_committed_ledger`. +- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside the Phase 4 ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. +- **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: + - Ledger active store — a per-chunk RocksDB (one instance per chunk). + - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). + - Events hot segment — in-memory roaring bitmaps plus persisted per-ledger index deltas (not a RocksDB; see [getEvents design](../../design-docs/getevents-full-history-design.md)). +- **Immutable store** — on-disk files produced by freezing an active store. Three kinds: + - Ledger pack file (one per chunk). + - RecSplit index `.idx` files (16 per index). + - Events cold segment (three files per chunk: `events.pack`, `index.pack`, `index.hash`). +- **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three transitions total per chunk (LFS, events) and one per index (RecSplit). +- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `chunk_first_ledger(C)` always ends in `..._02`; `chunk_last_ledger(C)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. +- **Txhash index** (a.k.a. "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). +- **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. +- **Index boundary** — the moment ingestion commits the last ledger of an index. Triggers background RecSplit build for that index. Every index boundary is also a chunk boundary. +- **Catchup** — synonym for "close the gap between last-committed ledger and current tip". Performed inside Phase 1. +- **`.bin` file** — a backfill-produced raw txhash flat file (transient). Exists only for chunks the backfill subroutine has flagged `:txhash` but whose containing index has not yet had its RecSplit built. Deleted by Phase 2 once loaded into the active txhash RocksDB. Streaming's live path never produces `.bin` files. -Each step is detailed in the sections that follow: [Startup Sequence](#startup-sequence), [Ingestion Loop](#ingestion-loop), and [Sub-flow Transitions](#sub-flow-transitions). +--- -### Boundary Math Reference +## Geometry -Boundary formulas are defined in [01-backfill-workflow.md](./01-backfill-workflow.md#geometry). Quick reference: +Chunk and txhash index math are defined in [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Quick reference: ```python -# Stellar ledgers start at 2 (not 0 or 1), so all formulas subtract 2 to zero-base -chunk_id = (ledger_seq - 2) // 10_000 # e.g. ledger 56_340_001 → chunk 5633 -chunk_first_ledger(C) = (C * 10_000) + 2 # e.g. chunk 5634 → ledger 56_340_002 -chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 # e.g. chunk 5634 → ledger 56_350_001 -index_id = chunk_id // chunks_per_txhash_index # e.g. chunk 5634 → index 5 -index_last_ledger(N) = ((N + 1) * chunks_per_txhash_index * 10_000) + 1 # e.g. index 5 → ledger 60_000_001 +# Stellar ledgers start at 2. All formulas subtract 2 to zero-base. +chunk_id = (ledger_seq - 2) // 10_000 # ledger 56_340_001 → chunk 5633 +chunk_first_ledger(C) = (C * 10_000) + 2 # chunk 5634 → ledger 56_340_002 +chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 # chunk 5634 → ledger 56_350_001 +index_id(C) = C // CHUNKS_PER_TXHASH_INDEX # chunk 5634 → index 5 (at cpi=1000) +index_last_ledger(N) = ((N + 1) * CHUNKS_PER_TXHASH_INDEX * 10_000) + 1 # index 5 → ledger 60_000_001 +LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * 10_000 # derived; at cpi=1000 this is 10_000_000 ``` --- ## Configuration -### Immutable Config (shared with backfill) +Streaming reads the same TOML file as backfill, plus additional keys described below. + +### Shared Config (from backfill) -- Streaming reads the same TOML config as backfill -- The following keys are immutable — set on first backfill, fatal error if changed: +All of `[SERVICE]`, `[BACKFILL]`, `[IMMUTABLE_STORAGE.*]`, `[META_STORE]`, `[LOGGING]` apply unchanged. See [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) for the full schema. -| Key | Stored in Meta Store | Set By | Description | +### Immutable Keys (stored in meta store, fatal if changed) + +Two keys are stored on first start and enforced on every subsequent start. Changing either requires wiping the datadir. + +| Key | Stored under | Set by | Rule | |---|---|---|---| -| `chunks_per_txhash_index` | `config:chunks_per_txhash_index` | First backfill run | Chunks per txhash index. Must never change. | +| `CHUNKS_PER_TXHASH_INDEX` | `config:chunks_per_txhash_index` | first run | Fatal if changed. | +| `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | -All `[immutable_storage.*]` paths from the backfill config apply unchanged. +Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence. Operators may add or remove BSB between runs without wiping — the daemon extends coverage forward from `derive_phase1_low_water` regardless of source. Retention immutability is what constrains the data envelope; source choice doesn't need its own immutability gate. ### Streaming-Specific TOML -**[streaming]** +**[STREAMING]** + +| Key | Type | Default | Description | +|---|---|---|---| +| `RETENTION_LEDGERS` | uint32 | `0` | `0` = full history; otherwise must be a positive multiple of `LEDGERS_PER_INDEX`. See [Validation Rules](#validation-rules). | +| `CAPTIVE_CORE_CONFIG` | string | **required** | Path to CaptiveStellarCore config file. | +| `DRIFT_WARNING_LEDGERS` | uint32 | `10` | `getHealth` reports unhealthy when ingestion drift exceeds this. ~60 seconds at 10 ledgers. | + +**[STREAMING.ACTIVE_STORAGE]** | Key | Type | Default | Description | |---|---|---|---| -| `captive_core_config` | string | **required** | Path to CaptiveStellarCore config file. | -| `drift_warning_ledgers` | int | `10` | Log warning when ingestion lags network tip by this many ledgers (~60 seconds at 10 ledgers). | +| `PATH` | string | `{DEFAULT_DATA_DIR}/active` | Base path for active RocksDB stores (ledger, txhash, events). | -**[streaming.active_storage]** +**[HISTORY_ARCHIVES]** | Key | Type | Default | Description | |---|---|---|---| -| `path` | string | `{default_data_dir}/active` | Base path for active RocksDB stores (ledger, txhash, events). | +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` when Phase 1 uses captive core. Same key the existing ingest service reads. | + +**[BACKFILL.BSB]** — optional when the daemon runs + +Same schema as in the backfill doc. Presence in the config file determines which source Phase 1 uses: + +- If `[BACKFILL.BSB]` is present: Phase 1 uses BSB (fast, parallel catchup from GCS). +- If `[BACKFILL.BSB]` is absent: Phase 1 uses captive core (slower, but no GCS dep). + +See [Ledger Source](#ledger-source) for the full source-selection rule. ### CLI Flags | Flag | Type | Default | Description | |---|---|---|---| -| `--mode` | string | `streaming` | `streaming` or `full-history-backfill`. | | `--config` | string | **required** | Path to TOML config file. | +| `--log-level` | string | from `[LOGGING].LEVEL` | Override log level. | +| `--log-format` | string | from `[LOGGING].FORMAT` | Override log format. | + +**No other flags.** No `--mode`, no `--start-ledger`, no `--end-ledger`, no subcommand. Any per-run behavior is either driven by config or derived at runtime from meta store + tip. + +### Validation Rules + +- `CHUNKS_PER_TXHASH_INDEX` immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). +- `RETENTION_LEDGERS` immutable across runs. +- `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. +- `[BACKFILL.BSB]` optional — presence determines Phase 1 source. May be added or removed between runs. +- `[HISTORY_ARCHIVES].URLS` required in all profiles. +- `CAPTIVE_CORE_CONFIG` required in all profiles. + +### Operator Profiles + +Three profiles emerge from config combinations. No profile flag. + +| Profile | `RETENTION_LEDGERS` | `[BACKFILL.BSB]` | Phase 1 source | Use case | +|---|---|---|---|---| +| Archive | `0` | present | BSB | Public archive node; full history. | +| Pruning-history | `N * LEDGERS_PER_INDEX`, N ≥ 1 | present | BSB | Windowed history with bulk initial catchup. | +| Tip-tracker | `N * LEDGERS_PER_INDEX`, N ≥ 1 | absent | captive core | App developer; small retention; no GCS dep. | + +--- + +## Operator Scenarios + +Worked examples showing what operators configure, what happens at runtime, and how crashes recover. Reference scenarios for PRD / test planning. + +### Scenario A — Fresh full-history archive, seamless cutover to live + +**Setup**: operator Bob wants a public archive node. Full history, retained forever, catchup from BSB, then live streaming. + +**Config** (`/etc/stellar-rpc/config.toml`): + +```toml +[SERVICE] +DEFAULT_DATA_DIR = "/data/stellar-rpc" + +[BACKFILL] +CHUNKS_PER_TXHASH_INDEX = 1000 # default; 10M ledgers per index + +[BACKFILL.BSB] +BUCKET_PATH = "sdf-ledger-close-meta/v1/ledgers/pubnet" + +[STREAMING] +RETENTION_LEDGERS = 0 # full history +CAPTIVE_CORE_CONFIG = "/etc/stellar-rpc/captive-core.cfg" + +[HISTORY_ARCHIVES] +URLS = ["https://history.stellar.org/prd/core-live/core_live_001/"] + +[LOGGING] +LEVEL = "info" +FORMAT = "text" +``` + +**Invocation**: + +``` +stellar-rpc --config /etc/stellar-rpc/config.toml +``` + +**Happy path**: + +- Daemon starts. `validate_config` stores `config:chunks_per_txhash_index = "1000"` and `config:retention_ledgers = "0"` on first start. +- Phase 1 picks `BSBSource` (BSB is configured). `run_backfill(0, last_complete_chunk_at_tip, source=BSBSource)`. +- Static DAG over ~5_600 chunks (at tip ~56M). Parallel BSB workers pull ledgers at GOMAXPROCS chunks at a time. Runs ~12h. +- Phase 1 exits when `T - L < 10_000`. All indexes 0..N complete with RecSplit built by the DAG; at most one partial trailing index remains. +- Phase 2 loads any `.bin` files left by the trailing partial index into the active txhash store, deletes `.bin` + `:txhash` flags. +- Phase 3 is a no-op (no orphan stores on a clean first run). +- Phase 4 opens active stores at `resume_ledger = last_phase1_ledger + 1`, starts captive core via `PrepareRange(UnboundedRange(resume_ledger))`, enters live ingestion. +- Queries begin serving at the moment Phase 4 flips `daemon_ready = true`. + +**No operator action between Phase 1 and Phase 4.** The cutover is automatic. + +**Crash recovery**: + +- Crash during Phase 1's BSB download at chunk 3_457: on restart, `derive_phase1_low_water` walks `:lfs` flags, returns the end of the contiguous prefix (say, chunk 3_200). `phase1_catchup` re-enters, `compute_backfill_range` produces a new range, backfill re-runs from chunk 3_201 forward. Chunks that already had `:lfs` are skipped via per-chunk idempotency. +- Crash after all chunks written but before index 3's RecSplit built: on restart, Phase 1 sees `index:3:txhash` absent → backfill's DAG re-runs the RecSplit build from the `.bin` files. Succeeds. +- Crash while `.bin` files from the trailing index are being loaded into the active txhash store (Phase 2): on restart, Phase 2 re-runs. Chunks that were already loaded had their `:txhash` flag deleted and `.bin` file removed — the loop skips them via the flag check. Chunks not yet loaded retain their `:txhash` flag and `.bin` file — the loop picks them up. +- Crash between low-water commit and chunk freeze during live ingestion: `streaming:last_committed_ledger = chunk_last_ledger(C)` but `chunk:{C}:lfs` absent. Phase 3 triggers the missing transitions when the daemon restarts, before Phase 4 re-enters. + +In every case: the daemon reaches a consistent state after one restart. No manual intervention. Dangling `.bin` files from incomplete indexes are cleaned by Phase 2 once the owning index progresses further. + +### Scenario B — Alice's tip-tracker (no BSB, small retention) + +**Setup**: Alice is building a wallet app. She wants live events only, starting from the current network tip. She doesn't want to stand up a GCS bucket for BSB. She picks `cpi=1` and `RETENTION_LEDGERS = 10_000` (one index worth, ~16 hours at 6s/ledger). + +**Config**: + +```toml +[SERVICE] +DEFAULT_DATA_DIR = "/data/stellar-rpc" + +[BACKFILL] +CHUNKS_PER_TXHASH_INDEX = 1 # minimum; one chunk per index + +[STREAMING] +RETENTION_LEDGERS = 10_000 # one index = ~16 hours +CAPTIVE_CORE_CONFIG = "/etc/stellar-rpc/captive-core.cfg" + +[HISTORY_ARCHIVES] +URLS = ["https://history.stellar.org/prd/core-live/core_live_001/"] + +[LOGGING] +LEVEL = "info" +FORMAT = "text" + +# No [BACKFILL.BSB] section — Phase 1 uses captive core. +``` + +**Invocation**: same as Scenario A. + +**Happy path on first-ever start** (say network tip is `56_342_637`): + +- Daemon starts. Validates config; stores immutable keys. +- `[BACKFILL.BSB]` absent → Phase 1 source is `CaptiveCoreSource`. +- Source samples tip via HTTP GET against `HISTORY_ARCHIVE_URLS`: tip = `56_342_637`. +- `compute_backfill_range(L=1, T=56_342_637, R=10_000, cpi=1)` — leapfrog lands at `index_first_ledger(index_id(T - R))`: + - `T - R = 56_332_637`. + - `chunk_id(56_332_637) = 5_633`. `index_id_of_chunk(5_633) = 5_633` (cpi=1). + - `index_first_ledger(5_633) = 56_330_002`. +- Backfill range is chunks `5_633..5_633` (one chunk to close the gap to tip at chunk 5_633, which is `last_complete_chunk_at(56_342_637)`). Up to ~10_000 ledgers of archive-catchup via captive core. Takes ~3–8 minutes. +- Phase 2 loads the one `.bin` file into the active txhash store, deletes it. +- Phase 4 opens active stores, starts captive core for live streaming from `resume_ledger = 56_340_002`, enters ingestion loop. + +**Why leapfrog lands ~10_000 ledgers back instead of exactly at tip**: Alice's first chunk must be a complete chunk (starts at `..._02`, ends at `..._01`). If the daemon started ingesting at `tip = 56_342_637` (mid-chunk), chunk 5_634 would be missing ledgers `56_340_002..56_342_636` — the no-gaps invariant would break and RecSplit for chunk 5_634's index could never be built. Leapfrog alignment is what keeps no-gaps intact. + +**What if Alice picks `cpi=10` instead of `cpi=1`?** + +- `LEDGERS_PER_INDEX = 100_000`. Minimum `RETENTION_LEDGERS = 100_000` (~7 days). Alice's `RETENTION_LEDGERS = 10_000` is invalid — `validate_config` fatals at startup with a clear error message. +- If Alice fixes retention to `100_000`, Phase 1 captive-core archive-catchup spans up to 100_000 ledgers (~30–60 min on first start). Once live, steady state is the same as cpi=1. + +**What if Alice wants "just start from tip, don't catch up anything"?** + +- Not possible under this design. The no-gaps invariant requires the first chunk to be complete. If tip falls mid-chunk, the daemon must ingest earlier ledgers to round down to an index boundary. Minimum leapfrog catchup at cpi=1 is ≤10_000 ledgers (~minutes via captive core). That's the floor. + +**Subsequent restart** (say after 1h downtime): + +- Daemon starts. `streaming:last_committed_ledger` is present from the prior run. Phase 1 samples tip; `T - L` is ~600 ledgers (10 min at 6s) — less than one chunk → Phase 1 exits immediately. +- Phase 2 finds no `.bin` files (deleted on first start). No-op. +- Phase 3 reconciles any orphan active stores from the crash — typical case is completing an interrupted chunk freeze. +- Phase 4 re-opens active stores, starts captive core, re-enters the ingestion loop. Captive core's own archive-catchup closes the 600-ledger gap in ~seconds, then cadence settles to live closes. + +**Crash recovery within Phase 1 (first-ever start)**: + +- Captive core subprocess crashes mid-archive-catchup: daemon retries spinning captive core up. No persisted state to roll back — the partial chunk's data was in the active store's WAL; captive core re-archive-catches-up from whatever ledger the WAL wasn't past. +- Daemon process itself crashes: on restart, `derive_phase1_low_water` returns whatever contiguous prefix exists. Phase 1 re-enters. Eventually completes. + +**Query behavior during Phase 1**: `HTTP 4xx` for all three query endpoints. `getHealth` reports `catching_up` + the drift. + +## Meta Store Keys + +Single RocksDB instance, WAL always enabled. Authoritative source for every startup decision. + +### Keys Introduced by Streaming + +| Key | Value | Written when | +|---|---|---| +| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `chunk_last_ledger(derive_phase1_low_water)`; subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{C}:lfs` flags alone. | +| `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | + +### Keys Shared with Backfill + +| Key | Semantics | +|---|---| +| `config:chunks_per_txhash_index` | Set on first run by whichever invocation runs first — here, first daemon start. | +| `chunk:{C:08d}:lfs` | Set after ledger pack file fsync. | +| `chunk:{C:08d}:events` | Set after events cold segment fsync. | +| `chunk:{C:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 hydration after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | +| `index:{N:08d}:txhash` | `"1"` after all 16 RecSplit CF `.idx` files built and fsynced. Transitions to `"deleting"` at the start of `prune_index`, deleted entirely when prune completes. Query routing treats `"deleting"` the same as absent. | + +### Key Lifecycle in Streaming + +``` +Phase 1 (backfill subroutine): + chunk:{C}:lfs = "1" (after pack fsync) + chunk:{C}:txhash = "1" (after .bin fsync) # only present for chunks that still have .bin on disk + chunk:{C}:events = "1" (after cold segment fsync) + index:{N}:txhash = "1" (after RecSplit, when all chunks of index N are done in Phase 1) + +Phase 2 (.bin hydration — see Startup Sequence): + For every chunk with :txhash flag and a .bin file: + load .bin into RocksDB txhash store + delete chunk:{C}:txhash flag + delete .bin file + After Phase 2, no chunk:{C}:txhash flags and no .bin files remain. + +Live path (per ledger): + streaming:last_committed_ledger = ledger_seq (after all 3 active stores commit) + +Live path (per chunk, background): + chunk:{C}:lfs = "1" (after pack fsync) + chunk:{C}:events = "1" (after cold segment fsync) + +Live path (per index, background): + index:{N}:txhash = "1" (after RecSplit + verify) + +Pruning (background, when index N is past retention): + index:{N}:txhash = "deleting" (FIRST; queries now return 4xx for this index) + [delete all files + per-chunk :lfs + :events keys for index N] + index:{N}:txhash → deleted (LAST) +``` + +### Flag Semantics + +- **Flag-after-fsync.** A flag is set only after the artifact it represents has been fsynced. Flag absent = artifact missing (or incomplete). +- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility — derives from meta store key presence. No filesystem-scan-and-infer. --- ## Active Store Architecture -Streaming maintains three active stores for the current ingestion position: +The daemon maintains three active stores for the current ingestion position. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). | Store | Path | Key | Value | Transition cadence | |---|---|---|---|---| -| Ledger | `{active_storage.path}/ledger-store-chunk-{chunkID:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | -| TxHash | `{active_storage.path}/txhash-store-index-{indexID:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every 10_000_000 ledgers (index) | +| Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{C:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | +| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{N:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_INDEX` ledgers (index) | | Events | In-memory hot segment + persisted index deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | -- **Ledger store**: default CF only. One RocksDB instance per chunk. WAL required. -- **TxHash store**: 16 column families (`cf-0` through `cf-f`), routed by `txhash[0] >> 4`. One RocksDB instance per index (spans 1_000 chunks). WAL required. -- **Events hot segment**: in-memory roaring bitmaps + persisted per-ledger index deltas (for crash recovery — replayed on startup to rebuild bitmaps). ~370 MB per segment (measured). See [getEvents full-history design](../../design-docs/getevents-full-history-design.md) for format details. +- Ledger and txhash stores are RocksDB. WAL required. +- TxHash store uses 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`. +- Events hot segment is in-memory roaring bitmaps plus persisted per-ledger index deltas for crash recovery. See [getEvents full-history design](../../design-docs/getevents-full-history-design.md). ### Store Pre-creation -- Stores for the next chunk/index are pre-created before the boundary is reached -- At transition time, only internal pointers change (active → transitioning, pre-created → active) -- Pre-created stores eliminate creation failures at boundary time -- On restart, pre-created stores are expected to exist — not treated as orphans +- The store for the next chunk / index is pre-created before the boundary is reached, so boundary-time work is a pointer swap only. +- On restart, a pre-created store is expected to exist — Phase 3 treats it as active, not an orphan. ### Max Concurrent Stores -| Sub-flow | Max active | Max transitioning | Max total | +| Store | Max active | Max transitioning | Max total | |---|---|---|---| | Ledger | 1 | 1 | 2 | -| Events | 1 (hot) | 1 (freezing) | 2 | +| Events | 1 (hot segment) | 1 (freezing cold segment) | 2 | | TxHash | 1 | 1 | 2 | --- -## Startup Sequence - -- The startup sequence is the same code path for first start in streaming mode (after backfill) and regular restarts -- The presence or absence of `streaming:last_committed_ledger` in the meta store distinguishes the two cases +## Ledger Source -### Startup Flow +Phase 1 reads ledgers from a source. Two implementations share one interface. Source is selected per-startup based on `[BACKFILL.BSB]` presence — no stored immutability gate. Operators may add or remove BSB between runs; retention immutability alone constrains the data envelope. ```python -def start_streaming(config): - meta_store = open_meta_store(config) +class LedgerSource: + """ + Provides a stream of LedgerCloseMeta for a contiguous ledger range. Used by the backfill + subroutine inside Phase 1. Live streaming (Phase 4) does NOT go through this abstraction — + it reads directly from CaptiveStellarCore via `ledgerBackend.PrepareRange(UnboundedRange(...))`. + """ + + def tip(self) -> int: + """Current network tip ledger. Used to compute Phase 1 target range.""" + + def get_range(self, start_ledger, end_ledger) -> Iterator[LedgerCloseMeta]: + """Stream LCMs for [start_ledger, end_ledger] inclusive. Must tolerate re-invocation — + the backfill DAG resumes per-chunk on crash.""" + + def max_parallelism(self) -> int: + """Upper bound on concurrent get_range calls the source can sustain. Backfill DAG + honors this when dispatching process_chunk workers.""" + + +class BSBSource(LedgerSource): + """ + Reads from the BSB (Buffered Storage Backend) bucket configured in [BACKFILL.BSB]. + + - Tip: queried from BSB's own range-end metadata. Same mechanism backfill uses today. + - get_range: parallel prefetch via BUFFER_SIZE + NUM_WORKERS knobs; same shape as backfill. + - max_parallelism: GOMAXPROCS (backfill's current default). + """ + + +class CaptiveCoreSource(LedgerSource): + """ + Drives a CaptiveStellarCore subprocess to replay ledgers from the history archive + peers. + + - Tip: fetched via HTTP GET on /.well-known/stellar-history.json against HISTORY_ARCHIVE_URLS. + Matches the existing ingest service pattern (Service.getNextLedgerSequence → archive.GetRootHAS()). + - get_range: drives captive core with ledgerBackend.PrepareRange(BoundedRange(start, end)), + drains sequential GetLedger(seq) calls. + - max_parallelism: 1. Captive core is a single heavy subprocess; parallelism would require + multiple subprocesses, each consuming several GB RAM. Backfill DAG dispatches chunks + sequentially when source is captive core. + """ +``` - # ── 1. Validate immutable config ── - validate_config(config, meta_store) +### Source Selection Rule - # ── 2. Validate backfill completeness ── - head_chunk, durable_tail = validate_chunk_coverage(config, meta_store) +```python +def choose_phase1_source(config): + """Called once at the top of Phase 1. Re-evaluated per startup.""" + if config.backfill.bsb is not None: + return BSBSource(config.backfill.bsb) + return CaptiveCoreSource(config.streaming.captive_core_config, + config.history_archives.urls) +``` - # ── 3. Load backfill txhash data into RocksDB ── - # If backfill left a partial txhash index, .bin files exist for the backfill chunks. - # Load .bin → RocksDB, then delete .bin files + txhash flags. - # No-op if backfill ended on an index boundary or on restarts. - load_backfill_bins(config, meta_store) +### Retention Semantics Under Captive Core - # ── 4. Reconcile orphaned transitions ── - # Complete any in-flight transitions from a previous crash. - reconcile_orphaned_transitions(config, meta_store) +When Phase 1 uses captive core, `RETENTION_LEDGERS` directly determines how many ledgers captive core must archive-catchup on first start: - # ── 5. Replay missed boundary handling ── - # If checkpoint is at a chunk/index boundary but transitions never fired. - replay_missed_boundaries(config, meta_store) +- `RETENTION_LEDGERS = 10_000_000` at `cpi=1000`: captive core archive-catches-up ~10M ledgers (hours to days). +- `RETENTION_LEDGERS = 10_000` at `cpi=1`: captive core archive-catches-up ~10K ledgers (~3–8 min). - # ── 6. Detect BUILD_READY indexes, spawn background RecSplit ── - spawn_pending_recsplit_builds(config, meta_store) +This is the main reason tip-tracker operators default to `cpi=1`: at cpi=1 a full index is 10K ledgers, so retention can be set small without violating the "multiple of LEDGERS_PER_INDEX" rule. - # ── 7. Determine resume ledger ── - last_committed = meta_store.get("streaming:last_committed_ledger") - if last_committed is None: - # First start in streaming mode: set checkpoint to backfill's last ledger - last_committed = chunk_last_ledger(durable_tail) - meta_store.put("streaming:last_committed_ledger", last_committed) - resume_ledger = last_committed + 1 +--- - # ── 8. Open/create active stores for the resume position ── - active_stores = open_active_stores(config, meta_store, resume_ledger) +## Startup Sequence - # ── 9. Start CaptiveStellarCore ── - # CaptiveStellarCore takes ~4-5 minutes to spin up to the target ledger. - # Steps 1-8 run sequentially before this point. - # If backfill left a partial index with .bin files, step 3 takes ~4-5 minutes - # (one-time cost, does not recur). Otherwise, steps 1-8 complete in seconds. - core = start_captive_core(config, resume_ledger) +Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 is the long-running state the daemon stays in until process exit. - # ── 10. Begin ingestion loop ── - run_ingestion_loop(core, active_stores, meta_store) -``` +- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip by invoking the backfill subroutine in a loop. +- **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 left (for the trailing partial index) into the active txhash store, then deletes them. +- **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. Truncates events hot segment beyond the last committed ledger. +- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle goroutine, flips the `daemon_ready` flag, enters the ingestion loop. Runs until process exit. -### Step 1: Validate Immutable Config +"Phase" here refers to the startup ordering only. Once Phase 4 is entered, there's no Phase 5 — the daemon is in live-streaming steady state. ```python -def validate_config(config, meta_store): - stored_cpi = meta_store.get("config:chunks_per_txhash_index") - if stored_cpi is not None and stored_cpi != config.chunks_per_txhash_index: - fatal(f"chunks_per_txhash_index changed: {stored_cpi} -> {config.chunks_per_txhash_index}") - if stored_cpi is None: - # First ever run writes the value. Backfill writes this on first run; - # if streaming runs first (no prior backfill), streaming writes it. - meta_store.put("config:chunks_per_txhash_index", config.chunks_per_txhash_index) +def run_streaming(config): + meta_store = open_meta_store(config) + validate_config(config, meta_store) # immutable key enforcement + + # ── Phase 1: catch up from last_committed_ledger (or genesis) to tip ── + source = choose_phase1_source(config) + phase1_catchup(config, meta_store, source) + + # ── Phase 2: load any .bin files left by Phase 1 into RocksDB; delete them ── + phase2_hydrate_txhash(config, meta_store) + + # ── Phase 3: reconcile orphaned transitions from prior crash ── + phase3_reconcile_orphans(config, meta_store) + + # ── Phase 4: open active stores, spawn lifecycle goroutine, start captive core, ingest ── + phase4_ingest(config, meta_store) ``` -### Step 2: Validate Chunk Coverage +Query serving is gated on Phase 4 being reached — see [Query Contract](#query-contract). + +### Phase 1 — Catchup + +Runs the backfill subroutine (`run_backfill` from `01-backfill-workflow.md`) once per source-tip sample, until the gap closes to less than one chunk. + +- Phase 1's unit of work is an entire chunk — never a partial chunk. Backfill's DAG dispatches integer chunk IDs; `process_chunk(C)` ingests ledgers `chunk_first_ledger(C)..chunk_last_ledger(C)` inclusive. Every chunk ever persisted by Phase 1 starts at `..._02` and ends at `..._01`. This is the chunk-alignment invariant the no-gaps guarantee rests on. +- Works the same whether the source is BSB (parallel) or captive core (sequential) — per-chunk work is atomic in both cases. ```python -def validate_chunk_coverage(config, meta_store): - cpi = config.chunks_per_txhash_index - - lfs_set = meta_store.scan_keys_with_suffix(":lfs") # set of chunk IDs - events_set = meta_store.scan_keys_with_suffix(":events") - txhash_set = meta_store.scan_keys_with_suffix(":txhash") # backfill-only flags - - if not lfs_set: - fatal("no chunk data found — run backfill first") - - head_chunk = min(lfs_set | events_set) # lowest chunk with any flag - head_index = head_chunk // cpi # which index that chunk belongs to - - # Head must be index-aligned — the first chunk must be the first chunk of its index. - # If not, the head index can never be completed (RecSplit needs all cpi chunks). - if head_chunk != head_index * cpi: # e.g. head_chunk=2300, expected=2000 - fatal(f"partial index at head: chunk {head_chunk} is not first chunk " - f"of index {head_index} (expected {head_index * cpi}). " - f"Run backfill to complete index {head_index}.") - - # Find contiguous tails — walk forward from head_chunk, stop at first gap - durable_tail_lfs = contiguous_tail(lfs_set, head_chunk) # last contiguous lfs chunk - durable_tail_events = contiguous_tail(events_set, head_chunk) # last contiguous events chunk - durable_tail = min(durable_tail_lfs, durable_tail_events) # both must be present - - # Verify complete indexes have index:{N}:txhash keys. - # durable_tail // cpi gives the index containing the tail chunk. - for index_id in range(head_index, durable_tail // cpi + 1): - idx_last = (index_id + 1) * cpi - 1 # last chunk of this index - if idx_last > durable_tail: - break # partial index at tail — streaming will complete - if not meta_store.has(f"index:{index_id:08d}:txhash"): - # All chunks present but RecSplit not built — BUILD_READY - # (handled in step 6, not an error) - pass - - # First start in streaming mode only: verify backfill chunks have all three flags - if not meta_store.has("streaming:last_committed_ledger"): - for chunk_id in range(head_chunk, durable_tail + 1): - index_id = chunk_id // cpi - if meta_store.has(f"index:{index_id:08d}:txhash"): - continue # complete index — txhash flags already cleaned up - if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - fatal(f"first start in streaming mode: chunk {chunk_id} missing txhash flag " - f"in incomplete index {index_id} — run backfill to complete") - - return head_chunk, durable_tail +def phase1_catchup(config, meta_store, source): + """ + Close the gap between what's already on disk and the current network tip. + + Control flow (outer loop): + 1. derive L from :lfs flags on disk (NOT from streaming:last_committed_ledger — + that key isn't written during Phases 1–3). + 2. sample the current tip T from the source. + 3. if T - L is less than one chunk, exit (captive core will close the residual + few-thousand-ledger gap in Phase 4 via its own archive-catchup). + 4. compute the chunk range to backfill this iteration. Leapfrog-alignment inside + compute_backfill_range guarantees range_start is the first chunk of an index + when retention is configured. + 5. invoke backfill's static-DAG subroutine. Backfill's own per-chunk idempotency + + crash recovery handle mid-iteration crashes. + 6. re-derive L from :lfs flags. Loop. + + The while loop is needed because the network tip advances while we catch up — + each run_backfill call covers the range known at the start of that iteration, + and subsequent iterations close whatever new ledgers accumulated. + """ + cpi = config.backfill.chunks_per_txhash_index + R = config.streaming.retention_ledgers + L = derive_phase1_low_water(meta_store) + + while True: + T = source.tip() + if T - L < 10_000: # less than one chunk remaining + break + + range_start, range_end = compute_backfill_range(L, T, R, cpi) + if range_end < range_start: + # Leapfrog landed past the last complete chunk at tip — happens when the + # network hasn't produced a full chunk past the retention line yet. Exit. + break + + # Backfill's DAG ingests [range_start..range_end] inclusive. Per-chunk idempotent: + # chunks with :lfs already set are skipped. Crash here resumes on restart. + run_backfill(config, range_start, range_end, source=source) + + # Re-derive L — not just range_end — because a mid-iteration crash could leave + # holes in [range_start..range_end] that the contiguous-prefix scan catches. + L = derive_phase1_low_water(meta_store) + + +def compute_backfill_range(L, T, R, cpi): + """ + Returns (range_start_chunk, range_end_chunk). Leapfrog aligns DOWN to the first chunk + of the index containing (T - R). No-op when R = 0 (full history archive). + + - R is a multiple of LEDGERS_PER_INDEX (validated at startup), but T itself is arbitrary + — so T - R is NOT on an index boundary in general. Leapfrog must explicitly round + T - R down to the first ledger of its containing index. That rounded value is the + new head of coverage; every earlier ledger is past retention and skipped. + - Worst-case: up to LEDGERS_PER_INDEX - 1 ledgers past the strict retention line are + ingested and held on disk. At cpi=1000 this is ~10M ledgers; at cpi=1 it is ~10k. + """ + gap_start_ledger = L + 1 + if R > 0: + target_ledger = max(T - R, GENESIS_LEDGER) + target_chunk = (target_ledger - 2) // 10_000 + target_index = target_chunk // cpi + # First ledger of target_index = target_index * LEDGERS_PER_INDEX + GENESIS_LEDGER + leapfrog_start_ledger = target_index * cpi * 10_000 + GENESIS_LEDGER + else: + leapfrog_start_ledger = GENESIS_LEDGER + + range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) + range_start_chunk = (range_start_ledger - 2) // 10_000 + range_end_chunk = ((T - 1) // 10_000) - 1 # last complete chunk at tip + return range_start_chunk, range_end_chunk + + +def derive_phase1_low_water(meta_store): + """ + Returns the last ledger of the contiguous tail of :lfs flags starting at the lowest + chunk currently on disk. + + - Finds min_chunk = lowest C with chunk:{C}:lfs set. + - Walks forward from min_chunk counting contiguous :lfs flags. Stops at the first gap. + - Returns GENESIS_LEDGER - 1 if no :lfs flags exist at all. + + Contiguous-tail semantics matter because: + - BSB workers complete chunks in parallel; a mid-Phase-1 crash can leave holes in the + middle of the ingested range. Resuming from the highest :lfs would skip those holes + and break the no-gaps invariant. + - Lifecycle pruning removes :lfs flags of past-retention indexes. The lowest remaining + :lfs after prune is naturally the head of surviving coverage — no separate tip sample + or leapfrog calculation needed here. + + Leapfrog decisions (where Phase 1 should start ingesting THIS run) are made separately + inside compute_backfill_range, which has access to the current tip sample. + """ + min_chunk = None + for key in meta_store.iter_prefix("chunk:"): + if not key.endswith(":lfs"): + continue + C = parse_chunk_id(key) + if min_chunk is None or C < min_chunk: + min_chunk = C + if min_chunk is None: + return GENESIS_LEDGER - 1 + + C = min_chunk + while meta_store.has(f"chunk:{C:08d}:lfs"): + C += 1 + return chunk_last_ledger(C - 1) # last contiguous chunk ``` -**Example — first start in streaming mode after backfill:** -``` -Backfill ran: --start-ledger 20_000_002 --end-ledger 56_340_001 -Expanded to chunks 2000-5633. chunks_per_txhash_index = 1000. - -Meta store: - chunk:00002000:lfs through chunk:00005633:lfs = "1" (3634 chunks) - chunk:00002000:txhash through chunk:00005633:txhash = "1" - chunk:00002000:events through chunk:00005633:events = "1" - index:00000002:txhash = "1" (chunks 2000-2999) - index:00000003:txhash = "1" (chunks 3000-3999) - index:00000004:txhash = "1" (chunks 4000-4999) - index:00000005:txhash = absent (chunks 5000-5633 done, 5634-5999 missing) - -Validation: - head_chunk = 2000 → 2000 % 1000 == 0 → index-aligned - durable_tail = 5633 - Indexes 2,3,4 → COMPLETE - Index 5: partial (634 of 1000 chunks) - First start in streaming mode: all backfill chunks 2000-5633 have txhash flags → valid - -Result: head_chunk=2000, durable_tail=5633 -``` +**Worker concurrency**: `run_backfill` honors `source.max_parallelism()` when dispatching `process_chunk` tasks. With BSB this is GOMAXPROCS (unchanged from backfill today). With captive core it is 1 — the DAG dispatches chunks sequentially to avoid spawning multiple captive core subprocesses. -### Step 3: Load Backfill TxHash Data into RocksDB +**Retention semantics** depend on source: +- With BSB: retention determines the Phase 1 range; catchup time scales with `RETENTION_LEDGERS / (BSB throughput)`. +- With captive core: retention determines the Phase 1 range AND captive core's archive-catchup scope. Operators must size retention against the wall-clock cost of captive-core archive catchup. -- If backfill left a partial txhash index, `.bin` files exist for the backfill chunks. Streaming loads these into the RocksDB txhash store — needed for query serving and as the single input source for RecSplit. -- If backfill ended on an index boundary, no `.bin` files exist — this step is a no-op -- Runs on every startup for robustness. On restarts, the loop finds no `txhash` flags and skips. +### Phase 2 — Hydrate TxHash Data from `.bin` -```python -def load_backfill_bins(config, meta_store): - cpi = config.chunks_per_txhash_index +Phase 1 may leave `.bin` files for chunks in the last (incomplete) index. Phase 2 loads each into the active txhash store and deletes the `.bin` file + its `chunk:{C}:txhash` flag. After Phase 2, no `.bin` files and no `chunk:{C}:txhash` flags remain. - # 1. Clean up complete indexes with leftover .bin files - # (backfill crashed after RecSplit but before cleanup_txhash) - for index_id in all_indexes_with_txhash_key(meta_store): +```python +def phase2_hydrate_txhash(config, meta_store): + """ + Loads every remaining .bin into the active txhash store, then deletes the .bin and flag. + + - Runs on every startup for robustness. On a restart where a previous Phase 2 completed, + no :txhash flags remain and this is a no-op. + - After each chunk is loaded: delete the flag FIRST, then delete the .bin. A crash + between these two steps leaves an orphan .bin that the sweep in step 3 handles. + - The txhash store must be opened (not re-created) — prior Phase 2 runs may have loaded + earlier chunks, and their .bin files are already gone. + """ + cpi = config.backfill.chunks_per_txhash_index + + # 1. Backfill may have completed an index (index:{N}:txhash = "1") before a crash + # prevented cleanup_txhash from deleting leftover .bin. Sweep those first. + for index_id in indexes_with_txhash_flag(meta_store): for chunk_id in range(index_id * cpi, (index_id + 1) * cpi): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) - # 2. Load .bin files for current incomplete index into RocksDB txhash store - # IMPORTANT: open existing store (WAL recovery), do NOT delete-and-recreate — - # previously loaded chunks' .bin files are already deleted - txhash_store = open_or_create_txhash_store(config, current_incomplete_index(meta_store)) - for chunk_id in chunks_for_current_incomplete_index(meta_store, cpi): + # 2. Load .bin files for the current incomplete index (if any) into RocksDB txhash store. + N = current_incomplete_index(meta_store) + if N is None: + return + + txhash_store = open_active_txhash_store(config, N) # WAL recovery; do NOT recreate + for chunk_id in range(N * cpi, (N + 1) * cpi): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already loaded (flag deleted), or streaming chunk (no .bin) + continue # already loaded (flag cleared) bin_path = raw_txhash_path(chunk_id) if os.path.exists(bin_path): - load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first - delete_if_exists(bin_path) # delete .bin second - - # 3. Sweep orphaned .bin files (flag gone, file lingering from prior crash) - for bin_file in scan_bin_files_for_index(current_incomplete_index(meta_store)): - chunk_id = parse_chunk_id(bin_file) - if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first + delete_if_exists(bin_path) # delete .bin second + + # 3. Sweep orphan .bin files (flag already gone, .bin lingering from crash between + # flag-delete and file-delete in a prior run). + for bin_file in scan_bin_files_for_index(N): + if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): os.remove(bin_file) ``` -**Crash safety for .bin loading (step 3):** -- If crash during loading: `streaming:last_committed_ledger` is still absent → next startup redoes the sequence -- Already-loaded chunks: `.bin` deleted, flag deleted, data in RocksDB via WAL recovery -- Not-yet-loaded chunks: `.bin` and flag still present → loop picks up where it left off -- The txhash store MUST be opened (WAL recovery), never deleted-and-recreated — already-loaded chunks' `.bin` files are gone and cannot be re-read +**Why "load then delete" matters.** Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. At `cpi=1000` with frequent restarts over a day, that is thousands of redundant loads. Deleting the `.bin` after the first successful load makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. -### Step 4: Reconcile Orphaned Transitions +**Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files — streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 is a trivial no-op in that case. -```python -def reconcile_orphaned_transitions(config, meta_store): - last_committed = meta_store.get("streaming:last_committed_ledger") - if last_committed is None: - return # first start in streaming mode — no prior streaming state to reconcile +### Phase 3 — Reconcile Orphaned Transitions - # Derive which chunk and index the checkpoint falls in. - # Subtract 2 because Stellar ledgers start at 2, not 0. - current_chunk = (last_committed - 2) // 10_000 # chunk containing the last committed ledger - current_index = current_chunk // config.chunks_per_txhash_index # index containing that chunk +Completes any in-flight transitions left by a prior crash. All decisions derive from meta store state + on-disk store directories. - # Orphaned transitioning ledger stores +```python +def phase3_reconcile_orphans(config, meta_store): + """ + Finishes any mid-flight LFS flush, events freeze, or RecSplit build from a crashed run. + + - Active store for resume_chunk: keep (Phase 4 will open it). + - Pre-created store for resume_chunk + 1: keep. + - Orphaned ledger store: + flag present → cleanup lingered; delete the store. + flag absent, chunk below resume_chunk → mid-flush crash; complete the flush. + flag absent, chunk above resume_chunk + 1 → orphan future store; delete. + - Orphaned txhash store: + flag present → cleanup lingered; delete the store. + flag absent, all chunks of index N have :lfs set → spawn RecSplit build. + + On a fresh datadir (no :lfs flags anywhere, Phase 1 had nothing to do) this is a no-op: + resume_ledger = GENESIS_LEDGER, resume_chunk = 0, no active stores on disk yet. + """ + # Derive resume_ledger the same way Phase 4 will. resume_chunk is the chunk Phase 4 + # ingests into first; its active store is preserved through Phase 3. + cpi = config.backfill.chunks_per_txhash_index + resume_ledger = derive_phase1_low_water(meta_store) + 1 + if resume_ledger < GENESIS_LEDGER: + resume_ledger = GENESIS_LEDGER + resume_chunk = (resume_ledger - 2) // 10_000 + + # Ledger stores for store_dir in scan_ledger_store_dirs(config): - chunk_id = parse_chunk_id_from_dir(store_dir) - if chunk_id == current_chunk: - continue # active store — keep (WAL recovery) - if chunk_id == current_chunk + 1: - continue # pre-created store — keep - if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): - delete_dir(store_dir) # flag set, cleanup didn't finish → delete - elif chunk_id < current_chunk: - # transitioning store, flush didn't complete → complete it - complete_lfs_flush(store_dir, chunk_id, meta_store) + C = parse_chunk_id_from_dir(store_dir) + if C == resume_chunk or C == resume_chunk + 1: + continue # active or pre-created + if meta_store.has(f"chunk:{C:08d}:lfs"): + delete_dir(store_dir) # orphaned post-flush cleanup + elif C < resume_chunk: + complete_lfs_flush(store_dir, C, meta_store) # mid-flush crash; finish else: - delete_dir(store_dir) # orphaned future store → delete + delete_dir(store_dir) # orphan future store - # Orphaned transitioning txhash stores + # Txhash stores + resume_index = resume_chunk // cpi for store_dir in scan_txhash_store_dirs(config): - index_id = parse_index_id_from_dir(store_dir) - if index_id == current_index: - continue # active store — keep - if meta_store.has(f"index:{index_id:08d}:txhash"): - delete_dir(store_dir) # complete, cleanup didn't finish → delete - # BUILD_READY handled in step 6 + N = parse_index_id_from_dir(store_dir) + if N == resume_index or N == resume_index + 1: + continue # active or pre-created + if meta_store.has(f"index:{N:08d}:txhash"): + delete_dir(store_dir) # RecSplit done, cleanup lingered + elif all_chunks_frozen(meta_store, N, cpi): + run_in_background(recsplit_transition, N, store_dir, meta_store) + + # Events hot segment: truncate any persisted deltas beyond resume_ledger - 1. + # Prevents duplicate event IDs when Phase 4 replays the first live ledger. + truncate_events_hot_segment(config, resume_ledger - 1) ``` -### Step 5: Replay Missed Boundary Handling +### Phase 4 — Live Ingestion -- Handles the case where `streaming:last_committed_ledger` is exactly at a boundary but transitions never fired (crash between checkpoint and transitions) +Opens active stores for the resume position, spawns the lifecycle goroutine, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). ```python -def replay_missed_boundaries(config, meta_store): +def phase4_ingest(config, meta_store): last_committed = meta_store.get("streaming:last_committed_ledger") if last_committed is None: - return + # First start after Phase 1: set checkpoint to end of Phase 1's coverage. + last_committed = derive_phase1_low_water(meta_store) + meta_store.put("streaming:last_committed_ledger", last_committed) + resume_ledger = last_committed + 1 - current_chunk = (last_committed - 2) // 10_000 # subtract 2: ledgers start at 2 - cpi = config.chunks_per_txhash_index - - # Check if checkpoint is exactly at a chunk boundary. - # chunk_last_ledger(C) = ((C + 1) * 10_000) + 1, the last ledger that belongs to chunk C. - # If checkpoint == this value, the chunk is fully ingested but transitions may not have fired. - if last_committed == chunk_last_ledger(current_chunk): - if not meta_store.has(f"chunk:{current_chunk:08d}:lfs"): - trigger_lfs_flush(current_chunk, config, meta_store) - if not meta_store.has(f"chunk:{current_chunk:08d}:events"): - trigger_events_freeze(current_chunk, config, meta_store) - - # Check if checkpoint is also at an index boundary. - # The last chunk of index N is ((N + 1) * cpi) - 1. - # If current_chunk equals this, the index is fully ingested. - current_index = current_chunk // cpi - if current_chunk == (current_index + 1) * cpi - 1: - # Index boundary — handled by step 6 (BUILD_READY detection) - pass -``` + active_stores = open_active_stores(config, meta_store, resume_ledger) -**Example — crash between checkpoint and chunk boundary transitions:** -``` -streaming:last_committed_ledger = 56_370_001 (= chunk_last_ledger(5636)) -chunk:00005636:lfs = absent (swap never happened, flush never spawned) -chunk:00005636:events = absent (freeze never started) -Active ledger store: ledger-store-chunk-005636/ (has all 10_000 ledgers via WAL) + run_in_background(lifecycle_loop, config, meta_store) + + # Prime captive core for unbounded stream from resume_ledger. + ledger_backend = make_ledger_backend(config.streaming.captive_core_config) + ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) + + set_daemon_ready() # in-memory flag; unblocks queries -Detection: 56_370_001 == chunk_last_ledger(5636) AND flags absent -Action: trigger LFS flush for chunk 5636, trigger events freeze for chunk 5636 + run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) ``` +Captive core takes 4–5 minutes to spin up and start emitting at `resume_ledger`. During that window `getHealth` remains in `catching_up` state (see [Query Contract](#query-contract)). + --- ## Ingestion Loop -### Per-Ledger Processing +Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` calls. Same code path drains captive core's internal buffer during catchup and switches cadence to live closes (~5 s per ledger) once caught up. ```python -def run_ingestion_loop(core, active_stores, meta_store): - for lcm in core.stream_ledgers(): - ledger_seq = lcm.ledger_sequence - process_ledger(ledger_seq, lcm, active_stores, meta_store) - -def process_ledger(ledger_seq, lcm, active_stores, meta_store): - # 1. Write to all three stores in parallel goroutines. - # Each store's write is atomic (WriteBatch + WAL for RocksDB, - # atomic commit for events hot segment). - run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm) - run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm) - run_in_background(write_events_hot_segment, active_stores.events, ledger_seq, lcm) - wait_for_all() # all three must succeed - - # 2. Set per-ledger checkpoint AFTER all writes succeed. - # INVARIANT: checkpoint is written ONLY after all three stores - # have durably committed the ledger data (WAL flush). - meta_store.put("streaming:last_committed_ledger", ledger_seq) - - # 3. If chunk boundary: trigger sub-flow transitions. - # Transitions happen AFTER checkpoint — a crash between checkpoint - # and transitions is detected on startup (see replay_missed_boundaries). - # Subtract 2 because Stellar ledgers start at 2. - current_chunk = (ledger_seq - 2) // 10_000 - # chunk_last_ledger(C) = ((C+1) * 10_000) + 1 — the last ledger in chunk C. - # Equality means this ledger completes the chunk. - if ledger_seq == chunk_last_ledger(current_chunk): - on_chunk_boundary(current_chunk, active_stores, meta_store) - - # 4. If index boundary: trigger index-level transitions. - # The index boundary ledger is always also a chunk boundary ledger - # (chunk boundaries align exactly with index boundaries). - current_index = current_chunk // chunks_per_txhash_index - # index_last_ledger(N) = ((N+1) * cpi * 10_000) + 1 — the last ledger in index N. - if ledger_seq == index_last_ledger(current_index): - on_index_boundary(current_index, active_stores, meta_store) +def run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): + """ + Sequential pull-based live ingestion. The daemon stays here until process exit. + + Per-ledger steps: + 1. Block on GetLedger(seq) until the ledger is available. + 2. Fan out writes to all three active stores in parallel. Each write is atomic + + WAL-backed, so each store alone is crash-safe. + 3. wait_all — all three must succeed before the per-ledger checkpoint advances. + 4. Commit streaming:last_committed_ledger = seq. This is the atomic 'the daemon + owns everything up to and including seq' signal. + 5. If seq completes a chunk, fire on_chunk_boundary (non-blocking — freeze + transitions run in background). + 6. If seq completes an index, fire on_index_boundary — RecSplit build kicks off. + 7. seq += 1. Loop. + + Immutable config values (cpi) are read once outside the loop — never per ledger. + """ + cpi = config.backfill.chunks_per_txhash_index # immutable; read once at loop entry + seq = resume_ledger + while True: + lcm = ledger_backend.GetLedger(seq) # blocks until ledger seq available + + # Write to all three active stores in parallel. Order: fan out, wait for all. + # Each store is idempotent on re-write of the same ledger (crash-safe). + wait_all( + run_in_background(write_ledger_store, active_stores.ledger, seq, lcm), + run_in_background(write_txhash_store, active_stores.txhash, seq, lcm), + run_in_background(write_events_hot_segment, active_stores.events, seq, lcm), + ) + + # Commit the per-ledger checkpoint (streaming:last_committed_ledger) only AFTER + # all three active stores have durably committed the ledger. This is the key + # atomic boundary for Phase 4 crash recovery — the checkpoint is the sole + # 'the daemon owns everything up to and including this ledger' signal. It's NOT + # the same as Phase 1's low-water (which derives from :lfs flags). + meta_store.put("streaming:last_committed_ledger", seq) + + # Chunk rollover: hand off to background LFS + events freeze transitions. + C = (seq - 2) // 10_000 + if seq == chunk_last_ledger(C): + on_chunk_boundary(C, active_stores, meta_store) + + # Index rollover — every index boundary is also a chunk boundary, so this runs + # AFTER on_chunk_boundary has already dispatched the last chunk's freeze transitions. + if seq == index_last_ledger(C // cpi): + on_index_boundary(C // cpi, active_stores, meta_store) + + seq += 1 ``` -### Per-Store Write Details - -**Ledger store write:** -```python -def write_ledger_store(store, ledger_seq, lcm): - key = uint32_big_endian(ledger_seq) - value = zstd_compress(lcm.to_bytes()) - store.put(key, value) # WriteBatch + WAL -``` +Each per-store write is atomic: RocksDB WriteBatch + WAL for ledger and txhash stores; atomic commit of events hot-segment + persisted deltas. Key/value schemas are in [Active Store Architecture](#active-store-architecture). -**TxHash store write:** -```python -def write_txhash_store(store, ledger_seq, lcm): - batch = WriteBatch() - for tx in lcm.transactions: - cf_name = f"cf-{tx.hash[0] >> 4:x}" # route by first nibble - batch.put_cf(cf_name, tx.hash, uint32_big_endian(ledger_seq)) - store.write(batch) # single WriteBatch across all CFs + WAL -``` +--- -**Events hot segment write:** -```python -def write_events_hot_segment(hot_segment, ledger_seq, lcm): - events = extract_contract_and_system_events(lcm) # excludes diagnostic events - for event in events: - event_id = hot_segment.next_event_id() - hot_segment.store_event(event_id, event) # persist event data - for term_key in index_terms(event): # contractId + topic0-3 - hot_segment.add_to_bitmap(term_key, event_id) # in-memory bitmap update - hot_segment.persist_deltas(ledger_seq) # (term_key, event_id) pairs → DB for crash recovery - hot_segment.update_offset_array(ledger_seq) # cumulative event count - hot_segment.commit(ledger_seq) # atomic commit -``` +## Freeze Transitions ---- +Three independent background transitions per chunk/index boundary. Each has its own goroutine, flag, and cleanup. Live ingestion never waits on them synchronously — they must not stall the ingestion loop. -## Sub-flow Transitions +- **LFS transition** — per chunk. Converts the retired ledger RocksDB to a `.pack` file. +- **Events transition** — per chunk. Converts the retired events hot segment to a cold segment (three files). +- **RecSplit transition** — per index. Builds 16 `.idx` files from the retired txhash RocksDB. -- Three independent sub-flows, each with its own goroutine, flag, and cleanup step -- No combined transitions — each sub-flow waits for its own predecessor only +Streaming's freeze transitions never produce `.bin` files. `.bin` files exist only as transient output of the backfill subroutine (inside Phase 1). ### Chunk Boundary (every 10_000 ledgers) -Triggered when `ledger_seq == chunk_last_ledger(current_chunk)`: +Triggered when the ingestion loop commits `chunk_last_ledger(C)`. Handoffs to two freeze transitions (LFS + events) that run in background. ```python -def on_chunk_boundary(chunk_id, active_stores, meta_store): - # ── LFS sub-flow ── - # Wait for OWN predecessor only (max 1 transitioning ledger store) +def on_chunk_boundary(C, active_stores, meta_store): + """ + Swap active stores and kick off LFS + events freeze transitions for chunk C. + + Ingestion for chunk C+1 continues unimpeded — the ingestion loop's active_stores + reference now points at pre-created stores for C+1, while the transitions below + read from the stores just retired. + """ + + # LFS transition — drain the last in-flight LFS freeze (max-1-transitioning invariant), + # then swap pointers so the next chunk writes to pre-created stores. wait_for_lfs_complete() transitioning_ledger = active_stores.ledger - active_stores.ledger = open_precreated_ledger_store(chunk_id + 1) - run_in_background(lfs_transition, chunk_id, transitioning_ledger, meta_store) + active_stores.ledger = open_precreated_ledger_store(C + 1) + run_in_background(lfs_transition, C, transitioning_ledger, meta_store) - # ── Events sub-flow ── - # Wait for OWN predecessor only (max 1 freezing events segment) + # Events transition — same shape. Independent goroutine; does NOT wait for LFS. wait_for_events_complete() freezing_segment = active_stores.events - active_stores.events = create_events_hot_segment(chunk_id + 1) - run_in_background(events_transition, chunk_id, freezing_segment, meta_store) + active_stores.events = create_events_hot_segment(C + 1) + run_in_background(events_transition, C, freezing_segment, meta_store) ``` -### LFS Transition (background goroutine) +### LFS Transition + +Converts the retired ledger RocksDB store to an immutable `.pack` file, then discards the store. ```python -def lfs_transition(chunk_id, transitioning_ledger_store, meta_store): - # ── Transition: read from RocksDB, write pack file ── - pack_path = ledger_pack_path(chunk_id) - first_ledger = chunk_first_ledger(chunk_id) - last_ledger = chunk_last_ledger(chunk_id) - - writer = packfile.create(pack_path, overwrite=True) # handles partial files - for seq in range(first_ledger, last_ledger + 1): - lcm_bytes = transitioning_ledger_store.get(uint32_big_endian(seq)) - writer.append(lcm_bytes) - writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # flag after fsync - - # ── Cleanup (separate step — if crash here, flag is set, retry cleanup) ── - transitioning_ledger_store.close() - delete_dir(ledger_store_path(chunk_id)) +def lfs_transition(C, transitioning_ledger_store, meta_store): + """ + Read all 10_000 ledgers for chunk C from its active store, write the pack file, + fsync, flag, then delete the store. + + Order matters: + 1. Open pack file with overwrite=True so a prior crashed attempt's bytes are discarded. + 2. Write all ledgers in order. + 3. fsync_and_close — the pack file is durable on disk after this. + 4. Set :lfs flag — the 'flag-after-fsync' invariant. Queries can now route here. + 5. Close and delete the active store. Crash between (4) and (5) leaves an orphan + directory; Phase 3's scan_ledger_store_dirs + :lfs-present check deletes it. + """ + pack_path = ledger_pack_path(C) + writer = packfile.create(pack_path, overwrite=True) # 1 + for seq in range(chunk_first_ledger(C), chunk_last_ledger(C) + 1): + writer.append(transitioning_ledger_store.get(uint32_big_endian(seq))) # 2 + writer.fsync_and_close() # 3 + meta_store.put(f"chunk:{C:08d}:lfs", "1") # 4 + + transitioning_ledger_store.close() # 5 + delete_dir(ledger_store_path(C)) signal_lfs_complete() ``` -### Events Transition (background goroutine) +### Events Transition + +Converts the retired events hot segment to three immutable files (events cold segment). ```python -def events_transition(chunk_id, freezing_segment, meta_store): - # ── Transition: freeze hot segment to cold segment ── - events_path = events_segment_path(chunk_id) - - # Write three cold segment files: - # {chunkID:08d}-events.pack — zstd-compressed event blocks - # {chunkID:08d}-index.pack — serialized roaring bitmaps - # {chunkID:08d}-index.hash — MPHF for term lookup - write_cold_segment(freezing_segment, events_path) +def events_transition(C, freezing_segment, meta_store): + """ + Freeze the events hot segment for chunk C. Same flag-after-fsync + cleanup order + as lfs_transition. + """ + events_path = events_segment_path(C) + write_cold_segment(freezing_segment, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # flag after fsync + meta_store.put(f"chunk:{C:08d}:events", "1") # flag-after-fsync - # ── Cleanup (separate step) ── - freezing_segment.discard() # delete persisted deltas + in-memory bitmaps + freezing_segment.discard() # drops in-memory bitmaps + persisted deltas signal_events_complete() ``` -### Index Boundary (every 10_000_000 ledgers) +### Index Boundary (every `LEDGERS_PER_INDEX` ledgers) -- Triggered when `ledger_seq == index_last_ledger(current_index)` -- The index boundary ledger is always also a chunk boundary ledger — `on_chunk_boundary` fires first, then `on_index_boundary` +The last chunk of an index has just rolled over. Before RecSplit can start, every chunk in the index must have its `:lfs` and `:events` flags set. ```python -def on_index_boundary(index_id, active_stores, meta_store): - # Wait for ALL chunk-level sub-flows to complete - # (the last chunk's LFS flush and events freeze must finish - # before the txhash store can be promoted) +def on_index_boundary(N, active_stores, meta_store): + """ + Dispatch RecSplit build for index N. Prerequisites: + - Every chunk in N has finished its LFS + events freeze transitions. + - No LFS or events transition is in flight for any chunk of N (would racethe RecSplit input). + """ + + # Drain ALL in-flight LFS + events transitions. On_chunk_boundary dispatches them; + # here we wait for them to finish — the final chunk of N may still be in-flight. wait_for_lfs_complete() wait_for_events_complete() + verify_all_chunk_flags(N, meta_store) # defense-in-depth - # Defense-in-depth: verify all chunk flags for the index - verify_all_chunk_flags(index_id, meta_store) - - # ── TxHash sub-flow ── + # Swap the txhash active store. RecSplit reads from the retired store. transitioning_txhash = active_stores.txhash - active_stores.txhash = open_precreated_txhash_store(index_id + 1) - - run_in_background(recsplit_transition, index_id, transitioning_txhash, meta_store) + active_stores.txhash = open_precreated_txhash_store(N + 1) + run_in_background(recsplit_transition, N, transitioning_txhash, meta_store) ``` -### RecSplit Transition (background goroutine) - -```python -def recsplit_transition(index_id, transitioning_txhash_store, meta_store): - # ── Transition: build RecSplit from RocksDB ── - # RecSplit builder reads from the transitioning txhash store (RocksDB, 16 CFs). - # This is the ONLY input source — both backfill .bin data (loaded at startup) - # and streaming txhash data are in the same RocksDB store. - # - # Contrast with backfill: backfill builds RecSplit from .bin flat files. - # Streaming builds RecSplit from RocksDB. - - idx_path = recsplit_index_path(index_id) - delete_partial_idx_files(idx_path) # clean up any partial files from prior crash - - # Build all 16 CF index files — what happens inside (goroutines, - # parallelism, memory) is up to the task. All-or-nothing. - build_recsplit(transitioning_txhash_store, idx_path) - fsync_all_idx_files(idx_path) - - # Verify: spot-check random ledgers and txhashes against immutable stores - verify_spot_check(index_id, idx_path, meta_store) - - meta_store.put(f"index:{index_id:08d}:txhash", "1") # flag after fsync + verify - - # ── Cleanup (separate step) ── - transitioning_txhash_store.close() - delete_dir(txhash_store_path(index_id)) -``` +### RecSplit Transition -### Transition Dependencies +Builds the 16 RecSplit `.idx` files for index N from the retired txhash active store. ```python -# Per chunk boundary: -# lfs_transition(C) → cleanup: delete ledger store → unblocks lfs_transition(C+1) -# events_transition(C) → cleanup: delete persisted deltas → unblocks events_transition(C+1) -# -# Per index boundary (= last chunk boundary of that index): -# ALL lfs_transition + events_transition for the index must complete -# → recsplit_transition(N) → cleanup: delete txhash store +def recsplit_transition(N, transitioning_txhash_store, meta_store): + """ + Same flag-after-fsync pattern as lfs/events: + 1. Delete any partial .idx files from a prior crashed attempt. + 2. Build the 16 RecSplit indexes (one per CF). + 3. fsync all .idx files. + 4. Verify spot-check against the txhash store. + 5. Flag. + 6. Close + delete the txhash active store. + """ + idx_path = recsplit_index_path(N) + delete_partial_idx_files(idx_path) # 1 + build_recsplit(transitioning_txhash_store, idx_path) # 2 (16 .idx files) + fsync_all_idx_files(idx_path) # 3 + verify_spot_check(N, idx_path, meta_store) # 4 + meta_store.put(f"index:{N:08d}:txhash", "1") # 5 + + transitioning_txhash_store.close() # 6 + delete_dir(txhash_store_path(N)) ``` -- Each sub-flow waits for its own predecessor at chunk boundaries -- At index boundary: ALL sub-flows must complete before RecSplit starts -- Cleanup is a separate step after the flag — crash after flag = retry just cleanup on restart - --- -## Crash Recovery +## Pruning -### Invariants +Retention is enforced by a single background goroutine, woken at chunk boundaries. Prune granularity is the whole txhash index — never per chunk. -Six invariants handle all crash recovery. No special-case logic outside these. +```python +def lifecycle_loop(config, meta_store): + """ + Runs as a single background goroutine. Chunk-boundary-notified. Prune gate is + uniform across all artifact kinds — LFS, events, RecSplit — for a given index. + """ + while True: + wait_for_chunk_boundary_notification() + + cpi = config.backfill.chunks_per_txhash_index + R = config.streaming.retention_ledgers + for N in eligible_prune_indexes(meta_store, R, cpi): + prune_index(N, meta_store, config) + + +def eligible_prune_indexes(meta_store, R, cpi): + """ + Returns indexes whose entire footprint is past the retention window and are still + prune-eligible (either :txhash == "1" meaning prune hasn't started, or "deleting" + meaning a prior run crashed mid-prune). + + - R = 0 → no pruning; archive profile retains everything. + - R > 0 → index N is eligible when tip > index_last_ledger(N) + R. + - tip ledger is streaming:last_committed_ledger (the daemon's own progress). + """ + if R == 0: + return [] + L = meta_store.get("streaming:last_committed_ledger") + ledgers_per_index = cpi * 10_000 + max_eligible_N = (L - R - 2) // ledgers_per_index + if max_eligible_N < 0: + return [] + result = [] + for N in range(0, max_eligible_N + 1): + val = meta_store.get(f"index:{N:08d}:txhash") + if val in ("1", "deleting"): + result.append(N) + return result + + +def prune_index(N, meta_store, config): + """ + Deletes every artifact for index N and clears its meta store keys. Two-phase marker + for query-routing safety: + + - Set :txhash = "deleting" FIRST. Queries short-circuit (treat as absent). + - Delete files + chunk keys. + - Delete :txhash key LAST. + + Crash between set-deleting and delete-key leaves :txhash == "deleting"; next startup + re-runs prune_index, which is idempotent (rm -f + delete_if_exists semantics). + """ + cpi = config.backfill.chunks_per_txhash_index + + # Stage 1: commit to pruning. Once this lands, queries for any ledger in index N + # return HTTP 4xx (past retention). + meta_store.put(f"index:{N:08d}:txhash", "deleting") + + # Stage 2: delete files and per-chunk keys. Idempotent on re-run. + for C in range(N * cpi, (N + 1) * cpi): + delete_if_exists(ledger_pack_path(C)) + delete_events_segment(C) + meta_store.delete(f"chunk:{C:08d}:lfs") + meta_store.delete(f"chunk:{C:08d}:events") + delete_recsplit_idx_files(N) + + # Stage 3: clear the index key. Index is now fully gone. + meta_store.delete(f"index:{N:08d}:txhash") +``` -1. **Flag-after-fsync** — meta store flags set only after corresponding files are fsynced - - Flag absent = output treated as missing → transition retried from scratch -2. **Idempotent writes** — same input ledger always produces same key-value pairs in all stores - - Re-processing after crash is always safe -3. **Per-ledger checkpoint** — `streaming:last_committed_ledger` written only after all three stores durably commit - - On crash: resume from `last_committed_ledger + 1` - - Events system truncates hot segment data beyond checkpoint on startup (prevents duplicate event IDs) -4. **No separate recovery phase** — startup derives state from meta store keys + on-disk artifacts, completes or discards incomplete work - - Same code path for first start, restarts, and crash recovery -5. **Max-1-transitioning per sub-flow** — previous transition must complete before next starts - - Applies to both steady-state and crash recovery -6. **DAG-structured cleanup** — cleanup runs as a separate step after flag is set - - Crash between flag and cleanup: flag is durable, startup retries just the cleanup - - The transition itself is never re-run +**Why index-atomic.** Per-chunk pruning would create a window where `getTransaction` resolves to a ledger sequence whose pack file has already been deleted. Gating every artifact kind on whole-index past-retention closes that window completely. -### Startup Validation Guards +**How much extra data sits on disk.** At most `LEDGERS_PER_INDEX - 1` ledgers past the strict retention line. Because `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_INDEX`, the strict retention line itself does not bisect an index — the next-eligible index is exactly `LEDGERS_PER_INDEX` further. -Four validation rules prevent starting in an invalid state: +--- -- **`chunks_per_txhash_index` immutable** — fatal if changed after first run -- **Head index-aligned** — fatal if the lowest chunk is not the first chunk of its index -- **Contiguous flags** — fatal if any gap exists in `lfs` or `events` flags within backfill's range -- **Backfill completeness (first start in streaming mode)** — fatal if any backfill chunk is missing a `txhash` flag in an incomplete index +## Query Contract -### How Invariants Resolve the Hardest Scenarios +Query serving is gated on Phase 4 being reached. `getLedger`, `getTransaction`, `getEvents` all return **HTTP 4xx** during Phases 1–3. -**Compound recovery — orphaned transition from chunk C-1 + missed boundary for chunk C:** -``` -State: streaming:last_committed_ledger = 56_380_001 (= chunk_last_ledger(5637)) - chunk:00005636:events = absent (freeze crashed) - chunk:00005637:lfs = absent (transitions never started) - chunk:00005637:events = absent - Events DB has persisted deltas for chunks 5636 + 5637 - -Invariants applied: - - Flag-after-fsync (#1): chunk 5636 events flag absent → freeze is retried - - No separate recovery (#4): startup detects absent flags, triggers transitions - - Max-1-transitioning (#5): 5636 freeze completes BEFORE 5637 freeze starts - - Per-ledger checkpoint (#3): resume from 56_380_002 after transitions complete -``` +### Readiness Signal -**Checkpoint-boundary gap — crash between checkpoint and boundary transitions:** -``` -State: streaming:last_committed_ledger = 56_370_001 (= chunk_last_ledger(5636)) - chunk:00005636:lfs = absent (swap never happened) - chunk:00005636:events = absent (freeze never started) - -Invariants applied: - - Per-ledger checkpoint (#3): checkpoint is at the boundary, data is in WAL - - No separate recovery (#4): startup detects checkpoint at boundary + absent flags → - triggers LFS flush and events freeze before resuming - - Idempotent writes (#2): if any partial data exists from a half-started transition, - re-writing with overwrite=True produces identical results -``` +- An in-memory boolean `daemon_ready` is set by `set_daemon_ready()` at the top of Phase 4, after Phases 1–3 complete and active stores are opened. +- Not persisted. On every startup the flag starts `false`; on every Phase 4 entry it flips to `true`. Clean shutdown discards it implicitly (process exits). +- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the daemon serves, every time. +- Query handlers check the flag on each request. `false` → HTTP 4xx. `true` → route normally. -**Crash during .bin loading (step 3) — partial load, some .bin files already deleted:** -``` -State: chunks 5000-5399: .bin deleted, txhash flag deleted, data in RocksDB (WAL) - chunks 5400-5633: .bin present, txhash flag present, not yet loaded - streaming:last_committed_ledger = absent - -Invariants applied: - - Per-ledger checkpoint (#3): absence of checkpoint signals "first start in streaming mode" → redo sequence - - Idempotent writes (#2): WAL-recovered data for 5000-5399 survives; loop loads 5400-5633 - - Flag-after-fsync (#1): txhash flags track which .bin files are pending (flag deleted = loaded) - - CRITICAL: must open existing txhash store (WAL recovery), not delete-and-recreate — - chunks 5000-5399 data would be lost since .bin files are gone -``` +### Behavior During Phases 1–3 -### Recovery Decision Tree +- `/getLedger`, `/getTransaction`, `/getEvents` → `HTTP 4xx` with no payload detail. +- `/getHealth` → always served; returns `catching_up` + drift when daemon is pre-Phase-4, otherwise `streaming` + drift. +- No partial / incremental serving. The daemon does not serve "whatever is ingested so far" while Phases 1–3 are running. -```python -def recover_on_startup(config, meta_store): - last_committed = meta_store.get("streaming:last_committed_ledger") +### Behavior When an Index Is Being Pruned - if last_committed is None: - # First start in streaming mode — no prior streaming checkpoint exists. - # Validate that backfill left complete data, then load .bin files into RocksDB. - validate_backfill_flags(config, meta_store) - load_backfill_bins(config, meta_store) - last_committed = chunk_last_ledger(durable_tail) - meta_store.put("streaming:last_committed_ledger", last_committed) +- `prune_index` sets `index:{N:08d}:txhash = "deleting"` before touching any files, and deletes the key after all files are gone. Query routing treats `"deleting"` identically to `"absent"` (key-not-present). +- Queries for a ledger in a pruning index return HTTP 4xx (past retention) starting the instant the `"deleting"` marker is set, not when the files actually disappear. No window where queries route into a half-deleted index. - # Reconcile: complete orphaned transitions, delete orphaned artifacts. - # Handles crashes during LFS flush, events freeze, or RecSplit build. - reconcile_orphaned_transitions(config, meta_store) - - # Replay missed boundary transitions. - # If the checkpoint landed exactly on a chunk boundary but the process crashed - # before the boundary transitions (swap store, spawn flush/freeze) fired, - # the flags for that chunk are absent. Detect and trigger them now. - current_chunk = (last_committed - 2) // 10_000 # subtract 2: Stellar ledgers start at 2 - if last_committed == chunk_last_ledger(current_chunk): - if not meta_store.has(f"chunk:{current_chunk:08d}:lfs"): - trigger_lfs_flush(current_chunk) - if not meta_store.has(f"chunk:{current_chunk:08d}:events"): - trigger_events_freeze(current_chunk) - - # Spawn background RecSplit for any BUILD_READY indexes. - # An index is BUILD_READY when all chunk lfs+events flags are set but - # index:{N:08d}:txhash is absent (RecSplit not yet built or crashed mid-build). - for index_id in indexes_with_all_chunk_flags_but_no_index_flag(meta_store): - run_in_background(recsplit_transition, index_id) +### Rationale - resume_ledger = last_committed + 1 - return meta_store, resume_ledger -``` +Without an explicit gate, implementations drift toward "best-effort serve whatever is ingested." That produces inconsistent results across operators and breaks client assumptions. An explicit `daemon_ready` flag + HTTP 4xx error gives clients an unambiguous signal, and the `catching_up` health status gives operators visibility into progress. --- -## Meta Store Keys - -### Keys Introduced by Streaming - -| Key | Value | Written When | -|---|---|---| -| `streaming:last_committed_ledger` | `uint32` | After every successfully committed ledger | -| `config:chunks_per_txhash_index` | `uint32` | On first run (backfill or streaming) — immutable | +## Crash Recovery -### Keys Shared with Backfill +No separate recovery phase. Every startup runs Phases 1–4 regardless — already-complete work is detected and skipped via meta store flags. -| Key | Written By | Notes | -|---|---|---| -| `chunk:{C:08d}:lfs` | Both | Backfill: after pack file fsync in `process_chunk`. Streaming: after LFS flush goroutine fsync. | -| `chunk:{C:08d}:events` | Both | Backfill: after cold segment fsync in `process_chunk`. Streaming: after events freeze goroutine fsync. | -| `chunk:{C:08d}:txhash` | Backfill only | After `.bin` file fsync. Streaming does NOT write `.bin` files — txhash data goes to RocksDB. Flags deleted during startup step 3 (.bin loading). | -| `index:{N:08d}:txhash` | Both | After all 16 RecSplit CF `.idx` files built + fsynced. Backfill: from `.bin` files. Streaming: from RocksDB txhash store. | +### Invariants -### Key Lifecycle in Streaming +1. **Flag-after-fsync.** A meta store flag is set only after the corresponding file(s) are fsynced. Flag absent = output treated as missing → transition retried from scratch. +2. **Idempotent writes.** The same input ledger always produces the same key-value pairs in all stores. Re-processing after crash is safe. +3. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. +4. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. +5. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. +6. **DAG-structured cleanup.** Cleanup runs as a separate step after the flag is set. Crash between flag and cleanup = retry just the cleanup on restart. +7. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 itself avoids producing them. +8. **Two-phase prune marker.** `prune_index` writes `index:{N}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `eligible_prune_indexes`. -``` -Per ledger: - streaming:last_committed_ledger = ledger_seq (after all 3 stores commit) +### Compound Recovery Scenarios -Per chunk (background, after chunk boundary): - chunk:{C:08d}:lfs = "1" (after pack file fsync) - chunk:{C:08d}:events = "1" (after cold segment fsync) +The backfill doc's crash recovery model (Section: Crash Recovery in `01-backfill-workflow.md`) handles every Phase 1 crash. Streaming extends it with per-ledger and per-transition recovery: -Per index (background, after index boundary): - index:{N:08d}:txhash = "1" (after all 16 .idx files fsync + verify) - -Startup step 3 (first start in streaming mode only): - chunk:{C:08d}:txhash → DELETED (after .bin loaded into RocksDB) - .bin files → DELETED (after flag deleted) -``` +- **Crash during Phase 2 `.bin` hydration.** On restart, Phase 2 re-runs. Chunks whose `.bin` was loaded and deleted on the first pass have no `:txhash` flag and no `.bin` file — the loop skips them via the flag check. Chunks not yet loaded still have their `:txhash` flag and `.bin` file — picked up by the same loop. +- **Crash between live per-ledger checkpoint and LFS freeze completion.** `streaming:last_committed_ledger = chunk_last_ledger(C)` but `chunk:{C}:lfs` is absent (freeze transition was killed before setting the flag). On restart, Phase 1 sees `:lfs` missing for C and re-runs `process_chunk(C)` against its configured source — idempotent per-artifact. Phase 3 then finds the active ledger store for C still on disk, sees `:lfs` now set, and deletes the orphaned store. Known inefficiency: ~10_000 ledgers of redundant ingestion work per affected chunk. Correctness is preserved. +- **Crash mid-RecSplit.** `index:{N}:txhash` absent. Phase 3 detects all chunks for N have `:lfs` set, re-spawns the RecSplit build. Partial `.idx` files are deleted first. +- **Crash mid-prune.** Some files deleted, some chunk keys cleared, `index:{N}:txhash = "deleting"` still present. On restart N is still in `eligible_prune_indexes` (the function picks up `"deleting"` as well as `"1"`), so `prune_index(N)` runs again — idempotent because file deletes are `rm -f` and key deletes are `delete_if_exists`. --- ## Backpressure and Drift Detection -- Streaming ingests ledgers at the Stellar network's production rate (~1 ledger every 6 seconds) -- All transitions (LFS flush, events freeze, RecSplit build) run in background goroutines — the ingestion loop should never stall -- If ingestion falls behind, the cause is typically disk I/O saturation or RocksDB compaction stalls +- Live ingestion runs at the network's production rate (~1 ledger / 6 s). Freeze transitions run in background and must not stall the ingestion loop. +- If ingestion drifts, the cause is typically disk I/O saturation or RocksDB compaction stalls. ### Drift Metric ```python -drift_ledgers = network_tip_ledger - last_committed_ledger +drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_committed_ledger") ``` -- `network_tip_ledger` obtained from CaptiveStellarCore metadata -- Exposed as a Prometheus gauge: `streaming_drift_ledgers` -- Threshold: **10 ledgers** (~60 seconds). More than 10 ledgers behind means something is wrong — transitions are background and should not cause drift. - -### Health Endpoint - -- `getHealth` returns unhealthy when `drift_ledgers > drift_warning_ledgers` (default 10) -- Kubernetes readiness probes use `getHealth` to remove the node from the service pool -- No automatic response (no pause, no abort) — the operator investigates and acts +- Exposed as a Prometheus gauge `streaming_drift_ledgers`. +- `getHealth` returns `unhealthy` when `drift_ledgers > DRIFT_WARNING_LEDGERS` (default 10). +- No automatic response (no pause, no abort). Operator investigates. --- @@ -808,36 +1083,33 @@ drift_ledgers = network_tip_ledger - last_committed_ledger | Error | Action | |---|---| | CaptiveStellarCore unavailable | RETRY with backoff; ABORT after N retries | -| Ledger store write failure | ABORT — disk full or storage corruption | -| TxHash store write failure | ABORT — disk full or storage corruption | -| Events hot segment write failure | ABORT — disk full or storage corruption | +| Ledger / txhash / events write failure | ABORT — disk full or storage corruption | | Meta store write failure | ABORT — cannot maintain checkpoint | -| LFS flush failure (pack file write/fsync) | Do NOT set `chunk:{C:08d}:lfs`; ABORT transition goroutine; restart re-triggers flush | -| Events freeze failure (cold segment write/fsync) | Do NOT set `chunk:{C:08d}:events`; ABORT transition goroutine; restart re-triggers freeze | -| RecSplit build failure | Do NOT set `index:{N:08d}:txhash`; ABORT transition goroutine; restart deletes partials and rebuilds | -| RecSplit verification mismatch | ABORT; do NOT delete transitioning txhash store; log error; operator investigates | -| Startup: `chunks_per_txhash_index` changed | FATAL — cannot change after first run | -| Startup: head not index-aligned | FATAL — run backfill to complete the head index | -| Startup: gap in chunk flags | FATAL — run backfill to fill the gap | -| Startup: backfill chunk missing txhash flag (first start in streaming mode) | FATAL — run backfill to complete | +| LFS flush failure | Do NOT set `chunk:{C}:lfs`; ABORT transition; restart retries | +| Events freeze failure | Do NOT set `chunk:{C}:events`; ABORT transition; restart retries | +| RecSplit build failure | Do NOT set `index:{N}:txhash`; ABORT transition; restart deletes partials and rebuilds | +| RecSplit verification mismatch | ABORT; do NOT delete transitioning txhash store; operator investigates | +| Startup: immutable key changed | FATAL — wipe datadir to change | +| Startup: `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_INDEX` | FATAL — fix config | +| Startup: head not index-aligned | FATAL — datadir corruption; wipe | +| Startup: gap in chunk flags | FATAL — datadir corruption; wipe | --- -## Query Routing +## Required Backfill Design Changes -- Covered in a separate design document -- Summary routing table for reference: +The unified design requires edits to `01-backfill-workflow.md` (authoritative `03-backfill-workflow.md` on `feature/full-history`): -| Query | Active phase | Transitioning phase | Complete phase | -|---|---|---|---| -| `getLedger` | Active ledger store (or LFS for already-flushed chunks) | LFS (all ledger stores already flushed) | LFS | -| `getTransaction` | Active txhash store (RocksDB CF lookup) | Transitioning txhash store (still open for reads) | RecSplit index | -| `getEvents` | Events hot segment (in-memory bitmap lookup) | Freezing segment (still serves reads) | Cold segment (MPHF + packfile) | +1. **Drop the `stellar-rpc full-history-backfill` cobra subcommand and all its per-run CLI flags** (`--start-ledger`, `--end-ledger`, `--workers`, `--verify-recsplit`, `--max-retries`). Backfill is no longer an operator-facing CLI entry point. `process_chunk`, `build_txhash_index`, `cleanup_txhash`, and the DAG scheduler remain as subroutines invoked by streaming Phase 1. +2. **Extend `process_chunk` with a `source=` parameter** accepting a `LedgerSource`. Default (`BSBSource`) matches today's behavior. Streaming Phase 1 passes the selected `LedgerSource`. +3. **Extend the DAG worker cap to honor `source.max_parallelism()`.** Currently the DAG caps at `--workers` (default GOMAXPROCS). Under `CaptiveCoreSource`, cap at 1. +4. **Move `retention_ledgers` validation + store-on-first-run into shared validation.** Backfill subroutine reads the stored value from meta store; streaming enforces it. +5. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{C}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. --- ## Related Documents -- [01-backfill-workflow.md](./01-backfill-workflow.md) — backfill pipeline, DAG tasks, partial index handling +- [01-backfill-workflow.md](./01-backfill-workflow.md) — backfill subroutine: DAG, `process_chunk`, partial index handling - [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segments, bitmap indexes, freeze process - Query routing — separate design document (TBD) From 044f7578ea00afc7511fc5647ca8ea557c02540a Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Tue, 21 Apr 2026 08:32:16 -0700 Subject: [PATCH 06/34] Streaming doc: grill-me pass fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three correctness fixes + two ambiguity fixes caught during the hands-off grill-me self-review. Correctness: - Phase 3 resume_ledger derivation now matches Phase 4 (uses streaming:last_committed_ledger when present; falls back to derive_phase1_low_water). Previous divergence could leave the active ledger store for the just-rolled-over chunk as an orphan. - on_chunk_boundary now calls notify_lifecycle at the end so the prune loop actually gets woken. - lifecycle_loop runs an initial prune sweep on entry so crashed-mid- prune indexes ("deleting" state) don't sit unserviced until the next chunk-boundary notification (up to ~16h at cpi=1). - Phase 3's RecSplit re-spawn passes an opened store handle, not a directory path — matches recsplit_transition's signature. Doc gaps: - Added validate_config + _enforce_immutable pseudocode under Configuration so the retention + cpi immutability machinery is explicit. - Pre-creation timing note in Active Store Architecture. - Required Backfill Design Changes: spelled out run_backfill's new signature (chunks + source= instead of flags.start_ledger / flags.end_ledger) and clarified the :txhash "deleting" value is a streaming-only extension. --- .../design-docs/02-streaming-workflow.md | 108 +++++++++++++++--- 1 file changed, 93 insertions(+), 15 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index dd1d42bc1..4915a52a6 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -135,6 +135,49 @@ See [Ledger Source](#ledger-source) for the full source-selection rule. - `[HISTORY_ARCHIVES].URLS` required in all profiles. - `CAPTIVE_CORE_CONFIG` required in all profiles. +### Validation Pseudocode + +```python +def validate_config(config, meta_store): + """ + Runs once at startup before Phase 1. Enforces: + - Immutable keys (CHUNKS_PER_TXHASH_INDEX, RETENTION_LEDGERS) match meta-store state. + - RETENTION_LEDGERS is 0 or a positive multiple of LEDGERS_PER_INDEX. + - Required config keys are present. + + Any failure is fatal — the daemon exits with a clear error. Operator fixes config + (or wipes the datadir for an immutable-key change) and re-invokes. + """ + cpi = config.backfill.chunks_per_txhash_index + R = config.streaming.retention_ledgers + ledgers_per_index = cpi * 10_000 + + # 1. Retention shape. + if R != 0 and (R <= 0 or R % ledgers_per_index != 0): + fatal(f"RETENTION_LEDGERS={R} must be 0 or a positive multiple of " + f"LEDGERS_PER_INDEX={ledgers_per_index}. Valid values at this cpi: " + f"0, {ledgers_per_index}, {2*ledgers_per_index}, ...") + + # 2. Required keys. + if not config.streaming.captive_core_config: + fatal("STREAMING.CAPTIVE_CORE_CONFIG is required.") + if not config.history_archives.urls: + fatal("HISTORY_ARCHIVES.URLS is required (list of at least one archive URL).") + + # 3. Immutable keys. Store on first run; fatal on mismatch thereafter. + _enforce_immutable(meta_store, "config:chunks_per_txhash_index", str(cpi)) + _enforce_immutable(meta_store, "config:retention_ledgers", str(R)) + + +def _enforce_immutable(meta_store, key, current_value): + stored = meta_store.get(key) + if stored is None: + meta_store.put(key, current_value) + elif stored != current_value: + fatal(f"{key} changed: stored={stored}, config={current_value}. " + f"Wipe datadir to change.") +``` + ### Operator Profiles Three profiles emerge from config combinations. No profile flag. @@ -351,7 +394,8 @@ The daemon maintains three active stores for the current ingestion position. All ### Store Pre-creation - The store for the next chunk / index is pre-created before the boundary is reached, so boundary-time work is a pointer swap only. -- On restart, a pre-created store is expected to exist — Phase 3 treats it as active, not an orphan. +- Creation timing: when the ingestion loop commits a ledger within a configurable window before the boundary (e.g., `chunk_last_ledger(C) - 1_000`). The window must be large enough that store initialization (directory mkdir + RocksDB open + column family setup) completes before the boundary ledger arrives, and small enough that pre-creation doesn't run prematurely for chunks the daemon may never reach. +- On restart, a pre-created store is expected to exist — Phase 3 treats `resume_chunk + 1` (and `resume_index + 1`) as active, not an orphan. ### Max Concurrent Stores @@ -662,10 +706,18 @@ def phase3_reconcile_orphans(config, meta_store): On a fresh datadir (no :lfs flags anywhere, Phase 1 had nothing to do) this is a no-op: resume_ledger = GENESIS_LEDGER, resume_chunk = 0, no active stores on disk yet. """ - # Derive resume_ledger the same way Phase 4 will. resume_chunk is the chunk Phase 4 - # ingests into first; its active store is preserved through Phase 3. + # Derive resume_ledger the SAME way Phase 4 will — otherwise Phase 3 and Phase 4 can + # disagree on which chunk's active store to preserve, causing Phase 4 to open a fresh + # store while Phase 3's kept-active-store is left as an orphan. + # + # Priority order (matches phase4_ingest): + # 1. streaming:last_committed_ledger if set (live-path crash mid-chunk or at boundary). + # 2. derive_phase1_low_water otherwise (first-start after Phase 1, or fresh datadir). cpi = config.backfill.chunks_per_txhash_index - resume_ledger = derive_phase1_low_water(meta_store) + 1 + last_committed = meta_store.get("streaming:last_committed_ledger") + if last_committed is None: + last_committed = derive_phase1_low_water(meta_store) + resume_ledger = last_committed + 1 if resume_ledger < GENESIS_LEDGER: resume_ledger = GENESIS_LEDGER resume_chunk = (resume_ledger - 2) // 10_000 @@ -691,7 +743,11 @@ def phase3_reconcile_orphans(config, meta_store): if meta_store.has(f"index:{N:08d}:txhash"): delete_dir(store_dir) # RecSplit done, cleanup lingered elif all_chunks_frozen(meta_store, N, cpi): - run_in_background(recsplit_transition, N, store_dir, meta_store) + # RecSplit build for N was never started or was interrupted. Open the store + # and spawn the build — pass the handle, not the directory path, because + # recsplit_transition reads from the store and closes it on completion. + transitioning_txhash = open_active_txhash_store(config, N) + run_in_background(recsplit_transition, N, transitioning_txhash, meta_store) # Events hot segment: truncate any persisted deltas beyond resume_ledger - 1. # Prevents duplicate event IDs when Phase 4 replays the first live ledger. @@ -824,6 +880,11 @@ def on_chunk_boundary(C, active_stores, meta_store): freezing_segment = active_stores.events active_stores.events = create_events_hot_segment(C + 1) run_in_background(events_transition, C, freezing_segment, meta_store) + + # Wake the lifecycle goroutine — it will check prune eligibility. Freeze transitions + # above are NOT dispatched via the lifecycle loop; they run as direct children of the + # ingestion-loop thread. The notification is specifically for pruning. + notify_lifecycle() ``` ### LFS Transition @@ -934,16 +995,32 @@ Retention is enforced by a single background goroutine, woken at chunk boundarie ```python def lifecycle_loop(config, meta_store): """ - Runs as a single background goroutine. Chunk-boundary-notified. Prune gate is - uniform across all artifact kinds — LFS, events, RecSplit — for a given index. + Runs as a single background goroutine. Prune gate is uniform across all artifact + kinds — LFS, events, RecSplit — for a given index. + + Wake-up sources: + - Initial scan at entry — catches any index left in "deleting" state by a prior + crashed prune before the first chunk-boundary notification of this run arrives. + Without this, a crashed prune could sit unserviced for up to 10_000 ledgers + (~16 hours at cpi=1). + - Chunk-boundary notifications from the ingestion loop (see on_chunk_boundary). + + The freeze transitions (lfs_transition, events_transition, recsplit_transition) are + NOT spawned by this loop — the ingestion loop's on_chunk_boundary / on_index_boundary + dispatch them directly. lifecycle_loop is scoped to pruning. """ + cpi = config.backfill.chunks_per_txhash_index + R = config.streaming.retention_ledgers + + _do_prune_sweep(meta_store, R, cpi, config) # initial scan while True: wait_for_chunk_boundary_notification() + _do_prune_sweep(meta_store, R, cpi, config) + - cpi = config.backfill.chunks_per_txhash_index - R = config.streaming.retention_ledgers - for N in eligible_prune_indexes(meta_store, R, cpi): - prune_index(N, meta_store, config) +def _do_prune_sweep(meta_store, R, cpi, config): + for N in eligible_prune_indexes(meta_store, R, cpi): + prune_index(N, meta_store, config) def eligible_prune_indexes(meta_store, R, cpi): @@ -1101,10 +1178,11 @@ drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_com The unified design requires edits to `01-backfill-workflow.md` (authoritative `03-backfill-workflow.md` on `feature/full-history`): 1. **Drop the `stellar-rpc full-history-backfill` cobra subcommand and all its per-run CLI flags** (`--start-ledger`, `--end-ledger`, `--workers`, `--verify-recsplit`, `--max-retries`). Backfill is no longer an operator-facing CLI entry point. `process_chunk`, `build_txhash_index`, `cleanup_txhash`, and the DAG scheduler remain as subroutines invoked by streaming Phase 1. -2. **Extend `process_chunk` with a `source=` parameter** accepting a `LedgerSource`. Default (`BSBSource`) matches today's behavior. Streaming Phase 1 passes the selected `LedgerSource`. -3. **Extend the DAG worker cap to honor `source.max_parallelism()`.** Currently the DAG caps at `--workers` (default GOMAXPROCS). Under `CaptiveCoreSource`, cap at 1. -4. **Move `retention_ledgers` validation + store-on-first-run into shared validation.** Backfill subroutine reads the stored value from meta store; streaming enforces it. -5. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{C}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. +2. **Change `run_backfill`'s signature to `run_backfill(config, range_start_chunk, range_end_chunk, source=...)`.** Previously `run_backfill(config, flags)` with `flags.start_ledger` and `flags.end_ledger`. Phase 1 computes chunk IDs (not ledger sequences) via `compute_backfill_range`, so the subroutine takes chunk IDs directly. `source=` selects BSB vs captive core. +3. **Extend `process_chunk` with the matching `source=` parameter** accepting a `LedgerSource`. Default (`BSBSource`) matches today's behavior. The subroutine no longer creates a BSB connection from config — it uses whatever the caller passed in. +4. **Extend the DAG worker cap to honor `source.max_parallelism()`.** Currently the DAG caps at `--workers` (default GOMAXPROCS). Under `CaptiveCoreSource`, cap at 1. +5. **Move `retention_ledgers` validation + store-on-first-run into shared validation.** Streaming's `validate_config` handles the store+compare. Backfill itself doesn't need to know retention — Phase 1 translates retention into the `[range_start_chunk, range_end_chunk]` it passes into `run_backfill`. +6. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{C}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. Exception: `index:{N:08d}:txhash` uses `"1"` or `"deleting"` for two-phase prune (spec'd in this doc under [Pruning](#pruning); backfill's `build_txhash_index` only ever writes `"1"`). --- From 392e78134f96274727b10eefc17979607f94f312 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Tue, 21 Apr 2026 08:33:35 -0700 Subject: [PATCH 07/34] Streaming doc: close active txhash store in Phase 2 Phase 2's hydrate step opened the active txhash store but never closed it. Phase 4's open_active_stores then tries to re-open the same directory and collides with RocksDB's directory flock. Fix: wrap the Phase 2 load loop in try/finally and close the handle before returning. WAL state on disk is preserved; Phase 4 re-opens cleanly. --- .../design-docs/02-streaming-workflow.md | 34 +++++++++++-------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 4915a52a6..3e0c5fd0b 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -664,20 +664,26 @@ def phase2_hydrate_txhash(config, meta_store): return txhash_store = open_active_txhash_store(config, N) # WAL recovery; do NOT recreate - for chunk_id in range(N * cpi, (N + 1) * cpi): - if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already loaded (flag cleared) - bin_path = raw_txhash_path(chunk_id) - if os.path.exists(bin_path): - load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first - delete_if_exists(bin_path) # delete .bin second - - # 3. Sweep orphan .bin files (flag already gone, .bin lingering from crash between - # flag-delete and file-delete in a prior run). - for bin_file in scan_bin_files_for_index(N): - if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): - os.remove(bin_file) + try: + for chunk_id in range(N * cpi, (N + 1) * cpi): + if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + continue # already loaded (flag cleared) + bin_path = raw_txhash_path(chunk_id) + if os.path.exists(bin_path): + load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first + delete_if_exists(bin_path) # delete .bin second + + # 3. Sweep orphan .bin files (flag already gone, .bin lingering from crash between + # flag-delete and file-delete in a prior run). + for bin_file in scan_bin_files_for_index(N): + if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): + os.remove(bin_file) + finally: + # Must close before returning — Phase 4's open_active_stores re-opens the same + # directory, and RocksDB's directory flock would collide if this handle is still + # open. WAL remains on disk; reopening is safe. + txhash_store.close() ``` **Why "load then delete" matters.** Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. At `cpi=1000` with frequent restarts over a day, that is thousands of redundant loads. Deleting the `.bin` after the first successful load makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. From 07310f1bd456d01ec1cf3a09022b75d095a02f13 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Tue, 21 Apr 2026 16:51:14 -0700 Subject: [PATCH 08/34] =?UTF-8?q?Streaming=20doc:=20devil's-advocate=20pas?= =?UTF-8?q?s=20=E2=80=94=20five=20substantive=20fixes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Five findings substantive enough to fix; rest were non-issues on analysis or implementation-detail footnotes that belong in the source-of-truth, not the design spec. Added pseudocode for previously-implicit functions: - open_active_stores (Phase 4 entry): eagerly opens resume_chunk + resume_chunk+1 stores for ledger / events / txhash, so the first rollover is a pointer swap with zero I/O. - precreate_next_stores (background, from on_chunk_boundary): refills the *_next slots so subsequent rollovers stay pointer-swap-only. - complete_lfs_flush (Phase 3 helper): re-runs an interrupted LFS freeze for an orphan active ledger store. Explicit counterpart to lfs_transition, without the max-1-transitioning gate since Phase 3 runs synchronously before Phase 4. New Concurrency Model subsection under Freeze Transitions: - active_stores fields are single-owner (ingestion loop thread). - wait_for_lfs_complete / wait_for_events_complete are per-kind single-flight gates (chan struct{} or mutex), NOT sync.WaitGroup. - Meta-store is single-writer; serialization is RocksDB-enforced. - Pre-creation approach is eager at Phase 4 entry + background refill at each boundary, not a mid-chunk threshold tripwire. on_chunk_boundary rewritten to do pointer swaps using the *_next slots; spawns precreate_next_stores in background for C+2. --- .../design-docs/02-streaming-workflow.md | 95 +++++++++++++++++-- 1 file changed, 89 insertions(+), 6 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 3e0c5fd0b..d59b47e8f 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -784,6 +784,36 @@ def phase4_ingest(config, meta_store): set_daemon_ready() # in-memory flag; unblocks queries run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) + + +def open_active_stores(config, meta_store, resume_ledger): + """ + Open or create the three active stores for resume_ledger's chunk + index. Also + pre-create the next chunk's / next index's stores up front so the first chunk + rollover doesn't pay creation latency. + + - Ledger active: per-chunk RocksDB for chunk_id(resume_ledger). WAL-recovered + if the directory exists (mid-chunk restart); fresh-created otherwise. + - Events hot segment: in-memory for chunk_id(resume_ledger). If persisted deltas + exist for this chunk (mid-chunk restart), replay them to rebuild bitmaps. + Phase 3 already truncated anything past last_committed_ledger, so replay is safe. + - TxHash active: per-index RocksDB for index_id(chunk_id(resume_ledger)). May + already contain data from Phase 2's .bin hydration (which closed the handle + before returning — see Phase 2 pseudocode). WAL-recovered on reopen. + - Pre-created: also open/create chunk_id + 1 and index_id + 1 stores so the + first boundary rollover is a pointer swap only. + """ + resume_chunk = (resume_ledger - 2) // 10_000 + resume_index = resume_chunk // config.backfill.chunks_per_txhash_index + + return ActiveStores( + ledger = open_or_create_ledger_store(config, resume_chunk), + ledger_next = open_or_create_ledger_store(config, resume_chunk + 1), + events = open_or_create_events_hot_segment(config, meta_store, resume_chunk, resume_ledger), + events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk + 1, None), + txhash = open_or_create_txhash_store(config, resume_index), + txhash_next = open_or_create_txhash_store(config, resume_index + 1), + ) ``` Captive core takes 4–5 minutes to spin up and start emitting at `resume_ledger`. During that window `getHealth` remains in `catching_up` state (see [Query Contract](#query-contract)). @@ -860,6 +890,14 @@ Three independent background transitions per chunk/index boundary. Each has its Streaming's freeze transitions never produce `.bin` files. `.bin` files exist only as transient output of the backfill subroutine (inside Phase 1). +### Concurrency Model + +- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `ledger_next`, `events`, `events_next`, `txhash`, `txhash_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. +- **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). Go's `sync.Mutex` inside the meta-store wrapper + RocksDB's own single-writer semantics keep these serialized. +- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `lfs_transition` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. +- **Query handlers read from storage-manager layer** (see [01-backfill-workflow.md](./01-backfill-workflow.md)'s sibling docs and the pending query-routing design). Each per-data-type storage manager owns its own state-transition synchronization; the query handler never touches `active_stores` directly. +- **Pre-creation happens at store-open time, not at a mid-chunk tripwire.** `open_active_stores` (Phase 4 entry) opens BOTH `resume_chunk`'s store AND `resume_chunk + 1`'s store up front. Subsequent pre-creation happens inside `on_chunk_boundary` after the rollover — it opens `C + 2` so the NEXT rollover has the pre-created store already waiting. Amortizes creation cost; keeps the ingestion loop's hot path free of store-open latency. + ### Chunk Boundary (every 10_000 ledgers) Triggered when the ingestion loop commits `chunk_last_ledger(C)`. Handoffs to two freeze transitions (LFS + events) that run in background. @@ -869,28 +907,51 @@ def on_chunk_boundary(C, active_stores, meta_store): """ Swap active stores and kick off LFS + events freeze transitions for chunk C. - Ingestion for chunk C+1 continues unimpeded — the ingestion loop's active_stores - reference now points at pre-created stores for C+1, while the transitions below - read from the stores just retired. + Ingestion for chunk C+1 continues unimpeded — active_stores.ledger now points at + the ledger_next store that was pre-created at Phase 4 entry (or by the prior chunk's + boundary handler). + + Also pre-creates C+2's stores in background, so the NEXT chunk rollover finds its + pre-created store already opened. """ # LFS transition — drain the last in-flight LFS freeze (max-1-transitioning invariant), - # then swap pointers so the next chunk writes to pre-created stores. + # then swap pointers so the next chunk writes to the pre-created store. wait_for_lfs_complete() transitioning_ledger = active_stores.ledger - active_stores.ledger = open_precreated_ledger_store(C + 1) + active_stores.ledger = active_stores.ledger_next # pointer swap, no I/O run_in_background(lfs_transition, C, transitioning_ledger, meta_store) # Events transition — same shape. Independent goroutine; does NOT wait for LFS. wait_for_events_complete() freezing_segment = active_stores.events - active_stores.events = create_events_hot_segment(C + 1) + active_stores.events = active_stores.events_next # pointer swap run_in_background(events_transition, C, freezing_segment, meta_store) + # Pre-create C+2's ledger + events so the NEXT boundary is also a pointer swap. + # Low priority; not part of the hot path. Runs in background. + run_in_background(precreate_next_stores, active_stores, meta_store, C + 2) + # Wake the lifecycle goroutine — it will check prune eligibility. Freeze transitions # above are NOT dispatched via the lifecycle loop; they run as direct children of the # ingestion-loop thread. The notification is specifically for pruning. notify_lifecycle() + + +def precreate_next_stores(active_stores, meta_store, target_chunk): + """ + Opens / creates the "next-next" ledger store + events hot segment in background so + the NEXT chunk rollover doesn't pay creation latency on the hot path. + + Similarly handles index-next pre-creation when target_chunk crosses an index boundary. + Idempotent — safe to run on a restart where the target stores already exist. + """ + active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk) + active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk, None) + cpi = config.backfill.chunks_per_txhash_index + target_index = target_chunk // cpi + if target_index != index_id_of_chunk(target_chunk - 1): + active_stores.txhash_next = open_or_create_txhash_store(config, target_index) ``` ### LFS Transition @@ -921,6 +982,28 @@ def lfs_transition(C, transitioning_ledger_store, meta_store): transitioning_ledger_store.close() # 5 delete_dir(ledger_store_path(C)) signal_lfs_complete() + + +def complete_lfs_flush(store_dir, C, meta_store): + """ + Phase 3 helper. Re-runs lfs_transition for a chunk whose active ledger store exists + on disk but whose :lfs flag is absent — i.e., a crash interrupted the freeze after + the per-ledger checkpoint but before the flag was set. + + Identical to lfs_transition except: + - No signal_lfs_complete call (not running under the max-1-transitioning gate; + Phase 3 is synchronous with startup and runs to completion before Phase 4 starts). + - Opens the existing store (WAL-recovered) rather than receiving a handle. + """ + transitioning_ledger_store = open_or_create_ledger_store(config, C) + pack_path = ledger_pack_path(C) + writer = packfile.create(pack_path, overwrite=True) + for seq in range(chunk_first_ledger(C), chunk_last_ledger(C) + 1): + writer.append(transitioning_ledger_store.get(uint32_big_endian(seq))) + writer.fsync_and_close() + meta_store.put(f"chunk:{C:08d}:lfs", "1") + transitioning_ledger_store.close() + delete_dir(store_dir) ``` ### Events Transition From b61c6fa0d4dade4e4d661092d39e71cdc514c64c Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Tue, 21 Apr 2026 17:01:14 -0700 Subject: [PATCH 09/34] Streaming doc: arithmetic hygiene + off-by-one fix in prune eligibility MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two fixes: 1. eligible_prune_indexes off-by-one Old formula: max_eligible_N = (L - R - 2) // ledgers_per_index It rounded UP by one, so at L just past an index's strict retention line, the NEXT index (not yet past retention) was also included in the result. A just-completed RecSplit on the not-yet-past-retention index would have :txhash = "1" and get pruned prematurely. New formula: max_eligible_N = ((L - GENESIS_LEDGER - R) // LEDGERS_PER_INDEX) - 1 Derivation + numeric check now inline in the docstring. 2. Replace bare 2 / 10_000 with GENESIS_LEDGER / LEDGERS_PER_CHUNK throughout pseudocode. Geometry section now self-documents the Stellar-starts-at-ledger-2 offset. Added worked-example expansions (e.g., "chunk 5634 → (5634 * 10_000) + 2 = 56_340_002 — ends in ..._02") so readers see exactly how the formulas produce the ..._02 / ..._01 boundaries. 3. Added explicit parentheses around compound arithmetic (R % ledgers_per_index != 0, (T - L) < LEDGERS_PER_CHUNK, etc.) so readers don't have to remember Python's operator precedence. No behavioral change beyond the prune-eligibility fix itself. --- .../design-docs/02-streaming-workflow.md | 85 ++++++++++++++----- 1 file changed, 63 insertions(+), 22 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index d59b47e8f..b2fcc5b72 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -55,13 +55,33 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a Chunk and txhash index math are defined in [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Quick reference: ```python -# Stellar ledgers start at 2. All formulas subtract 2 to zero-base. -chunk_id = (ledger_seq - 2) // 10_000 # ledger 56_340_001 → chunk 5633 -chunk_first_ledger(C) = (C * 10_000) + 2 # chunk 5634 → ledger 56_340_002 -chunk_last_ledger(C) = ((C + 1) * 10_000) + 1 # chunk 5634 → ledger 56_350_001 -index_id(C) = C // CHUNKS_PER_TXHASH_INDEX # chunk 5634 → index 5 (at cpi=1000) -index_last_ledger(N) = ((N + 1) * CHUNKS_PER_TXHASH_INDEX * 10_000) + 1 # index 5 → ledger 60_000_001 -LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * 10_000 # derived; at cpi=1000 this is 10_000_000 +# Stellar's first ledger is GENESIS_LEDGER = 2 (not 0 or 1). Every formula that maps +# ledger_seq ↔ chunk_id subtracts GENESIS_LEDGER to zero-base the axis: ledger 2 lands +# in chunk 0, ledger 10_001 is chunk 0's last ledger, ledger 10_002 starts chunk 1. + +GENESIS_LEDGER = 2 +LEDGERS_PER_CHUNK = 10_000 +LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK + # at cpi=1000 this is 10_000_000 + +chunk_id(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + # 56_342_637 → (56_342_637 - 2) // 10_000 = 5634 + +chunk_first_ledger(C) = (C * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + # chunk 5634 → (5634 * 10_000) + 2 = 56_340_002 — ends in ..._02 + +chunk_last_ledger(C) = ((C + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) + # chunk 5634 → ((5635) * 10_000) + 1 = 56_350_001 — ends in ..._01 + # (GENESIS_LEDGER - 1) = 1 is what keeps chunk ends in ..._01 + +index_id_of_chunk(C) = C // CHUNKS_PER_TXHASH_INDEX + # chunk 5634 → 5 (at cpi=1000) + +index_first_ledger(N) = (N * LEDGERS_PER_INDEX) + GENESIS_LEDGER + # index 5 → (5 * 10_000_000) + 2 = 50_000_002 + +index_last_ledger(N) = ((N + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + # index 5 → ((6) * 10_000_000) + 1 = 60_000_001 ``` --- @@ -150,10 +170,10 @@ def validate_config(config, meta_store): """ cpi = config.backfill.chunks_per_txhash_index R = config.streaming.retention_ledgers - ledgers_per_index = cpi * 10_000 + ledgers_per_index = cpi * LEDGERS_PER_CHUNK # 1. Retention shape. - if R != 0 and (R <= 0 or R % ledgers_per_index != 0): + if R != 0 and (R <= 0 or (R % ledgers_per_index) != 0): fatal(f"RETENTION_LEDGERS={R} must be 0 or a positive multiple of " f"LEDGERS_PER_INDEX={ledgers_per_index}. Valid values at this cpi: " f"0, {ledgers_per_index}, {2*ledgers_per_index}, ...") @@ -544,7 +564,7 @@ def phase1_catchup(config, meta_store, source): while True: T = source.tip() - if T - L < 10_000: # less than one chunk remaining + if (T - L) < LEDGERS_PER_CHUNK: # less than one chunk remaining break range_start, range_end = compute_backfill_range(L, T, R, cpi) @@ -577,16 +597,20 @@ def compute_backfill_range(L, T, R, cpi): gap_start_ledger = L + 1 if R > 0: target_ledger = max(T - R, GENESIS_LEDGER) - target_chunk = (target_ledger - 2) // 10_000 + target_chunk = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK target_index = target_chunk // cpi - # First ledger of target_index = target_index * LEDGERS_PER_INDEX + GENESIS_LEDGER - leapfrog_start_ledger = target_index * cpi * 10_000 + GENESIS_LEDGER + # First ledger of target_index = (target_index * LEDGERS_PER_INDEX) + GENESIS_LEDGER. + leapfrog_start_ledger = (target_index * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER else: leapfrog_start_ledger = GENESIS_LEDGER range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) - range_start_chunk = (range_start_ledger - 2) // 10_000 - range_end_chunk = ((T - 1) // 10_000) - 1 # last complete chunk at tip + range_start_chunk = (range_start_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + # range_end_chunk: largest C such that chunk_last_ledger(C) <= T. + # chunk_last_ledger(C) = ((C + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) + # <= T iff (C + 1) <= (T - (GENESIS_LEDGER - 1)) / LEDGERS_PER_CHUNK + # iff C <= ((T - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 + range_end_chunk = ((T - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 return range_start_chunk, range_end_chunk @@ -726,7 +750,7 @@ def phase3_reconcile_orphans(config, meta_store): resume_ledger = last_committed + 1 if resume_ledger < GENESIS_LEDGER: resume_ledger = GENESIS_LEDGER - resume_chunk = (resume_ledger - 2) // 10_000 + resume_chunk = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK # Ledger stores for store_dir in scan_ledger_store_dirs(config): @@ -803,7 +827,7 @@ def open_active_stores(config, meta_store, resume_ledger): - Pre-created: also open/create chunk_id + 1 and index_id + 1 stores so the first boundary rollover is a pointer swap only. """ - resume_chunk = (resume_ledger - 2) // 10_000 + resume_chunk = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK resume_index = resume_chunk // config.backfill.chunks_per_txhash_index return ActiveStores( @@ -864,14 +888,15 @@ def run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume meta_store.put("streaming:last_committed_ledger", seq) # Chunk rollover: hand off to background LFS + events freeze transitions. - C = (seq - 2) // 10_000 + C = (seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK if seq == chunk_last_ledger(C): on_chunk_boundary(C, active_stores, meta_store) # Index rollover — every index boundary is also a chunk boundary, so this runs # AFTER on_chunk_boundary has already dispatched the last chunk's freeze transitions. - if seq == index_last_ledger(C // cpi): - on_index_boundary(C // cpi, active_stores, meta_store) + N = C // cpi + if seq == index_last_ledger(N): + on_index_boundary(N, active_stores, meta_store) seq += 1 ``` @@ -1121,12 +1146,28 @@ def eligible_prune_indexes(meta_store, R, cpi): - R = 0 → no pruning; archive profile retains everything. - R > 0 → index N is eligible when tip > index_last_ledger(N) + R. - tip ledger is streaming:last_committed_ledger (the daemon's own progress). + + Upper bound derivation: + index_last_ledger(N) = ((N + 1) * LPI) + (GENESIS_LEDGER - 1) + Eligible iff L > ((N + 1) * LPI) + (GENESIS_LEDGER - 1) + R + iff L - (GENESIS_LEDGER - 1) - R > (N + 1) * LPI + iff (N + 1) < (L - (GENESIS_LEDGER - 1) - R) / LPI + iff N <= ((L - (GENESIS_LEDGER - 1) - R - 1) // LPI) - 1 (integer floor) + Simplify: L - (GENESIS_LEDGER - 1) - 1 = L - GENESIS_LEDGER. + max_eligible_N = ((L - GENESIS_LEDGER - R) // LPI) - 1 + + Numeric check at L=70_000_002, R=10_000_000, cpi=1000 (LPI=10_000_000): + max_eligible_N = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 6 - 1 = 5. + Index 5 has index_last_ledger(5) + R = 60_000_001 + 10_000_000 = 70_000_001. + 70_000_002 > 70_000_001 → N=5 eligible. ✓ + Index 6 has index_last_ledger(6) + R = 70_000_001 + 10_000_000 = 80_000_001. + 70_000_002 > 80_000_001 is false → N=6 NOT eligible. ✓ """ if R == 0: return [] L = meta_store.get("streaming:last_committed_ledger") - ledgers_per_index = cpi * 10_000 - max_eligible_N = (L - R - 2) // ledgers_per_index + ledgers_per_index = cpi * LEDGERS_PER_CHUNK + max_eligible_N = ((L - GENESIS_LEDGER - R) // ledgers_per_index) - 1 if max_eligible_N < 0: return [] result = [] From cc417bef3ff3c758e9ae65ae6ece868e3ed7112f Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 11:00:29 -0700 Subject: [PATCH 10/34] Streaming doc: variable + function rename pass MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rename principle: names describe what they return (for functions) or what they hold (for variables). Single-letter shorthand inherited from Tamir's gist retired. Variables (locals in pseudocode, snake_case): L → last_committed_ledger T → network_tip_ledger R → retention_ledgers C → chunk_id N → tx_index_id max_eligible_N → max_eligible_tx_index_id resume_chunk → resume_chunk_id resume_index → resume_tx_index_id target_chunk → target_chunk_id target_index → target_tx_index_id min_chunk → min_chunk_id Functions (snake_case, return-value-centric): chunk_id(ledger_seq) → chunk_id_of_ledger(ledger_seq) chunk_first_ledger(C) → first_ledger_in_chunk(chunk_id) chunk_last_ledger(C) → last_ledger_in_chunk(chunk_id) index_id_of_chunk(C) → tx_index_id_of_chunk(chunk_id) index_first_ledger(N) → first_ledger_in_tx_index(tx_index_id) index_last_ledger(N) → last_ledger_in_tx_index(tx_index_id) derive_phase1_low_water → phase1_coverage_end_ledger compute_backfill_range → compute_backfill_chunk_range eligible_prune_indexes → prunable_tx_index_ids current_incomplete_index → current_incomplete_tx_index_id all_chunks_frozen → all_chunks_in_tx_index_have_lfs_flag indexes_with_txhash_flag → tx_index_ids_with_txhash_flag parse_index_id_from_dir → parse_tx_index_id_from_dir on_index_boundary → on_tx_index_boundary lfs_transition → freeze_ledger_chunk_to_pack_file events_transition → freeze_events_chunk_to_cold_segment recsplit_transition → build_tx_index_recsplit_files prune_index → prune_tx_index lifecycle_loop → run_prune_lifecycle_loop _do_prune_sweep → _run_prune_sweep open_active_stores → open_active_stores_for_resume precreate_next_stores → precreate_next_boundary_stores complete_lfs_flush → finish_interrupted_ledger_freeze choose_phase1_source → select_phase1_ledger_source phase4_ingest → phase4_live_ingest run_ingestion_loop → run_live_ingestion_loop run_streaming → run_streaming_daemon Terminology section: retired "Phase 1 low-water" (was semantically a high-water mark), introduced network_tip_ledger explicitly, pointed at phase1_coverage_end_ledger. Meta-store key templates updated: {C:08d} → {chunk_id:08d}, {N:08d} → {tx_index_id:08d}. Prose referencing N/C chunks/indexes normalized to the long names. Worked examples in Scenario B now use kwarg-style calls (last_committed_ledger=1, network_tip_ledger=...). Python pseudocode convention (not camelCase); the Go implementation will use camelCase during TDD. No semantic or behavioral changes. Pure rename. --- .../design-docs/02-streaming-workflow.md | 711 +++++++++--------- 1 file changed, 364 insertions(+), 347 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index b2fcc5b72..7a88b5942 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -30,8 +30,9 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Phase 1 catchup** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. - **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI entry point exists. - **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. -- **Phase 1 low-water** (`derive_phase1_low_water`) — the last ledger of the contiguous prefix of `chunk:{C}:lfs` flags starting from the lowest chunk on disk. Phase 1 uses this to decide what's still left to ingest. **Not the same** as `streaming:last_committed_ledger`. -- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside the Phase 4 ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. +- **`phase1_coverage_end_ledger`** — the last ledger of the contiguous prefix of `chunk:{chunkId}:lfs` flags starting from the lowest chunk on disk. Phase 1 uses this to decide what's still left to ingest. Returned by the same-named function. **Not the same** as `streaming:last_committed_ledger`. (Prior drafts called this concept "Phase 1 low-water mark"; the term was retired because it's semantically a HIGH-water mark — the newest confirmed ledger in contiguous coverage.) +- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside the Phase 4 ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. +- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from `source.tip()`. For `BSBSource`: read from BSB's range-end metadata. For `CaptiveCoreSource`: fetched via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVE_URLS`. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - Ledger active store — a per-chunk RocksDB (one instance per chunk). - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). @@ -41,7 +42,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - RecSplit index `.idx` files (16 per index). - Events cold segment (three files per chunk: `events.pack`, `index.pack`, `index.hash`). - **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three transitions total per chunk (LFS, events) and one per index (RecSplit). -- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `chunk_first_ledger(C)` always ends in `..._02`; `chunk_last_ledger(C)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. +- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `first_ledger_in_chunk(chunk_id)` always ends in `..._02`; `last_ledger_in_chunk(chunk_id)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. - **Txhash index** (a.k.a. "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). - **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. - **Index boundary** — the moment ingestion commits the last ledger of an index. Triggers background RecSplit build for that index. Every index boundary is also a chunk boundary. @@ -64,24 +65,24 @@ LEDGERS_PER_CHUNK = 10_000 LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK # at cpi=1000 this is 10_000_000 -chunk_id(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK +chunk_id_of_ledger(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK # 56_342_637 → (56_342_637 - 2) // 10_000 = 5634 -chunk_first_ledger(C) = (C * LEDGERS_PER_CHUNK) + GENESIS_LEDGER - # chunk 5634 → (5634 * 10_000) + 2 = 56_340_002 — ends in ..._02 +first_ledger_in_chunk(chunk_id) = (chunk_id * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + # chunk_id=5634 → (5634 * 10_000) + 2 = 56_340_002 — ends in ..._02 -chunk_last_ledger(C) = ((C + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) - # chunk 5634 → ((5635) * 10_000) + 1 = 56_350_001 — ends in ..._01 +last_ledger_in_chunk(chunk_id) = ((chunk_id + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) + # chunk_id=5634 → ((5635) * 10_000) + 1 = 56_350_001 — ends in ..._01 # (GENESIS_LEDGER - 1) = 1 is what keeps chunk ends in ..._01 -index_id_of_chunk(C) = C // CHUNKS_PER_TXHASH_INDEX - # chunk 5634 → 5 (at cpi=1000) +tx_index_id_of_chunk(chunk_id) = chunk_id // CHUNKS_PER_TXHASH_INDEX + # chunk_id=5634 → tx_index_id=5 (at cpi=1000) -index_first_ledger(N) = (N * LEDGERS_PER_INDEX) + GENESIS_LEDGER - # index 5 → (5 * 10_000_000) + 2 = 50_000_002 +first_ledger_in_tx_index(tx_index_id) = (tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER + # tx_index_id=5 → (5 * 10_000_000) + 2 = 50_000_002 -index_last_ledger(N) = ((N + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) - # index 5 → ((6) * 10_000_000) + 1 = 60_000_001 +last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + # tx_index_id=5 → ((6) * 10_000_000) + 1 = 60_000_001 ``` --- @@ -103,7 +104,7 @@ Two keys are stored on first start and enforced on every subsequent start. Chang | `CHUNKS_PER_TXHASH_INDEX` | `config:chunks_per_txhash_index` | first run | Fatal if changed. | | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | -Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence. Operators may add or remove BSB between runs without wiping — the daemon extends coverage forward from `derive_phase1_low_water` regardless of source. Retention immutability is what constrains the data envelope; source choice doesn't need its own immutability gate. +Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence. Operators may add or remove BSB between runs without wiping — the daemon extends coverage forward from `phase1_coverage_end_ledger` regardless of source. Retention immutability is what constrains the data envelope; source choice doesn't need its own immutability gate. ### Streaming-Specific TOML @@ -169,12 +170,12 @@ def validate_config(config, meta_store): (or wipes the datadir for an immutable-key change) and re-invokes. """ cpi = config.backfill.chunks_per_txhash_index - R = config.streaming.retention_ledgers + retention_ledgers = config.streaming.retention_ledgers ledgers_per_index = cpi * LEDGERS_PER_CHUNK # 1. Retention shape. - if R != 0 and (R <= 0 or (R % ledgers_per_index) != 0): - fatal(f"RETENTION_LEDGERS={R} must be 0 or a positive multiple of " + if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % ledgers_per_index) != 0): + fatal(f"RETENTION_LEDGERS={retention_ledgers} must be 0 or a positive multiple of " f"LEDGERS_PER_INDEX={ledgers_per_index}. Valid values at this cpi: " f"0, {ledgers_per_index}, {2*ledgers_per_index}, ...") @@ -186,7 +187,7 @@ def validate_config(config, meta_store): # 3. Immutable keys. Store on first run; fatal on mismatch thereafter. _enforce_immutable(meta_store, "config:chunks_per_txhash_index", str(cpi)) - _enforce_immutable(meta_store, "config:retention_ledgers", str(R)) + _enforce_immutable(meta_store, "config:retention_ledgers", str(retention_ledgers)) def _enforce_immutable(meta_store, key, current_value): @@ -263,10 +264,10 @@ stellar-rpc --config /etc/stellar-rpc/config.toml **Crash recovery**: -- Crash during Phase 1's BSB download at chunk 3_457: on restart, `derive_phase1_low_water` walks `:lfs` flags, returns the end of the contiguous prefix (say, chunk 3_200). `phase1_catchup` re-enters, `compute_backfill_range` produces a new range, backfill re-runs from chunk 3_201 forward. Chunks that already had `:lfs` are skipped via per-chunk idempotency. +- Crash during Phase 1's BSB download at chunk 3_457: on restart, `phase1_coverage_end_ledger` walks `:lfs` flags, returns the end of the contiguous prefix (say, chunk 3_200). `phase1_catchup` re-enters, `compute_backfill_chunk_range` produces a new range, backfill re-runs from chunk 3_201 forward. Chunks that already had `:lfs` are skipped via per-chunk idempotency. - Crash after all chunks written but before index 3's RecSplit built: on restart, Phase 1 sees `index:3:txhash` absent → backfill's DAG re-runs the RecSplit build from the `.bin` files. Succeeds. - Crash while `.bin` files from the trailing index are being loaded into the active txhash store (Phase 2): on restart, Phase 2 re-runs. Chunks that were already loaded had their `:txhash` flag deleted and `.bin` file removed — the loop skips them via the flag check. Chunks not yet loaded retain their `:txhash` flag and `.bin` file — the loop picks them up. -- Crash between low-water commit and chunk freeze during live ingestion: `streaming:last_committed_ledger = chunk_last_ledger(C)` but `chunk:{C}:lfs` absent. Phase 3 triggers the missing transitions when the daemon restarts, before Phase 4 re-enters. +- Crash between per-ledger checkpoint commit and chunk freeze during live ingestion: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)` but `chunk:{chunk_id}:lfs` absent. Phase 3 triggers the missing transitions when the daemon restarts, before Phase 4 re-enters. In every case: the daemon reaches a consistent state after one restart. No manual intervention. Dangling `.bin` files from incomplete indexes are cleaned by Phase 2 once the owning index progresses further. @@ -304,10 +305,10 @@ FORMAT = "text" - Daemon starts. Validates config; stores immutable keys. - `[BACKFILL.BSB]` absent → Phase 1 source is `CaptiveCoreSource`. - Source samples tip via HTTP GET against `HISTORY_ARCHIVE_URLS`: tip = `56_342_637`. -- `compute_backfill_range(L=1, T=56_342_637, R=10_000, cpi=1)` — leapfrog lands at `index_first_ledger(index_id(T - R))`: - - `T - R = 56_332_637`. - - `chunk_id(56_332_637) = 5_633`. `index_id_of_chunk(5_633) = 5_633` (cpi=1). - - `index_first_ledger(5_633) = 56_330_002`. +- `compute_backfill_chunk_range(last_committed_ledger=1, network_tip_ledger=56_342_637, retention_ledgers=10_000, cpi=1)` — leapfrog lands at `first_ledger_in_tx_index(tx_index_id_of_chunk(chunk_id_of_ledger(network_tip_ledger - retention_ledgers)))`: + - `network_tip_ledger - retention_ledgers = 56_332_637`. + - `chunk_id_of_ledger(56_332_637) = 5_633`. `tx_index_id_of_chunk(5_633) = 5_633` (cpi=1). + - `first_ledger_in_tx_index(5_633) = 56_330_002`. - Backfill range is chunks `5_633..5_633` (one chunk to close the gap to tip at chunk 5_633, which is `last_complete_chunk_at(56_342_637)`). Up to ~10_000 ledgers of archive-catchup via captive core. Takes ~3–8 minutes. - Phase 2 loads the one `.bin` file into the active txhash store, deletes it. - Phase 4 opens active stores, starts captive core for live streaming from `resume_ledger = 56_340_002`, enters ingestion loop. @@ -325,7 +326,7 @@ FORMAT = "text" **Subsequent restart** (say after 1h downtime): -- Daemon starts. `streaming:last_committed_ledger` is present from the prior run. Phase 1 samples tip; `T - L` is ~600 ledgers (10 min at 6s) — less than one chunk → Phase 1 exits immediately. +- Daemon starts. `streaming:last_committed_ledger` is present from the prior run. Phase 1 samples tip; `network_tip_ledger - last_committed_ledger` is ~600 ledgers (10 min at 6s) — less than one chunk → Phase 1 exits immediately. - Phase 2 finds no `.bin` files (deleted on first start). No-op. - Phase 3 reconciles any orphan active stores from the crash — typical case is completing an interrupted chunk freeze. - Phase 4 re-opens active stores, starts captive core, re-enters the ingestion loop. Captive core's own archive-catchup closes the 600-ledger gap in ~seconds, then cadence settles to live closes. @@ -333,7 +334,7 @@ FORMAT = "text" **Crash recovery within Phase 1 (first-ever start)**: - Captive core subprocess crashes mid-archive-catchup: daemon retries spinning captive core up. No persisted state to roll back — the partial chunk's data was in the active store's WAL; captive core re-archive-catches-up from whatever ledger the WAL wasn't past. -- Daemon process itself crashes: on restart, `derive_phase1_low_water` returns whatever contiguous prefix exists. Phase 1 re-enters. Eventually completes. +- Daemon process itself crashes: on restart, `phase1_coverage_end_ledger` returns whatever contiguous prefix exists. Phase 1 re-enters. Eventually completes. **Query behavior during Phase 1**: `HTTP 4xx` for all three query endpoints. `getHealth` reports `catching_up` + the drift. @@ -345,7 +346,7 @@ Single RocksDB instance, WAL always enabled. Authoritative source for every star | Key | Value | Written when | |---|---|---| -| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `chunk_last_ledger(derive_phase1_low_water)`; subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{C}:lfs` flags alone. | +| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `last_ledger_in_chunk(phase1_coverage_end_ledger)`; subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | | `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | ### Keys Shared with Backfill @@ -353,41 +354,41 @@ Single RocksDB instance, WAL always enabled. Authoritative source for every star | Key | Semantics | |---|---| | `config:chunks_per_txhash_index` | Set on first run by whichever invocation runs first — here, first daemon start. | -| `chunk:{C:08d}:lfs` | Set after ledger pack file fsync. | -| `chunk:{C:08d}:events` | Set after events cold segment fsync. | -| `chunk:{C:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 hydration after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | -| `index:{N:08d}:txhash` | `"1"` after all 16 RecSplit CF `.idx` files built and fsynced. Transitions to `"deleting"` at the start of `prune_index`, deleted entirely when prune completes. Query routing treats `"deleting"` the same as absent. | +| `chunk:{chunk_id:08d}:lfs` | Set after ledger pack file fsync. | +| `chunk:{chunk_id:08d}:events` | Set after events cold segment fsync. | +| `chunk:{chunk_id:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 hydration after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | +| `index:{tx_index_id:08d}:txhash` | `"1"` after all 16 RecSplit CF `.idx` files built and fsynced. Transitions to `"deleting"` at the start of `prune_tx_index`, deleted entirely when prune completes. Query routing treats `"deleting"` the same as absent. | ### Key Lifecycle in Streaming ``` Phase 1 (backfill subroutine): - chunk:{C}:lfs = "1" (after pack fsync) - chunk:{C}:txhash = "1" (after .bin fsync) # only present for chunks that still have .bin on disk - chunk:{C}:events = "1" (after cold segment fsync) - index:{N}:txhash = "1" (after RecSplit, when all chunks of index N are done in Phase 1) + chunk:{chunk_id}:lfs = "1" (after pack fsync) + chunk:{chunk_id}:txhash = "1" (after .bin fsync) # only present for chunks that still have .bin on disk + chunk:{chunk_id}:events = "1" (after cold segment fsync) + index:{tx_index_id}:txhash = "1" (after RecSplit, when all chunks of tx_index_id are done in Phase 1) Phase 2 (.bin hydration — see Startup Sequence): For every chunk with :txhash flag and a .bin file: load .bin into RocksDB txhash store - delete chunk:{C}:txhash flag + delete chunk:{chunk_id}:txhash flag delete .bin file - After Phase 2, no chunk:{C}:txhash flags and no .bin files remain. + After Phase 2, no chunk:{chunk_id}:txhash flags and no .bin files remain. Live path (per ledger): streaming:last_committed_ledger = ledger_seq (after all 3 active stores commit) Live path (per chunk, background): - chunk:{C}:lfs = "1" (after pack fsync) - chunk:{C}:events = "1" (after cold segment fsync) + chunk:{chunk_id}:lfs = "1" (after pack fsync) + chunk:{chunk_id}:events = "1" (after cold segment fsync) Live path (per index, background): - index:{N}:txhash = "1" (after RecSplit + verify) + index:{tx_index_id}:txhash = "1" (after RecSplit + verify) -Pruning (background, when index N is past retention): - index:{N}:txhash = "deleting" (FIRST; queries now return 4xx for this index) - [delete all files + per-chunk :lfs + :events keys for index N] - index:{N}:txhash → deleted (LAST) +Pruning (background, when tx_index_id is past retention): + index:{tx_index_id}:txhash = "deleting" (FIRST; queries now return 4xx for this index) + [delete all files + per-chunk :lfs + :events keys for tx_index_id] + index:{tx_index_id}:txhash → deleted (LAST) ``` ### Flag Semantics @@ -403,8 +404,8 @@ The daemon maintains three active stores for the current ingestion position. All | Store | Path | Key | Value | Transition cadence | |---|---|---|---|---| -| Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{C:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | -| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{N:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_INDEX` ledgers (index) | +| Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | +| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_INDEX` ledgers (index) | | Events | In-memory hot segment + persisted index deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | - Ledger and txhash stores are RocksDB. WAL required. @@ -414,7 +415,7 @@ The daemon maintains three active stores for the current ingestion position. All ### Store Pre-creation - The store for the next chunk / index is pre-created before the boundary is reached, so boundary-time work is a pointer swap only. -- Creation timing: when the ingestion loop commits a ledger within a configurable window before the boundary (e.g., `chunk_last_ledger(C) - 1_000`). The window must be large enough that store initialization (directory mkdir + RocksDB open + column family setup) completes before the boundary ledger arrives, and small enough that pre-creation doesn't run prematurely for chunks the daemon may never reach. +- Creation timing: when the ingestion loop commits a ledger within a configurable window before the boundary (e.g., `last_ledger_in_chunk(chunk_id) - 1_000`). The window must be large enough that store initialization (directory mkdir + RocksDB open + column family setup) completes before the boundary ledger arrives, and small enough that pre-creation doesn't run prematurely for chunks the daemon may never reach. - On restart, a pre-created store is expected to exist — Phase 3 treats `resume_chunk + 1` (and `resume_index + 1`) as active, not an orphan. ### Max Concurrent Stores @@ -478,7 +479,7 @@ class CaptiveCoreSource(LedgerSource): ### Source Selection Rule ```python -def choose_phase1_source(config): +def select_phase1_ledger_source(config): """Called once at the top of Phase 1. Re-evaluated per startup.""" if config.backfill.bsb is not None: return BSBSource(config.backfill.bsb) @@ -509,12 +510,12 @@ Four sequential phases, same code path for first start and every restart. The fi "Phase" here refers to the startup ordering only. Once Phase 4 is entered, there's no Phase 5 — the daemon is in live-streaming steady state. ```python -def run_streaming(config): +def run_streaming_daemon(config): meta_store = open_meta_store(config) validate_config(config, meta_store) # immutable key enforcement # ── Phase 1: catch up from last_committed_ledger (or genesis) to tip ── - source = choose_phase1_source(config) + source = select_phase1_ledger_source(config) phase1_catchup(config, meta_store, source) # ── Phase 2: load any .bin files left by Phase 1 into RocksDB; delete them ── @@ -524,7 +525,7 @@ def run_streaming(config): phase3_reconcile_orphans(config, meta_store) # ── Phase 4: open active stores, spawn lifecycle goroutine, start captive core, ingest ── - phase4_ingest(config, meta_store) + phase4_live_ingest(config, meta_store) ``` Query serving is gated on Phase 4 being reached — see [Query Contract](#query-contract). @@ -533,7 +534,7 @@ Query serving is gated on Phase 4 being reached — see [Query Contract](#query- Runs the backfill subroutine (`run_backfill` from `01-backfill-workflow.md`) once per source-tip sample, until the gap closes to less than one chunk. -- Phase 1's unit of work is an entire chunk — never a partial chunk. Backfill's DAG dispatches integer chunk IDs; `process_chunk(C)` ingests ledgers `chunk_first_ledger(C)..chunk_last_ledger(C)` inclusive. Every chunk ever persisted by Phase 1 starts at `..._02` and ends at `..._01`. This is the chunk-alignment invariant the no-gaps guarantee rests on. +- Phase 1's unit of work is an entire chunk — never a partial chunk. Backfill's DAG dispatches integer chunk IDs; `process_chunk(chunk_id)` ingests ledgers `first_ledger_in_chunk(chunk_id)..last_ledger_in_chunk(chunk_id)` inclusive. Every chunk ever persisted by Phase 1 starts at `..._02` and ends at `..._01`. This is the chunk-alignment invariant the no-gaps guarantee rests on. - Works the same whether the source is BSB (parallel) or captive core (sequential) — per-chunk work is atomic in both cases. ```python @@ -542,112 +543,119 @@ def phase1_catchup(config, meta_store, source): Close the gap between what's already on disk and the current network tip. Control flow (outer loop): - 1. derive L from :lfs flags on disk (NOT from streaming:last_committed_ledger — - that key isn't written during Phases 1–3). - 2. sample the current tip T from the source. - 3. if T - L is less than one chunk, exit (captive core will close the residual - few-thousand-ledger gap in Phase 4 via its own archive-catchup). + 1. derive last_committed_ledger from :lfs flags on disk (NOT from + streaming:last_committed_ledger — that key isn't written during Phases 1–3). + 2. sample the current network_tip_ledger from the source. + 3. if (network_tip_ledger - last_committed_ledger) is less than one chunk, exit + (captive core will close the residual few-thousand-ledger gap in Phase 4 via + its own archive-catchup). 4. compute the chunk range to backfill this iteration. Leapfrog-alignment inside - compute_backfill_range guarantees range_start is the first chunk of an index - when retention is configured. + compute_backfill_chunk_range guarantees range_start is the first chunk of an + index when retention is configured. 5. invoke backfill's static-DAG subroutine. Backfill's own per-chunk idempotency + crash recovery handle mid-iteration crashes. - 6. re-derive L from :lfs flags. Loop. + 6. re-derive last_committed_ledger from :lfs flags. Loop. The while loop is needed because the network tip advances while we catch up — each run_backfill call covers the range known at the start of that iteration, and subsequent iterations close whatever new ledgers accumulated. """ cpi = config.backfill.chunks_per_txhash_index - R = config.streaming.retention_ledgers - L = derive_phase1_low_water(meta_store) + retention_ledgers = config.streaming.retention_ledgers + last_committed_ledger = phase1_coverage_end_ledger(meta_store) while True: - T = source.tip() - if (T - L) < LEDGERS_PER_CHUNK: # less than one chunk remaining - break + network_tip_ledger = source.tip() + if (network_tip_ledger - last_committed_ledger) < LEDGERS_PER_CHUNK: + break # less than one chunk remaining - range_start, range_end = compute_backfill_range(L, T, R, cpi) - if range_end < range_start: + range_start_chunk_id, range_end_chunk_id = compute_backfill_chunk_range( + last_committed_ledger, network_tip_ledger, retention_ledgers, cpi) + if range_end_chunk_id < range_start_chunk_id: # Leapfrog landed past the last complete chunk at tip — happens when the # network hasn't produced a full chunk past the retention line yet. Exit. break - # Backfill's DAG ingests [range_start..range_end] inclusive. Per-chunk idempotent: - # chunks with :lfs already set are skipped. Crash here resumes on restart. - run_backfill(config, range_start, range_end, source=source) + # Backfill's DAG ingests [range_start_chunk_id..range_end_chunk_id] inclusive. + # Per-chunk idempotent: chunks with :lfs already set are skipped. Crash here + # resumes on restart. + run_backfill(config, range_start_chunk_id, range_end_chunk_id, source=source) - # Re-derive L — not just range_end — because a mid-iteration crash could leave - # holes in [range_start..range_end] that the contiguous-prefix scan catches. - L = derive_phase1_low_water(meta_store) + # Re-derive last_committed_ledger — not just range_end_chunk_id — because a + # mid-iteration crash could leave holes in [range_start..range_end] that the + # contiguous-prefix scan catches. + last_committed_ledger = phase1_coverage_end_ledger(meta_store) -def compute_backfill_range(L, T, R, cpi): +def compute_backfill_chunk_range(last_committed_ledger, network_tip_ledger, retention_ledgers, cpi): """ - Returns (range_start_chunk, range_end_chunk). Leapfrog aligns DOWN to the first chunk - of the index containing (T - R). No-op when R = 0 (full history archive). - - - R is a multiple of LEDGERS_PER_INDEX (validated at startup), but T itself is arbitrary - — so T - R is NOT on an index boundary in general. Leapfrog must explicitly round - T - R down to the first ledger of its containing index. That rounded value is the - new head of coverage; every earlier ledger is past retention and skipped. - - Worst-case: up to LEDGERS_PER_INDEX - 1 ledgers past the strict retention line are + Returns (range_start_chunk_id, range_end_chunk_id). Leapfrog aligns DOWN to the + first chunk of the tx index containing (network_tip_ledger - retention_ledgers). + No-op when retention_ledgers = 0 (full history archive). + + - retention_ledgers is a multiple of LEDGERS_PER_INDEX (validated at startup), but + network_tip_ledger itself is arbitrary — so (network_tip_ledger - retention_ledgers) + is NOT on a tx-index boundary in general. Leapfrog must explicitly round that value + down to the first ledger of its containing tx index. That rounded value is the new + head of coverage; every earlier ledger is past retention and skipped. + - Worst case: up to LEDGERS_PER_INDEX - 1 ledgers past the strict retention line are ingested and held on disk. At cpi=1000 this is ~10M ledgers; at cpi=1 it is ~10k. """ - gap_start_ledger = L + 1 - if R > 0: - target_ledger = max(T - R, GENESIS_LEDGER) - target_chunk = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - target_index = target_chunk // cpi - # First ledger of target_index = (target_index * LEDGERS_PER_INDEX) + GENESIS_LEDGER. - leapfrog_start_ledger = (target_index * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + gap_start_ledger = last_committed_ledger + 1 + if retention_ledgers > 0: + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + target_tx_index_id = target_chunk_id // cpi + # First ledger of target_tx_index_id = (target_tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER. + leapfrog_start_ledger = (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER else: leapfrog_start_ledger = GENESIS_LEDGER - range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) - range_start_chunk = (range_start_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - # range_end_chunk: largest C such that chunk_last_ledger(C) <= T. - # chunk_last_ledger(C) = ((C + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) - # <= T iff (C + 1) <= (T - (GENESIS_LEDGER - 1)) / LEDGERS_PER_CHUNK - # iff C <= ((T - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 - range_end_chunk = ((T - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 - return range_start_chunk, range_end_chunk + range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) + range_start_chunk_id = (range_start_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + # range_end_chunk_id: largest chunkId such that last_ledger_in_chunk(chunkId) + # is <= network_tip_ledger. + # last_ledger_in_chunk(chunkId) = ((chunkId + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) + # <= network_tip_ledger iff (chunkId + 1) <= (network_tip_ledger - (GENESIS_LEDGER - 1)) / LEDGERS_PER_CHUNK + # iff chunkId <= ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 + range_end_chunk_id = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 + return range_start_chunk_id, range_end_chunk_id -def derive_phase1_low_water(meta_store): +def phase1_coverage_end_ledger(meta_store): """ Returns the last ledger of the contiguous tail of :lfs flags starting at the lowest chunk currently on disk. - - Finds min_chunk = lowest C with chunk:{C}:lfs set. - - Walks forward from min_chunk counting contiguous :lfs flags. Stops at the first gap. + - Finds min_chunk_id = lowest chunkId with chunk:{chunkId}:lfs set. + - Walks forward from min_chunk_id counting contiguous :lfs flags. Stops at the first gap. - Returns GENESIS_LEDGER - 1 if no :lfs flags exist at all. Contiguous-tail semantics matter because: - BSB workers complete chunks in parallel; a mid-Phase-1 crash can leave holes in the middle of the ingested range. Resuming from the highest :lfs would skip those holes and break the no-gaps invariant. - - Lifecycle pruning removes :lfs flags of past-retention indexes. The lowest remaining - :lfs after prune is naturally the head of surviving coverage — no separate tip sample - or leapfrog calculation needed here. + - Lifecycle pruning removes :lfs flags of past-retention tx indexes. The lowest + remaining :lfs after prune is naturally the head of surviving coverage — no separate + tip sample or leapfrog calculation needed here. Leapfrog decisions (where Phase 1 should start ingesting THIS run) are made separately - inside compute_backfill_range, which has access to the current tip sample. + inside compute_backfill_chunk_range, which has access to the current tip sample. """ - min_chunk = None + min_chunk_id = None for key in meta_store.iter_prefix("chunk:"): if not key.endswith(":lfs"): continue - C = parse_chunk_id(key) - if min_chunk is None or C < min_chunk: - min_chunk = C - if min_chunk is None: + chunk_id = parse_chunk_id(key) + if min_chunk_id is None or chunk_id < min_chunk_id: + min_chunk_id = chunk_id + if min_chunk_id is None: return GENESIS_LEDGER - 1 - C = min_chunk - while meta_store.has(f"chunk:{C:08d}:lfs"): - C += 1 - return chunk_last_ledger(C - 1) # last contiguous chunk + chunk_id = min_chunk_id + while meta_store.has(f"chunk:{chunk_id:08d}:lfs"): + chunk_id += 1 + return last_ledger_in_chunk(chunk_id - 1) # last contiguous chunk ``` **Worker concurrency**: `run_backfill` honors `source.max_parallelism()` when dispatching `process_chunk` tasks. With BSB this is GOMAXPROCS (unchanged from backfill today). With captive core it is 1 — the DAG dispatches chunks sequentially to avoid spawning multiple captive core subprocesses. @@ -658,7 +666,7 @@ def derive_phase1_low_water(meta_store): ### Phase 2 — Hydrate TxHash Data from `.bin` -Phase 1 may leave `.bin` files for chunks in the last (incomplete) index. Phase 2 loads each into the active txhash store and deletes the `.bin` file + its `chunk:{C}:txhash` flag. After Phase 2, no `.bin` files and no `chunk:{C}:txhash` flags remain. +Phase 1 may leave `.bin` files for chunks in the last (incomplete) index. Phase 2 loads each into the active txhash store and deletes the `.bin` file + its `chunk:{chunk_id}:txhash` flag. After Phase 2, no `.bin` files and no `chunk:{chunk_id}:txhash` flags remain. ```python def phase2_hydrate_txhash(config, meta_store): @@ -674,37 +682,38 @@ def phase2_hydrate_txhash(config, meta_store): """ cpi = config.backfill.chunks_per_txhash_index - # 1. Backfill may have completed an index (index:{N}:txhash = "1") before a crash - # prevented cleanup_txhash from deleting leftover .bin. Sweep those first. - for index_id in indexes_with_txhash_flag(meta_store): - for chunk_id in range(index_id * cpi, (index_id + 1) * cpi): + # 1. Backfill may have completed a tx index (index:{tx_index_id}:txhash = "1") before + # a crash prevented cleanup_txhash from deleting leftover .bin. Sweep those first. + for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): + for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) - # 2. Load .bin files for the current incomplete index (if any) into RocksDB txhash store. - N = current_incomplete_index(meta_store) - if N is None: + # 2. Load .bin files for the current incomplete tx index (if any) into the active + # txhash RocksDB. + incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) + if incomplete_tx_index_id is None: return - txhash_store = open_active_txhash_store(config, N) # WAL recovery; do NOT recreate + txhash_store = open_active_txhash_store(config, incomplete_tx_index_id) # WAL recovery; do NOT recreate try: - for chunk_id in range(N * cpi, (N + 1) * cpi): + for chunk_id in range(incomplete_tx_index_id * cpi, (incomplete_tx_index_id + 1) * cpi): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already loaded (flag cleared) + continue # already loaded (flag cleared) bin_path = raw_txhash_path(chunk_id) if os.path.exists(bin_path): - load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first - delete_if_exists(bin_path) # delete .bin second + load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first + delete_if_exists(bin_path) # delete .bin second # 3. Sweep orphan .bin files (flag already gone, .bin lingering from crash between # flag-delete and file-delete in a prior run). - for bin_file in scan_bin_files_for_index(N): + for bin_file in scan_bin_files_for_tx_index(incomplete_tx_index_id): if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): os.remove(bin_file) finally: - # Must close before returning — Phase 4's open_active_stores re-opens the same + # Must close before returning — Phase 4's open_active_stores_for_resume re-opens the same # directory, and RocksDB's directory flock would collide if this handle is still # open. WAL remains on disk; reopening is safe. txhash_store.close() @@ -723,61 +732,62 @@ def phase3_reconcile_orphans(config, meta_store): """ Finishes any mid-flight LFS flush, events freeze, or RecSplit build from a crashed run. - - Active store for resume_chunk: keep (Phase 4 will open it). - - Pre-created store for resume_chunk + 1: keep. + - Active store for resume_chunk_id: keep (Phase 4 will open it). + - Pre-created store for resume_chunk_id + 1: keep. - Orphaned ledger store: flag present → cleanup lingered; delete the store. - flag absent, chunk below resume_chunk → mid-flush crash; complete the flush. - flag absent, chunk above resume_chunk + 1 → orphan future store; delete. + flag absent, chunk below resume_chunk_id → mid-flush crash; complete the flush. + flag absent, chunk above resume_chunk_id + 1 → orphan future store; delete. - Orphaned txhash store: flag present → cleanup lingered; delete the store. - flag absent, all chunks of index N have :lfs set → spawn RecSplit build. + flag absent, all chunks of tx index tx_index_id have :lfs set → spawn RecSplit build. On a fresh datadir (no :lfs flags anywhere, Phase 1 had nothing to do) this is a no-op: - resume_ledger = GENESIS_LEDGER, resume_chunk = 0, no active stores on disk yet. + resume_ledger = GENESIS_LEDGER, resume_chunk_id = 0, no active stores on disk yet. """ # Derive resume_ledger the SAME way Phase 4 will — otherwise Phase 3 and Phase 4 can # disagree on which chunk's active store to preserve, causing Phase 4 to open a fresh # store while Phase 3's kept-active-store is left as an orphan. # - # Priority order (matches phase4_ingest): + # Priority order (matches phase4_live_ingest): # 1. streaming:last_committed_ledger if set (live-path crash mid-chunk or at boundary). - # 2. derive_phase1_low_water otherwise (first-start after Phase 1, or fresh datadir). + # 2. phase1_coverage_end_ledger otherwise (first-start after Phase 1, or fresh datadir). cpi = config.backfill.chunks_per_txhash_index - last_committed = meta_store.get("streaming:last_committed_ledger") - if last_committed is None: - last_committed = derive_phase1_low_water(meta_store) - resume_ledger = last_committed + 1 + last_committed_ledger = meta_store.get("streaming:last_committed_ledger") + if last_committed_ledger is None: + last_committed_ledger = phase1_coverage_end_ledger(meta_store) + resume_ledger = last_committed_ledger + 1 if resume_ledger < GENESIS_LEDGER: resume_ledger = GENESIS_LEDGER - resume_chunk = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK # Ledger stores for store_dir in scan_ledger_store_dirs(config): - C = parse_chunk_id_from_dir(store_dir) - if C == resume_chunk or C == resume_chunk + 1: + chunk_id = parse_chunk_id_from_dir(store_dir) + if chunk_id == resume_chunk_id or chunk_id == resume_chunk_id + 1: continue # active or pre-created - if meta_store.has(f"chunk:{C:08d}:lfs"): + if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): delete_dir(store_dir) # orphaned post-flush cleanup - elif C < resume_chunk: - complete_lfs_flush(store_dir, C, meta_store) # mid-flush crash; finish + elif chunk_id < resume_chunk_id: + finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) # mid-flush crash; finish else: delete_dir(store_dir) # orphan future store # Txhash stores - resume_index = resume_chunk // cpi + resume_tx_index_id = resume_chunk_id // cpi for store_dir in scan_txhash_store_dirs(config): - N = parse_index_id_from_dir(store_dir) - if N == resume_index or N == resume_index + 1: + tx_index_id = parse_tx_index_id_from_dir(store_dir) + if tx_index_id == resume_tx_index_id or tx_index_id == resume_tx_index_id + 1: continue # active or pre-created - if meta_store.has(f"index:{N:08d}:txhash"): + if meta_store.has(f"index:{tx_index_id:08d}:txhash"): delete_dir(store_dir) # RecSplit done, cleanup lingered - elif all_chunks_frozen(meta_store, N, cpi): - # RecSplit build for N was never started or was interrupted. Open the store - # and spawn the build — pass the handle, not the directory path, because - # recsplit_transition reads from the store and closes it on completion. - transitioning_txhash = open_active_txhash_store(config, N) - run_in_background(recsplit_transition, N, transitioning_txhash, meta_store) + elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): + # RecSplit build for tx_index_id was never started or was interrupted. Open + # the store and spawn the build — pass the handle, not the directory path, + # because build_tx_index_recsplit_files reads from the store and closes it + # on completion. + transitioning_txhash = open_active_txhash_store(config, tx_index_id) + run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) # Events hot segment: truncate any persisted deltas beyond resume_ledger - 1. # Prevents duplicate event IDs when Phase 4 replays the first live ledger. @@ -789,17 +799,17 @@ def phase3_reconcile_orphans(config, meta_store): Opens active stores for the resume position, spawns the lifecycle goroutine, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). ```python -def phase4_ingest(config, meta_store): - last_committed = meta_store.get("streaming:last_committed_ledger") - if last_committed is None: +def phase4_live_ingest(config, meta_store): + last_committed_ledger = meta_store.get("streaming:last_committed_ledger") + if last_committed_ledger is None: # First start after Phase 1: set checkpoint to end of Phase 1's coverage. - last_committed = derive_phase1_low_water(meta_store) - meta_store.put("streaming:last_committed_ledger", last_committed) - resume_ledger = last_committed + 1 + last_committed_ledger = phase1_coverage_end_ledger(meta_store) + meta_store.put("streaming:last_committed_ledger", last_committed_ledger) + resume_ledger = last_committed_ledger + 1 - active_stores = open_active_stores(config, meta_store, resume_ledger) + active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) - run_in_background(lifecycle_loop, config, meta_store) + run_in_background(run_prune_lifecycle_loop, config, meta_store) # Prime captive core for unbounded stream from resume_ledger. ledger_backend = make_ledger_backend(config.streaming.captive_core_config) @@ -807,36 +817,36 @@ def phase4_ingest(config, meta_store): set_daemon_ready() # in-memory flag; unblocks queries - run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) + run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) -def open_active_stores(config, meta_store, resume_ledger): +def open_active_stores_for_resume(config, meta_store, resume_ledger): """ - Open or create the three active stores for resume_ledger's chunk + index. Also - pre-create the next chunk's / next index's stores up front so the first chunk + Open or create the three active stores for resume_ledger's chunk + tx index. Also + pre-create the next chunk's / next tx index's stores up front so the first chunk rollover doesn't pay creation latency. - - Ledger active: per-chunk RocksDB for chunk_id(resume_ledger). WAL-recovered + - Ledger active: per-chunk RocksDB for chunk_id_of_ledger(resume_ledger). WAL-recovered if the directory exists (mid-chunk restart); fresh-created otherwise. - - Events hot segment: in-memory for chunk_id(resume_ledger). If persisted deltas + - Events hot segment: in-memory for chunk_id_of_ledger(resume_ledger). If persisted deltas exist for this chunk (mid-chunk restart), replay them to rebuild bitmaps. Phase 3 already truncated anything past last_committed_ledger, so replay is safe. - - TxHash active: per-index RocksDB for index_id(chunk_id(resume_ledger)). May + - TxHash active: per-index RocksDB for tx_index_id_of_chunk(chunk_id_of_ledger(resume_ledger)). May already contain data from Phase 2's .bin hydration (which closed the handle before returning — see Phase 2 pseudocode). WAL-recovered on reopen. - - Pre-created: also open/create chunk_id + 1 and index_id + 1 stores so the + - Pre-created: also open/create (chunk_id + 1) and (tx_index_id + 1) stores so the first boundary rollover is a pointer swap only. """ - resume_chunk = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - resume_index = resume_chunk // config.backfill.chunks_per_txhash_index + resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + resume_tx_index_id = resume_chunk_id // config.backfill.chunks_per_txhash_index return ActiveStores( - ledger = open_or_create_ledger_store(config, resume_chunk), - ledger_next = open_or_create_ledger_store(config, resume_chunk + 1), - events = open_or_create_events_hot_segment(config, meta_store, resume_chunk, resume_ledger), - events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk + 1, None), - txhash = open_or_create_txhash_store(config, resume_index), - txhash_next = open_or_create_txhash_store(config, resume_index + 1), + ledger = open_or_create_ledger_store(config, resume_chunk_id), + ledger_next = open_or_create_ledger_store(config, resume_chunk_id + 1), + events = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id, resume_ledger), + events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id + 1, None), + txhash = open_or_create_txhash_store(config, resume_tx_index_id), + txhash_next = open_or_create_txhash_store(config, resume_tx_index_id + 1), ) ``` @@ -849,7 +859,7 @@ Captive core takes 4–5 minutes to spin up and start emitting at `resume_ledger Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` calls. Same code path drains captive core's internal buffer during catchup and switches cadence to live closes (~5 s per ledger) once caught up. ```python -def run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): +def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): """ Sequential pull-based live ingestion. The daemon stays here until process exit. @@ -862,43 +872,44 @@ def run_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume owns everything up to and including seq' signal. 5. If seq completes a chunk, fire on_chunk_boundary (non-blocking — freeze transitions run in background). - 6. If seq completes an index, fire on_index_boundary — RecSplit build kicks off. + 6. If seq completes an index, fire on_tx_index_boundary — RecSplit build kicks off. 7. seq += 1. Loop. Immutable config values (cpi) are read once outside the loop — never per ledger. """ cpi = config.backfill.chunks_per_txhash_index # immutable; read once at loop entry - seq = resume_ledger + ledger_seq = resume_ledger while True: - lcm = ledger_backend.GetLedger(seq) # blocks until ledger seq available + lcm = ledger_backend.GetLedger(ledger_seq) # blocks until this ledger is available # Write to all three active stores in parallel. Order: fan out, wait for all. # Each store is idempotent on re-write of the same ledger (crash-safe). wait_all( - run_in_background(write_ledger_store, active_stores.ledger, seq, lcm), - run_in_background(write_txhash_store, active_stores.txhash, seq, lcm), - run_in_background(write_events_hot_segment, active_stores.events, seq, lcm), + run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), + run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), + run_in_background(write_events_hot_segment, active_stores.events, ledger_seq, lcm), ) # Commit the per-ledger checkpoint (streaming:last_committed_ledger) only AFTER # all three active stores have durably committed the ledger. This is the key # atomic boundary for Phase 4 crash recovery — the checkpoint is the sole # 'the daemon owns everything up to and including this ledger' signal. It's NOT - # the same as Phase 1's low-water (which derives from :lfs flags). - meta_store.put("streaming:last_committed_ledger", seq) + # the same as Phase 1's coverage-end-ledger (which derives from :lfs flags). + meta_store.put("streaming:last_committed_ledger", ledger_seq) # Chunk rollover: hand off to background LFS + events freeze transitions. - C = (seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - if seq == chunk_last_ledger(C): - on_chunk_boundary(C, active_stores, meta_store) - - # Index rollover — every index boundary is also a chunk boundary, so this runs - # AFTER on_chunk_boundary has already dispatched the last chunk's freeze transitions. - N = C // cpi - if seq == index_last_ledger(N): - on_index_boundary(N, active_stores, meta_store) - - seq += 1 + chunk_id = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + if ledger_seq == last_ledger_in_chunk(chunk_id): + on_chunk_boundary(chunk_id, active_stores, meta_store) + + # Tx-index rollover — every tx-index boundary is also a chunk boundary, so this + # runs AFTER on_chunk_boundary has already dispatched the last chunk's freeze + # transitions. + tx_index_id = chunk_id // cpi + if ledger_seq == last_ledger_in_tx_index(tx_index_id): + on_tx_index_boundary(tx_index_id, active_stores, meta_store) + + ledger_seq += 1 ``` Each per-store write is atomic: RocksDB WriteBatch + WAL for ledger and txhash stores; atomic commit of events hot-segment + persisted deltas. Key/value schemas are in [Active Store Architecture](#active-store-architecture). @@ -917,45 +928,45 @@ Streaming's freeze transitions never produce `.bin` files. `.bin` files exist on ### Concurrency Model -- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `ledger_next`, `events`, `events_next`, `txhash`, `txhash_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. +- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `ledger_next`, `events`, `events_next`, `txhash`, `txhash_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. - **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). Go's `sync.Mutex` inside the meta-store wrapper + RocksDB's own single-writer semantics keep these serialized. -- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `lfs_transition` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. +- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. - **Query handlers read from storage-manager layer** (see [01-backfill-workflow.md](./01-backfill-workflow.md)'s sibling docs and the pending query-routing design). Each per-data-type storage manager owns its own state-transition synchronization; the query handler never touches `active_stores` directly. -- **Pre-creation happens at store-open time, not at a mid-chunk tripwire.** `open_active_stores` (Phase 4 entry) opens BOTH `resume_chunk`'s store AND `resume_chunk + 1`'s store up front. Subsequent pre-creation happens inside `on_chunk_boundary` after the rollover — it opens `C + 2` so the NEXT rollover has the pre-created store already waiting. Amortizes creation cost; keeps the ingestion loop's hot path free of store-open latency. +- **Pre-creation happens at store-open time, not at a mid-chunk tripwire.** `open_active_stores_for_resume` (Phase 4 entry) opens BOTH `resume_chunk_id`'s store AND `resume_chunk_id + 1`'s store up front. Subsequent pre-creation happens inside `on_chunk_boundary` after the rollover — it opens `chunk_id + 2` so the NEXT rollover has the pre-created store already waiting. Amortizes creation cost; keeps the ingestion loop's hot path free of store-open latency. ### Chunk Boundary (every 10_000 ledgers) -Triggered when the ingestion loop commits `chunk_last_ledger(C)`. Handoffs to two freeze transitions (LFS + events) that run in background. +Triggered when the ingestion loop commits `last_ledger_in_chunk(chunk_id)`. Handoffs to two freeze transitions (LFS + events) that run in background. ```python -def on_chunk_boundary(C, active_stores, meta_store): +def on_chunk_boundary(chunk_id, active_stores, meta_store): """ - Swap active stores and kick off LFS + events freeze transitions for chunk C. + Swap active stores and kick off LFS + events freeze transitions for this chunk_id. - Ingestion for chunk C+1 continues unimpeded — active_stores.ledger now points at - the ledger_next store that was pre-created at Phase 4 entry (or by the prior chunk's - boundary handler). + Ingestion for (chunk_id + 1) continues unimpeded — active_stores.ledger now points + at the ledger_next store that was pre-created at Phase 4 entry (or by the prior + chunk's boundary handler). - Also pre-creates C+2's stores in background, so the NEXT chunk rollover finds its - pre-created store already opened. + Also pre-creates (chunk_id + 2)'s stores in background, so the NEXT chunk rollover + finds its pre-created store already opened. """ # LFS transition — drain the last in-flight LFS freeze (max-1-transitioning invariant), # then swap pointers so the next chunk writes to the pre-created store. wait_for_lfs_complete() - transitioning_ledger = active_stores.ledger + transitioning_ledger_store = active_stores.ledger active_stores.ledger = active_stores.ledger_next # pointer swap, no I/O - run_in_background(lfs_transition, C, transitioning_ledger, meta_store) + run_in_background(freeze_ledger_chunk_to_pack_file, chunk_id, transitioning_ledger_store, meta_store) # Events transition — same shape. Independent goroutine; does NOT wait for LFS. wait_for_events_complete() - freezing_segment = active_stores.events + freezing_events_segment = active_stores.events active_stores.events = active_stores.events_next # pointer swap - run_in_background(events_transition, C, freezing_segment, meta_store) + run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, freezing_events_segment, meta_store) - # Pre-create C+2's ledger + events so the NEXT boundary is also a pointer swap. - # Low priority; not part of the hot path. Runs in background. - run_in_background(precreate_next_stores, active_stores, meta_store, C + 2) + # Pre-create (chunk_id + 2)'s ledger + events so the NEXT boundary is also a pointer + # swap. Low priority; not part of the hot path. Runs in background. + run_in_background(precreate_next_boundary_stores, active_stores, meta_store, chunk_id + 2) # Wake the lifecycle goroutine — it will check prune eligibility. Freeze transitions # above are NOT dispatched via the lifecycle loop; they run as direct children of the @@ -963,20 +974,20 @@ def on_chunk_boundary(C, active_stores, meta_store): notify_lifecycle() -def precreate_next_stores(active_stores, meta_store, target_chunk): +def precreate_next_boundary_stores(active_stores, meta_store, target_chunk_id): """ Opens / creates the "next-next" ledger store + events hot segment in background so the NEXT chunk rollover doesn't pay creation latency on the hot path. - Similarly handles index-next pre-creation when target_chunk crosses an index boundary. - Idempotent — safe to run on a restart where the target stores already exist. + Similarly handles tx-index-next pre-creation when target_chunk_id crosses a tx-index + boundary. Idempotent — safe to run on a restart where the target stores already exist. """ - active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk) - active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk, None) + active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk_id) + active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk_id, None) cpi = config.backfill.chunks_per_txhash_index - target_index = target_chunk // cpi - if target_index != index_id_of_chunk(target_chunk - 1): - active_stores.txhash_next = open_or_create_txhash_store(config, target_index) + target_tx_index_id = target_chunk_id // cpi + if target_tx_index_id != tx_index_id_of_chunk(target_chunk_id - 1): + active_stores.txhash_next = open_or_create_txhash_store(config, target_tx_index_id) ``` ### LFS Transition @@ -984,10 +995,10 @@ def precreate_next_stores(active_stores, meta_store, target_chunk): Converts the retired ledger RocksDB store to an immutable `.pack` file, then discards the store. ```python -def lfs_transition(C, transitioning_ledger_store, meta_store): +def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_store): """ - Read all 10_000 ledgers for chunk C from its active store, write the pack file, - fsync, flag, then delete the store. + Read all LEDGERS_PER_CHUNK ledgers for chunk_id from its active store, write the + pack file, fsync, flag, then delete the store. Order matters: 1. Open pack file with overwrite=True so a prior crashed attempt's bytes are discarded. @@ -997,36 +1008,36 @@ def lfs_transition(C, transitioning_ledger_store, meta_store): 5. Close and delete the active store. Crash between (4) and (5) leaves an orphan directory; Phase 3's scan_ledger_store_dirs + :lfs-present check deletes it. """ - pack_path = ledger_pack_path(C) - writer = packfile.create(pack_path, overwrite=True) # 1 - for seq in range(chunk_first_ledger(C), chunk_last_ledger(C) + 1): - writer.append(transitioning_ledger_store.get(uint32_big_endian(seq))) # 2 - writer.fsync_and_close() # 3 - meta_store.put(f"chunk:{C:08d}:lfs", "1") # 4 - - transitioning_ledger_store.close() # 5 - delete_dir(ledger_store_path(C)) + pack_path = ledger_pack_path(chunk_id) + writer = packfile.create(pack_path, overwrite=True) # 1 + for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): + writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) # 2 + writer.fsync_and_close() # 3 + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # 4 + + transitioning_ledger_store.close() # 5 + delete_dir(ledger_store_path(chunk_id)) signal_lfs_complete() -def complete_lfs_flush(store_dir, C, meta_store): +def finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store): """ - Phase 3 helper. Re-runs lfs_transition for a chunk whose active ledger store exists - on disk but whose :lfs flag is absent — i.e., a crash interrupted the freeze after - the per-ledger checkpoint but before the flag was set. + Phase 3 helper. Re-runs freeze_ledger_chunk_to_pack_file for a chunk whose active + ledger store exists on disk but whose :lfs flag is absent — i.e., a crash + interrupted the freeze after the per-ledger checkpoint but before the flag was set. - Identical to lfs_transition except: + Identical to freeze_ledger_chunk_to_pack_file except: - No signal_lfs_complete call (not running under the max-1-transitioning gate; Phase 3 is synchronous with startup and runs to completion before Phase 4 starts). - Opens the existing store (WAL-recovered) rather than receiving a handle. """ - transitioning_ledger_store = open_or_create_ledger_store(config, C) - pack_path = ledger_pack_path(C) + transitioning_ledger_store = open_or_create_ledger_store(config, chunk_id) + pack_path = ledger_pack_path(chunk_id) writer = packfile.create(pack_path, overwrite=True) - for seq in range(chunk_first_ledger(C), chunk_last_ledger(C) + 1): - writer.append(transitioning_ledger_store.get(uint32_big_endian(seq))) + for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): + writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) writer.fsync_and_close() - meta_store.put(f"chunk:{C:08d}:lfs", "1") + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") transitioning_ledger_store.close() delete_dir(store_dir) ``` @@ -1036,52 +1047,54 @@ def complete_lfs_flush(store_dir, C, meta_store): Converts the retired events hot segment to three immutable files (events cold segment). ```python -def events_transition(C, freezing_segment, meta_store): +def freeze_events_chunk_to_cold_segment(chunk_id, freezing_events_segment, meta_store): """ - Freeze the events hot segment for chunk C. Same flag-after-fsync + cleanup order - as lfs_transition. + Freeze the events hot segment for chunk_id. Same flag-after-fsync + cleanup order + as freeze_ledger_chunk_to_pack_file. """ - events_path = events_segment_path(C) - write_cold_segment(freezing_segment, events_path) # 3 files: events.pack, index.pack, index.hash + events_path = events_segment_path(chunk_id) + write_cold_segment(freezing_events_segment, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) - meta_store.put(f"chunk:{C:08d}:events", "1") # flag-after-fsync + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # flag-after-fsync - freezing_segment.discard() # drops in-memory bitmaps + persisted deltas + freezing_events_segment.discard() # drops in-memory bitmaps + persisted deltas signal_events_complete() ``` -### Index Boundary (every `LEDGERS_PER_INDEX` ledgers) +### Tx-Index Boundary (every `LEDGERS_PER_INDEX` ledgers) -The last chunk of an index has just rolled over. Before RecSplit can start, every chunk in the index must have its `:lfs` and `:events` flags set. +The last chunk of a tx index has just rolled over. Before RecSplit can start, every chunk in the tx index must have its `:lfs` and `:events` flags set. ```python -def on_index_boundary(N, active_stores, meta_store): +def on_tx_index_boundary(tx_index_id, active_stores, meta_store): """ - Dispatch RecSplit build for index N. Prerequisites: - - Every chunk in N has finished its LFS + events freeze transitions. - - No LFS or events transition is in flight for any chunk of N (would racethe RecSplit input). + Dispatch RecSplit build for this tx_index_id. Prerequisites: + - Every chunk in tx_index_id has finished its LFS + events freeze transitions. + - No LFS or events transition is in flight for any chunk of tx_index_id (would + race the RecSplit input). """ - # Drain ALL in-flight LFS + events transitions. On_chunk_boundary dispatches them; - # here we wait for them to finish — the final chunk of N may still be in-flight. + # Drain ALL in-flight LFS + events transitions. on_chunk_boundary dispatches them; + # here we wait for them to finish — the final chunk of tx_index_id may still be + # in-flight. wait_for_lfs_complete() wait_for_events_complete() - verify_all_chunk_flags(N, meta_store) # defense-in-depth + verify_all_chunk_flags(tx_index_id, meta_store) # defense-in-depth # Swap the txhash active store. RecSplit reads from the retired store. - transitioning_txhash = active_stores.txhash - active_stores.txhash = open_precreated_txhash_store(N + 1) - run_in_background(recsplit_transition, N, transitioning_txhash, meta_store) + transitioning_txhash_store = active_stores.txhash + active_stores.txhash = active_stores.txhash_next + run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash_store, meta_store) ``` ### RecSplit Transition -Builds the 16 RecSplit `.idx` files for index N from the retired txhash active store. +Builds the 16 RecSplit `.idx` files for tx_index_id from the retired txhash active store. ```python -def recsplit_transition(N, transitioning_txhash_store, meta_store): +def build_tx_index_recsplit_files(tx_index_id, transitioning_txhash_store, meta_store): """ - Same flag-after-fsync pattern as lfs/events: + Same flag-after-fsync pattern as LFS / events freeze: 1. Delete any partial .idx files from a prior crashed attempt. 2. Build the 16 RecSplit indexes (one per CF). 3. fsync all .idx files. @@ -1089,15 +1102,15 @@ def recsplit_transition(N, transitioning_txhash_store, meta_store): 5. Flag. 6. Close + delete the txhash active store. """ - idx_path = recsplit_index_path(N) - delete_partial_idx_files(idx_path) # 1 - build_recsplit(transitioning_txhash_store, idx_path) # 2 (16 .idx files) - fsync_all_idx_files(idx_path) # 3 - verify_spot_check(N, idx_path, meta_store) # 4 - meta_store.put(f"index:{N:08d}:txhash", "1") # 5 - - transitioning_txhash_store.close() # 6 - delete_dir(txhash_store_path(N)) + idx_path = recsplit_index_path(tx_index_id) + delete_partial_idx_files(idx_path) # 1 + build_recsplit(transitioning_txhash_store, idx_path) # 2 (16 .idx files) + fsync_all_idx_files(idx_path) # 3 + verify_spot_check(tx_index_id, idx_path, meta_store) # 4 + meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") # 5 + + transitioning_txhash_store.close() # 6 + delete_dir(txhash_store_path(tx_index_id)) ``` --- @@ -1107,7 +1120,7 @@ def recsplit_transition(N, transitioning_txhash_store, meta_store): Retention is enforced by a single background goroutine, woken at chunk boundaries. Prune granularity is the whole txhash index — never per chunk. ```python -def lifecycle_loop(config, meta_store): +def run_prune_lifecycle_loop(config, meta_store): """ Runs as a single background goroutine. Prune gate is uniform across all artifact kinds — LFS, events, RecSplit — for a given index. @@ -1119,93 +1132,97 @@ def lifecycle_loop(config, meta_store): (~16 hours at cpi=1). - Chunk-boundary notifications from the ingestion loop (see on_chunk_boundary). - The freeze transitions (lfs_transition, events_transition, recsplit_transition) are - NOT spawned by this loop — the ingestion loop's on_chunk_boundary / on_index_boundary - dispatch them directly. lifecycle_loop is scoped to pruning. + The freeze transitions (freeze_ledger_chunk_to_pack_file, freeze_events_chunk_to_cold_segment, build_tx_index_recsplit_files) are + NOT spawned by this loop — the ingestion loop's on_chunk_boundary / on_tx_index_boundary + dispatch them directly. run_prune_lifecycle_loop is scoped to pruning. """ cpi = config.backfill.chunks_per_txhash_index - R = config.streaming.retention_ledgers + retention_ledgers = config.streaming.retention_ledgers - _do_prune_sweep(meta_store, R, cpi, config) # initial scan + _run_prune_sweep(meta_store, retention_ledgers, cpi, config) # initial scan while True: wait_for_chunk_boundary_notification() - _do_prune_sweep(meta_store, R, cpi, config) + _run_prune_sweep(meta_store, retention_ledgers, cpi, config) -def _do_prune_sweep(meta_store, R, cpi, config): - for N in eligible_prune_indexes(meta_store, R, cpi): - prune_index(N, meta_store, config) +def _run_prune_sweep(meta_store, retention_ledgers, cpi, config): + for tx_index_id in prunable_tx_index_ids(meta_store, retention_ledgers, cpi): + prune_tx_index(tx_index_id, meta_store, config) -def eligible_prune_indexes(meta_store, R, cpi): +def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): """ - Returns indexes whose entire footprint is past the retention window and are still - prune-eligible (either :txhash == "1" meaning prune hasn't started, or "deleting" - meaning a prior run crashed mid-prune). + Returns tx_index_ids whose entire footprint is past the retention window and are + still prune-eligible (either :txhash == "1" meaning prune hasn't started, or + "deleting" meaning a prior run crashed mid-prune). - - R = 0 → no pruning; archive profile retains everything. - - R > 0 → index N is eligible when tip > index_last_ledger(N) + R. - - tip ledger is streaming:last_committed_ledger (the daemon's own progress). + - retention_ledgers = 0 → no pruning; archive profile retains everything. + - retention_ledgers > 0 → tx_index_id is eligible when + last_committed_ledger > last_ledger_in_tx_index(tx_index_id) + retention_ledgers. + - 'tip ledger' used in the check is streaming:last_committed_ledger (the daemon's + own progress), not the source-reported network tip. Upper bound derivation: - index_last_ledger(N) = ((N + 1) * LPI) + (GENESIS_LEDGER - 1) - Eligible iff L > ((N + 1) * LPI) + (GENESIS_LEDGER - 1) + R - iff L - (GENESIS_LEDGER - 1) - R > (N + 1) * LPI - iff (N + 1) < (L - (GENESIS_LEDGER - 1) - R) / LPI - iff N <= ((L - (GENESIS_LEDGER - 1) - R - 1) // LPI) - 1 (integer floor) - Simplify: L - (GENESIS_LEDGER - 1) - 1 = L - GENESIS_LEDGER. - max_eligible_N = ((L - GENESIS_LEDGER - R) // LPI) - 1 - - Numeric check at L=70_000_002, R=10_000_000, cpi=1000 (LPI=10_000_000): - max_eligible_N = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 6 - 1 = 5. - Index 5 has index_last_ledger(5) + R = 60_000_001 + 10_000_000 = 70_000_001. - 70_000_002 > 70_000_001 → N=5 eligible. ✓ - Index 6 has index_last_ledger(6) + R = 70_000_001 + 10_000_000 = 80_000_001. - 70_000_002 > 80_000_001 is false → N=6 NOT eligible. ✓ + last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + Eligible iff last_committed_ledger > ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + retention_ledgers + iff last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers > (tx_index_id + 1) * LEDGERS_PER_INDEX + iff (tx_index_id + 1) < (last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers) / LEDGERS_PER_INDEX + iff tx_index_id <= ((last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers - 1) // LEDGERS_PER_INDEX) - 1 + Simplify: last_committed_ledger - (GENESIS_LEDGER - 1) - 1 = last_committed_ledger - GENESIS_LEDGER. + max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) // LEDGERS_PER_INDEX) - 1 + + Numeric check at last_committed_ledger=70_000_002, retention_ledgers=10_000_000, + cpi=1000 (LEDGERS_PER_INDEX=10_000_000): + max_eligible_tx_index_id = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 6 - 1 = 5. + tx_index_id=5 has last_ledger_in_tx_index(5) + retention_ledgers = 60_000_001 + 10_000_000 = 70_000_001. + 70_000_002 > 70_000_001 → tx_index_id=5 eligible. ✓ + tx_index_id=6 has last_ledger_in_tx_index(6) + retention_ledgers = 70_000_001 + 10_000_000 = 80_000_001. + 70_000_002 > 80_000_001 is false → tx_index_id=6 NOT eligible. ✓ """ - if R == 0: + if retention_ledgers == 0: return [] - L = meta_store.get("streaming:last_committed_ledger") + last_committed_ledger = meta_store.get("streaming:last_committed_ledger") ledgers_per_index = cpi * LEDGERS_PER_CHUNK - max_eligible_N = ((L - GENESIS_LEDGER - R) // ledgers_per_index) - 1 - if max_eligible_N < 0: + max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) // ledgers_per_index) - 1 + if max_eligible_tx_index_id < 0: return [] result = [] - for N in range(0, max_eligible_N + 1): - val = meta_store.get(f"index:{N:08d}:txhash") + for tx_index_id in range(0, max_eligible_tx_index_id + 1): + val = meta_store.get(f"index:{tx_index_id:08d}:txhash") if val in ("1", "deleting"): - result.append(N) + result.append(tx_index_id) return result -def prune_index(N, meta_store, config): +def prune_tx_index(tx_index_id, meta_store, config): """ - Deletes every artifact for index N and clears its meta store keys. Two-phase marker - for query-routing safety: + Deletes every artifact for tx_index_id and clears its meta store keys. Two-phase + marker for query-routing safety: - Set :txhash = "deleting" FIRST. Queries short-circuit (treat as absent). - Delete files + chunk keys. - Delete :txhash key LAST. - Crash between set-deleting and delete-key leaves :txhash == "deleting"; next startup - re-runs prune_index, which is idempotent (rm -f + delete_if_exists semantics). + Crash between set-deleting and delete-key leaves :txhash == "deleting"; next + startup re-runs prune_tx_index, which is idempotent (rm -f + delete_if_exists + semantics). """ cpi = config.backfill.chunks_per_txhash_index - # Stage 1: commit to pruning. Once this lands, queries for any ledger in index N - # return HTTP 4xx (past retention). - meta_store.put(f"index:{N:08d}:txhash", "deleting") + # Stage 1: commit to pruning. Once this lands, queries for any ledger in + # tx_index_id return HTTP 4xx (past retention). + meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") # Stage 2: delete files and per-chunk keys. Idempotent on re-run. - for C in range(N * cpi, (N + 1) * cpi): - delete_if_exists(ledger_pack_path(C)) - delete_events_segment(C) - meta_store.delete(f"chunk:{C:08d}:lfs") - meta_store.delete(f"chunk:{C:08d}:events") - delete_recsplit_idx_files(N) - - # Stage 3: clear the index key. Index is now fully gone. - meta_store.delete(f"index:{N:08d}:txhash") + for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): + delete_if_exists(ledger_pack_path(chunk_id)) + delete_events_segment(chunk_id) + meta_store.delete(f"chunk:{chunk_id:08d}:lfs") + meta_store.delete(f"chunk:{chunk_id:08d}:events") + delete_recsplit_idx_files(tx_index_id) + + # Stage 3: clear the tx-index key. Tx index is now fully gone. + meta_store.delete(f"index:{tx_index_id:08d}:txhash") ``` **Why index-atomic.** Per-chunk pruning would create a window where `getTransaction` resolves to a ledger sequence whose pack file has already been deleted. Gating every artifact kind on whole-index past-retention closes that window completely. @@ -1233,7 +1250,7 @@ Query serving is gated on Phase 4 being reached. `getLedger`, `getTransaction`, ### Behavior When an Index Is Being Pruned -- `prune_index` sets `index:{N:08d}:txhash = "deleting"` before touching any files, and deletes the key after all files are gone. Query routing treats `"deleting"` identically to `"absent"` (key-not-present). +- `prune_tx_index` sets `index:{tx_index_id:08d}:txhash = "deleting"` before touching any files, and deletes the key after all files are gone. Query routing treats `"deleting"` identically to `"absent"` (key-not-present). - Queries for a ledger in a pruning index return HTTP 4xx (past retention) starting the instant the `"deleting"` marker is set, not when the files actually disappear. No window where queries route into a half-deleted index. ### Rationale @@ -1255,16 +1272,16 @@ No separate recovery phase. Every startup runs Phases 1–4 regardless — alrea 5. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. 6. **DAG-structured cleanup.** Cleanup runs as a separate step after the flag is set. Crash between flag and cleanup = retry just the cleanup on restart. 7. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 itself avoids producing them. -8. **Two-phase prune marker.** `prune_index` writes `index:{N}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `eligible_prune_indexes`. +8. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. ### Compound Recovery Scenarios The backfill doc's crash recovery model (Section: Crash Recovery in `01-backfill-workflow.md`) handles every Phase 1 crash. Streaming extends it with per-ledger and per-transition recovery: - **Crash during Phase 2 `.bin` hydration.** On restart, Phase 2 re-runs. Chunks whose `.bin` was loaded and deleted on the first pass have no `:txhash` flag and no `.bin` file — the loop skips them via the flag check. Chunks not yet loaded still have their `:txhash` flag and `.bin` file — picked up by the same loop. -- **Crash between live per-ledger checkpoint and LFS freeze completion.** `streaming:last_committed_ledger = chunk_last_ledger(C)` but `chunk:{C}:lfs` is absent (freeze transition was killed before setting the flag). On restart, Phase 1 sees `:lfs` missing for C and re-runs `process_chunk(C)` against its configured source — idempotent per-artifact. Phase 3 then finds the active ledger store for C still on disk, sees `:lfs` now set, and deletes the orphaned store. Known inefficiency: ~10_000 ledgers of redundant ingestion work per affected chunk. Correctness is preserved. -- **Crash mid-RecSplit.** `index:{N}:txhash` absent. Phase 3 detects all chunks for N have `:lfs` set, re-spawns the RecSplit build. Partial `.idx` files are deleted first. -- **Crash mid-prune.** Some files deleted, some chunk keys cleared, `index:{N}:txhash = "deleting"` still present. On restart N is still in `eligible_prune_indexes` (the function picks up `"deleting"` as well as `"1"`), so `prune_index(N)` runs again — idempotent because file deletes are `rm -f` and key deletes are `delete_if_exists`. +- **Crash between live per-ledger checkpoint and LFS freeze completion.** `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)` but `chunk:{chunk_id}:lfs` is absent (freeze transition was killed before setting the flag). On restart, Phase 1 sees `:lfs` missing for chunk_id and re-runs `process_chunk(chunk_id)` against its configured source — idempotent per-artifact. Phase 3 then finds the active ledger store for chunk_id still on disk, sees `:lfs` now set, and deletes the orphaned store. Known inefficiency: ~10_000 ledgers of redundant ingestion work per affected chunk. Correctness is preserved. +- **Crash mid-RecSplit.** `index:{tx_index_id}:txhash` absent. Phase 3 detects all chunks for tx_index_id have `:lfs` set, re-spawns the RecSplit build. Partial `.idx` files are deleted first. +- **Crash mid-prune.** Some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. On restart tx_index_id is still in `prunable_tx_index_ids` (the function picks up `"deleting"` as well as `"1"`), so `prune_tx_index(tx_index_id)` runs again — idempotent because file deletes are `rm -f` and key deletes are `delete_if_exists`. --- @@ -1289,12 +1306,12 @@ drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_com | Error | Action | |---|---| -| CaptiveStellarCore unavailable | RETRY with backoff; ABORT after N retries | +| CaptiveStellarCore unavailable | RETRY with backoff; ABORT after `CAPTIVE_CORE_RETRY_MAX` retries (implementation-defined) | | Ledger / txhash / events write failure | ABORT — disk full or storage corruption | | Meta store write failure | ABORT — cannot maintain checkpoint | -| LFS flush failure | Do NOT set `chunk:{C}:lfs`; ABORT transition; restart retries | -| Events freeze failure | Do NOT set `chunk:{C}:events`; ABORT transition; restart retries | -| RecSplit build failure | Do NOT set `index:{N}:txhash`; ABORT transition; restart deletes partials and rebuilds | +| LFS flush failure | Do NOT set `chunk:{chunk_id}:lfs`; ABORT transition; restart retries | +| Events freeze failure | Do NOT set `chunk:{chunk_id}:events`; ABORT transition; restart retries | +| RecSplit build failure | Do NOT set `index:{tx_index_id}:txhash`; ABORT transition; restart deletes partials and rebuilds | | RecSplit verification mismatch | ABORT; do NOT delete transitioning txhash store; operator investigates | | Startup: immutable key changed | FATAL — wipe datadir to change | | Startup: `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_INDEX` | FATAL — fix config | @@ -1308,11 +1325,11 @@ drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_com The unified design requires edits to `01-backfill-workflow.md` (authoritative `03-backfill-workflow.md` on `feature/full-history`): 1. **Drop the `stellar-rpc full-history-backfill` cobra subcommand and all its per-run CLI flags** (`--start-ledger`, `--end-ledger`, `--workers`, `--verify-recsplit`, `--max-retries`). Backfill is no longer an operator-facing CLI entry point. `process_chunk`, `build_txhash_index`, `cleanup_txhash`, and the DAG scheduler remain as subroutines invoked by streaming Phase 1. -2. **Change `run_backfill`'s signature to `run_backfill(config, range_start_chunk, range_end_chunk, source=...)`.** Previously `run_backfill(config, flags)` with `flags.start_ledger` and `flags.end_ledger`. Phase 1 computes chunk IDs (not ledger sequences) via `compute_backfill_range`, so the subroutine takes chunk IDs directly. `source=` selects BSB vs captive core. +2. **Change `run_backfill`'s signature to `run_backfill(config, range_start_chunk, range_end_chunk, source=...)`.** Previously `run_backfill(config, flags)` with `flags.start_ledger` and `flags.end_ledger`. Phase 1 computes chunk IDs (not ledger sequences) via `compute_backfill_chunk_range`, so the subroutine takes chunk IDs directly. `source=` selects BSB vs captive core. 3. **Extend `process_chunk` with the matching `source=` parameter** accepting a `LedgerSource`. Default (`BSBSource`) matches today's behavior. The subroutine no longer creates a BSB connection from config — it uses whatever the caller passed in. 4. **Extend the DAG worker cap to honor `source.max_parallelism()`.** Currently the DAG caps at `--workers` (default GOMAXPROCS). Under `CaptiveCoreSource`, cap at 1. 5. **Move `retention_ledgers` validation + store-on-first-run into shared validation.** Streaming's `validate_config` handles the store+compare. Backfill itself doesn't need to know retention — Phase 1 translates retention into the `[range_start_chunk, range_end_chunk]` it passes into `run_backfill`. -6. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{C}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. Exception: `index:{N:08d}:txhash` uses `"1"` or `"deleting"` for two-phase prune (spec'd in this doc under [Pruning](#pruning); backfill's `build_txhash_index` only ever writes `"1"`). +6. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{chunk_id}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. Exception: `index:{tx_index_id:08d}:txhash` uses `"1"` or `"deleting"` for two-phase prune (spec'd in this doc under [Pruning](#pruning); backfill's `build_txhash_index` only ever writes `"1"`). --- From 33aa92af7dd2171bd8df6c3b1c23e3bc43233431 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 18:27:17 -0700 Subject: [PATCH 11/34] Streaming doc: narrow-scope alignment edits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two changes under the backfill-doc alignment scope: - Geometry section reduced to a one-line pointer to the backfill doc (01-backfill-workflow.md § Geometry), since backfill now owns the chunk/tx-index math in a single canonical location. Removed the full Python block that duplicated it. - Invariants section lead-in now explicitly cross-references backfill's crash-recovery invariants in 01-backfill-workflow.md. Streaming's listed invariants continue to include items already covered by backfill (flag-after-fsync, idempotent writes, DAG-structured cleanup) for scannability; cutting those duplicates is flagged for review. --- .../design-docs/02-streaming-workflow.md | 34 ++----------------- 1 file changed, 3 insertions(+), 31 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 7a88b5942..99f9325b4 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -53,37 +53,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a ## Geometry -Chunk and txhash index math are defined in [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Quick reference: - -```python -# Stellar's first ledger is GENESIS_LEDGER = 2 (not 0 or 1). Every formula that maps -# ledger_seq ↔ chunk_id subtracts GENESIS_LEDGER to zero-base the axis: ledger 2 lands -# in chunk 0, ledger 10_001 is chunk 0's last ledger, ledger 10_002 starts chunk 1. - -GENESIS_LEDGER = 2 -LEDGERS_PER_CHUNK = 10_000 -LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK - # at cpi=1000 this is 10_000_000 - -chunk_id_of_ledger(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - # 56_342_637 → (56_342_637 - 2) // 10_000 = 5634 - -first_ledger_in_chunk(chunk_id) = (chunk_id * LEDGERS_PER_CHUNK) + GENESIS_LEDGER - # chunk_id=5634 → (5634 * 10_000) + 2 = 56_340_002 — ends in ..._02 - -last_ledger_in_chunk(chunk_id) = ((chunk_id + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) - # chunk_id=5634 → ((5635) * 10_000) + 1 = 56_350_001 — ends in ..._01 - # (GENESIS_LEDGER - 1) = 1 is what keeps chunk ends in ..._01 - -tx_index_id_of_chunk(chunk_id) = chunk_id // CHUNKS_PER_TXHASH_INDEX - # chunk_id=5634 → tx_index_id=5 (at cpi=1000) - -first_ledger_in_tx_index(tx_index_id) = (tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER - # tx_index_id=5 → (5 * 10_000_000) + 2 = 50_000_002 - -last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) - # tx_index_id=5 → ((6) * 10_000_000) + 1 = 60_000_001 -``` +See [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Streaming uses the same constants (`GENESIS_LEDGER`, `LEDGERS_PER_CHUNK`, `LEDGERS_PER_INDEX`, `CHUNKS_PER_TXHASH_INDEX`) and the same mapping functions (`chunk_id_of_ledger`, `first_ledger_in_chunk`, `last_ledger_in_chunk`, `tx_index_id_of_chunk`, `first_ledger_in_tx_index`, `last_ledger_in_tx_index`). --- @@ -1265,6 +1235,8 @@ No separate recovery phase. Every startup runs Phases 1–4 regardless — alrea ### Invariants +In addition to the backfill subroutine's invariants in [01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery), streaming adds the following: + 1. **Flag-after-fsync.** A meta store flag is set only after the corresponding file(s) are fsynced. Flag absent = output treated as missing → transition retried from scratch. 2. **Idempotent writes.** The same input ledger always produces the same key-value pairs in all stores. Re-processing after crash is safe. 3. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. From 6a5fa5c041ab4a404fa87bc112a9d8ae732d55ba Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 18:27:26 -0700 Subject: [PATCH 12/34] Design-docs README: reduce to reading-order guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The pre-session README restated content from the two main docs (mode descriptions, shared concepts, meta-store key summary) and referenced a two-mode design that no longer exists under the unified daemon. Replaced with a minimal one-page guide: - Points at 01-backfill-workflow.md and 02-streaming-workflow.md with one-line scope per doc. - States the reading order: backfill first (defines shared concepts), streaming second (builds on 01's vocabulary). - Cross-references the getEvents design doc as related material. Nothing else. 90 lines → 17 lines. --- full-history/design-docs/README.md | 71 ++++-------------------------- 1 file changed, 8 insertions(+), 63 deletions(-) diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md index af0bc3e76..735e44500 100644 --- a/full-history/design-docs/README.md +++ b/full-history/design-docs/README.md @@ -1,72 +1,17 @@ # Stellar Full History RPC Service — Design Docs -## Quick Context - -The Stellar Full History RPC Service ingests the complete blockchain history and serves queries: - -- `getLedger` — retrieve any ledger from history -- `getTransaction` — retrieve any transaction from history -- `getEvents` — retrieve events with filter matching from history - -Two mutually exclusive modes: - -- **Backfill** — offline bulk import - - Writes directly to immutable files (LFS pack files + RecSplit indexes + events cold segments) - - No RocksDB active stores, no queries during ingestion - - DAG-scheduled with a flat worker pool - - Exits when done -- **Streaming** (default) — real-time ingestion via CaptiveStellarCore - - Writes to RocksDB active stores + events hot segment - - Serves queries concurrently with ingestion - - Transitions completed data to immutable storage in background - - Long-running daemon -- Backfill typically runs first to populate historical data -- Streaming picks up where backfill left off - ## Documents -| Doc | Title | Status | -|-----|-------|--------| -| [events](../../design-docs/getevents-full-history-design.md) | getEvents Full-History Design | Complete | -| [01](./01-backfill-workflow.md) | Backfill Workflow | Complete | -| [02](./02-streaming-workflow.md) | Streaming Workflow | Complete | -| 03 | Query Routing | **Not started** | -| 04 | Operator Guide | **Not started** | - -**What each doc covers:** - -- **Events** — hot/cold segments, roaring bitmap indexes, MPHF, freeze process, query path -- **01 Backfill** — geometry, directory layout, meta store keys, config, DAG tasks, execution model, crash recovery -- **02 Streaming** — startup, ingestion loop, three sub-flow transitions, crash recovery invariants, backfill-to-streaming migration -- **03 Query Routing** — routing `getLedger`/`getTransaction`/`getEvents` to correct store during active/transitioning/complete phases -- **04 Operator Guide** — end-to-end setup, hardware sizing, monitoring, troubleshooting - -**What's folded into existing docs (no separate doc needed):** - -- Architecture overview → split across 01 overview, 02 overview, this README -- Meta store keys → defined inline in 01 and 02 -- Directory structure → inline in 01 -- Configuration → inline in 01 and 02 -- Checkpointing math → inline in 01, referenced by 02 -- Crash recovery → inline in 01 (backfill) and 02 (streaming invariants) +| Doc | Scope | +|-----|-------| +| [01-backfill-workflow.md](./01-backfill-workflow.md) | Backfill subroutine internals — DAG, per-chunk tasks, shared TOML config, meta-store key schema, crash recovery | +| [02-streaming-workflow.md](./02-streaming-workflow.md) | Unified daemon end-to-end — startup phases, live ingestion, freeze transitions, pruning, query contract | ## Reading Order -- Read events doc first — standalone, no prerequisites -- Read 01 (backfill) second — defines all shared concepts: geometry, meta store keys, directory layout, flag-after-fsync invariant -- Read 02 (streaming) third — assumes familiarity with 01 - -## Shared Concepts +- Read **01 Backfill** first. It defines shared concepts used by both docs: geometry, meta-store key schema, shared TOML config, flag-after-fsync. +- Read **02 Streaming** second. It builds on 01's vocabulary and describes how the daemon invokes backfill as its Phase 1 subroutine. -Defined in the backfill doc, used by all documents: +## See Also -- **Chunk** — 10_000 ledgers, atomic unit of ingestion and file I/O -- **Index** — `chunks_per_txhash_index` chunks (default 1_000 = 10_000_000 ledgers), unit of RecSplit build -- **Meta store** — single RocksDB instance, source of truth for crash recovery - - `chunk:{C:08d}:lfs` — ledger pack file complete - - `chunk:{C:08d}:txhash` — raw txhash `.bin` file complete (backfill only) - - `chunk:{C:08d}:events` — events cold segment complete - - `index:{N:08d}:txhash` — RecSplit index complete - - `streaming:last_committed_ledger` — per-ledger checkpoint (streaming only) - - `config:chunks_per_txhash_index` — immutable after first run -- **Flag-after-fsync** — meta store flags set only after durable file writes, core crash recovery invariant for both modes +- [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segment layout; consumed by both docs above. From f6b61059e68d0f0c67e3d057866db8944378b166 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 18:27:57 -0700 Subject: [PATCH 13/34] Backfill doc: align to unified daemon + grill-me pass fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reframes the backfill doc as an internal subroutine invoked by Phase 1 of the streaming daemon, matching the post-#700 unified design. All operator-CLI ceremony is removed; naming + style now match 02-streaming-workflow.md. Alignment edits (source-of-truth §9): - Dropped the stellar-rpc full-history-backfill cobra subcommand and all per-run CLI flags (--start-ledger, --end-ledger, --workers, --verify-recsplit, --max-retries). Removed the getStatus endpoint (daemon-level getHealth covers status under unified design). - run_backfill signature is now run_backfill(config, range_start_chunk_id, range_end_chunk_id, source) where source is a LedgerSource (BSB or captive core), matching the streaming doc's ledger source abstraction. - process_chunk takes source= and reads via source.get_range(...) instead of constructing its own BSB connection. - DAG worker cap honors source.max_parallelism() (GOMAXPROCS for BSB, 1 for captive core). - Retention handling removed entirely from backfill. The daemon's validate_config owns retention + immutable-key enforcement. - Partial Tx Index Ranges section describes how trailing partial-tx- index chunks (.bin files + :txhash flags) persist until Phase 2 hydrates them on the next daemon start. Naming + style alignment with 02-streaming-workflow.md: - snake_case Python pseudocode throughout. - GENESIS_LEDGER, LEDGERS_PER_CHUNK, LEDGERS_PER_INDEX, and CHUNKS_PER_TXHASH_INDEX defined as SCREAMING_SNAKE constants. - tx_index_id replaces index_id; chunk_id, tx_index_id are the canonical long names (no bare C / N / L / T / R placeholders). - Geometry functions: chunk_id_of_ledger, first_ledger_in_chunk, last_ledger_in_chunk, tx_index_id_of_chunk, first_ledger_in_tx_index, last_ledger_in_tx_index — matching streaming doc exactly. - Meta-store key templates use {chunk_id:08d} and {tx_index_id:08d}. - UPPER_SNAKE_CASE TOML keys and section headers throughout. - build_txhash_index's internal pipeline renamed from Phase 1 / 2 / 3 / 4 to Stage 1 / 2 / 3 / 4, preserving the rule that "Phase" refers only to the daemon's startup phases. Grill-me Pass A (correctness) fixes: - cleanup_txhash now uses delete_if_exists for .bin files — crash between .bin delete and :txhash flag delete is safe to retry. - build_dag schedules build_txhash_index for tx indexes whose LAST chunk falls in the current range, and filters process_chunk scheduling to in-range chunks only. Covers the cross-iteration tx-index-completion case where iteration N ingests a tx index's first chunks and iteration N+1 ingests its last chunks: build runs in iteration N+1 using .bin files from both iterations. - validate asserts source.tip() >= last_ledger_in_chunk(range_end_chunk_id) rather than calling source.covers() (not on the LedgerSource interface). - build_txhash_index: added an invariant comment noting that every chunk's .bin is on disk when the task runs (DAG ordering guarantees cleanup can only run after build succeeds). Grill-me Pass B (ambiguity) fixes: - cpi defined inline in Geometry as shorthand for CHUNKS_PER_TXHASH_INDEX. - Stray "txhash index" in prose replaced with "tx index" for consistency with the streaming doc's dominant form. - "LFS" in process_chunk's key-properties bullet spelled as "ledger pack file (:lfs flag)". File now 664 lines (down from 698 at HEAD). ID-leak grep (CC/D/B-INV) and stale-term grep both zero-hit. --- .../design-docs/01-backfill-workflow.md | 674 +++++++++--------- 1 file changed, 320 insertions(+), 354 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index dbb7aa05f..763d8187d 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -2,64 +2,71 @@ ## Overview -Backfill populates the immutable stores for a configured ledger range `[start_ledger, end_ledger]`. +Backfill is a subroutine invoked by Phase 1 of the streaming daemon (see [02-streaming-workflow.md](./02-streaming-workflow.md)). Given an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `LedgerSource`, it produces the immutable output files for those chunks via a static DAG of idempotent per-chunk tasks. + +**Not an operator CLI.** The daemon is the single operator entry point (`stellar-rpc --config path/to/config.toml`); backfill has no `full-history-backfill` subcommand and no per-run CLI flags. **What it does:** -- Ingests historical ledgers offline — no live queries served (only `getHealth` / `getStatus`). `getHealth` is the existing lightweight liveness check; `getStatus` is the new backfill-specific progress endpoint (see [getStatus API Response](#getstatus-api-response) below). -- Writes directly to immutable file formats — no RocksDB active stores -- Schedules work as a DAG of idempotent tasks, dispatched via a flat worker pool (default GOMAXPROCS slots) -- Exits when done; on failure, re-run the same command — completed work is never repeated +- Ingests historical ledgers via the `LedgerSource` passed in by the caller — BSB or captive core (see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source)). +- Writes directly to immutable file formats — no RocksDB active stores. +- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool. +- Returns when every chunk in the range is complete; on crash, Phase 1 re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. **What it produces:** | Query it enables | Immutable output | Scope | |-----------------|-----------------|-------| -| `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10K ledgers) | -| `getTransaction` | Txhash index files | Per txhash index (default 10M ledgers) | +| `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10_000 ledgers) | +| `getTransaction` | Txhash index files | Per tx index (default 10_000_000 ledgers) | | `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | --- ## Geometry -The Stellar blockchain starts at ledger 2. Backfill organizes data using two concepts: +Stellar's first ledger is `GENESIS_LEDGER = 2` (not 0 or 1). Every formula that maps `ledger_seq ↔ chunk_id` subtracts `GENESIS_LEDGER` to zero-base the axis. In the pseudocode below, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. + +```python +GENESIS_LEDGER = 2 +LEDGERS_PER_CHUNK = 10_000 # hardcoded; not configurable +CHUNKS_PER_TXHASH_INDEX = # 1 / 10 / 100 / 1_000; default 1_000 +LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK + # at cpi=1_000 this is 10_000_000 -- **Chunk** — 10_000 ledgers (hardcoded, not configurable) - - Atomic unit of ingestion and crash recovery - - Produces: one ledger `.pack` file, one raw txhash `.bin` file, one events cold segment (`events.pack`, `index.pack`, `index.hash`) - - `chunk_id = (ledger_seq - 2) / 10_000` -- **Txhash Index** — `CHUNKS_PER_TXHASH_INDEX` chunks (default 1000 = 10M ledgers) - - One RecSplit index covers all transactions across `CHUNKS_PER_TXHASH_INDEX` chunks (default: 10M ledgers worth of transactions) - - Produces 16 CF (column family) `.idx` files per txhash index - - `index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX` - - Configurable via TOML, but must not change across runs — once set, it is fixed +chunk_id_of_ledger(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + # 56_342_637 → (56_342_637 - 2) // 10_000 = 5_634 -### ID Formulas +first_ledger_in_chunk(chunk_id) = (chunk_id * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + # chunk_id=5_634 → (5_634 * 10_000) + 2 = 56_340_002 — ends in ..._02 -``` -chunk_id = (ledger_seq - 2) / 10_000 -index_id = chunk_id / CHUNKS_PER_TXHASH_INDEX +last_ledger_in_chunk(chunk_id) = ((chunk_id + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) + # chunk_id=5_634 → ((5_635) * 10_000) + 1 = 56_350_001 — ends in ..._01 + +tx_index_id_of_chunk(chunk_id) = chunk_id // CHUNKS_PER_TXHASH_INDEX + # chunk_id=5_634 → tx_index_id=5 (at cpi=1_000) + +first_ledger_in_tx_index(tx_index_id) = (tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER + # tx_index_id=5 → (5 * 10_000_000) + 2 = 50_000_002 + +last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + # tx_index_id=5 → ((6) * 10_000_000) + 1 = 60_000_001 ``` -Example with `CHUNKS_PER_TXHASH_INDEX = 1000` (default): +Example rows at `CHUNKS_PER_TXHASH_INDEX = 1000` (default): -| Txhash Index ID | First Ledger | Last Ledger | Chunks | -|-----------------|-------------|------------|--------| -| 0 | 2 | 10_000_001 | 0–999 | -| 1 | 10_000_002 | 20_000_001 | 1000–1999 | -| 2 | 20_000_002 | 30_000_001 | 2000–2999 | -| N | (N × 10M) + 2 | ((N+1) × 10M) + 1 | N×1000 – (N+1)×1000 - 1 | +| `tx_index_id` | First Ledger | Last Ledger | Chunks | +|---|---|---|---| +| `0` | `2` | `10_000_001` | `0 – 999` | +| `1` | `10_000_002` | `20_000_001` | `1_000 – 1_999` | +| `2` | `20_000_002` | `30_000_001` | `2_000 – 2_999` | -All IDs use uniform `%08d` zero-padding (supports up to 99_999_999). +All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). --- ## Configuration -TOML file, passed via `stellar-rpc full-history-backfill --config path/to/config.toml`. - -- **TOML** defines data layout and storage paths — must be stable across runs -- **CLI flags** define per-run parameters (range, workers, retries) +The streaming daemon loads a single TOML file; backfill reads the subset documented here. Streaming-only sections (`[STREAMING]`, `[HISTORY_ARCHIVES]`) are in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). ### TOML Config @@ -73,7 +80,7 @@ TOML file, passed via `stellar-rpc full-history-backfill --config path/to/config | Key | Type | Default | Description | |-----|------|---------|-------------| -| `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per txhash index. Defines data layout — must be stable across runs. | +| `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | **[IMMUTABLE_STORAGE.LEDGERS]** @@ -99,103 +106,49 @@ TOML file, passed via `stellar-rpc full-history-backfill --config path/to/config |-----|------|---------|-------------| | `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/index` | Base path for RecSplit index files (permanent). | -The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores used by the streaming workflow). +The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores owned by the streaming workflow). -**[BACKFILL.BSB]** — BSB / Buffered Storage Backend (required) +**[BACKFILL.BSB]** — BSB / Buffered Storage Backend (optional at the daemon level; required when Phase 1 selects `BSBSource`) -| Key | Type | Default | Description | -|-----|------|---------|-------------------------------------------------------------------------------------| +| Key | Type | Default | Description | +|-----|------|---------|-------------| | `BUCKET_PATH` | string | **required** | Remote object store path to fetch LedgerCloseMeta (without `gs://` prefix for GCS). | -| `BUFFER_SIZE` | int | `1000` | Prefetch buffer depth per connection. | -| `NUM_WORKERS` | int | `20` | Download workers per connection. | +| `BUFFER_SIZE` | int | `1000` | Prefetch buffer depth per connection. | +| `NUM_WORKERS` | int | `20` | Download workers per connection. | -**[LOGGING]** +Source selection at the daemon level (BSB vs captive core, based on `[BACKFILL.BSB]` presence) is described in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). When the caller invokes `run_backfill(..., source=CaptiveCoreSource(...))`, this section is not used. -Both keys are optional. When a key is set in both TOML and on the CLI, the CLI flag wins — specifying both is not an error. +**[LOGGING]** | Key | Type | Default | Description | |-----|------|---------|-------------| -| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. | -| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. | - -### CLI Flags - -| Flag | Type | Default | Description | -|------|------|---------|-------------| -| `--start-ledger` | uint32 | **required** | First ledger (inclusive). Must be ≥ 2. | -| `--end-ledger` | uint32 | **required** | Last ledger (inclusive). Must be > `start_ledger`. | -| `--workers` | int | `GOMAXPROCS` | Total concurrent DAG task slots. | -| `--verify-recsplit` | bool | `true` | Run RecSplit verify phase after build. | -| `--max-retries` | int | `3` | Max retries per task before marking it failed. | -| `--log-level` | string | — | Overrides `[LOGGING].LEVEL` when set. | -| `--log-format` | string | — | Overrides `[LOGGING].FORMAT` when set. | +| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. Daemon CLI flag `--log-level` wins when both are set. | +| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. Daemon CLI flag `--log-format` wins when both are set. | -### Optional TOML Sections +**[META_STORE]** -| Section | Key | Default | Description | -|---------|-----|---------|-------------| -| `[META_STORE]` | `PATH` | `{DEFAULT_DATA_DIR}/meta/rocksdb` | Meta store RocksDB directory | +| Key | Type | Default | Description | +|-----|------|---------|-------------| +| `PATH` | string | `{DEFAULT_DATA_DIR}/meta/rocksdb` | Meta store RocksDB directory. | ### Validation Rules -The only hard constraints are: - -- `start_ledger >= 2` -- `end_ledger > start_ledger` -- `[BACKFILL.BSB]` must be present -- `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — changing it invalidates existing txhash index boundaries -- Backfill never prunes existing data — narrowing the range between runs is safe (completed work outside the new range is simply left untouched) -- No txhash-index-alignment required — the operator can pass any arbitrary ledger range -- If gaps remain after backfill, streaming mode validates completeness for all chunks and all txhash indexes at startup, reports any gaps to the operator, and aborts - -#### Chunk Boundary Expansion - -- System expands the requested range **outward** to the nearest chunk boundaries -- Start expands DOWN to the first ledger of its chunk -- End expands UP to the last ledger of its chunk -- Never clamps inward — the effective range is always ≥ the requested range -- Operator doesn't need to manually calculate chunk-aligned values - -``` -Operator requests: --start-ledger 5_000_000 --end-ledger 56_337_842 -Chunk boundary expand: start=5_000_000 falls within chunk 499 (starts at 4_990_002) - → expand start to 4_990_002 - end=56_337_842 falls within chunk 5633 (ends at 56_340_001) - → expand end to 56_340_001 -Effective range: ledgers 4_990_002–56_340_001 = 5_135 chunks -``` - -#### BSB Availability Validation - -After expansion, the system validates that the remote object store referenced by BSB contains all ledgers in the expanded range: - -- Expanded end exceeds BSB availability → error at startup (no silent truncation) -- Operator must either reduce `--end-ledger` or wait for more ledgers to become available in BSB - -#### Partial Txhash Index Ranges - -If the expanded range does not complete a full txhash index: +- `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — the daemon's `validate_config` enforces this at startup (see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode)). +- When the caller invokes backfill with `source=BSBSource(...)`, `[BACKFILL.BSB]` must be present AND the source must cover the requested chunk range. `run_backfill`'s `validate` asserts `source.tip() >= last_ledger_in_chunk(range_end_chunk_id)` at the start; lower-bound coverage (bucket retention floor, captive core history start) is verified by `source.get_range` during execution. -- Chunks are still backfilled and immediately serve `getLedger`/`getEvents` when the service is started in streaming mode -- Txhash index creation only happens once **all** input chunks for the txhash index are ready -- If txhash index creation does not happen in the current backfill run, the remaining chunks are completed either by a subsequent backfill run (should the operator run backfill again) or when streaming mode starts for the first time (see [Implications for Streaming Workflow](#implications-for-streaming-workflow) below) +### Partial Tx Index Ranges -Ledger and events data are useful per-chunk and should not be blocked by txhash index alignment. Without relaxed validation: +When the caller's chunk range does not span a complete tx index, the trailing chunks have: -- A node at ledger 56_340_000 cannot backfill the latest ~6.3M ledgers because `50_000_002–56_340_001` doesn't align to a 10M txhash index boundary — the operator would have to wait until ledger 60_000_001 -- Incremental backfill (extending coverage from a completed txhash index to recent history) would be blocked unless the chain happens to sit on a txhash index boundary +- Their raw `.bin` files on disk (inside `IMMUTABLE_STORAGE.TXHASH_RAW.PATH`). +- Their `chunk:{chunk_id:08d}:txhash` flags set in the meta store. +- No RecSplit `.idx` files (RecSplit is built only when every chunk of the tx index is ready). -#### Implications for Streaming Workflow +These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 of the streaming daemon loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). -When backfill completes at a non-txhash-index-aligned boundary, a partially-filled txhash index remains. The streaming workflow completes the remaining chunks: +Ledger and events data are useful per-chunk and are not blocked by tx-index alignment — `chunk:{chunk_id:08d}:lfs` and `chunk:{chunk_id:08d}:events` flags are set as soon as each chunk's outputs are durable. -- Streaming continues chunk ingestion from where backfill left off, writing the same per-chunk outputs (LFS, txhash, events) using the same flag-based idempotency -- When streaming completes the last chunk needed for a pending txhash index, txhash index creation becomes eligible and runs -- The meta store is the shared coordination point — streaming checks the same chunk flags as backfill, so there is no gap or overlap between backfill and streaming coverage - -See [PR #617 discussion](https://github.com/stellar/stellar-rpc/pull/617#discussion_r2969796337) for the original rationale. - -### Example: GCS Backfill Config +### Example TOML ```toml [SERVICE] @@ -224,25 +177,20 @@ LEVEL = "info" FORMAT = "text" ``` -```bash -stellar-rpc full-history-backfill --config config.toml \ - --start-ledger 2 \ - --end-ledger 30_000_001 \ - --workers 40 -``` +The TOML above is consumed by the streaming daemon entry point (`stellar-rpc --config ...`); backfill is invoked internally by Phase 1 with the chunk range and source it computed. --- ## Directory Structure -With geometry (chunk, txhash index) and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is how they map to the filesystem. +With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is how they map to the filesystem. -- Each data type has its own directory tree rooted at its `IMMUTABLE_STORAGE.*.PATH` -- Chunk-level files (ledgers, events, raw txhash) are grouped into subdirectories (bucket) of 1_000 chunks: - - `bucket_id = chunk_id / 1000` (hardcoded, not configurable), formatted as `%05d` - - `bucket_id` is purely a filesystem concern — it does not appear in meta store keys, DAG dependencies, or config -- Txhash index output is the only structure that uses `index_id` instead of `bucket_id` -- Directories are created on-demand via `os.MkdirAll` (safe for concurrent writes) +- Each data type has its own directory tree rooted at its `IMMUTABLE_STORAGE.*.PATH`. +- Chunk-level files (ledgers, events, raw txhash) are grouped into subdirectories (`bucket`) of 1_000 chunks: + - `bucket_id = chunk_id // 1000` (hardcoded, not configurable), formatted as `%05d`. + - `bucket_id` is purely a filesystem concern — it does not appear in meta store keys, DAG dependencies, or config. +- Tx-index output is the only structure that uses `tx_index_id` instead of `bucket_id`. +- Directories are created on-demand via `os.MkdirAll` (safe for concurrent writes). ``` {DEFAULT_DATA_DIR}/ @@ -250,16 +198,16 @@ With geometry (chunk, txhash index) and storage paths (`IMMUTABLE_STORAGE.*`) de │ └── rocksdb/ ← Meta store (WAL always enabled) │ ├── ledgers/ ← IMMUTABLE_STORAGE.LEDGERS.PATH -│ ├── 00000/ ← chunks 0–999 (1_000 .pack files) +│ ├── 00000/ ← chunk_ids 0–999 (1_000 .pack files) │ │ ├── 00000000.pack ← ledger pack file (PR #633) │ │ ├── 00000001.pack │ │ └── ... -│ ├── 00001/ ← chunks 1000–1999 +│ ├── 00001/ ← chunk_ids 1_000–1_999 │ │ └── ... │ └── .../ │ ├── events/ ← IMMUTABLE_STORAGE.EVENTS.PATH -│ ├── 00000/ ← chunks 0–999 (3_000 files: 3 per chunk) +│ ├── 00000/ ← chunk_ids 0–999 (3_000 files: 3 per chunk) │ │ ├── 00000000-events.pack ← compressed event blocks │ │ ├── 00000000-index.pack ← serialized roaring bitmaps │ │ ├── 00000000-index.hash ← MPHF for term → slot lookup @@ -268,47 +216,47 @@ With geometry (chunk, txhash index) and storage paths (`IMMUTABLE_STORAGE.*`) de │ └── txhash/ ├── raw/ ← IMMUTABLE_STORAGE.TXHASH_RAW.PATH - │ ├── 00000/ ← chunks 0–999 (1_000 .bin files) - │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit) + │ ├── 00000/ ← chunk_ids 0–999 (1_000 .bin files) + │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit + Phase 2) │ │ └── ... │ └── .../ └── index/ ← IMMUTABLE_STORAGE.TXHASH_INDEX.PATH - ├── 00000000/ ← txhash index 0 (16 RecSplit CF files) + ├── 00000000/ ← tx_index_id=0 (16 RecSplit CF files) │ └── cf-{0-f}.idx ← PERMANENT └── .../ ``` -`CHUNKS_PER_TXHASH_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. +`CHUNKS_PER_TXHASH_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. -The directory tree above reflects the default `CHUNKS_PER_TXHASH_INDEX = 1000`. Using 20M ledgers (2_000 chunks) as an example: +Directory-count tradeoffs for a 2_000-chunk (20M-ledger) dataset: -| `CHUNKS_PER_TXHASH_INDEX` | Txhash index dirs | Tradeoff | -|---------------------------|-------------------|----------| -| `1000` (default) | 2_000 / 1000 = 2 | Fewer dirs, larger indexes — longer build time per index, fewer files to search at query time | -| `100` | 2_000 / 100 = 20 | More dirs, smaller indexes — faster build time per index, more files to search at query time | -| `1` | 2_000 / 1 = 2_000 | One index per chunk — fastest build, most files to search | +| `CHUNKS_PER_TXHASH_INDEX` | Tx-index dirs | Tradeoff | +|---------------------------|---------------|----------| +| `1000` (default) | `2_000 / 1_000 = 2` | Fewer dirs, larger indexes — longer build time per index, fewer files to search at query time | +| `100` | `2_000 / 100 = 20` | More dirs, smaller indexes — faster build time per index, more files to search at query time | +| `1` | `2_000 / 1 = 2_000` | One index per chunk — fastest build, most files to search | ### Path Conventions | File Type | Pattern | Example | |-----------|---------|---------| -| Ledger pack | `{IMMUTABLE_STORAGE.LEDGERS.PATH}/{bucketID:05d}/{chunkID:08d}.pack` | `ledgers/00000/00000042.pack` | -| Raw txhash | `{IMMUTABLE_STORAGE.TXHASH_RAW.PATH}/{bucketID:05d}/{chunkID:08d}.bin` | `txhash/raw/00000/00000042.bin` | -| RecSplit CF | `{IMMUTABLE_STORAGE.TXHASH_INDEX.PATH}/{indexID:08d}/cf-{nibble}.idx` | `txhash/index/00000000/cf-a.idx` | -| Events data | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-events.pack` | `events/00000/00000042-events.pack` | -| Events index | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.pack` | `events/00000/00000042-index.pack` | -| Events hash | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucketID:05d}/{chunkID:08d}-index.hash` | `events/00000/00000042-index.hash` | +| Ledger pack | `{IMMUTABLE_STORAGE.LEDGERS.PATH}/{bucket_id:05d}/{chunk_id:08d}.pack` | `ledgers/00000/00000042.pack` | +| Raw txhash | `{IMMUTABLE_STORAGE.TXHASH_RAW.PATH}/{bucket_id:05d}/{chunk_id:08d}.bin` | `txhash/raw/00000/00000042.bin` | +| RecSplit CF | `{IMMUTABLE_STORAGE.TXHASH_INDEX.PATH}/{tx_index_id:08d}/cf-{nibble}.idx` | `txhash/index/00000000/cf-a.idx` | +| Events data | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucket_id:05d}/{chunk_id:08d}-events.pack` | `events/00000/00000042-events.pack` | +| Events index | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucket_id:05d}/{chunk_id:08d}-index.pack` | `events/00000/00000042-index.pack` | +| Events hash | `{IMMUTABLE_STORAGE.EVENTS.PATH}/{bucket_id:05d}/{chunk_id:08d}-index.hash` | `events/00000/00000042-index.hash` | - **Nibble** = high 4 bits of `txhash[0]`, i.e., `txhash[0] >> 4`. Values `0`–`f`. Determines which of 16 CFs a txhash is routed to. -- **Raw txhash format**: 36 bytes per entry, no header: `[txhash: 32 bytes][ledgerSeq: 4 bytes big-endian]` -- **Events cold segment**: See [getEvents full-history design](https://github.com/stellar/stellar-rpc/pull/635) for the full format specification. +- **Raw txhash format**: 36 bytes per entry, no header: `[txhash: 32 bytes][ledger_seq: 4 bytes big-endian]`. +- **Events cold segment**: see [getEvents full-history design](https://github.com/stellar/stellar-rpc/pull/635) for the full format. --- ## Meta Store Keys -- Single RocksDB instance with WAL (Write-Ahead Log) always enabled -- Authoritative source for crash recovery — all resume decisions derive from key presence in this store +- Single RocksDB instance with WAL (Write-Ahead Log) always enabled. +- Authoritative source for crash recovery — all resume decisions derive from key presence. ### Key Schema @@ -316,40 +264,42 @@ All IDs use uniform `%08d` zero-padding, matching the directory structure. | Key Pattern | Value | Written When | |-------------|-------|-------------| -| `chunk:{C:08d}:lfs` | `"1"` | After ledger `.pack` file is fsynced | -| `chunk:{C:08d}:txhash` | `"1"` | After raw txhash `.bin` file is fsynced | -| `chunk:{C:08d}:events` | `"1"` | After events cold segment files (`events.pack`, `index.pack`, `index.hash`) are fsynced | -| `index:{N:08d}:txhash` | `"1"` | After all 16 RecSplit CF `.idx` files are built and fsynced | - -- Values are `"1"` (retained for `ldb`/`sst_dump` readability); key presence is the signal -- Key absence means not started or incomplete — treated identically on resume -- Each chunk flag is written independently after its output's fsync — a crash may leave some flags set and others absent for the same chunk -- On resume, each chunk's flags are checked independently — only missing outputs are produced -- WAL is always enabled — disabling it would invalidate all crash recovery -- `chunk:{C}:txhash` keys are deleted after the txhash index is built (the raw `.bin` files they reference are also deleted); all other flags are permanent +| `chunk:{chunk_id:08d}:lfs` | `"1"` | After ledger `.pack` file is fsynced | +| `chunk:{chunk_id:08d}:txhash` | `"1"` | After raw txhash `.bin` file is fsynced | +| `chunk:{chunk_id:08d}:events` | `"1"` | After events cold segment files (`events.pack`, `index.pack`, `index.hash`) are fsynced | +| `index:{tx_index_id:08d}:txhash` | `"1"` | After all 16 RecSplit CF `.idx` files are built and fsynced | + +- Values are `"1"` (retained for `ldb`/`sst_dump` readability); key presence is the signal. +- Key absence means not started or incomplete — treated identically on resume. +- Each chunk flag is written independently after its output's fsync — a crash may leave some flags set and others absent for the same chunk. +- On resume, each chunk's flags are checked independently — only missing outputs are produced. +- WAL is always enabled — disabling it would invalidate all crash recovery. +- `chunk:{chunk_id:08d}:txhash` keys are deleted after the tx index is built (the raw `.bin` files they reference are also deleted); all other flags are permanent within backfill's scope. + +**Streaming's extension.** Streaming's prune path may transition `index:{tx_index_id:08d}:txhash` through an intermediate `"deleting"` value before clearing the key entirely. Backfill's `build_txhash_index` only ever writes `"1"`. See [02-streaming-workflow.md — Pruning](./02-streaming-workflow.md#pruning) for the prune mechanism. **Examples:** ``` -chunk:00000000:lfs → "1" chunk 0 ledger pack done -chunk:00000000:txhash → "1" chunk 0 raw txhash done -chunk:00000000:events → "1" chunk 0 events cold segment done -chunk:00000999:events → "1" last chunk of txhash index 0 -index:00000000:txhash → "1" txhash index 0 RecSplit complete -index:00000001:txhash → absent txhash index 1 not yet built +chunk:00000000:lfs → "1" chunk_id=0 ledger pack done +chunk:00000000:txhash → "1" chunk_id=0 raw txhash done +chunk:00000000:events → "1" chunk_id=0 events cold segment done +chunk:00000999:events → "1" last chunk of tx_index_id=0 (at cpi=1_000) +index:00000000:txhash → "1" tx_index_id=0 RecSplit complete +index:00000001:txhash → absent tx_index_id=1 not yet built ``` ### Key Lifecycle ``` -chunk ingestion → sets chunk:{C}:lfs, chunk:{C}:txhash, chunk:{C}:events - (each independently, after its output's fsync) -txhash index build → sets index:{N}:txhash -txhash cleanup → deletes chunk:{C}:txhash keys + raw .bin files +chunk ingestion → sets chunk:{chunk_id:08d}:lfs, chunk:{chunk_id:08d}:txhash, chunk:{chunk_id:08d}:events + (each independently, after its output's fsync) +tx index build → sets index:{tx_index_id:08d}:txhash +txhash cleanup → deletes chunk:{chunk_id:08d}:txhash keys + raw .bin files ``` -After a completed txhash index: -- `chunk:{C}:lfs`, `chunk:{C}:events`, `index:{N}:txhash` — permanent -- `chunk:{C}:txhash` keys + raw `.bin` files — deleted after txhash index is built +After a completed tx index: +- `chunk:{chunk_id:08d}:lfs`, `chunk:{chunk_id:08d}:events`, `index:{tx_index_id:08d}:txhash` — permanent within backfill's scope. +- `chunk:{chunk_id:08d}:txhash` keys + raw `.bin` files — deleted after tx index is built. --- @@ -359,221 +309,257 @@ The backfill DAG has three task types: | Task | Cadence | Dependencies | Produces | |------|---------|-------------|----------| -| `process_chunk(chunk_id)` | Per chunk (10K ledgers) | None | Ledger `.pack` + raw txhash `.bin` + events cold segment | -| `build_txhash_index(index_id)` | Per txhash index | All `process_chunk` tasks for this txhash index | 16 RecSplit `.idx` files | -| `cleanup_txhash(index_id)` | Per txhash index | `build_txhash_index` for this txhash index | Deletes raw `.bin` files + `chunk:{C}:txhash` meta keys | +| `process_chunk(chunk_id, source)` | Per chunk (10_000 ledgers) | None | Ledger `.pack` + raw txhash `.bin` + events cold segment | +| `build_txhash_index(tx_index_id)` | Per tx index | All `process_chunk` tasks for this tx index | 16 RecSplit `.idx` files | +| `cleanup_txhash(tx_index_id)` | Per tx index | `build_txhash_index` for this tx index | Deletes raw `.bin` files + `chunk:{chunk_id:08d}:txhash` meta keys | -- Each task is a black box to the DAG scheduler — it calls `Execute()` and waits for return -- What happens inside (goroutines, I/O, parallelism) is up to the task +- Each task is a black box to the DAG scheduler — it calls `execute()` and waits for return. +- What happens inside (goroutines, I/O, parallelism) is up to the task. ### Dependency Diagram -For a single txhash index with N chunks: +For the chunks of one tx index (first chunk through last chunk): ``` -process_chunk(chunk 0) ─┐ -process_chunk(chunk 1) ─┤ -process_chunk(chunk 2) ─┼──→ build_txhash_index(index_id) ──→ cleanup_txhash(index_id) -... │ -process_chunk(chunk N) ─┘ +process_chunk(chunk_id=first) ─┐ +process_chunk(chunk_id=first+1) ─┤ +process_chunk(chunk_id=first+2) ─┼──→ build_txhash_index(tx_index_id) ──→ cleanup_txhash(tx_index_id) +... │ +process_chunk(chunk_id=last) ─┘ ``` -- All `process_chunk` tasks for a txhash index must complete before `build_txhash_index` fires -- `cleanup_txhash` runs after `build_txhash_index` succeeds -- Cleanup deletes the raw `.bin` files and their `chunk:{C}:txhash` meta keys +- All `process_chunk` tasks for a tx index must complete before `build_txhash_index` fires. +- `cleanup_txhash` runs after `build_txhash_index` succeeds. +- Cleanup deletes the raw `.bin` files and their `chunk:{chunk_id:08d}:txhash` meta keys. ### Main Flow -```python -def run_backfill(config, flags): - - # 1. Validate — abort before any work if config is incompatible with existing state - validate(config, flags) +`run_backfill` is invoked by Phase 1 of the streaming daemon with an integer chunk range and a `LedgerSource`: - # 2. Build DAG — register all tasks; each task's execute() handles its own no-op check - dag = build_dag(config, flags) - - # 3. Execute — dispatch all tasks concurrently, bounded by worker count - dag.execute(max_workers=flags.workers) # default GOMAXPROCS +```python +def run_backfill(config, range_start_chunk_id, range_end_chunk_id, source): + """ + Ingest chunks [range_start_chunk_id, range_end_chunk_id] inclusive via the given source. + + Called by Phase 1 of the streaming daemon. Idempotent per chunk — already-completed + chunks in the range return early from their task's execute(). On crash, Phase 1 + re-invokes with the same range; previously-completed work is skipped automatically. + """ + # 1. Validate — abort before any work if inputs are inconsistent with existing state. + validate(config, range_start_chunk_id, range_end_chunk_id, source) + + # 2. Build DAG — register all tasks; each task's execute() handles its own no-op check. + dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, source) + + # 3. Execute — dispatch all tasks concurrently, bounded by the source's parallelism. + dag.execute(max_workers=source.max_parallelism()) ``` ### Validation -Validation runs before DAG construction, not as a DAG task. If it were a DAG task, other tasks with no dependencies would start executing concurrently before validation completes — and if validation fails, in-flight work that should never have started would need to be cancelled. Running it first means a clean abort with no partial work. +Validation runs before DAG construction, not as a DAG task — if it were a task, other tasks with no dependencies would start executing concurrently before validation completes, and a failure would leave in-flight work to cancel. Running it first means a clean abort with no partial work. ```python -def validate(config, flags): - # See Validation Rules for the full list of checks. - assert flags.start_ledger >= 2 - assert flags.end_ledger > flags.start_ledger - assert config.backfill.bsb is not None - assert CHUNKS_PER_TXHASH_INDEX unchanged from prior runs (if meta store is non-empty) +def validate(config, range_start_chunk_id, range_end_chunk_id, source): + # Range sanity. + assert range_start_chunk_id >= 0 + assert range_end_chunk_id >= range_start_chunk_id + + # Source tip coverage. source.tip() is the highest ledger the source can serve; + # lower-bound availability (BSB bucket retention floor, captive core history start) + # is source-specific and surfaces as a per-task failure during execution — retried at + # the DAG level and ultimately fatal if unrecoverable. + last_ledger = last_ledger_in_chunk(range_end_chunk_id) + assert source.tip() >= last_ledger + + # CHUNKS_PER_TXHASH_INDEX immutability. The daemon's validate_config (see + # 02-streaming-workflow.md) stores this on first run and enforces it on every + # subsequent start; backfill re-asserts the match defensively. + assert meta_store.get("config:chunks_per_txhash_index") == str(config.backfill.chunks_per_txhash_index) ``` ### DAG Setup ```python -def build_dag(config, flags): +def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): # Wires up tasks and dependency edges — no completion checks or skip logic. - # Each task's execute() handles its own no-op check (early return if already complete). - + # Each task's execute() handles its own no-op check. dag = new DAG() - for index_id in configured_indexes(config, flags): + # For each tx index whose LAST chunk falls in [range_start_chunk_id, range_end_chunk_id], + # schedule process_chunk (for in-range chunks only) + build + cleanup. Prior chunks of + # such a tx index are either also in the current range (first-ever invocation) or + # already flagged :lfs by a prior Phase 1 iteration — either way, build has every + # chunk's .bin file available when it runs. + for tx_index_id in tx_indexes_ending_in_range(range_start_chunk_id, range_end_chunk_id, config): chunk_tasks = [] - for chunk_id in chunks_for_index(index_id): - t = dag.add(ProcessChunkTask(chunk_id), deps=[]) + for chunk_id in chunks_for_tx_index(tx_index_id, config): + if not (range_start_chunk_id <= chunk_id <= range_end_chunk_id): + continue # prior iteration processed this chunk + t = dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) chunk_tasks.append(t.id) - b = dag.add(BuildTxHashIndexTask(index_id), - deps=chunk_tasks) - dag.add(CleanupTxHashTask(index_id), deps=[b.id]) + b = dag.add(BuildTxHashIndexTask(tx_index_id), deps=chunk_tasks) + dag.add(CleanupTxHashTask(tx_index_id), deps=[b.id]) + + # Trailing partial tx index: its last chunk is past range_end_chunk_id, so no build / + # cleanup this run. Schedule process_chunk for its in-range chunks only; a future + # Phase 1 iteration covering the missing trailing chunks will trigger the build. + for chunk_id in trailing_partial_tx_index_chunks(range_start_chunk_id, range_end_chunk_id, config): + dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) return dag ``` +A trailing tx index whose last chunks fall past `range_end_chunk_id` has its `process_chunk` tasks scheduled but no `build_txhash_index` / `cleanup_txhash`. Its `.bin` files + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR Phase 2 hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). + --- ## Task Details -### process_chunk(chunk_id) +### process_chunk(chunk_id, source) -- Processes a single 10K-ledger chunk end-to-end -- Occupies one DAG worker slot -- Only produces missing outputs — checks each flag independently -- Internal concurrency is an implementation detail +- Processes a single 10_000-ledger chunk end-to-end. +- Occupies one DAG worker slot. +- Only produces missing outputs — checks each flag independently. +- Internal concurrency is an implementation detail. **Outputs** (all produced in a single task, only if missing): -- Ledger pack file (`{chunkID:08d}.pack`) — compressed ledger data in [packfile format](https://github.com/stellar/stellar-rpc/pull/633) -- Raw txhash flat file (`{chunkID:08d}.bin`) — 36-byte entries consumed by RecSplit builder -- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — per [getEvents design](https://github.com/stellar/stellar-rpc/pull/635) + +- Ledger pack file (`{chunk_id:08d}.pack`) — compressed ledger data in [packfile format](https://github.com/stellar/stellar-rpc/pull/633). +- Raw txhash flat file (`{chunk_id:08d}.bin`) — 36-byte entries consumed by RecSplit builder. +- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — per [getEvents design](https://github.com/stellar/stellar-rpc/pull/635). **Pseudocode:** ```python -process_chunk(chunk_id): - bucket_id = chunk_id / 1000 # hardcoded subdirectory grouping (see Directory Structure) - first_ledger = chunk_first_ledger(chunk_id) - last_ledger = chunk_last_ledger(chunk_id) +def process_chunk(chunk_id, source): + bucket_id = chunk_id // 1000 # hardcoded subdirectory grouping (see Directory Structure) + first_ledger = first_ledger_in_chunk(chunk_id) + last_ledger = last_ledger_in_chunk(chunk_id) - # 1. Check which outputs are missing + # 1. Check which outputs are missing. need_lfs = not meta_store.has(f"chunk:{chunk_id:08d}:lfs") need_txhash = not meta_store.has(f"chunk:{chunk_id:08d}:txhash") need_events = not meta_store.has(f"chunk:{chunk_id:08d}:events") - if not (need_lfs or need_txhash or need_events): - return # all outputs already present + return # all outputs already present - # 2. Choose data source + # 2. Choose data source for the in-loop ledger read. if not need_lfs: - source = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe, no BSB + ledger_reader = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe; no source call else: - source = BSBFactory.create(first_ledger, last_ledger) # BSB connection + ledger_reader = source.get_range(first_ledger, last_ledger) # BSB or captive core - # 3. Open writers only for missing outputs + # 3. Open writers only for missing outputs. ledger_writer = packfile.create(ledger_pack_path(bucket_id, chunk_id), - overwrite=True) if need_lfs else None + overwrite=True) if need_lfs else None txhash_writer = open(raw_txhash_path(bucket_id, chunk_id), - overwrite=True) if need_txhash else None + overwrite=True) if need_txhash else None events_writer = events_segment.create(events_path(bucket_id, chunk_id), - overwrite=True) if need_events else None - - # 4. Process each ledger - for seq in range(first_ledger, last_ledger + 1): - lcm = source.get_ledger(seq) + overwrite=True) if need_events else None + # 4. Process each ledger. + for ledger_seq in range(first_ledger, last_ledger + 1): + lcm = ledger_reader.get_ledger(ledger_seq) if need_lfs: ledger_writer.append(compress(lcm)) - if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx + if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx if need_events: events_writer.append(extract_events(lcm)) - # 5. Fsync + flag each output independently + # 5. Fsync + flag each output independently. if need_lfs: ledger_writer.fsync_and_close() meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") - if need_txhash: txhash_writer.fsync_and_close() meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") - if need_events: - events_writer.finalize() # flush, build MPHF + bitmap index, fsync + events_writer.finalize() # flush, build MPHF + bitmap index, fsync meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - source.close() + ledger_reader.close() ``` Key properties: -- Only missing outputs are produced — a partially-completed chunk resumes from where it left off -- If LFS is already present, reads from local NVMe instead of BSB (avoids redundant download) -- Each flag is written independently after its output's fsync — no atomic WriteBatch needed -- `packfile.Create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed -- Naturally extends to new data types (add a fourth flag) -**BSB** (BufferedStorageBackend): -- Ledger source backed by a remote object store -- Each `process_chunk` task creates its own BSB connection -- Internal prefetch workers: `BUFFER_SIZE` ledgers ahead, `NUM_WORKERS` download goroutines +- Only missing outputs are produced — a partially-completed chunk resumes from where it left off. +- If the ledger pack file (`:lfs` flag) is already present, reads from local NVMe instead of the source (avoids redundant download). +- Each flag is written independently after its output's fsync — no atomic WriteBatch needed. +- `packfile.create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed. +- Naturally extends to new data types (add a fourth flag). -### build_txhash_index(index_id) +**Source concurrency.** With `source=BSBSource(...)`, many `process_chunk` tasks run in parallel (bounded by `source.max_parallelism() = GOMAXPROCS`). With `source=CaptiveCoreSource(...)`, `source.max_parallelism() = 1` — a single captive core subprocess cannot serve multiple chunk ranges in parallel, so the DAG dispatches chunks sequentially. See [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source) for the interface. -- Builds the RecSplit txhash index for one completed txhash index -- Occupies one DAG worker slot, but spawns several goroutines internally -- The DAG guarantees all chunk `.bin` files exist before this runs +### build_txhash_index(tx_index_id) + +- Builds the RecSplit index for one completed tx index. +- Occupies one DAG worker slot, but spawns several goroutines internally. +- The DAG guarantees all chunk `.bin` files exist before this runs. **Pseudocode:** ```python -build_txhash_index(index_id): - if meta_store.has(f"index:{index_id:08d}:txhash"): - return # already built — no-op +def build_txhash_index(tx_index_id): + if meta_store.has(f"index:{tx_index_id:08d}:txhash"): + return # already built — no-op - bin_files = list_bin_files(index_id) # all .bin files for chunks in this txhash index + # Invariant: every chunk of tx_index_id has its .bin file on disk when this runs. + # Prior-iteration chunks keep their .bin until cleanup_txhash runs, and cleanup only + # runs AFTER this build succeeds (DAG dep). The list below therefore includes every + # chunk's .bin, regardless of which Phase 1 iteration wrote it. + bin_files = list_bin_files(tx_index_id) # all .bin files for chunks in this tx index - # Phase 1: COUNT — scan all .bin files, count entries per CF + # Stage 1: COUNT — scan all .bin files, count entries per CF. cf_counts = parallel_count(bin_files, workers=100) - # cf_counts[nibble] = number of (txhash, ledgerSeq) entries routed to that CF + # cf_counts[nibble] = number of (txhash, ledger_seq) entries routed to that CF - # Phase 2: ADD — re-read .bin files, route entries to CF builders - cf_builders = [RecSplitBuilder(cf_counts[n]) for n in range(16)] + # Stage 2: ADD — re-read .bin files, route entries to CF builders. + cf_builders = [RecSplitBuilder(cf_counts[nibble]) for nibble in range(16)] parallel_add(bin_files, cf_builders, workers=100) # each entry routed to cf_builders[txhash[0] >> 4] (mutex per CF) - # Phase 3: BUILD — build MPH index per CF, one .idx file each + # Stage 3: BUILD — build MPH index per CF, one .idx file each. parallel_build(cf_builders, workers=16) # each CF produces one .idx file; all fsynced - # Phase 4: VERIFY (optional) — look up every key in the built indexes - if verify_recsplit: - parallel_verify(bin_files, cf_builders, workers=100) + # Stage 4: VERIFY — look up every key in the built indexes (full verification, + # since backfill has no wall-clock pressure). + parallel_verify(bin_files, cf_builders, workers=100) - # Mark index complete - meta_store.put(f"index:{index_id:08d}:txhash", "1") + # Flag. Backfill writes "1" only; streaming's prune path may later transition + # this key through "deleting" before clearing it — see 02-streaming-workflow.md. + meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") ``` Key properties: -- COUNT and ADD each read all `.bin` files (two full passes over the data) -- BUILD runs 16 goroutines in parallel (one per CF) — each CF is independent -- VERIFY is skippable via `--verify-recsplit=false` cli flag -- All-or-nothing recovery: if `index:{N}:txhash` is absent on restart → delete partial `.idx` files → rerun entire build -### cleanup_txhash(index_id) +- COUNT and ADD each read all `.bin` files (two full passes over the data). +- BUILD runs 16 goroutines in parallel (one per CF) — each CF is independent. +- VERIFY always runs (there is no `--verify-recsplit=false` escape hatch — backfill trades throughput for correctness every time). +- All-or-nothing recovery: if `index:{tx_index_id:08d}:txhash` is absent on restart → delete partial `.idx` files → rerun entire build. + +### cleanup_txhash(tx_index_id) -- Runs after `build_txhash_index` completes successfully +- Runs after `build_txhash_index` completes successfully. **Pseudocode:** ```python -cleanup_txhash(index_id): - for chunk_id in chunks_for_index(index_id): +def cleanup_txhash(tx_index_id): + for chunk_id in chunks_for_tx_index(tx_index_id, config): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already cleaned up — skip - delete(raw_txhash_path(bucket_id, chunk_id)) # remove .bin file - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key + continue # already cleaned up — skip + bucket_id = chunk_id // 1000 + delete_if_exists(raw_txhash_path(bucket_id, chunk_id)) # remove .bin (idempotent — crash + # between .bin delete and flag delete + # is safe to retry on restart) + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key ``` Key properties: -- Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery works naturally -- Per-chunk idempotency: each chunk checks its own `chunk:{C}:txhash` key before deleting — a crash mid-cleanup resumes from where cleanup left off -- On restart: DAG sees txhash index key present (build complete) but `chunk:{C}:txhash` keys still exist → cleanup runs as a normal task + +- Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery works naturally. +- Per-chunk idempotency: each chunk checks its own `chunk:{chunk_id:08d}:txhash` key before deleting — a crash mid-cleanup resumes from where cleanup left off. +- On restart: DAG sees the tx-index key present (build complete) but `chunk:{chunk_id:08d}:txhash` keys still exist → cleanup runs as a normal task. --- @@ -581,56 +567,58 @@ Key properties: ### DAG Scheduler -- Pipeline builds a single DAG at startup, executes it with bounded concurrency -- The DAG is the only scheduling mechanism — no per-txhash-index coordinators, no secondary worker pools -- Each task's `Execute()` is wrapped with a retry loop bounded by `--max-retries` (default 3). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. +- The subroutine builds a single DAG per invocation and executes it with bounded concurrency. +- The DAG is the only scheduling mechanism — no per-tx-index coordinators, no secondary worker pools. +- Each task's `execute()` is wrapped with a retry loop bounded by `MAX_RETRIES` (implementation-defined constant). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. ```python -run_dag(dag, max_workers): +def run_dag(dag, max_workers): worker_slots = Semaphore(max_workers) runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) def execute_task(task): """Runs in a background thread — one per dispatched task.""" - for attempt in range(1, max_retries + 1): + for attempt in range(1, MAX_RETRIES + 1): error = task.execute() if error is None: break - if attempt == max_retries: - mark_failed(task, error) # halt all dependents + if attempt == MAX_RETRIES: + mark_failed(task, error) # halt all dependents break log.warn("retry", task, attempt, error) - worker_slots.release() # free worker slot + worker_slots.release() # free worker slot - # Check if completing this task unblocks any downstream tasks - for downstream in dag.dependents_of(task): - downstream.mark_dependency_done(task) - if downstream.all_dependencies_done(): - runnable_tasks.push(downstream) # now eligible to run + # Check if completing this task unblocks any downstream tasks. + for downstream_task in dag.dependents_of(task): + downstream_task.mark_dependency_done(task) + if downstream_task.all_dependencies_done(): + runnable_tasks.push(downstream_task) # now eligible to run - # Main loop — dispatches tasks as they become runnable + # Main loop — dispatches tasks as they become runnable. while runnable_tasks: current_task = runnable_tasks.pop() - worker_slots.acquire() # block until a worker slot is free - run_in_background(execute_task, current_task) # launch — returns immediately + worker_slots.acquire() # block until a worker slot is free + run_in_background(execute_task, current_task) # launch — returns immediately ``` ### Worker Pool -- Single flat pool of `workers` slots (default `GOMAXPROCS`) -- Any mix of task types can occupy slots simultaneously -- `process_chunk`: 1 slot per task -- `build_txhash_index`: 1 slot per task (uses many goroutines internally) -- `cleanup_txhash`: 1 slot per task +- Single flat pool of `max_workers` slots, set by `source.max_parallelism()`: + - `BSBSource.max_parallelism() = GOMAXPROCS`. + - `CaptiveCoreSource.max_parallelism() = 1`. +- Any mix of task types can occupy slots simultaneously. +- `process_chunk`: 1 slot per task. +- `build_txhash_index`: 1 slot per task (uses many goroutines internally). +- `cleanup_txhash`: 1 slot per task. ### How Work Flows Through the Pipeline -- All `process_chunk` tasks have no dependencies → DAG dispatches up to `workers` slots immediately at startup -- Chunks from different txhash indexes run side by side — the scheduler does not process txhash indexes sequentially -- When the last chunk of a txhash index completes → `build_txhash_index` becomes eligible, claims a slot -- After build completes → `cleanup_txhash` becomes eligible -- Remaining slots continue processing chunks for other txhash indexes throughout — no special coordination needed +- All `process_chunk` tasks have no dependencies → the DAG dispatches up to `max_workers` at startup. +- Chunks from different tx indexes run side by side — the scheduler does not process tx indexes sequentially. +- When the last chunk of a tx index completes → `build_txhash_index` becomes eligible and claims a slot. +- After build completes → `cleanup_txhash` becomes eligible. +- Remaining slots continue processing chunks for other tx indexes throughout — no special coordination needed. --- @@ -638,42 +626,22 @@ run_dag(dag, max_workers): There is no separate crash recovery, reconciliation, or startup triage phase. Recovery happens organically because every task's `execute()` checks its own completion state: -- On every startup, `build_dag()` registers ALL tasks for the configured range — no meta store scanning in DAG setup -- `process_chunk` checks each output flag independently — missing outputs are produced, existing outputs are skipped -- `build_txhash_index` checks `index:{N}:txhash` — if present, returns immediately; if absent, deletes partial `.idx` files and reruns the full build -- `cleanup_txhash` checks `chunk:{C}:txhash` per-chunk — already-cleaned chunks are skipped, remaining chunks are cleaned up +- On every invocation, `build_dag()` registers ALL tasks for the chunk range — no meta store scanning in DAG setup. +- `process_chunk` checks each output flag independently — missing outputs are produced, existing outputs are skipped. +- `build_txhash_index` checks `index:{tx_index_id:08d}:txhash` — if present, returns immediately; if absent, deletes partial `.idx` files and reruns the full build. +- `cleanup_txhash` checks `chunk:{chunk_id:08d}:txhash` per-chunk — already-cleaned chunks are skipped, remaining chunks are cleaned up. -This works because of three invariants: +Three invariants make this work: -1. **Key implies durable file** — a meta store flag is set only after fsync -2. **Tasks are idempotent** — each checks its own outputs and skips or overwrites what exists -3. **DAG registers all tasks on every startup** — completed tasks return immediately from `execute()` +1. **Key implies durable file** — a meta store flag is set only after fsync. +2. **Tasks are idempotent** — each checks its own outputs and skips or overwrites what exists. +3. **DAG registers all tasks on every invocation** — completed tasks return immediately from `execute()`. ### Concurrent Access Prevention -- Meta store RocksDB uses kernel-level `flock()` on a `LOCK` file -- A second process attempting to open the same meta store fails immediately -- Released automatically on process exit (including `kill -9`) - - ---- - -## getStatus API Response - -During backfill, `getStatus` returns progress as task-type summaries: -- No per-txhash-index breakdown — just completed/pending/in_progress counts per task type - -```json -{ - "mode": "BACKFILL", - "tasks": { - "process_chunk": {"completed": 288, "pending": 5712, "in_progress": 40}, - "build_txhash_index": {"completed": 0, "pending": 6, "in_progress": 0}, - "cleanup_txhash": {"completed": 0, "pending": 6, "in_progress": 0} - }, - "eta_seconds": 1820 -} -``` +- Meta store RocksDB uses kernel-level `flock()` on a `LOCK` file. +- A second process attempting to open the same meta store fails immediately. +- Released automatically on process exit (including `kill -9`). --- @@ -681,18 +649,16 @@ During backfill, `getStatus` returns progress as task-type summaries: Two layers of retry: -- **BSB retries** — BSB handles transient errors internally (connection resets, throttling, etc). These retries happen within a single task execution and are not visible to the DAG scheduler. -- **Task-level retries** — the DAG scheduler wraps each task's `execute()` with a retry loop bounded by `--max-retries` (default 3). If a task returns an error after BSB has exhausted its own retries, the scheduler retries the entire task. After `--max-retries` exhausted → task marked failed → DAG halts all dependent tasks → process exits non-zero. - -Operator re-runs the same command; completed work is never repeated. +- **Source-internal retries** — the `LedgerSource` handles transient errors internally (BSB connection resets, throttling, captive core subprocess hiccups). These retries happen inside a single task execution and are invisible to the DAG scheduler. +- **Task-level retries** — the DAG scheduler wraps each task's `execute()` with a retry loop bounded by `MAX_RETRIES`. After the source has exhausted its own retries, the scheduler retries the entire task. After `MAX_RETRIES` exhausted → task marked failed → DAG halts all dependents → `run_backfill` returns a fatal error → Phase 1 propagates → daemon exits non-zero. Operator investigates, fixes the root cause, and restarts the daemon; restart re-enters Phase 1, which re-invokes `run_backfill` with a fresh chunk range, and already-complete work is skipped via per-chunk idempotency. | Error | Handled by | Action | |-------|-----------|--------| -| BSB transient error (throttle, connection reset) | BSB internal retry | Retried within the task; transparent to DAG | -| BSB persistent error (BSB retries exhausted) | Task-level retry | `--max-retries` attempts; then ABORT | -| Ledger pack write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| TxHash write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| Events write / fsync failure | Task-level retry | `--max-retries` attempts; then ABORT; flag not set | -| RecSplit build failure | Task-level retry | `--max-retries` attempts; then ABORT; txhash index key absent | -| Verify phase mismatch | None | ABORT immediately — data corruption, operator investigates | -| Meta store write failure | None | ABORT immediately — treat as crash, operator re-runs | +| Source transient error (throttle, connection reset) | Source-internal retry | Retried within the task; transparent to DAG | +| Source persistent error (source retries exhausted) | Task-level retry | `MAX_RETRIES` attempts; then ABORT | +| Ledger pack write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | +| Txhash write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | +| Events write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | +| RecSplit build failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; tx-index key absent | +| VERIFY stage mismatch | None | ABORT immediately — data corruption; operator investigates | +| Meta store write failure | None | ABORT immediately — treat as crash; operator re-runs daemon | From 2c8652016b83a6713c4497d5f709dc0e0b151c23 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 19:50:48 -0700 Subject: [PATCH 14/34] Streaming doc: Phase A follow-up fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove ## Required Backfill Design Changes section — every item in it was applied in the prior session's backfill-doc rewrite, so the section was obsolete. - Deduplicate ### Invariants: cut the three bullets that overlapped backfill's three crash-recovery invariants (flag-after-fsync, idempotent writes, DAG-structured cleanup). The cross-reference lead-in stays; remaining 5 invariants renumbered 1–5. - Fix Meta Store Keys table entry for streaming:last_committed_ledger: was 'last_ledger_in_chunk(phase1_coverage_end_ledger)', which is a type error (phase1_coverage_end_ledger() returns a ledger seq, not a chunk_id). Actual pseudocode in phase4_live_ingest stores the value directly. Table entry now matches. - Normalize cpi=1000 → cpi=1_000 across the doc (5 occurrences) per the underscore-for-≥1000 style rule. - LedgerSource redesign to mirror the stellar Go SDK's LedgerBackend pattern: replace get_range(start, end) -> Iterator with prepare_range(start, end) -> None + get_ledger(seq) -> LCM. Both BSBSource and CaptiveCoreSource already expose this via the SDK, so random-access reads are native. Updated docstrings for both subclasses. Paired with the corresponding backfill-doc change in the next commit. --- .../design-docs/02-streaming-workflow.md | 77 ++++++++++--------- 1 file changed, 39 insertions(+), 38 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 99f9325b4..974943ba9 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -121,7 +121,7 @@ See [Ledger Source](#ledger-source) for the full source-selection rule. - `CHUNKS_PER_TXHASH_INDEX` immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). - `RETENTION_LEDGERS` immutable across runs. -- `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. +- `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. - `[BACKFILL.BSB]` optional — presence determines Phase 1 source. May be added or removed between runs. - `[HISTORY_ARCHIVES].URLS` required in all profiles. - `CAPTIVE_CORE_CONFIG` required in all profiles. @@ -316,7 +316,7 @@ Single RocksDB instance, WAL always enabled. Authoritative source for every star | Key | Value | Written when | |---|---|---| -| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `last_ledger_in_chunk(phase1_coverage_end_ledger)`; subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | +| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `phase1_coverage_end_ledger(meta_store)` (the end of the contiguous `:lfs` prefix — already a ledger sequence); subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | | `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | ### Keys Shared with Backfill @@ -402,24 +402,38 @@ The daemon maintains three active stores for the current ingestion position. All Phase 1 reads ledgers from a source. Two implementations share one interface. Source is selected per-startup based on `[BACKFILL.BSB]` presence — no stored immutability gate. Operators may add or remove BSB between runs; retention immutability alone constrains the data envelope. +The interface mirrors the stellar Go SDK's `LedgerBackend` pattern (`PrepareRange` + `GetLedger`) — both implementations below (BSB and captive core) already expose that pattern in the SDK, so random-access reads are native and no sequential-iterator shim is needed. + ```python class LedgerSource: """ - Provides a stream of LedgerCloseMeta for a contiguous ledger range. Used by the backfill + Provides random-access LedgerCloseMeta reads for a prepared range. Used by the backfill subroutine inside Phase 1. Live streaming (Phase 4) does NOT go through this abstraction — - it reads directly from CaptiveStellarCore via `ledgerBackend.PrepareRange(UnboundedRange(...))`. + it reads directly from CaptiveStellarCore via `ledgerBackend.PrepareRange(UnboundedRange(...))` + + `ledgerBackend.GetLedger(seq)`. + + Usage pattern: run_backfill calls prepare_range ONCE for the full backfill run, then + process_chunk tasks concurrently call get_ledger(seq) for any seq inside the prepared + range. Implementations must be safe under concurrent get_ledger calls (the DAG dispatches + up to max_parallelism() process_chunk workers). """ def tip(self) -> int: - """Current network tip ledger. Used to compute Phase 1 target range.""" + """Current network tip ledger. Used to compute Phase 1 target range. Callable without + a prior prepare_range.""" - def get_range(self, start_ledger, end_ledger) -> Iterator[LedgerCloseMeta]: - """Stream LCMs for [start_ledger, end_ledger] inclusive. Must tolerate re-invocation — - the backfill DAG resumes per-chunk on crash.""" + def prepare_range(self, start_ledger, end_ledger) -> None: + """Prime the source for random-access reads in [start_ledger, end_ledger] inclusive. + Called once per run_backfill invocation (phase1_catchup may invoke run_backfill + multiple times, each with its own range). Must tolerate re-invocation.""" + + def get_ledger(self, ledger_seq) -> LedgerCloseMeta: + """Return the LCM for ledger_seq. Requires prepare_range to have covered ledger_seq. + Thread-safe under concurrent calls from process_chunk workers.""" def max_parallelism(self) -> int: - """Upper bound on concurrent get_range calls the source can sustain. Backfill DAG - honors this when dispatching process_chunk workers.""" + """Upper bound on concurrent get_ledger call chains the source can sustain. Backfill + DAG honors this when dispatching process_chunk workers.""" class BSBSource(LedgerSource): @@ -427,7 +441,9 @@ class BSBSource(LedgerSource): Reads from the BSB (Buffered Storage Backend) bucket configured in [BACKFILL.BSB]. - Tip: queried from BSB's own range-end metadata. Same mechanism backfill uses today. - - get_range: parallel prefetch via BUFFER_SIZE + NUM_WORKERS knobs; same shape as backfill. + - prepare_range: sets the BSB-backed LedgerBackend's range; BSB internal prefetch workers + (BUFFER_SIZE, NUM_WORKERS) fill buffers ahead of get_ledger reads. + - get_ledger: random-access via the SDK's GetLedger(seq); reads from the prefetch buffer. - max_parallelism: GOMAXPROCS (backfill's current default). """ @@ -438,8 +454,9 @@ class CaptiveCoreSource(LedgerSource): - Tip: fetched via HTTP GET on /.well-known/stellar-history.json against HISTORY_ARCHIVE_URLS. Matches the existing ingest service pattern (Service.getNextLedgerSequence → archive.GetRootHAS()). - - get_range: drives captive core with ledgerBackend.PrepareRange(BoundedRange(start, end)), - drains sequential GetLedger(seq) calls. + - prepare_range: spins up (or re-primes) captive core with BoundedRange(start, end). + - get_ledger: random-access via the SDK's GetLedger(seq); blocks until that ledger is + available in the captive-core subprocess's emitted stream. - max_parallelism: 1. Captive core is a single heavy subprocess; parallelism would require multiple subprocesses, each consuming several GB RAM. Backfill DAG dispatches chunks sequentially when source is captive core. @@ -461,7 +478,7 @@ def select_phase1_ledger_source(config): When Phase 1 uses captive core, `RETENTION_LEDGERS` directly determines how many ledgers captive core must archive-catchup on first start: -- `RETENTION_LEDGERS = 10_000_000` at `cpi=1000`: captive core archive-catches-up ~10M ledgers (hours to days). +- `RETENTION_LEDGERS = 10_000_000` at `cpi=1_000`: captive core archive-catches-up ~10M ledgers (hours to days). - `RETENTION_LEDGERS = 10_000` at `cpi=1`: captive core archive-catches-up ~10K ledgers (~3–8 min). This is the main reason tip-tracker operators default to `cpi=1`: at cpi=1 a full index is 10K ledgers, so retention can be set small without violating the "multiple of LEDGERS_PER_INDEX" rule. @@ -569,7 +586,7 @@ def compute_backfill_chunk_range(last_committed_ledger, network_tip_ledger, rete down to the first ledger of its containing tx index. That rounded value is the new head of coverage; every earlier ledger is past retention and skipped. - Worst case: up to LEDGERS_PER_INDEX - 1 ledgers past the strict retention line are - ingested and held on disk. At cpi=1000 this is ~10M ledgers; at cpi=1 it is ~10k. + ingested and held on disk. At cpi=1_000 this is ~10M ledgers; at cpi=1 it is ~10k. """ gap_start_ledger = last_committed_ledger + 1 if retention_ledgers > 0: @@ -689,7 +706,7 @@ def phase2_hydrate_txhash(config, meta_store): txhash_store.close() ``` -**Why "load then delete" matters.** Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. At `cpi=1000` with frequent restarts over a day, that is thousands of redundant loads. Deleting the `.bin` after the first successful load makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. +**Why "load then delete" matters.** Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. At `cpi=1_000` with frequent restarts over a day, that is thousands of redundant loads. Deleting the `.bin` after the first successful load makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. **Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files — streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 is a trivial no-op in that case. @@ -1142,7 +1159,7 @@ def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) // LEDGERS_PER_INDEX) - 1 Numeric check at last_committed_ledger=70_000_002, retention_ledgers=10_000_000, - cpi=1000 (LEDGERS_PER_INDEX=10_000_000): + cpi=1_000 (LEDGERS_PER_INDEX=10_000_000): max_eligible_tx_index_id = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 6 - 1 = 5. tx_index_id=5 has last_ledger_in_tx_index(5) + retention_ledgers = 60_000_001 + 10_000_000 = 70_000_001. 70_000_002 > 70_000_001 → tx_index_id=5 eligible. ✓ @@ -1237,14 +1254,11 @@ No separate recovery phase. Every startup runs Phases 1–4 regardless — alrea In addition to the backfill subroutine's invariants in [01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery), streaming adds the following: -1. **Flag-after-fsync.** A meta store flag is set only after the corresponding file(s) are fsynced. Flag absent = output treated as missing → transition retried from scratch. -2. **Idempotent writes.** The same input ledger always produces the same key-value pairs in all stores. Re-processing after crash is safe. -3. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. -4. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. -5. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. -6. **DAG-structured cleanup.** Cleanup runs as a separate step after the flag is set. Crash between flag and cleanup = retry just the cleanup on restart. -7. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 itself avoids producing them. -8. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. +1. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. +2. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. +3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. +4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 itself avoids producing them. +5. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. ### Compound Recovery Scenarios @@ -1292,19 +1306,6 @@ drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_com --- -## Required Backfill Design Changes - -The unified design requires edits to `01-backfill-workflow.md` (authoritative `03-backfill-workflow.md` on `feature/full-history`): - -1. **Drop the `stellar-rpc full-history-backfill` cobra subcommand and all its per-run CLI flags** (`--start-ledger`, `--end-ledger`, `--workers`, `--verify-recsplit`, `--max-retries`). Backfill is no longer an operator-facing CLI entry point. `process_chunk`, `build_txhash_index`, `cleanup_txhash`, and the DAG scheduler remain as subroutines invoked by streaming Phase 1. -2. **Change `run_backfill`'s signature to `run_backfill(config, range_start_chunk, range_end_chunk, source=...)`.** Previously `run_backfill(config, flags)` with `flags.start_ledger` and `flags.end_ledger`. Phase 1 computes chunk IDs (not ledger sequences) via `compute_backfill_chunk_range`, so the subroutine takes chunk IDs directly. `source=` selects BSB vs captive core. -3. **Extend `process_chunk` with the matching `source=` parameter** accepting a `LedgerSource`. Default (`BSBSource`) matches today's behavior. The subroutine no longer creates a BSB connection from config — it uses whatever the caller passed in. -4. **Extend the DAG worker cap to honor `source.max_parallelism()`.** Currently the DAG caps at `--workers` (default GOMAXPROCS). Under `CaptiveCoreSource`, cap at 1. -5. **Move `retention_ledgers` validation + store-on-first-run into shared validation.** Streaming's `validate_config` handles the store+compare. Backfill itself doesn't need to know retention — Phase 1 translates retention into the `[range_start_chunk, range_end_chunk]` it passes into `run_backfill`. -6. **Artifact key values stay as `"1"`.** No state-machine extension (`"frozen"` / `"pruning"`) needed — the `chunk:{chunk_id}:txhash` key is transient (Phase 2 deletes it) and the remaining keys have simple presence/absence semantics. Exception: `index:{tx_index_id:08d}:txhash` uses `"1"` or `"deleting"` for two-phase prune (spec'd in this doc under [Pruning](#pruning); backfill's `build_txhash_index` only ever writes `"1"`). - ---- - ## Related Documents - [01-backfill-workflow.md](./01-backfill-workflow.md) — backfill subroutine: DAG, `process_chunk`, partial index handling From 16a4fa66664ce593928f2cda3c0c23480ae38ae3 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 19:50:58 -0700 Subject: [PATCH 15/34] Backfill doc: Phase A follow-up fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Integrate LedgerSource interface redesign (prepare_range + get_ledger; see prior streaming-doc commit). run_backfill now calls source.prepare_range(first, last) ONCE at the top, before the DAG dispatches. process_chunk uses the pre-prepared source (no per-task prepare). ledger_reader.close() is now guarded by 'if not need_lfs' so only the local-packfile handle is closed; the shared source stays open for the run's lifetime. - Update Validation Rules bullet to reference prepare_range + get_ledger instead of the removed get_range. - Cut ### How Work Flows Through the Pipeline — restated what the DAG Scheduler pseudocode above already encodes. - Replace ### Concurrent Access Prevention's 3-bullet RocksDB flock description with a one-line statement that the daemon acquires a directory flock on the meta-store at startup. --- .../design-docs/01-backfill-workflow.md | 37 ++++++++++--------- 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index 763d8187d..62031ce87 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -134,7 +134,7 @@ Source selection at the daemon level (BSB vs captive core, based on `[BACKFILL.B ### Validation Rules - `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — the daemon's `validate_config` enforces this at startup (see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode)). -- When the caller invokes backfill with `source=BSBSource(...)`, `[BACKFILL.BSB]` must be present AND the source must cover the requested chunk range. `run_backfill`'s `validate` asserts `source.tip() >= last_ledger_in_chunk(range_end_chunk_id)` at the start; lower-bound coverage (bucket retention floor, captive core history start) is verified by `source.get_range` during execution. +- When the caller invokes backfill with `source=BSBSource(...)`, `[BACKFILL.BSB]` must be present AND the source must cover the requested chunk range. `run_backfill`'s `validate` asserts `source.tip() >= last_ledger_in_chunk(range_end_chunk_id)` at the start; `run_backfill` then calls `source.prepare_range(first_ledger, last_ledger)` once, after which lower-bound coverage (bucket retention floor, captive core history start) is verified per-ledger via `source.get_ledger(seq)` during execution. ### Partial Tx Index Ranges @@ -348,10 +348,19 @@ def run_backfill(config, range_start_chunk_id, range_end_chunk_id, source): # 1. Validate — abort before any work if inputs are inconsistent with existing state. validate(config, range_start_chunk_id, range_end_chunk_id, source) - # 2. Build DAG — register all tasks; each task's execute() handles its own no-op check. + # 2. Prime the source for random-access reads across the entire run range. Done once + # per run_backfill invocation; process_chunk tasks then call source.get_ledger(seq) + # concurrently for seqs inside this range. Mirrors the stellar Go SDK's LedgerBackend + # pattern (PrepareRange → GetLedger). + source.prepare_range( + first_ledger_in_chunk(range_start_chunk_id), + last_ledger_in_chunk(range_end_chunk_id), + ) + + # 3. Build DAG — register all tasks; each task's execute() handles its own no-op check. dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, source) - # 3. Execute — dispatch all tasks concurrently, bounded by the source's parallelism. + # 4. Execute — dispatch all tasks concurrently, bounded by the source's parallelism. dag.execute(max_workers=source.max_parallelism()) ``` @@ -444,11 +453,12 @@ def process_chunk(chunk_id, source): if not (need_lfs or need_txhash or need_events): return # all outputs already present - # 2. Choose data source for the in-loop ledger read. + # 2. Choose data source for the in-loop ledger read. Both options expose get_ledger(seq) + # for random-access reads; source was already prepared by run_backfill's prepare_range call. if not need_lfs: ledger_reader = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe; no source call else: - ledger_reader = source.get_range(first_ledger, last_ledger) # BSB or captive core + ledger_reader = source # BSB or captive core; pre-prepared # 3. Open writers only for missing outputs. ledger_writer = packfile.create(ledger_pack_path(bucket_id, chunk_id), @@ -476,7 +486,10 @@ def process_chunk(chunk_id, source): events_writer.finalize() # flush, build MPHF + bitmap index, fsync meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - ledger_reader.close() + # Close the local packfile handle when we used one. The source case is shared across + # tasks and stays open until run_backfill returns — no per-task close. + if not need_lfs: + ledger_reader.close() ``` Key properties: @@ -612,14 +625,6 @@ def run_dag(dag, max_workers): - `build_txhash_index`: 1 slot per task (uses many goroutines internally). - `cleanup_txhash`: 1 slot per task. -### How Work Flows Through the Pipeline - -- All `process_chunk` tasks have no dependencies → the DAG dispatches up to `max_workers` at startup. -- Chunks from different tx indexes run side by side — the scheduler does not process tx indexes sequentially. -- When the last chunk of a tx index completes → `build_txhash_index` becomes eligible and claims a slot. -- After build completes → `cleanup_txhash` becomes eligible. -- Remaining slots continue processing chunks for other tx indexes throughout — no special coordination needed. - --- ## Crash Recovery @@ -639,9 +644,7 @@ Three invariants make this work: ### Concurrent Access Prevention -- Meta store RocksDB uses kernel-level `flock()` on a `LOCK` file. -- A second process attempting to open the same meta store fails immediately. -- Released automatically on process exit (including `kill -9`). +The daemon acquires a directory flock on the meta-store at startup. A second process against the same datadir fails immediately. --- From 7fe88f7dccaa34ed5b50565fae4942e2138a8d81 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 19:58:25 -0700 Subject: [PATCH 16/34] Backfill doc: Phase B grill-me fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Standardize path-helper signatures to 1-arg (chunk_id), matching 02-streaming-workflow.md. Was 2-arg (bucket_id, chunk_id); bucket_id is derivable as chunk_id // 1000 and now computed inside each helper. Affected call sites: process_chunk + cleanup_txhash. - Rename events_path → events_segment_path to match the streaming doc's name for the same helper. - build_txhash_index: add an explicit cleanup preamble that deletes partial .idx files from a prior crashed attempt. The property list below the pseudocode already documented this as 'All-or-nothing recovery'; now the pseudocode shows it. Numbered stages stay 1-4 (COUNT / ADD / BUILD / VERIFY); cleanup is a preamble, not a numbered stage. - Overview table row for getTransaction: 'Txhash index files' → 'Tx-index files' to match the 'tx index' form used in the scope column of the same row. --- .../design-docs/01-backfill-workflow.md | 29 +++++++++++-------- 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index 62031ce87..73e4ad5c1 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -17,7 +17,7 @@ Backfill is a subroutine invoked by Phase 1 of the streaming daemon (see [02-str | Query it enables | Immutable output | Scope | |-----------------|-----------------|-------| | `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10_000 ledgers) | -| `getTransaction` | Txhash index files | Per tx index (default 10_000_000 ledgers) | +| `getTransaction` | Tx-index files | Per tx index (default 10_000_000 ledgers) | | `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | --- @@ -442,7 +442,6 @@ A trailing tx index whose last chunks fall past `range_end_chunk_id` has its `pr ```python def process_chunk(chunk_id, source): - bucket_id = chunk_id // 1000 # hardcoded subdirectory grouping (see Directory Structure) first_ledger = first_ledger_in_chunk(chunk_id) last_ledger = last_ledger_in_chunk(chunk_id) @@ -456,17 +455,18 @@ def process_chunk(chunk_id, source): # 2. Choose data source for the in-loop ledger read. Both options expose get_ledger(seq) # for random-access reads; source was already prepared by run_backfill's prepare_range call. if not need_lfs: - ledger_reader = local_packfile(ledger_pack_path(bucket_id, chunk_id)) # NVMe; no source call + ledger_reader = local_packfile(ledger_pack_path(chunk_id)) # NVMe; no source call else: - ledger_reader = source # BSB or captive core; pre-prepared + ledger_reader = source # BSB or captive core; pre-prepared - # 3. Open writers only for missing outputs. - ledger_writer = packfile.create(ledger_pack_path(bucket_id, chunk_id), + # 3. Open writers only for missing outputs. Path helpers derive bucket_id = chunk_id // 1000 + # internally (see Directory Structure). + ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(bucket_id, chunk_id), - overwrite=True) if need_txhash else None - events_writer = events_segment.create(events_path(bucket_id, chunk_id), - overwrite=True) if need_events else None + txhash_writer = open(raw_txhash_path(chunk_id), + overwrite=True) if need_txhash else None + events_writer = events_segment.create(events_segment_path(chunk_id), + overwrite=True) if need_events else None # 4. Process each ledger. for ledger_seq in range(first_ledger, last_ledger + 1): @@ -515,6 +515,12 @@ def build_txhash_index(tx_index_id): if meta_store.has(f"index:{tx_index_id:08d}:txhash"): return # already built — no-op + # Cleanup: remove any partial .idx files from a prior crashed attempt. All-or-nothing + # recovery — the tx-index flag is absent iff the build hasn't completed, so any .idx + # files on disk are from a failed or interrupted run and must be discarded before the + # rebuild starts. + delete_partial_idx_files(recsplit_index_path(tx_index_id)) + # Invariant: every chunk of tx_index_id has its .bin file on disk when this runs. # Prior-iteration chunks keep their .bin until cleanup_txhash runs, and cleanup only # runs AFTER this build succeeds (DAG dep). The list below therefore includes every @@ -561,8 +567,7 @@ def cleanup_txhash(tx_index_id): for chunk_id in chunks_for_tx_index(tx_index_id, config): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): continue # already cleaned up — skip - bucket_id = chunk_id // 1000 - delete_if_exists(raw_txhash_path(bucket_id, chunk_id)) # remove .bin (idempotent — crash + delete_if_exists(raw_txhash_path(chunk_id)) # remove .bin (idempotent — crash # between .bin delete and flag delete # is safe to retry on restart) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key From 4ff7b5c054a62c4db380e73a0b3d6259d5d074a1 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 19:58:35 -0700 Subject: [PATCH 17/34] Streaming doc: Phase B grill-me fix Update the Txhash index Terminology entry to list 'tx index' as an alias. Both docs dominantly use 'tx index' as the narrative form (backfill 18:0, streaming 8:2) while the formal constant names (CHUNKS_PER_TXHASH_INDEX, tx_index_id, :txhash flag) keep 'txhash' / 'tx'. The alias in Terminology closes the ambiguity without touching narrative usage. --- full-history/design-docs/02-streaming-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 974943ba9..0f7f7da77 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -43,7 +43,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - Events cold segment (three files per chunk: `events.pack`, `index.pack`, `index.hash`). - **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three transitions total per chunk (LFS, events) and one per index (RecSplit). - **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `first_ledger_in_chunk(chunk_id)` always ends in `..._02`; `last_ledger_in_chunk(chunk_id)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. -- **Txhash index** (a.k.a. "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). +- **Txhash index** (a.k.a. "tx index", "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). Both docs use "tx index" as the dominant narrative form; "txhash index" appears where the output's role as a txhash lookup is the emphasis. - **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. - **Index boundary** — the moment ingestion commits the last ledger of an index. Triggers background RecSplit build for that index. Every index boundary is also a chunk boundary. - **Catchup** — synonym for "close the gap between last-committed ledger and current tip". Performed inside Phase 1. From c34c7cc8f4c2ee4fdea1fdcc0f8c8f0343187302 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 21:23:12 -0700 Subject: [PATCH 18/34] Design-docs README: add packfile library doc under See Also MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both backfill and streaming reference the pack file format via inline PR #633 links, but the design-docs README only pointed at the getEvents design doc. Adding the packfile library doc alongside it so a reader landing on the README sees both related design docs without having to follow an inline PR link first. Also expanded the getEvents bullet to name what's in that doc (hot/cold segment layout, roaring bitmap indexes, MPHF) — matches the shape of the new packfile bullet. --- full-history/design-docs/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md index 735e44500..e53783762 100644 --- a/full-history/design-docs/README.md +++ b/full-history/design-docs/README.md @@ -14,4 +14,5 @@ ## See Also -- [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segment layout; consumed by both docs above. +- [packfile library design](../../design-docs/packfile-library.md) — binary format for immutable `.pack` files (ledger packs + events cold segments); consumed by both docs above. +- [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segment layout, roaring bitmap indexes, MPHF; consumed by both docs above. From 7ad0f653fabac8cb157a25a49f8a868079090e2a Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Wed, 22 Apr 2026 22:14:52 -0700 Subject: [PATCH 19/34] Design docs: cuts, bullets-over-prose sweep, pseudocode-comment trim MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three intertwined passes across both in-repo design docs (streaming and backfill). README untouched except for the prior packfile-doc pointer. Cuts (streaming doc): - ## Operator Scenarios (Bob's archive + Alice's tip-tracker, ~126 lines). Design verification for these scenarios is captured in source-of-truth §6 CC1 + CC2; the narrative belongs in the future operator-guide doc. Pointer to the recovery path added as §10 in the off-repo source-of-truth file. - ## Related Documents trailer (~4 lines). Redundant with the README; stale "Query routing — TBD" bullet. Bullets-over-prose sweep (Pass 1, both docs): - Streaming Compound Recovery Scenarios: each multi-sentence bullet → nested per-claim bullets (state / phase action / cost). - Streaming Query Contract § Rationale: 2-sentence prose → 3 bullets. - Streaming Pruning "Why index-atomic" + "How much extra data": prose blocks → bullets. - Streaming Phase 2 "Why 'load then delete' matters": 4-sentence prose → 3 bullets. - Streaming Error Handling: 11-row table (with overlapping rows) → 3 grouped subsections — Runtime / Freeze transitions / Startup — with flag-after-fsync factored out so the three transition rows collapse to a single rule. - Streaming Immutable Keys intro: 2 sentences → 1-line intro. - Streaming Freeze Transitions intro + ".bin files never here": fold into the existing bullet list. - Streaming Ingestion Loop closing note: prose → bullets. - Streaming Phase 2 intro: 3 sentences → 3 bullets. - Streaming Ledger Source intro: 4-sentence prose → 4 bullets. - Streaming Phase 1 "unit of work" bullet: 4-sentence compound bullet → 3 separate bullets. - Backfill Validation intro: 2 sentences → 3 bullets. - Backfill DAG Setup trailing paragraph: prose → 2 bullets. - Backfill Source concurrency: 3-sentence prose → 3 bullets. - Backfill Crash Recovery intro: 2 sentences → 1-line lead-in. - Backfill Error Handling: task-level-retries bullet's multi-sentence paragraph → nested sub-bullets. Pseudocode comment trim (both docs): - Removed docstrings that restated the function name (run_streaming_ daemon, select_phase1_ledger_source, freeze_events_chunk_to_cold_ segment, cleanup_txhash, etc.). - Replaced numbered-step docstrings + redundant numbered inline comments with a single crisp header comment where the intent isn't obvious from the code (freeze_ledger_chunk_to_pack_file, build_tx_index_recsplit_files, prune_tx_index, build_txhash_index, process_chunk). - Kept only comments that carry non-obvious context: invariants, cross-function ordering contracts, reason-for-choice notes, crash-recovery gotchas, cross-doc pointers. - Compressed LedgerSource / BSBSource / CaptiveCoreSource class docstrings from prose paragraphs to bullet-style comments with the per-method behaviors. Source-of-truth updates: - New §10 documents the operator-guide doc as a future-work item with instructions for recovering the scenario content from git history. Line count: streaming 1313 → 929 (-384, -29%). Backfill 672 → 633 (-39, -6%). Total 1997 → 1580 (-417). No ID leaks (CC/D/B-INV), no stale terms, no semantic changes — only prose trimming and comment pruning. --- .../design-docs/01-backfill-workflow.md | 167 ++-- .../design-docs/02-streaming-workflow.md | 796 +++++------------- 2 files changed, 270 insertions(+), 693 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index 73e4ad5c1..a54741e3d 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -338,52 +338,36 @@ process_chunk(chunk_id=last) ─┘ ```python def run_backfill(config, range_start_chunk_id, range_end_chunk_id, source): - """ - Ingest chunks [range_start_chunk_id, range_end_chunk_id] inclusive via the given source. - - Called by Phase 1 of the streaming daemon. Idempotent per chunk — already-completed - chunks in the range return early from their task's execute(). On crash, Phase 1 - re-invokes with the same range; previously-completed work is skipped automatically. - """ - # 1. Validate — abort before any work if inputs are inconsistent with existing state. validate(config, range_start_chunk_id, range_end_chunk_id, source) - # 2. Prime the source for random-access reads across the entire run range. Done once - # per run_backfill invocation; process_chunk tasks then call source.get_ledger(seq) - # concurrently for seqs inside this range. Mirrors the stellar Go SDK's LedgerBackend - # pattern (PrepareRange → GetLedger). + # Prime source ONCE per invocation. process_chunk tasks then concurrently call + # source.get_ledger(seq) for seqs within this range. Mirrors the stellar Go SDK's + # LedgerBackend pattern (PrepareRange → GetLedger). source.prepare_range( first_ledger_in_chunk(range_start_chunk_id), last_ledger_in_chunk(range_end_chunk_id), ) - # 3. Build DAG — register all tasks; each task's execute() handles its own no-op check. dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, source) - - # 4. Execute — dispatch all tasks concurrently, bounded by the source's parallelism. dag.execute(max_workers=source.max_parallelism()) ``` ### Validation -Validation runs before DAG construction, not as a DAG task — if it were a task, other tasks with no dependencies would start executing concurrently before validation completes, and a failure would leave in-flight work to cancel. Running it first means a clean abort with no partial work. +- Runs before DAG construction, not as a DAG task. +- If it were a task: no-dependency tasks would start concurrently; a validation failure would leave in-flight work to cancel. +- Running it first → clean abort, no partial work. ```python def validate(config, range_start_chunk_id, range_end_chunk_id, source): - # Range sanity. assert range_start_chunk_id >= 0 assert range_end_chunk_id >= range_start_chunk_id - # Source tip coverage. source.tip() is the highest ledger the source can serve; - # lower-bound availability (BSB bucket retention floor, captive core history start) - # is source-specific and surfaces as a per-task failure during execution — retried at - # the DAG level and ultimately fatal if unrecoverable. - last_ledger = last_ledger_in_chunk(range_end_chunk_id) - assert source.tip() >= last_ledger + # Upper bound only. Lower-bound availability (BSB bucket floor, captive core history + # start) surfaces as a per-task failure during execution. + assert source.tip() >= last_ledger_in_chunk(range_end_chunk_id) - # CHUNKS_PER_TXHASH_INDEX immutability. The daemon's validate_config (see - # 02-streaming-workflow.md) stores this on first run and enforces it on every - # subsequent start; backfill re-asserts the match defensively. + # Defensive re-assert; daemon's validate_config owns the enforcement. assert meta_store.get("config:chunks_per_txhash_index") == str(config.backfill.chunks_per_txhash_index) ``` @@ -391,35 +375,31 @@ def validate(config, range_start_chunk_id, range_end_chunk_id, source): ```python def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): - # Wires up tasks and dependency edges — no completion checks or skip logic. - # Each task's execute() handles its own no-op check. dag = new DAG() - # For each tx index whose LAST chunk falls in [range_start_chunk_id, range_end_chunk_id], - # schedule process_chunk (for in-range chunks only) + build + cleanup. Prior chunks of - # such a tx index are either also in the current range (first-ever invocation) or - # already flagged :lfs by a prior Phase 1 iteration — either way, build has every - # chunk's .bin file available when it runs. + # Tx indexes whose LAST chunk is in range: schedule process_chunk for in-range chunks + # only (prior chunks are already `:lfs`-flagged from a prior iteration) + build + + # cleanup. build has every chunk's .bin when it runs. for tx_index_id in tx_indexes_ending_in_range(range_start_chunk_id, range_end_chunk_id, config): chunk_tasks = [] for chunk_id in chunks_for_tx_index(tx_index_id, config): if not (range_start_chunk_id <= chunk_id <= range_end_chunk_id): - continue # prior iteration processed this chunk + continue t = dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) chunk_tasks.append(t.id) b = dag.add(BuildTxHashIndexTask(tx_index_id), deps=chunk_tasks) dag.add(CleanupTxHashTask(tx_index_id), deps=[b.id]) - # Trailing partial tx index: its last chunk is past range_end_chunk_id, so no build / - # cleanup this run. Schedule process_chunk for its in-range chunks only; a future - # Phase 1 iteration covering the missing trailing chunks will trigger the build. + # Trailing partial tx index (last chunk past range_end): process_chunk only; a future + # iteration that covers the missing trailing chunks will schedule the build. for chunk_id in trailing_partial_tx_index_chunks(range_start_chunk_id, range_end_chunk_id, config): dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) return dag ``` -A trailing tx index whose last chunks fall past `range_end_chunk_id` has its `process_chunk` tasks scheduled but no `build_txhash_index` / `cleanup_txhash`. Its `.bin` files + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR Phase 2 hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). +- Trailing tx index whose last chunk is past `range_end_chunk_id`: `process_chunk` scheduled for in-range chunks only; no `build_txhash_index` / `cleanup_txhash`. +- `.bin` + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR Phase 2 hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). --- @@ -445,37 +425,30 @@ def process_chunk(chunk_id, source): first_ledger = first_ledger_in_chunk(chunk_id) last_ledger = last_ledger_in_chunk(chunk_id) - # 1. Check which outputs are missing. need_lfs = not meta_store.has(f"chunk:{chunk_id:08d}:lfs") need_txhash = not meta_store.has(f"chunk:{chunk_id:08d}:txhash") need_events = not meta_store.has(f"chunk:{chunk_id:08d}:events") if not (need_lfs or need_txhash or need_events): - return # all outputs already present + return - # 2. Choose data source for the in-loop ledger read. Both options expose get_ledger(seq) - # for random-access reads; source was already prepared by run_backfill's prepare_range call. + # If :lfs is present, read from the local packfile (NVMe, no source call). Otherwise + # use the already-prepared source. if not need_lfs: - ledger_reader = local_packfile(ledger_pack_path(chunk_id)) # NVMe; no source call + ledger_reader = local_packfile(ledger_pack_path(chunk_id)) else: - ledger_reader = source # BSB or captive core; pre-prepared - - # 3. Open writers only for missing outputs. Path helpers derive bucket_id = chunk_id // 1000 - # internally (see Directory Structure). - ledger_writer = packfile.create(ledger_pack_path(chunk_id), - overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(chunk_id), - overwrite=True) if need_txhash else None - events_writer = events_segment.create(events_segment_path(chunk_id), - overwrite=True) if need_events else None - - # 4. Process each ledger. + ledger_reader = source + + ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None + txhash_writer = open(raw_txhash_path(chunk_id), overwrite=True) if need_txhash else None + events_writer = events_segment.create(events_segment_path(chunk_id), overwrite=True) if need_events else None + for ledger_seq in range(first_ledger, last_ledger + 1): lcm = ledger_reader.get_ledger(ledger_seq) if need_lfs: ledger_writer.append(compress(lcm)) - if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx + if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx if need_events: events_writer.append(extract_events(lcm)) - # 5. Fsync + flag each output independently. + # Fsync + flag each output independently (flag-after-fsync). if need_lfs: ledger_writer.fsync_and_close() meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") @@ -483,13 +456,11 @@ def process_chunk(chunk_id, source): txhash_writer.fsync_and_close() meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") if need_events: - events_writer.finalize() # flush, build MPHF + bitmap index, fsync + events_writer.finalize() meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - # Close the local packfile handle when we used one. The source case is shared across - # tasks and stays open until run_backfill returns — no per-task close. if not need_lfs: - ledger_reader.close() + ledger_reader.close() # close local packfile handle; source stays open across tasks ``` Key properties: @@ -500,7 +471,10 @@ Key properties: - `packfile.create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed. - Naturally extends to new data types (add a fourth flag). -**Source concurrency.** With `source=BSBSource(...)`, many `process_chunk` tasks run in parallel (bounded by `source.max_parallelism() = GOMAXPROCS`). With `source=CaptiveCoreSource(...)`, `source.max_parallelism() = 1` — a single captive core subprocess cannot serve multiple chunk ranges in parallel, so the DAG dispatches chunks sequentially. See [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source) for the interface. +**Source concurrency.** +- `source=BSBSource(...)`: `source.max_parallelism() = GOMAXPROCS`; many `process_chunk` tasks run in parallel. +- `source=CaptiveCoreSource(...)`: `source.max_parallelism() = 1`; single subprocess serializes. DAG dispatches chunks sequentially. +- Interface: see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). ### build_txhash_index(tx_index_id) @@ -513,39 +487,29 @@ Key properties: ```python def build_txhash_index(tx_index_id): if meta_store.has(f"index:{tx_index_id:08d}:txhash"): - return # already built — no-op + return - # Cleanup: remove any partial .idx files from a prior crashed attempt. All-or-nothing - # recovery — the tx-index flag is absent iff the build hasn't completed, so any .idx - # files on disk are from a failed or interrupted run and must be discarded before the - # rebuild starts. + # All-or-nothing recovery: absent flag ⇒ any .idx on disk is from a crashed attempt. delete_partial_idx_files(recsplit_index_path(tx_index_id)) - # Invariant: every chunk of tx_index_id has its .bin file on disk when this runs. - # Prior-iteration chunks keep their .bin until cleanup_txhash runs, and cleanup only - # runs AFTER this build succeeds (DAG dep). The list below therefore includes every - # chunk's .bin, regardless of which Phase 1 iteration wrote it. - bin_files = list_bin_files(tx_index_id) # all .bin files for chunks in this tx index + # Invariant: every chunk's .bin is on disk when this runs. Prior-iteration chunks + # keep their .bin until cleanup_txhash runs; cleanup is DAG-gated on this build. + bin_files = list_bin_files(tx_index_id) - # Stage 1: COUNT — scan all .bin files, count entries per CF. + # Stage 1 (COUNT) — two passes total over .bin; count entries per CF. cf_counts = parallel_count(bin_files, workers=100) - # cf_counts[nibble] = number of (txhash, ledger_seq) entries routed to that CF - # Stage 2: ADD — re-read .bin files, route entries to CF builders. + # Stage 2 (ADD) — route entries into 16 per-CF builders (nibble = txhash[0] >> 4). cf_builders = [RecSplitBuilder(cf_counts[nibble]) for nibble in range(16)] parallel_add(bin_files, cf_builders, workers=100) - # each entry routed to cf_builders[txhash[0] >> 4] (mutex per CF) - # Stage 3: BUILD — build MPH index per CF, one .idx file each. + # Stage 3 (BUILD) — 16 parallel CF builds; each produces one .idx; all fsynced. parallel_build(cf_builders, workers=16) - # each CF produces one .idx file; all fsynced - # Stage 4: VERIFY — look up every key in the built indexes (full verification, - # since backfill has no wall-clock pressure). + # Stage 4 (VERIFY) — full verification; no wall-clock pressure at backfill time. parallel_verify(bin_files, cf_builders, workers=100) - # Flag. Backfill writes "1" only; streaming's prune path may later transition - # this key through "deleting" before clearing it — see 02-streaming-workflow.md. + # Backfill writes "1" only; streaming's prune path may transition through "deleting". meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") ``` @@ -566,11 +530,9 @@ Key properties: def cleanup_txhash(tx_index_id): for chunk_id in chunks_for_tx_index(tx_index_id, config): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already cleaned up — skip - delete_if_exists(raw_txhash_path(chunk_id)) # remove .bin (idempotent — crash - # between .bin delete and flag delete - # is safe to retry on restart) - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # remove meta key + continue + delete_if_exists(raw_txhash_path(chunk_id)) # idempotent; crash-between is safe + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") ``` Key properties: @@ -595,29 +557,25 @@ def run_dag(dag, max_workers): runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) def execute_task(task): - """Runs in a background thread — one per dispatched task.""" for attempt in range(1, MAX_RETRIES + 1): error = task.execute() if error is None: break if attempt == MAX_RETRIES: - mark_failed(task, error) # halt all dependents + mark_failed(task, error) # halt dependents break log.warn("retry", task, attempt, error) + worker_slots.release() - worker_slots.release() # free worker slot - - # Check if completing this task unblocks any downstream tasks. for downstream_task in dag.dependents_of(task): downstream_task.mark_dependency_done(task) if downstream_task.all_dependencies_done(): - runnable_tasks.push(downstream_task) # now eligible to run + runnable_tasks.push(downstream_task) - # Main loop — dispatches tasks as they become runnable. while runnable_tasks: current_task = runnable_tasks.pop() - worker_slots.acquire() # block until a worker slot is free - run_in_background(execute_task, current_task) # launch — returns immediately + worker_slots.acquire() + run_in_background(execute_task, current_task) ``` ### Worker Pool @@ -634,12 +592,12 @@ def run_dag(dag, max_workers): ## Crash Recovery -There is no separate crash recovery, reconciliation, or startup triage phase. Recovery happens organically because every task's `execute()` checks its own completion state: +No separate reconciliation phase — every task's `execute()` checks its own completion state: -- On every invocation, `build_dag()` registers ALL tasks for the chunk range — no meta store scanning in DAG setup. -- `process_chunk` checks each output flag independently — missing outputs are produced, existing outputs are skipped. -- `build_txhash_index` checks `index:{tx_index_id:08d}:txhash` — if present, returns immediately; if absent, deletes partial `.idx` files and reruns the full build. -- `cleanup_txhash` checks `chunk:{chunk_id:08d}:txhash` per-chunk — already-cleaned chunks are skipped, remaining chunks are cleaned up. +- `build_dag()` registers ALL tasks for the chunk range on every invocation; no meta-store scanning in setup. +- `process_chunk` checks each output flag independently — missing produced, existing skipped. +- `build_txhash_index` checks `index:{tx_index_id:08d}:txhash` — present → early return; absent → delete partial `.idx` files, rerun full build. +- `cleanup_txhash` checks `chunk:{chunk_id:08d}:txhash` per-chunk — cleaned skipped, remaining cleaned. Three invariants make this work: @@ -657,8 +615,11 @@ The daemon acquires a directory flock on the meta-store at startup. A second pro Two layers of retry: -- **Source-internal retries** — the `LedgerSource` handles transient errors internally (BSB connection resets, throttling, captive core subprocess hiccups). These retries happen inside a single task execution and are invisible to the DAG scheduler. -- **Task-level retries** — the DAG scheduler wraps each task's `execute()` with a retry loop bounded by `MAX_RETRIES`. After the source has exhausted its own retries, the scheduler retries the entire task. After `MAX_RETRIES` exhausted → task marked failed → DAG halts all dependents → `run_backfill` returns a fatal error → Phase 1 propagates → daemon exits non-zero. Operator investigates, fixes the root cause, and restarts the daemon; restart re-enters Phase 1, which re-invokes `run_backfill` with a fresh chunk range, and already-complete work is skipped via per-chunk idempotency. +- **Source-internal retries.** `LedgerSource` handles transient errors (BSB connection resets, throttling, captive-core subprocess hiccups) inside a single task execution. Invisible to the DAG. +- **Task-level retries.** DAG wraps each task's `execute()` in a retry loop bounded by `MAX_RETRIES`. + - Source retries exhausted → task retries whole. + - `MAX_RETRIES` exhausted → task marked failed → DAG halts dependents → `run_backfill` returns fatal → Phase 1 propagates → daemon exits non-zero. + - Operator fixes root cause + restarts → Phase 1 re-enters → `run_backfill` re-invoked with a fresh range → completed work skipped via per-chunk idempotency. | Error | Handled by | Action | |-------|-----------|--------| diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 0f7f7da77..53a709778 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -67,14 +67,16 @@ All of `[SERVICE]`, `[BACKFILL]`, `[IMMUTABLE_STORAGE.*]`, `[META_STORE]`, `[LOG ### Immutable Keys (stored in meta store, fatal if changed) -Two keys are stored on first start and enforced on every subsequent start. Changing either requires wiping the datadir. +Stored on first start; fatal on any subsequent start where the config value differs. Changing either requires wiping the datadir. | Key | Stored under | Set by | Rule | |---|---|---|---| | `CHUNKS_PER_TXHASH_INDEX` | `config:chunks_per_txhash_index` | first run | Fatal if changed. | | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | -Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence. Operators may add or remove BSB between runs without wiping — the daemon extends coverage forward from `phase1_coverage_end_ledger` regardless of source. Retention immutability is what constrains the data envelope; source choice doesn't need its own immutability gate. +- Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence; not stored as immutable. +- Operators may add or remove BSB between runs; daemon extends coverage forward from `phase1_coverage_end_ledger` regardless. +- Retention immutability alone constrains the data envelope — source choice doesn't need its own gate. ### Streaming-Specific TOML @@ -130,32 +132,19 @@ See [Ledger Source](#ledger-source) for the full source-selection rule. ```python def validate_config(config, meta_store): - """ - Runs once at startup before Phase 1. Enforces: - - Immutable keys (CHUNKS_PER_TXHASH_INDEX, RETENTION_LEDGERS) match meta-store state. - - RETENTION_LEDGERS is 0 or a positive multiple of LEDGERS_PER_INDEX. - - Required config keys are present. - - Any failure is fatal — the daemon exits with a clear error. Operator fixes config - (or wipes the datadir for an immutable-key change) and re-invokes. - """ cpi = config.backfill.chunks_per_txhash_index retention_ledgers = config.streaming.retention_ledgers ledgers_per_index = cpi * LEDGERS_PER_CHUNK - # 1. Retention shape. if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % ledgers_per_index) != 0): fatal(f"RETENTION_LEDGERS={retention_ledgers} must be 0 or a positive multiple of " - f"LEDGERS_PER_INDEX={ledgers_per_index}. Valid values at this cpi: " - f"0, {ledgers_per_index}, {2*ledgers_per_index}, ...") + f"LEDGERS_PER_INDEX={ledgers_per_index}.") - # 2. Required keys. if not config.streaming.captive_core_config: fatal("STREAMING.CAPTIVE_CORE_CONFIG is required.") if not config.history_archives.urls: - fatal("HISTORY_ARCHIVES.URLS is required (list of at least one archive URL).") + fatal("HISTORY_ARCHIVES.URLS is required.") - # 3. Immutable keys. Store on first run; fatal on mismatch thereafter. _enforce_immutable(meta_store, "config:chunks_per_txhash_index", str(cpi)) _enforce_immutable(meta_store, "config:retention_ledgers", str(retention_ledgers)) @@ -165,8 +154,7 @@ def _enforce_immutable(meta_store, key, current_value): if stored is None: meta_store.put(key, current_value) elif stored != current_value: - fatal(f"{key} changed: stored={stored}, config={current_value}. " - f"Wipe datadir to change.") + fatal(f"{key} changed: stored={stored}, config={current_value}. Wipe datadir.") ``` ### Operator Profiles @@ -181,133 +169,6 @@ Three profiles emerge from config combinations. No profile flag. --- -## Operator Scenarios - -Worked examples showing what operators configure, what happens at runtime, and how crashes recover. Reference scenarios for PRD / test planning. - -### Scenario A — Fresh full-history archive, seamless cutover to live - -**Setup**: operator Bob wants a public archive node. Full history, retained forever, catchup from BSB, then live streaming. - -**Config** (`/etc/stellar-rpc/config.toml`): - -```toml -[SERVICE] -DEFAULT_DATA_DIR = "/data/stellar-rpc" - -[BACKFILL] -CHUNKS_PER_TXHASH_INDEX = 1000 # default; 10M ledgers per index - -[BACKFILL.BSB] -BUCKET_PATH = "sdf-ledger-close-meta/v1/ledgers/pubnet" - -[STREAMING] -RETENTION_LEDGERS = 0 # full history -CAPTIVE_CORE_CONFIG = "/etc/stellar-rpc/captive-core.cfg" - -[HISTORY_ARCHIVES] -URLS = ["https://history.stellar.org/prd/core-live/core_live_001/"] - -[LOGGING] -LEVEL = "info" -FORMAT = "text" -``` - -**Invocation**: - -``` -stellar-rpc --config /etc/stellar-rpc/config.toml -``` - -**Happy path**: - -- Daemon starts. `validate_config` stores `config:chunks_per_txhash_index = "1000"` and `config:retention_ledgers = "0"` on first start. -- Phase 1 picks `BSBSource` (BSB is configured). `run_backfill(0, last_complete_chunk_at_tip, source=BSBSource)`. -- Static DAG over ~5_600 chunks (at tip ~56M). Parallel BSB workers pull ledgers at GOMAXPROCS chunks at a time. Runs ~12h. -- Phase 1 exits when `T - L < 10_000`. All indexes 0..N complete with RecSplit built by the DAG; at most one partial trailing index remains. -- Phase 2 loads any `.bin` files left by the trailing partial index into the active txhash store, deletes `.bin` + `:txhash` flags. -- Phase 3 is a no-op (no orphan stores on a clean first run). -- Phase 4 opens active stores at `resume_ledger = last_phase1_ledger + 1`, starts captive core via `PrepareRange(UnboundedRange(resume_ledger))`, enters live ingestion. -- Queries begin serving at the moment Phase 4 flips `daemon_ready = true`. - -**No operator action between Phase 1 and Phase 4.** The cutover is automatic. - -**Crash recovery**: - -- Crash during Phase 1's BSB download at chunk 3_457: on restart, `phase1_coverage_end_ledger` walks `:lfs` flags, returns the end of the contiguous prefix (say, chunk 3_200). `phase1_catchup` re-enters, `compute_backfill_chunk_range` produces a new range, backfill re-runs from chunk 3_201 forward. Chunks that already had `:lfs` are skipped via per-chunk idempotency. -- Crash after all chunks written but before index 3's RecSplit built: on restart, Phase 1 sees `index:3:txhash` absent → backfill's DAG re-runs the RecSplit build from the `.bin` files. Succeeds. -- Crash while `.bin` files from the trailing index are being loaded into the active txhash store (Phase 2): on restart, Phase 2 re-runs. Chunks that were already loaded had their `:txhash` flag deleted and `.bin` file removed — the loop skips them via the flag check. Chunks not yet loaded retain their `:txhash` flag and `.bin` file — the loop picks them up. -- Crash between per-ledger checkpoint commit and chunk freeze during live ingestion: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)` but `chunk:{chunk_id}:lfs` absent. Phase 3 triggers the missing transitions when the daemon restarts, before Phase 4 re-enters. - -In every case: the daemon reaches a consistent state after one restart. No manual intervention. Dangling `.bin` files from incomplete indexes are cleaned by Phase 2 once the owning index progresses further. - -### Scenario B — Alice's tip-tracker (no BSB, small retention) - -**Setup**: Alice is building a wallet app. She wants live events only, starting from the current network tip. She doesn't want to stand up a GCS bucket for BSB. She picks `cpi=1` and `RETENTION_LEDGERS = 10_000` (one index worth, ~16 hours at 6s/ledger). - -**Config**: - -```toml -[SERVICE] -DEFAULT_DATA_DIR = "/data/stellar-rpc" - -[BACKFILL] -CHUNKS_PER_TXHASH_INDEX = 1 # minimum; one chunk per index - -[STREAMING] -RETENTION_LEDGERS = 10_000 # one index = ~16 hours -CAPTIVE_CORE_CONFIG = "/etc/stellar-rpc/captive-core.cfg" - -[HISTORY_ARCHIVES] -URLS = ["https://history.stellar.org/prd/core-live/core_live_001/"] - -[LOGGING] -LEVEL = "info" -FORMAT = "text" - -# No [BACKFILL.BSB] section — Phase 1 uses captive core. -``` - -**Invocation**: same as Scenario A. - -**Happy path on first-ever start** (say network tip is `56_342_637`): - -- Daemon starts. Validates config; stores immutable keys. -- `[BACKFILL.BSB]` absent → Phase 1 source is `CaptiveCoreSource`. -- Source samples tip via HTTP GET against `HISTORY_ARCHIVE_URLS`: tip = `56_342_637`. -- `compute_backfill_chunk_range(last_committed_ledger=1, network_tip_ledger=56_342_637, retention_ledgers=10_000, cpi=1)` — leapfrog lands at `first_ledger_in_tx_index(tx_index_id_of_chunk(chunk_id_of_ledger(network_tip_ledger - retention_ledgers)))`: - - `network_tip_ledger - retention_ledgers = 56_332_637`. - - `chunk_id_of_ledger(56_332_637) = 5_633`. `tx_index_id_of_chunk(5_633) = 5_633` (cpi=1). - - `first_ledger_in_tx_index(5_633) = 56_330_002`. -- Backfill range is chunks `5_633..5_633` (one chunk to close the gap to tip at chunk 5_633, which is `last_complete_chunk_at(56_342_637)`). Up to ~10_000 ledgers of archive-catchup via captive core. Takes ~3–8 minutes. -- Phase 2 loads the one `.bin` file into the active txhash store, deletes it. -- Phase 4 opens active stores, starts captive core for live streaming from `resume_ledger = 56_340_002`, enters ingestion loop. - -**Why leapfrog lands ~10_000 ledgers back instead of exactly at tip**: Alice's first chunk must be a complete chunk (starts at `..._02`, ends at `..._01`). If the daemon started ingesting at `tip = 56_342_637` (mid-chunk), chunk 5_634 would be missing ledgers `56_340_002..56_342_636` — the no-gaps invariant would break and RecSplit for chunk 5_634's index could never be built. Leapfrog alignment is what keeps no-gaps intact. - -**What if Alice picks `cpi=10` instead of `cpi=1`?** - -- `LEDGERS_PER_INDEX = 100_000`. Minimum `RETENTION_LEDGERS = 100_000` (~7 days). Alice's `RETENTION_LEDGERS = 10_000` is invalid — `validate_config` fatals at startup with a clear error message. -- If Alice fixes retention to `100_000`, Phase 1 captive-core archive-catchup spans up to 100_000 ledgers (~30–60 min on first start). Once live, steady state is the same as cpi=1. - -**What if Alice wants "just start from tip, don't catch up anything"?** - -- Not possible under this design. The no-gaps invariant requires the first chunk to be complete. If tip falls mid-chunk, the daemon must ingest earlier ledgers to round down to an index boundary. Minimum leapfrog catchup at cpi=1 is ≤10_000 ledgers (~minutes via captive core). That's the floor. - -**Subsequent restart** (say after 1h downtime): - -- Daemon starts. `streaming:last_committed_ledger` is present from the prior run. Phase 1 samples tip; `network_tip_ledger - last_committed_ledger` is ~600 ledgers (10 min at 6s) — less than one chunk → Phase 1 exits immediately. -- Phase 2 finds no `.bin` files (deleted on first start). No-op. -- Phase 3 reconciles any orphan active stores from the crash — typical case is completing an interrupted chunk freeze. -- Phase 4 re-opens active stores, starts captive core, re-enters the ingestion loop. Captive core's own archive-catchup closes the 600-ledger gap in ~seconds, then cadence settles to live closes. - -**Crash recovery within Phase 1 (first-ever start)**: - -- Captive core subprocess crashes mid-archive-catchup: daemon retries spinning captive core up. No persisted state to roll back — the partial chunk's data was in the active store's WAL; captive core re-archive-catches-up from whatever ledger the WAL wasn't past. -- Daemon process itself crashes: on restart, `phase1_coverage_end_ledger` returns whatever contiguous prefix exists. Phase 1 re-enters. Eventually completes. - -**Query behavior during Phase 1**: `HTTP 4xx` for all three query endpoints. `getHealth` reports `catching_up` + the drift. - ## Meta Store Keys Single RocksDB instance, WAL always enabled. Authoritative source for every startup decision. @@ -400,74 +261,46 @@ The daemon maintains three active stores for the current ingestion position. All ## Ledger Source -Phase 1 reads ledgers from a source. Two implementations share one interface. Source is selected per-startup based on `[BACKFILL.BSB]` presence — no stored immutability gate. Operators may add or remove BSB between runs; retention immutability alone constrains the data envelope. - -The interface mirrors the stellar Go SDK's `LedgerBackend` pattern (`PrepareRange` + `GetLedger`) — both implementations below (BSB and captive core) already expose that pattern in the SDK, so random-access reads are native and no sequential-iterator shim is needed. +- Phase 1 reads ledgers from a `LedgerSource`; two implementations share one interface. +- Source is selected per-startup by `[BACKFILL.BSB]` presence; not stored as immutable. +- Operators may add or remove BSB between runs. Retention immutability alone constrains the data envelope. +- Interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`); both implementations expose that pattern natively, so random-access reads are native and no iterator shim is needed. ```python class LedgerSource: - """ - Provides random-access LedgerCloseMeta reads for a prepared range. Used by the backfill - subroutine inside Phase 1. Live streaming (Phase 4) does NOT go through this abstraction — - it reads directly from CaptiveStellarCore via `ledgerBackend.PrepareRange(UnboundedRange(...))` - + `ledgerBackend.GetLedger(seq)`. - - Usage pattern: run_backfill calls prepare_range ONCE for the full backfill run, then - process_chunk tasks concurrently call get_ledger(seq) for any seq inside the prepared - range. Implementations must be safe under concurrent get_ledger calls (the DAG dispatches - up to max_parallelism() process_chunk workers). - """ - - def tip(self) -> int: - """Current network tip ledger. Used to compute Phase 1 target range. Callable without - a prior prepare_range.""" + # Used by backfill inside Phase 1. Live streaming (Phase 4) bypasses this abstraction + # and calls ledgerBackend.PrepareRange(UnboundedRange(...)) + GetLedger(seq) directly. + # Contract: run_backfill calls prepare_range ONCE, then process_chunk tasks call + # get_ledger(seq) concurrently (thread-safe up to max_parallelism()). - def prepare_range(self, start_ledger, end_ledger) -> None: - """Prime the source for random-access reads in [start_ledger, end_ledger] inclusive. - Called once per run_backfill invocation (phase1_catchup may invoke run_backfill - multiple times, each with its own range). Must tolerate re-invocation.""" - - def get_ledger(self, ledger_seq) -> LedgerCloseMeta: - """Return the LCM for ledger_seq. Requires prepare_range to have covered ledger_seq. - Thread-safe under concurrent calls from process_chunk workers.""" - - def max_parallelism(self) -> int: - """Upper bound on concurrent get_ledger call chains the source can sustain. Backfill - DAG honors this when dispatching process_chunk workers.""" + def tip(self) -> int: ... + def prepare_range(self, start_ledger, end_ledger) -> None: ... + def get_ledger(self, ledger_seq) -> LedgerCloseMeta: ... + def max_parallelism(self) -> int: ... class BSBSource(LedgerSource): - """ - Reads from the BSB (Buffered Storage Backend) bucket configured in [BACKFILL.BSB]. - - - Tip: queried from BSB's own range-end metadata. Same mechanism backfill uses today. - - prepare_range: sets the BSB-backed LedgerBackend's range; BSB internal prefetch workers - (BUFFER_SIZE, NUM_WORKERS) fill buffers ahead of get_ledger reads. - - get_ledger: random-access via the SDK's GetLedger(seq); reads from the prefetch buffer. - - max_parallelism: GOMAXPROCS (backfill's current default). - """ + # tip: BSB's range-end metadata. + # prepare_range: sets the BSB-backed LedgerBackend's range; BSB prefetch workers + # (BUFFER_SIZE, NUM_WORKERS) fill buffers ahead of get_ledger. + # get_ledger: SDK GetLedger(seq) reads from the prefetch buffer. + # max_parallelism: GOMAXPROCS. + ... class CaptiveCoreSource(LedgerSource): - """ - Drives a CaptiveStellarCore subprocess to replay ledgers from the history archive + peers. - - - Tip: fetched via HTTP GET on /.well-known/stellar-history.json against HISTORY_ARCHIVE_URLS. - Matches the existing ingest service pattern (Service.getNextLedgerSequence → archive.GetRootHAS()). - - prepare_range: spins up (or re-primes) captive core with BoundedRange(start, end). - - get_ledger: random-access via the SDK's GetLedger(seq); blocks until that ledger is - available in the captive-core subprocess's emitted stream. - - max_parallelism: 1. Captive core is a single heavy subprocess; parallelism would require - multiple subprocesses, each consuming several GB RAM. Backfill DAG dispatches chunks - sequentially when source is captive core. - """ + # tip: HTTP GET on /.well-known/stellar-history.json against HISTORY_ARCHIVE_URLS + # (same pattern as existing ingest service). + # prepare_range: spins up (or re-primes) captive core with BoundedRange(start, end). + # get_ledger: SDK GetLedger(seq); blocks until the subprocess emits that ledger. + # max_parallelism: 1 (single subprocess; multiple would OOM). + ... ``` ### Source Selection Rule ```python def select_phase1_ledger_source(config): - """Called once at the top of Phase 1. Re-evaluated per startup.""" if config.backfill.bsb is not None: return BSBSource(config.backfill.bsb) return CaptiveCoreSource(config.streaming.captive_core_config, @@ -499,19 +332,11 @@ Four sequential phases, same code path for first start and every restart. The fi ```python def run_streaming_daemon(config): meta_store = open_meta_store(config) - validate_config(config, meta_store) # immutable key enforcement - - # ── Phase 1: catch up from last_committed_ledger (or genesis) to tip ── + validate_config(config, meta_store) source = select_phase1_ledger_source(config) phase1_catchup(config, meta_store, source) - - # ── Phase 2: load any .bin files left by Phase 1 into RocksDB; delete them ── phase2_hydrate_txhash(config, meta_store) - - # ── Phase 3: reconcile orphaned transitions from prior crash ── phase3_reconcile_orphans(config, meta_store) - - # ── Phase 4: open active stores, spawn lifecycle goroutine, start captive core, ingest ── phase4_live_ingest(config, meta_store) ``` @@ -521,114 +346,60 @@ Query serving is gated on Phase 4 being reached — see [Query Contract](#query- Runs the backfill subroutine (`run_backfill` from `01-backfill-workflow.md`) once per source-tip sample, until the gap closes to less than one chunk. -- Phase 1's unit of work is an entire chunk — never a partial chunk. Backfill's DAG dispatches integer chunk IDs; `process_chunk(chunk_id)` ingests ledgers `first_ledger_in_chunk(chunk_id)..last_ledger_in_chunk(chunk_id)` inclusive. Every chunk ever persisted by Phase 1 starts at `..._02` and ends at `..._01`. This is the chunk-alignment invariant the no-gaps guarantee rests on. -- Works the same whether the source is BSB (parallel) or captive core (sequential) — per-chunk work is atomic in both cases. +- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. +- Every chunk Phase 1 persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. +- Works the same with BSB (parallel) or captive core (sequential); per-chunk work is atomic in both. ```python def phase1_catchup(config, meta_store, source): - """ - Close the gap between what's already on disk and the current network tip. - - Control flow (outer loop): - 1. derive last_committed_ledger from :lfs flags on disk (NOT from - streaming:last_committed_ledger — that key isn't written during Phases 1–3). - 2. sample the current network_tip_ledger from the source. - 3. if (network_tip_ledger - last_committed_ledger) is less than one chunk, exit - (captive core will close the residual few-thousand-ledger gap in Phase 4 via - its own archive-catchup). - 4. compute the chunk range to backfill this iteration. Leapfrog-alignment inside - compute_backfill_chunk_range guarantees range_start is the first chunk of an - index when retention is configured. - 5. invoke backfill's static-DAG subroutine. Backfill's own per-chunk idempotency - + crash recovery handle mid-iteration crashes. - 6. re-derive last_committed_ledger from :lfs flags. Loop. - - The while loop is needed because the network tip advances while we catch up — - each run_backfill call covers the range known at the start of that iteration, - and subsequent iterations close whatever new ledgers accumulated. - """ cpi = config.backfill.chunks_per_txhash_index retention_ledgers = config.streaming.retention_ledgers last_committed_ledger = phase1_coverage_end_ledger(meta_store) + # Loop because tip advances during catchup; each iteration closes whatever's + # accumulated since the previous sample. while True: network_tip_ledger = source.tip() if (network_tip_ledger - last_committed_ledger) < LEDGERS_PER_CHUNK: - break # less than one chunk remaining + break # remaining gap < 1 chunk; captive core closes it in Phase 4 range_start_chunk_id, range_end_chunk_id = compute_backfill_chunk_range( last_committed_ledger, network_tip_ledger, retention_ledgers, cpi) if range_end_chunk_id < range_start_chunk_id: - # Leapfrog landed past the last complete chunk at tip — happens when the - # network hasn't produced a full chunk past the retention line yet. Exit. - break + break # leapfrog landed past last complete chunk — nothing to ingest yet - # Backfill's DAG ingests [range_start_chunk_id..range_end_chunk_id] inclusive. - # Per-chunk idempotent: chunks with :lfs already set are skipped. Crash here - # resumes on restart. run_backfill(config, range_start_chunk_id, range_end_chunk_id, source=source) - # Re-derive last_committed_ledger — not just range_end_chunk_id — because a - # mid-iteration crash could leave holes in [range_start..range_end] that the - # contiguous-prefix scan catches. + # Re-derive from :lfs flags (not from range_end_chunk_id): a mid-iteration + # crash can leave holes that the contiguous-prefix scan detects. last_committed_ledger = phase1_coverage_end_ledger(meta_store) def compute_backfill_chunk_range(last_committed_ledger, network_tip_ledger, retention_ledgers, cpi): - """ - Returns (range_start_chunk_id, range_end_chunk_id). Leapfrog aligns DOWN to the - first chunk of the tx index containing (network_tip_ledger - retention_ledgers). - No-op when retention_ledgers = 0 (full history archive). - - - retention_ledgers is a multiple of LEDGERS_PER_INDEX (validated at startup), but - network_tip_ledger itself is arbitrary — so (network_tip_ledger - retention_ledgers) - is NOT on a tx-index boundary in general. Leapfrog must explicitly round that value - down to the first ledger of its containing tx index. That rounded value is the new - head of coverage; every earlier ledger is past retention and skipped. - - Worst case: up to LEDGERS_PER_INDEX - 1 ledgers past the strict retention line are - ingested and held on disk. At cpi=1_000 this is ~10M ledgers; at cpi=1 it is ~10k. - """ + # Leapfrog aligns DOWN to the first chunk of the tx index containing + # (network_tip_ledger - retention_ledgers). retention_ledgers is a multiple of + # LEDGERS_PER_INDEX but network_tip_ledger is arbitrary, so that subtraction isn't + # tx-index-aligned in general and must be rounded. Worst case: up to + # LEDGERS_PER_INDEX - 1 ledgers past the strict retention line stay on disk. gap_start_ledger = last_committed_ledger + 1 if retention_ledgers > 0: target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK target_tx_index_id = target_chunk_id // cpi - # First ledger of target_tx_index_id = (target_tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER. leapfrog_start_ledger = (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER else: leapfrog_start_ledger = GENESIS_LEDGER range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) range_start_chunk_id = (range_start_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - # range_end_chunk_id: largest chunkId such that last_ledger_in_chunk(chunkId) - # is <= network_tip_ledger. - # last_ledger_in_chunk(chunkId) = ((chunkId + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) - # <= network_tip_ledger iff (chunkId + 1) <= (network_tip_ledger - (GENESIS_LEDGER - 1)) / LEDGERS_PER_CHUNK - # iff chunkId <= ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 range_end_chunk_id = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 return range_start_chunk_id, range_end_chunk_id def phase1_coverage_end_ledger(meta_store): - """ - Returns the last ledger of the contiguous tail of :lfs flags starting at the lowest - chunk currently on disk. - - - Finds min_chunk_id = lowest chunkId with chunk:{chunkId}:lfs set. - - Walks forward from min_chunk_id counting contiguous :lfs flags. Stops at the first gap. - - Returns GENESIS_LEDGER - 1 if no :lfs flags exist at all. - - Contiguous-tail semantics matter because: - - BSB workers complete chunks in parallel; a mid-Phase-1 crash can leave holes in the - middle of the ingested range. Resuming from the highest :lfs would skip those holes - and break the no-gaps invariant. - - Lifecycle pruning removes :lfs flags of past-retention tx indexes. The lowest - remaining :lfs after prune is naturally the head of surviving coverage — no separate - tip sample or leapfrog calculation needed here. - - Leapfrog decisions (where Phase 1 should start ingesting THIS run) are made separately - inside compute_backfill_chunk_range, which has access to the current tip sample. - """ + # Last ledger of the contiguous :lfs prefix starting at the lowest on-disk chunk. + # Contiguous-tail (not max-of-:lfs) because parallel BSB workers can leave mid-range + # holes on crash; max would skip them and break no-gaps. min_chunk_id = None for key in meta_store.iter_prefix("chunk:"): if not key.endswith(":lfs"): @@ -642,73 +413,66 @@ def phase1_coverage_end_ledger(meta_store): chunk_id = min_chunk_id while meta_store.has(f"chunk:{chunk_id:08d}:lfs"): chunk_id += 1 - return last_ledger_in_chunk(chunk_id - 1) # last contiguous chunk + return last_ledger_in_chunk(chunk_id - 1) ``` -**Worker concurrency**: `run_backfill` honors `source.max_parallelism()` when dispatching `process_chunk` tasks. With BSB this is GOMAXPROCS (unchanged from backfill today). With captive core it is 1 — the DAG dispatches chunks sequentially to avoid spawning multiple captive core subprocesses. +**Worker concurrency:** `run_backfill` honors `source.max_parallelism()` — GOMAXPROCS for BSB, 1 for captive core (single subprocess, can't parallelize). -**Retention semantics** depend on source: -- With BSB: retention determines the Phase 1 range; catchup time scales with `RETENTION_LEDGERS / (BSB throughput)`. -- With captive core: retention determines the Phase 1 range AND captive core's archive-catchup scope. Operators must size retention against the wall-clock cost of captive-core archive catchup. +**Retention semantics by source:** +- BSB: retention determines Phase 1 range; catchup time ≈ `RETENTION_LEDGERS / (BSB throughput)`. +- Captive core: retention determines both Phase 1 range AND captive-core archive-catchup scope — size retention against the wall-clock cost. ### Phase 2 — Hydrate TxHash Data from `.bin` -Phase 1 may leave `.bin` files for chunks in the last (incomplete) index. Phase 2 loads each into the active txhash store and deletes the `.bin` file + its `chunk:{chunk_id}:txhash` flag. After Phase 2, no `.bin` files and no `chunk:{chunk_id}:txhash` flags remain. +- Phase 1 may leave `.bin` files for chunks in the last (incomplete) tx index. +- Phase 2 loads each into the active txhash store, then deletes the `.bin` + `chunk:{chunk_id:08d}:txhash` flag. +- After Phase 2: no `.bin` files and no `:txhash` chunk flags remain. ```python def phase2_hydrate_txhash(config, meta_store): - """ - Loads every remaining .bin into the active txhash store, then deletes the .bin and flag. - - - Runs on every startup for robustness. On a restart where a previous Phase 2 completed, - no :txhash flags remain and this is a no-op. - - After each chunk is loaded: delete the flag FIRST, then delete the .bin. A crash - between these two steps leaves an orphan .bin that the sweep in step 3 handles. - - The txhash store must be opened (not re-created) — prior Phase 2 runs may have loaded - earlier chunks, and their .bin files are already gone. - """ cpi = config.backfill.chunks_per_txhash_index - # 1. Backfill may have completed a tx index (index:{tx_index_id}:txhash = "1") before - # a crash prevented cleanup_txhash from deleting leftover .bin. Sweep those first. + # Step 1: sweep leftover .bin for tx indexes already flagged complete — backfill + # may have set index:N:txhash before cleanup_txhash finished on crash. for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) - # 2. Load .bin files for the current incomplete tx index (if any) into the active - # txhash RocksDB. + # Step 2: load .bin for the trailing incomplete tx index into the active RocksDB. incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) if incomplete_tx_index_id is None: return - txhash_store = open_active_txhash_store(config, incomplete_tx_index_id) # WAL recovery; do NOT recreate + txhash_store = open_active_txhash_store(config, incomplete_tx_index_id) try: for chunk_id in range(incomplete_tx_index_id * cpi, (incomplete_tx_index_id + 1) * cpi): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - continue # already loaded (flag cleared) + continue bin_path = raw_txhash_path(chunk_id) if os.path.exists(bin_path): - load_bin_into_rocksdb(bin_path, txhash_store) # idempotent writes - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # delete flag first - delete_if_exists(bin_path) # delete .bin second + load_bin_into_rocksdb(bin_path, txhash_store) + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # flag first + delete_if_exists(bin_path) # then .bin - # 3. Sweep orphan .bin files (flag already gone, .bin lingering from crash between - # flag-delete and file-delete in a prior run). + # Step 3: sweep orphan .bin (flag gone, .bin lingering from a prior crash + # between the two deletes above). for bin_file in scan_bin_files_for_tx_index(incomplete_tx_index_id): if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): os.remove(bin_file) finally: - # Must close before returning — Phase 4's open_active_stores_for_resume re-opens the same - # directory, and RocksDB's directory flock would collide if this handle is still - # open. WAL remains on disk; reopening is safe. + # Close before returning: Phase 4 re-opens by directory path and the RocksDB + # flock would collide if this handle stayed open. txhash_store.close() ``` -**Why "load then delete" matters.** Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. At `cpi=1_000` with frequent restarts over a day, that is thousands of redundant loads. Deleting the `.bin` after the first successful load makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. +**Why "load then delete" matters.** +- Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. +- At `cpi=1_000` with frequent restarts over a day: thousands of redundant loads. +- Load-then-delete makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. -**Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files — streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 is a trivial no-op in that case. +**Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files; streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 is a no-op. ### Phase 3 — Reconcile Orphaned Transitions @@ -716,67 +480,39 @@ Completes any in-flight transitions left by a prior crash. All decisions derive ```python def phase3_reconcile_orphans(config, meta_store): - """ - Finishes any mid-flight LFS flush, events freeze, or RecSplit build from a crashed run. - - - Active store for resume_chunk_id: keep (Phase 4 will open it). - - Pre-created store for resume_chunk_id + 1: keep. - - Orphaned ledger store: - flag present → cleanup lingered; delete the store. - flag absent, chunk below resume_chunk_id → mid-flush crash; complete the flush. - flag absent, chunk above resume_chunk_id + 1 → orphan future store; delete. - - Orphaned txhash store: - flag present → cleanup lingered; delete the store. - flag absent, all chunks of tx index tx_index_id have :lfs set → spawn RecSplit build. - - On a fresh datadir (no :lfs flags anywhere, Phase 1 had nothing to do) this is a no-op: - resume_ledger = GENESIS_LEDGER, resume_chunk_id = 0, no active stores on disk yet. - """ - # Derive resume_ledger the SAME way Phase 4 will — otherwise Phase 3 and Phase 4 can - # disagree on which chunk's active store to preserve, causing Phase 4 to open a fresh - # store while Phase 3's kept-active-store is left as an orphan. - # - # Priority order (matches phase4_live_ingest): - # 1. streaming:last_committed_ledger if set (live-path crash mid-chunk or at boundary). - # 2. phase1_coverage_end_ledger otherwise (first-start after Phase 1, or fresh datadir). + # resume_ledger derivation must match phase4_live_ingest exactly — if they disagree, + # Phase 4 opens a fresh store while Phase 3's preserved store becomes an orphan. cpi = config.backfill.chunks_per_txhash_index last_committed_ledger = meta_store.get("streaming:last_committed_ledger") if last_committed_ledger is None: last_committed_ledger = phase1_coverage_end_ledger(meta_store) - resume_ledger = last_committed_ledger + 1 - if resume_ledger < GENESIS_LEDGER: - resume_ledger = GENESIS_LEDGER + resume_ledger = max(last_committed_ledger + 1, GENESIS_LEDGER) resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - # Ledger stores for store_dir in scan_ledger_store_dirs(config): chunk_id = parse_chunk_id_from_dir(store_dir) if chunk_id == resume_chunk_id or chunk_id == resume_chunk_id + 1: - continue # active or pre-created + continue # active / pre-created; keep if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): delete_dir(store_dir) # orphaned post-flush cleanup elif chunk_id < resume_chunk_id: - finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) # mid-flush crash; finish + finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) else: delete_dir(store_dir) # orphan future store - # Txhash stores resume_tx_index_id = resume_chunk_id // cpi for store_dir in scan_txhash_store_dirs(config): tx_index_id = parse_tx_index_id_from_dir(store_dir) if tx_index_id == resume_tx_index_id or tx_index_id == resume_tx_index_id + 1: - continue # active or pre-created + continue if meta_store.has(f"index:{tx_index_id:08d}:txhash"): - delete_dir(store_dir) # RecSplit done, cleanup lingered + delete_dir(store_dir) # RecSplit done; cleanup lingered elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): - # RecSplit build for tx_index_id was never started or was interrupted. Open - # the store and spawn the build — pass the handle, not the directory path, - # because build_tx_index_recsplit_files reads from the store and closes it - # on completion. + # Re-spawn build. Pass the handle, not the dir path — build_tx_index_recsplit_files + # reads from the store and closes it. transitioning_txhash = open_active_txhash_store(config, tx_index_id) run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) - # Events hot segment: truncate any persisted deltas beyond resume_ledger - 1. # Prevents duplicate event IDs when Phase 4 replays the first live ledger. truncate_events_hot_segment(config, resume_ledger - 1) ``` @@ -789,51 +525,36 @@ Opens active stores for the resume position, spawns the lifecycle goroutine, sta def phase4_live_ingest(config, meta_store): last_committed_ledger = meta_store.get("streaming:last_committed_ledger") if last_committed_ledger is None: - # First start after Phase 1: set checkpoint to end of Phase 1's coverage. + # First start after Phase 1. last_committed_ledger = phase1_coverage_end_ledger(meta_store) meta_store.put("streaming:last_committed_ledger", last_committed_ledger) resume_ledger = last_committed_ledger + 1 active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) - run_in_background(run_prune_lifecycle_loop, config, meta_store) - # Prime captive core for unbounded stream from resume_ledger. ledger_backend = make_ledger_backend(config.streaming.captive_core_config) ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) - set_daemon_ready() # in-memory flag; unblocks queries - + set_daemon_ready() # in-memory; unblocks queries run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) def open_active_stores_for_resume(config, meta_store, resume_ledger): - """ - Open or create the three active stores for resume_ledger's chunk + tx index. Also - pre-create the next chunk's / next tx index's stores up front so the first chunk - rollover doesn't pay creation latency. - - - Ledger active: per-chunk RocksDB for chunk_id_of_ledger(resume_ledger). WAL-recovered - if the directory exists (mid-chunk restart); fresh-created otherwise. - - Events hot segment: in-memory for chunk_id_of_ledger(resume_ledger). If persisted deltas - exist for this chunk (mid-chunk restart), replay them to rebuild bitmaps. - Phase 3 already truncated anything past last_committed_ledger, so replay is safe. - - TxHash active: per-index RocksDB for tx_index_id_of_chunk(chunk_id_of_ledger(resume_ledger)). May - already contain data from Phase 2's .bin hydration (which closed the handle - before returning — see Phase 2 pseudocode). WAL-recovered on reopen. - - Pre-created: also open/create (chunk_id + 1) and (tx_index_id + 1) stores so the - first boundary rollover is a pointer swap only. - """ + # Open/WAL-recover the current store for each of ledger/events/txhash AND pre-create + # the "next" stores so the first boundary rollover is a pointer swap. + # Events hot segment replays persisted deltas from disk; safe because Phase 3 already + # truncated anything past last_committed_ledger. resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK resume_tx_index_id = resume_chunk_id // config.backfill.chunks_per_txhash_index return ActiveStores( - ledger = open_or_create_ledger_store(config, resume_chunk_id), - ledger_next = open_or_create_ledger_store(config, resume_chunk_id + 1), - events = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id, resume_ledger), - events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id + 1, None), - txhash = open_or_create_txhash_store(config, resume_tx_index_id), - txhash_next = open_or_create_txhash_store(config, resume_tx_index_id + 1), + ledger = open_or_create_ledger_store(config, resume_chunk_id), + ledger_next = open_or_create_ledger_store(config, resume_chunk_id + 1), + events = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id, resume_ledger), + events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id + 1, None), + txhash = open_or_create_txhash_store(config, resume_tx_index_id), + txhash_next = open_or_create_txhash_store(config, resume_tx_index_id + 1), ) ``` @@ -847,51 +568,30 @@ Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` call ```python def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): - """ - Sequential pull-based live ingestion. The daemon stays here until process exit. - - Per-ledger steps: - 1. Block on GetLedger(seq) until the ledger is available. - 2. Fan out writes to all three active stores in parallel. Each write is atomic - + WAL-backed, so each store alone is crash-safe. - 3. wait_all — all three must succeed before the per-ledger checkpoint advances. - 4. Commit streaming:last_committed_ledger = seq. This is the atomic 'the daemon - owns everything up to and including seq' signal. - 5. If seq completes a chunk, fire on_chunk_boundary (non-blocking — freeze - transitions run in background). - 6. If seq completes an index, fire on_tx_index_boundary — RecSplit build kicks off. - 7. seq += 1. Loop. - - Immutable config values (cpi) are read once outside the loop — never per ledger. - """ - cpi = config.backfill.chunks_per_txhash_index # immutable; read once at loop entry + cpi = config.backfill.chunks_per_txhash_index # immutable; read once ledger_seq = resume_ledger while True: - lcm = ledger_backend.GetLedger(ledger_seq) # blocks until this ledger is available + lcm = ledger_backend.GetLedger(ledger_seq) # blocks until available - # Write to all three active stores in parallel. Order: fan out, wait for all. - # Each store is idempotent on re-write of the same ledger (crash-safe). + # Fan out to all three active stores; wait for all to durably commit before + # advancing the checkpoint. Each write is idempotent on retry. wait_all( run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), run_in_background(write_events_hot_segment, active_stores.events, ledger_seq, lcm), ) - # Commit the per-ledger checkpoint (streaming:last_committed_ledger) only AFTER - # all three active stores have durably committed the ledger. This is the key - # atomic boundary for Phase 4 crash recovery — the checkpoint is the sole - # 'the daemon owns everything up to and including this ledger' signal. It's NOT - # the same as Phase 1's coverage-end-ledger (which derives from :lfs flags). + # Atomic "daemon owns everything up to ledger_seq" signal — written only after + # all three stores have durably committed. Distinct from Phase 1's :lfs-derived + # coverage end. meta_store.put("streaming:last_committed_ledger", ledger_seq) - # Chunk rollover: hand off to background LFS + events freeze transitions. chunk_id = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK if ledger_seq == last_ledger_in_chunk(chunk_id): on_chunk_boundary(chunk_id, active_stores, meta_store) - # Tx-index rollover — every tx-index boundary is also a chunk boundary, so this - # runs AFTER on_chunk_boundary has already dispatched the last chunk's freeze - # transitions. + # Tx-index boundary runs AFTER on_chunk_boundary (every tx-index boundary is + # also a chunk boundary). tx_index_id = chunk_id // cpi if ledger_seq == last_ledger_in_tx_index(tx_index_id): on_tx_index_boundary(tx_index_id, active_stores, meta_store) @@ -899,19 +599,19 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ledger_seq += 1 ``` -Each per-store write is atomic: RocksDB WriteBatch + WAL for ledger and txhash stores; atomic commit of events hot-segment + persisted deltas. Key/value schemas are in [Active Store Architecture](#active-store-architecture). +- Each per-store write is atomic: RocksDB WriteBatch + WAL for ledger and txhash stores; atomic commit of events hot-segment + persisted deltas. +- Key/value schemas are in [Active Store Architecture](#active-store-architecture). --- ## Freeze Transitions -Three independent background transitions per chunk/index boundary. Each has its own goroutine, flag, and cleanup. Live ingestion never waits on them synchronously — they must not stall the ingestion loop. - -- **LFS transition** — per chunk. Converts the retired ledger RocksDB to a `.pack` file. -- **Events transition** — per chunk. Converts the retired events hot segment to a cold segment (three files). -- **RecSplit transition** — per index. Builds 16 `.idx` files from the retired txhash RocksDB. +Three independent background transitions per chunk/index boundary; each has its own goroutine, flag, and cleanup. Live ingestion never blocks on them. -Streaming's freeze transitions never produce `.bin` files. `.bin` files exist only as transient output of the backfill subroutine (inside Phase 1). +- **LFS transition** — per chunk. Retired ledger RocksDB → `.pack` file. +- **Events transition** — per chunk. Retired events hot segment → cold segment (3 files). +- **RecSplit transition** — per index. Retired txhash RocksDB → 16 `.idx` files. +- Streaming's freeze transitions never produce `.bin` files; those are transient backfill output only (Phase 1). ### Concurrency Model @@ -927,48 +627,28 @@ Triggered when the ingestion loop commits `last_ledger_in_chunk(chunk_id)`. Hand ```python def on_chunk_boundary(chunk_id, active_stores, meta_store): - """ - Swap active stores and kick off LFS + events freeze transitions for this chunk_id. - - Ingestion for (chunk_id + 1) continues unimpeded — active_stores.ledger now points - at the ledger_next store that was pre-created at Phase 4 entry (or by the prior - chunk's boundary handler). - - Also pre-creates (chunk_id + 2)'s stores in background, so the NEXT chunk rollover - finds its pre-created store already opened. - """ - - # LFS transition — drain the last in-flight LFS freeze (max-1-transitioning invariant), - # then swap pointers so the next chunk writes to the pre-created store. + # LFS: drain the last in-flight LFS freeze (max-1-transitioning), swap pointers, + # spawn the freeze. wait_for_lfs_complete() transitioning_ledger_store = active_stores.ledger - active_stores.ledger = active_stores.ledger_next # pointer swap, no I/O + active_stores.ledger = active_stores.ledger_next run_in_background(freeze_ledger_chunk_to_pack_file, chunk_id, transitioning_ledger_store, meta_store) - # Events transition — same shape. Independent goroutine; does NOT wait for LFS. + # Events: same shape, independent goroutine (does NOT wait on LFS). wait_for_events_complete() freezing_events_segment = active_stores.events - active_stores.events = active_stores.events_next # pointer swap + active_stores.events = active_stores.events_next run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, freezing_events_segment, meta_store) - # Pre-create (chunk_id + 2)'s ledger + events so the NEXT boundary is also a pointer - # swap. Low priority; not part of the hot path. Runs in background. + # Pre-create "next-next" so the NEXT boundary is also a pointer swap. Background; + # not on the hot path. run_in_background(precreate_next_boundary_stores, active_stores, meta_store, chunk_id + 2) - # Wake the lifecycle goroutine — it will check prune eligibility. Freeze transitions - # above are NOT dispatched via the lifecycle loop; they run as direct children of the - # ingestion-loop thread. The notification is specifically for pruning. - notify_lifecycle() + notify_lifecycle() # wake prune loop (this notification is ONLY for prune eligibility) def precreate_next_boundary_stores(active_stores, meta_store, target_chunk_id): - """ - Opens / creates the "next-next" ledger store + events hot segment in background so - the NEXT chunk rollover doesn't pay creation latency on the hot path. - - Similarly handles tx-index-next pre-creation when target_chunk_id crosses a tx-index - boundary. Idempotent — safe to run on a restart where the target stores already exist. - """ + # Idempotent — safe to re-run when target stores already exist. active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk_id) active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk_id, None) cpi = config.backfill.chunks_per_txhash_index @@ -983,41 +663,23 @@ Converts the retired ledger RocksDB store to an immutable `.pack` file, then dis ```python def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_store): - """ - Read all LEDGERS_PER_CHUNK ledgers for chunk_id from its active store, write the - pack file, fsync, flag, then delete the store. - - Order matters: - 1. Open pack file with overwrite=True so a prior crashed attempt's bytes are discarded. - 2. Write all ledgers in order. - 3. fsync_and_close — the pack file is durable on disk after this. - 4. Set :lfs flag — the 'flag-after-fsync' invariant. Queries can now route here. - 5. Close and delete the active store. Crash between (4) and (5) leaves an orphan - directory; Phase 3's scan_ledger_store_dirs + :lfs-present check deletes it. - """ + # Order: overwrite=True (discard any prior partial) → write → fsync → flag → cleanup. + # Flag-after-fsync. Crash between flag and store-delete leaves an orphan dir; Phase 3 + # reconciles via `:lfs` present + store present → delete store. pack_path = ledger_pack_path(chunk_id) - writer = packfile.create(pack_path, overwrite=True) # 1 + writer = packfile.create(pack_path, overwrite=True) for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): - writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) # 2 - writer.fsync_and_close() # 3 - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # 4 - - transitioning_ledger_store.close() # 5 + writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) + writer.fsync_and_close() + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") + transitioning_ledger_store.close() delete_dir(ledger_store_path(chunk_id)) signal_lfs_complete() def finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store): - """ - Phase 3 helper. Re-runs freeze_ledger_chunk_to_pack_file for a chunk whose active - ledger store exists on disk but whose :lfs flag is absent — i.e., a crash - interrupted the freeze after the per-ledger checkpoint but before the flag was set. - - Identical to freeze_ledger_chunk_to_pack_file except: - - No signal_lfs_complete call (not running under the max-1-transitioning gate; - Phase 3 is synchronous with startup and runs to completion before Phase 4 starts). - - Opens the existing store (WAL-recovered) rather than receiving a handle. - """ + # Phase 3 helper. Same as freeze_ledger_chunk_to_pack_file but opens the existing + # store (WAL-recovered) and skips signal_lfs_complete (Phase 3 is synchronous). transitioning_ledger_store = open_or_create_ledger_store(config, chunk_id) pack_path = ledger_pack_path(chunk_id) writer = packfile.create(pack_path, overwrite=True) @@ -1035,16 +697,12 @@ Converts the retired events hot segment to three immutable files (events cold se ```python def freeze_events_chunk_to_cold_segment(chunk_id, freezing_events_segment, meta_store): - """ - Freeze the events hot segment for chunk_id. Same flag-after-fsync + cleanup order - as freeze_ledger_chunk_to_pack_file. - """ + # Same flag-after-fsync order as freeze_ledger_chunk_to_pack_file. events_path = events_segment_path(chunk_id) - write_cold_segment(freezing_events_segment, events_path) # 3 files: events.pack, index.pack, index.hash + write_cold_segment(freezing_events_segment, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # flag-after-fsync - - freezing_events_segment.discard() # drops in-memory bitmaps + persisted deltas + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") + freezing_events_segment.discard() # drops in-memory bitmaps + persisted deltas signal_events_complete() ``` @@ -1054,21 +712,12 @@ The last chunk of a tx index has just rolled over. Before RecSplit can start, ev ```python def on_tx_index_boundary(tx_index_id, active_stores, meta_store): - """ - Dispatch RecSplit build for this tx_index_id. Prerequisites: - - Every chunk in tx_index_id has finished its LFS + events freeze transitions. - - No LFS or events transition is in flight for any chunk of tx_index_id (would - race the RecSplit input). - """ - - # Drain ALL in-flight LFS + events transitions. on_chunk_boundary dispatches them; - # here we wait for them to finish — the final chunk of tx_index_id may still be - # in-flight. + # Drain ALL in-flight LFS + events (the final chunk's freeze may still be running) + # — RecSplit cannot race its input. wait_for_lfs_complete() wait_for_events_complete() - verify_all_chunk_flags(tx_index_id, meta_store) # defense-in-depth + verify_all_chunk_flags(tx_index_id, meta_store) # defense-in-depth - # Swap the txhash active store. RecSplit reads from the retired store. transitioning_txhash_store = active_stores.txhash active_stores.txhash = active_stores.txhash_next run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash_store, meta_store) @@ -1080,23 +729,14 @@ Builds the 16 RecSplit `.idx` files for tx_index_id from the retired txhash acti ```python def build_tx_index_recsplit_files(tx_index_id, transitioning_txhash_store, meta_store): - """ - Same flag-after-fsync pattern as LFS / events freeze: - 1. Delete any partial .idx files from a prior crashed attempt. - 2. Build the 16 RecSplit indexes (one per CF). - 3. fsync all .idx files. - 4. Verify spot-check against the txhash store. - 5. Flag. - 6. Close + delete the txhash active store. - """ + # Same flag-after-fsync pattern as LFS / events freeze; verify before flag. idx_path = recsplit_index_path(tx_index_id) - delete_partial_idx_files(idx_path) # 1 - build_recsplit(transitioning_txhash_store, idx_path) # 2 (16 .idx files) - fsync_all_idx_files(idx_path) # 3 - verify_spot_check(tx_index_id, idx_path, meta_store) # 4 - meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") # 5 - - transitioning_txhash_store.close() # 6 + delete_partial_idx_files(idx_path) + build_recsplit(transitioning_txhash_store, idx_path) # 16 .idx files + fsync_all_idx_files(idx_path) + verify_spot_check(tx_index_id, idx_path, meta_store) + meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") + transitioning_txhash_store.close() delete_dir(txhash_store_path(tx_index_id)) ``` @@ -1108,25 +748,13 @@ Retention is enforced by a single background goroutine, woken at chunk boundarie ```python def run_prune_lifecycle_loop(config, meta_store): - """ - Runs as a single background goroutine. Prune gate is uniform across all artifact - kinds — LFS, events, RecSplit — for a given index. - - Wake-up sources: - - Initial scan at entry — catches any index left in "deleting" state by a prior - crashed prune before the first chunk-boundary notification of this run arrives. - Without this, a crashed prune could sit unserviced for up to 10_000 ledgers - (~16 hours at cpi=1). - - Chunk-boundary notifications from the ingestion loop (see on_chunk_boundary). - - The freeze transitions (freeze_ledger_chunk_to_pack_file, freeze_events_chunk_to_cold_segment, build_tx_index_recsplit_files) are - NOT spawned by this loop — the ingestion loop's on_chunk_boundary / on_tx_index_boundary - dispatch them directly. run_prune_lifecycle_loop is scoped to pruning. - """ + # Initial scan at entry catches any `"deleting"` state left by a prior crashed prune; + # without it, a crashed prune could sit unserviced until the next chunk boundary + # (up to ~16 h at cpi=1). Subsequent sweeps fire on chunk-boundary notifications. cpi = config.backfill.chunks_per_txhash_index retention_ledgers = config.streaming.retention_ledgers - _run_prune_sweep(meta_store, retention_ledgers, cpi, config) # initial scan + _run_prune_sweep(meta_store, retention_ledgers, cpi, config) while True: wait_for_chunk_boundary_notification() _run_prune_sweep(meta_store, retention_ledgers, cpi, config) @@ -1138,34 +766,18 @@ def _run_prune_sweep(meta_store, retention_ledgers, cpi, config): def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): - """ - Returns tx_index_ids whose entire footprint is past the retention window and are - still prune-eligible (either :txhash == "1" meaning prune hasn't started, or - "deleting" meaning a prior run crashed mid-prune). - - - retention_ledgers = 0 → no pruning; archive profile retains everything. - - retention_ledgers > 0 → tx_index_id is eligible when - last_committed_ledger > last_ledger_in_tx_index(tx_index_id) + retention_ledgers. - - 'tip ledger' used in the check is streaming:last_committed_ledger (the daemon's - own progress), not the source-reported network tip. - - Upper bound derivation: - last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) - Eligible iff last_committed_ledger > ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) + retention_ledgers - iff last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers > (tx_index_id + 1) * LEDGERS_PER_INDEX - iff (tx_index_id + 1) < (last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers) / LEDGERS_PER_INDEX - iff tx_index_id <= ((last_committed_ledger - (GENESIS_LEDGER - 1) - retention_ledgers - 1) // LEDGERS_PER_INDEX) - 1 - Simplify: last_committed_ledger - (GENESIS_LEDGER - 1) - 1 = last_committed_ledger - GENESIS_LEDGER. - max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) // LEDGERS_PER_INDEX) - 1 - - Numeric check at last_committed_ledger=70_000_002, retention_ledgers=10_000_000, - cpi=1_000 (LEDGERS_PER_INDEX=10_000_000): - max_eligible_tx_index_id = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 6 - 1 = 5. - tx_index_id=5 has last_ledger_in_tx_index(5) + retention_ledgers = 60_000_001 + 10_000_000 = 70_000_001. - 70_000_002 > 70_000_001 → tx_index_id=5 eligible. ✓ - tx_index_id=6 has last_ledger_in_tx_index(6) + retention_ledgers = 70_000_001 + 10_000_000 = 80_000_001. - 70_000_002 > 80_000_001 is false → tx_index_id=6 NOT eligible. ✓ - """ + # Returns tx_index_ids fully past retention and still prune-eligible (`:txhash` is + # `"1"` or `"deleting"`). Uses streaming:last_committed_ledger (daemon's own progress), + # not the source-reported tip. + # + # Derivation: tx_index_id eligible iff + # last_committed_ledger > last_ledger_in_tx_index(tx_index_id) + retention_ledgers + # → max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) + # // LEDGERS_PER_INDEX) - 1 + # Check at last_committed_ledger=70_000_002, retention_ledgers=10_000_000, cpi=1_000: + # max_eligible = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 5. + # tx_index 5 ends at 60_000_001 + 10M = 70_000_001; 70_000_002 > that → eligible. ✓ + # tx_index 6 ends at 70_000_001 + 10M = 80_000_001; NOT eligible. ✓ if retention_ledgers == 0: return [] last_committed_ledger = meta_store.get("streaming:last_committed_ledger") @@ -1182,25 +794,13 @@ def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): def prune_tx_index(tx_index_id, meta_store, config): - """ - Deletes every artifact for tx_index_id and clears its meta store keys. Two-phase - marker for query-routing safety: - - - Set :txhash = "deleting" FIRST. Queries short-circuit (treat as absent). - - Delete files + chunk keys. - - Delete :txhash key LAST. - - Crash between set-deleting and delete-key leaves :txhash == "deleting"; next - startup re-runs prune_tx_index, which is idempotent (rm -f + delete_if_exists - semantics). - """ + # Two-phase marker for query-routing safety: set "deleting" BEFORE any file delete; + # clear the key AFTER. Queries short-circuit on "deleting" (treated as absent). + # Idempotent on crash-between-stages retry. cpi = config.backfill.chunks_per_txhash_index - # Stage 1: commit to pruning. Once this lands, queries for any ledger in - # tx_index_id return HTTP 4xx (past retention). meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") - # Stage 2: delete files and per-chunk keys. Idempotent on re-run. for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) @@ -1208,13 +808,16 @@ def prune_tx_index(tx_index_id, meta_store, config): meta_store.delete(f"chunk:{chunk_id:08d}:events") delete_recsplit_idx_files(tx_index_id) - # Stage 3: clear the tx-index key. Tx index is now fully gone. meta_store.delete(f"index:{tx_index_id:08d}:txhash") ``` -**Why index-atomic.** Per-chunk pruning would create a window where `getTransaction` resolves to a ledger sequence whose pack file has already been deleted. Gating every artifact kind on whole-index past-retention closes that window completely. +**Why index-atomic.** +- Per-chunk pruning would open a window where `getTransaction` resolves to a ledger seq whose pack has already been deleted. +- Whole-index gating closes that window. -**How much extra data sits on disk.** At most `LEDGERS_PER_INDEX - 1` ledgers past the strict retention line. Because `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_INDEX`, the strict retention line itself does not bisect an index — the next-eligible index is exactly `LEDGERS_PER_INDEX` further. +**How much extra data sits on disk.** +- At most `LEDGERS_PER_INDEX - 1` ledgers past the strict retention line. +- `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_INDEX`, so the line never bisects an index; the next-eligible index is exactly `LEDGERS_PER_INDEX` further. --- @@ -1242,7 +845,9 @@ Query serving is gated on Phase 4 being reached. `getLedger`, `getTransaction`, ### Rationale -Without an explicit gate, implementations drift toward "best-effort serve whatever is ingested." That produces inconsistent results across operators and breaks client assumptions. An explicit `daemon_ready` flag + HTTP 4xx error gives clients an unambiguous signal, and the `catching_up` health status gives operators visibility into progress. +- Without an explicit gate, implementations drift toward "best-effort serve whatever is ingested" — inconsistent across operators, breaks client assumptions. +- Explicit `daemon_ready` + HTTP 4xx gives clients an unambiguous signal. +- `catching_up` health status gives operators visibility into progress. --- @@ -1262,12 +867,25 @@ In addition to the backfill subroutine's invariants in [01-backfill-workflow.md ### Compound Recovery Scenarios -The backfill doc's crash recovery model (Section: Crash Recovery in `01-backfill-workflow.md`) handles every Phase 1 crash. Streaming extends it with per-ledger and per-transition recovery: +Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workflow.md#crash-recovery) handles every Phase 1 crash. Streaming adds: + +- **Crash during Phase 2 `.bin` hydration.** + - Chunks loaded pre-crash: no `:txhash` flag, no `.bin` → loop skips via flag check. + - Chunks not yet loaded: `:txhash` + `.bin` present → loop picks them up. + +- **Crash between per-ledger checkpoint and LFS freeze completion.** + - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent. + - Phase 1 on restart: `:lfs` missing → re-runs `process_chunk(chunk_id)` against source (idempotent per artifact). + - Phase 3 then: active ledger store present + `:lfs` now set → deletes the orphaned store. + - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. -- **Crash during Phase 2 `.bin` hydration.** On restart, Phase 2 re-runs. Chunks whose `.bin` was loaded and deleted on the first pass have no `:txhash` flag and no `.bin` file — the loop skips them via the flag check. Chunks not yet loaded still have their `:txhash` flag and `.bin` file — picked up by the same loop. -- **Crash between live per-ledger checkpoint and LFS freeze completion.** `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)` but `chunk:{chunk_id}:lfs` is absent (freeze transition was killed before setting the flag). On restart, Phase 1 sees `:lfs` missing for chunk_id and re-runs `process_chunk(chunk_id)` against its configured source — idempotent per-artifact. Phase 3 then finds the active ledger store for chunk_id still on disk, sees `:lfs` now set, and deletes the orphaned store. Known inefficiency: ~10_000 ledgers of redundant ingestion work per affected chunk. Correctness is preserved. -- **Crash mid-RecSplit.** `index:{tx_index_id}:txhash` absent. Phase 3 detects all chunks for tx_index_id have `:lfs` set, re-spawns the RecSplit build. Partial `.idx` files are deleted first. -- **Crash mid-prune.** Some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. On restart tx_index_id is still in `prunable_tx_index_ids` (the function picks up `"deleting"` as well as `"1"`), so `prune_tx_index(tx_index_id)` runs again — idempotent because file deletes are `rm -f` and key deletes are `delete_if_exists`. +- **Crash mid-RecSplit.** + - State: `index:{tx_index_id}:txhash` absent; all `:lfs` chunks of the tx index present. + - Phase 3: re-spawns the RecSplit build after deleting partial `.idx` files. + +- **Crash mid-prune.** + - State: some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. + - `prunable_tx_index_ids` picks up `"deleting"` alongside `"1"` → `prune_tx_index(tx_index_id)` re-runs, idempotent (file deletes `rm -f`, key deletes `delete_if_exists`). --- @@ -1290,24 +908,22 @@ drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_com ## Error Handling -| Error | Action | -|---|---| -| CaptiveStellarCore unavailable | RETRY with backoff; ABORT after `CAPTIVE_CORE_RETRY_MAX` retries (implementation-defined) | -| Ledger / txhash / events write failure | ABORT — disk full or storage corruption | -| Meta store write failure | ABORT — cannot maintain checkpoint | -| LFS flush failure | Do NOT set `chunk:{chunk_id}:lfs`; ABORT transition; restart retries | -| Events freeze failure | Do NOT set `chunk:{chunk_id}:events`; ABORT transition; restart retries | -| RecSplit build failure | Do NOT set `index:{tx_index_id}:txhash`; ABORT transition; restart deletes partials and rebuilds | -| RecSplit verification mismatch | ABORT; do NOT delete transitioning txhash store; operator investigates | -| Startup: immutable key changed | FATAL — wipe datadir to change | -| Startup: `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_INDEX` | FATAL — fix config | -| Startup: head not index-aligned | FATAL — datadir corruption; wipe | -| Startup: gap in chunk flags | FATAL — datadir corruption; wipe | +Three distinct policies — runtime ABORT, transition retry-via-flag-absence, startup FATAL. ---- +### Runtime (Phase 4 ingestion) + +- **CaptiveStellarCore unavailable.** RETRY with backoff; ABORT after `CAPTIVE_CORE_RETRY_MAX` attempts (implementation-defined). +- **Per-ledger store write failure (ledger / txhash / events).** ABORT — disk full or storage corruption. +- **Meta-store write failure.** ABORT — cannot maintain checkpoint. + +### Freeze transitions (LFS / events / RecSplit) + +All three follow the flag-after-fsync invariant: on failure, don't set the completion flag; abort the transition; restart retries the whole transition from scratch (partial `.idx` files get cleaned by the build's own preamble). + +- **RecSplit verification mismatch.** ABORT; do NOT delete the transitioning txhash store; operator investigates. -## Related Documents +### Startup (FATAL — datadir / config issues) -- [01-backfill-workflow.md](./01-backfill-workflow.md) — backfill subroutine: DAG, `process_chunk`, partial index handling -- [getEvents full-history design](../../design-docs/getevents-full-history-design.md) — events hot/cold segments, bitmap indexes, freeze process -- Query routing — separate design document (TBD) +- `CHUNKS_PER_TXHASH_INDEX` or `RETENTION_LEDGERS` changed: wipe datadir to change. +- `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_INDEX`: fix config. +- Head not index-aligned / gap in chunk flags: datadir corruption; wipe. From f4a33c0ffef3fb12577f972f25d875c5e6c89da4 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Thu, 23 Apr 2026 00:22:49 -0700 Subject: [PATCH 20/34] Design docs: BSB-only backfill + flat config + tip simplification MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Backfill is BSB-only. run_backfill takes a make_bsb partial; each process_chunk instantiates its own BSBSource scoped to its chunk's 10_000 ledgers, reads, tears down. DAG cap is GOMAXPROCS. run_backfill fires the DAG without probing the source — task failures surface any coverage issue at runtime. Phase 1 (catchup) invokes backfill only when [BSB] is configured. Without [BSB], Phase 1 is a no-op; Phase 4's captive core catches up from a leapfrog'd resume_ledger as part of its own startup. Captive core is not a backfill source. validate_config fatals on [BSB] absent plus full history. Config flattened. [BACKFILL] and [STREAMING] sections removed. Top-level [SERVICE] (DEFAULT_DATA_DIR, CHUNKS_PER_TXHASH_INDEX, RETENTION_LEDGERS, DRIFT_WARNING_LEDGERS), [BSB], [CAPTIVE_CORE] (new; replaces CAPTIVE_CORE_CONFIG under STREAMING), [ACTIVE_STORAGE] (was [STREAMING.ACTIVE_STORAGE]); path-default sections marked (optional). Service name: "RPC service" everywhere (was "streaming daemon"). Tip sampling: history archive when captive core is not yet running (Phase 1 loop, Phase 4 leapfrog-from-tip on fresh start without BSB); captive core itself once it is running (Phase 4 ingestion / drift). No BSB tip calls anywhere. "Backfill vs Phase 1" explainer canonical in 02; 01 points to it. --- .../design-docs/01-backfill-workflow.md | 177 ++++++++------- .../design-docs/02-streaming-workflow.md | 208 +++++++++--------- 2 files changed, 196 insertions(+), 189 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index a54741e3d..eb65c47de 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -2,15 +2,16 @@ ## Overview -Backfill is a subroutine invoked by Phase 1 of the streaming daemon (see [02-streaming-workflow.md](./02-streaming-workflow.md)). Given an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `LedgerSource`, it produces the immutable output files for those chunks via a static DAG of idempotent per-chunk tasks. - -**Not an operator CLI.** The daemon is the single operator entry point (`stellar-rpc --config path/to/config.toml`); backfill has no `full-history-backfill` subcommand and no per-run CLI flags. +- Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). +- Internal to the daemon; no `full-history-backfill` subcommand, no per-run flags. +- Input: an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `make_bsb` partial function — calling `make_bsb()` returns a fresh [`BSBSource`](./02-streaming-workflow.md#ledger-source) instance. **What it does:** -- Ingests historical ledgers via the `LedgerSource` passed in by the caller — BSB or captive core (see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source)). +- Ingests historical ledgers via per-task BSB instances. Each `process_chunk` calls `make_bsb()` to get its own `BSBSource`, calls `prepare_range` scoped to the chunk's 10_000 ledgers, reads in a loop, and tears down. Independent per chunk — no shared source state. - Writes directly to immutable file formats — no RocksDB active stores. -- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool. -- Returns when every chunk in the range is complete; on crash, Phase 1 re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. +- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool (`GOMAXPROCS` concurrency). +- Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. +- **BSB-only.** Backfill does not use captive core as a ledger source. Captive core belongs to Phase 4 (live streaming); if BSB isn't configured, backfill is not invoked at all and Phase 4's captive core catches up from a leapfrog'd resume ledger as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion). **What it produces:** @@ -20,11 +21,15 @@ Backfill is a subroutine invoked by Phase 1 of the streaming daemon (see [02-str | `getTransaction` | Tx-index files | Per tx index (default 10_000_000 ledgers) | | `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | +For the distinction between *backfill (this subroutine)* and *Phase 1 (the startup phase that invokes it)* — two terms that get conflated because their scopes overlap — see [02-streaming-workflow.md — Backfill vs Phase 1](./02-streaming-workflow.md#backfill-vs-phase-1). + --- ## Geometry -Stellar's first ledger is `GENESIS_LEDGER = 2` (not 0 or 1). Every formula that maps `ledger_seq ↔ chunk_id` subtracts `GENESIS_LEDGER` to zero-base the axis. In the pseudocode below, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. +Stellar's first ledger is `GENESIS_LEDGER = 2` (not 0 or 1). +Every formula that maps `ledger_seq ↔ chunk_id` subtracts `GENESIS_LEDGER` to zero-base the axis. +In the pseudocode below, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. ```python GENESIS_LEDGER = 2 @@ -66,7 +71,8 @@ All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). ## Configuration -The streaming daemon loads a single TOML file; backfill reads the subset documented here. Streaming-only sections (`[STREAMING]`, `[HISTORY_ARCHIVES]`) are in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). +- The service loads a single TOML file; backfill reads the subset documented here. +- Daemon-level sections not consumed by backfill — `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]`, plus `RETENTION_LEDGERS` / `DRIFT_WARNING_LEDGERS` under `[SERVICE]` — are documented in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). ### TOML Config @@ -75,32 +81,29 @@ The streaming daemon loads a single TOML file; backfill reads the subset documen | Key | Type | Default | Description | |-----|------|---------|-------------| | `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | - -**[BACKFILL]** - -| Key | Type | Default | Description | -|-----|------|---------|-------------| | `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | -**[IMMUTABLE_STORAGE.LEDGERS]** +`[SERVICE]` also carries daemon-level keys not read by backfill — `RETENTION_LEDGERS`, `DRIFT_WARNING_LEDGERS` — see [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). + +**[IMMUTABLE_STORAGE.LEDGERS]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| | `PATH` | string | `{DEFAULT_DATA_DIR}/ledgers` | Base path for ledger pack files. | -**[IMMUTABLE_STORAGE.EVENTS]** +**[IMMUTABLE_STORAGE.EVENTS]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| | `PATH` | string | `{DEFAULT_DATA_DIR}/events` | Base path for events cold segments. | -**[IMMUTABLE_STORAGE.TXHASH_RAW]** +**[IMMUTABLE_STORAGE.TXHASH_RAW]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| | `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/raw` | Base path for raw txhash `.bin` files (transient). | -**[IMMUTABLE_STORAGE.TXHASH_INDEX]** +**[IMMUTABLE_STORAGE.TXHASH_INDEX]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| @@ -108,7 +111,7 @@ The streaming daemon loads a single TOML file; backfill reads the subset documen The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores owned by the streaming workflow). -**[BACKFILL.BSB]** — BSB / Buffered Storage Backend (optional at the daemon level; required when Phase 1 selects `BSBSource`) +**[BSB]** — Buffered Storage Backend (optional at the daemon level; required when [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) selects `BSBSource`) | Key | Type | Default | Description | |-----|------|---------|-------------| @@ -116,16 +119,16 @@ The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-back | `BUFFER_SIZE` | int | `1000` | Prefetch buffer depth per connection. | | `NUM_WORKERS` | int | `20` | Download workers per connection. | -Source selection at the daemon level (BSB vs captive core, based on `[BACKFILL.BSB]` presence) is described in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). When the caller invokes `run_backfill(..., source=CaptiveCoreSource(...))`, this section is not used. +- `[BSB]` is effectively required when backfill runs. If absent, Phase 1 (catchup) does not invoke `run_backfill` at all — Phase 4's captive core handles initial catchup instead (see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source)). -**[LOGGING]** +**[LOGGING]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| | `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. Daemon CLI flag `--log-level` wins when both are set. | | `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. Daemon CLI flag `--log-format` wins when both are set. | -**[META_STORE]** +**[META_STORE]** (optional) | Key | Type | Default | Description | |-----|------|---------|-------------| @@ -133,8 +136,10 @@ Source selection at the daemon level (BSB vs captive core, based on `[BACKFILL.B ### Validation Rules -- `CHUNKS_PER_TXHASH_INDEX` must not change after the first run — the daemon's `validate_config` enforces this at startup (see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode)). -- When the caller invokes backfill with `source=BSBSource(...)`, `[BACKFILL.BSB]` must be present AND the source must cover the requested chunk range. `run_backfill`'s `validate` asserts `source.tip() >= last_ledger_in_chunk(range_end_chunk_id)` at the start; `run_backfill` then calls `source.prepare_range(first_ledger, last_ledger)` once, after which lower-bound coverage (bucket retention floor, captive core history start) is verified per-ledger via `source.get_ledger(seq)` during execution. +- `validate` checks argument sanity and defensively re-asserts `CHUNKS_PER_TXHASH_INDEX` against the meta store — the daemon's `validate_config` is the real enforcer; see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). +- No source probe. `run_backfill` trusts the caller's range and fires the DAG. Per-chunk idempotency means already-done chunks are no-ops; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). +- `[BSB]` must be configured whenever `run_backfill` is invoked. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. +- DAG worker cap is `GOMAXPROCS`. BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. ### Partial Tx Index Ranges @@ -144,7 +149,7 @@ When the caller's chunk range does not span a complete tx index, the trailing ch - Their `chunk:{chunk_id:08d}:txhash` flags set in the meta store. - No RecSplit `.idx` files (RecSplit is built only when every chunk of the tx index is ready). -These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 of the streaming daemon loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). +These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 (`.bin` hydration) of the RPC service loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). Ledger and events data are useful per-chunk and are not blocked by tx-index alignment — `chunk:{chunk_id:08d}:lfs` and `chunk:{chunk_id:08d}:events` flags are set as soon as each chunk's outputs are durable. @@ -153,8 +158,6 @@ Ledger and events data are useful per-chunk and are not blocked by tx-index alig ```toml [SERVICE] DEFAULT_DATA_DIR = "/data/stellar-rpc" - -[BACKFILL] CHUNKS_PER_TXHASH_INDEX = 1000 [IMMUTABLE_STORAGE.LEDGERS] @@ -169,7 +172,7 @@ PATH = "/mnt/nvme/txhash/raw" [IMMUTABLE_STORAGE.TXHASH_INDEX] PATH = "/mnt/nvme/txhash/index" -[BACKFILL.BSB] +[BSB] BUCKET_PATH = "sdf-ledger-close-meta/v1/ledgers/pubnet" [LOGGING] @@ -177,7 +180,7 @@ LEVEL = "info" FORMAT = "text" ``` -The TOML above is consumed by the streaming daemon entry point (`stellar-rpc --config ...`); backfill is invoked internally by Phase 1 with the chunk range and source it computed. +The TOML above is consumed by the RPC service entry point (`stellar-rpc --config ...`); backfill is invoked internally by [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with the chunk range and source it computed. --- @@ -217,7 +220,7 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h └── txhash/ ├── raw/ ← IMMUTABLE_STORAGE.TXHASH_RAW.PATH │ ├── 00000/ ← chunk_ids 0–999 (1_000 .bin files) - │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit + Phase 2) + │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit or by Phase 2 hydration) │ │ └── ... │ └── .../ └── index/ ← IMMUTABLE_STORAGE.TXHASH_INDEX.PATH @@ -334,22 +337,17 @@ process_chunk(chunk_id=last) ─┘ ### Main Flow -`run_backfill` is invoked by Phase 1 of the streaming daemon with an integer chunk range and a `LedgerSource`: +`run_backfill` is invoked by the daemon's [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with an integer chunk range and a `make_bsb` partial: ```python -def run_backfill(config, range_start_chunk_id, range_end_chunk_id, source): - validate(config, range_start_chunk_id, range_end_chunk_id, source) - - # Prime source ONCE per invocation. process_chunk tasks then concurrently call - # source.get_ledger(seq) for seqs within this range. Mirrors the stellar Go SDK's - # LedgerBackend pattern (PrepareRange → GetLedger). - source.prepare_range( - first_ledger_in_chunk(range_start_chunk_id), - last_ledger_in_chunk(range_end_chunk_id), - ) - - dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, source) - dag.execute(max_workers=source.max_parallelism()) +def run_backfill(config, range_start_chunk_id, range_end_chunk_id, make_bsb): + # make_bsb is a partial (e.g. functools.partial(BSBSource, config.bsb)). Each call + # returns a fresh BSBSource. Every process_chunk that needs to download ledgers + # owns its own BSB for its chunk's range — no shared-source state across tasks. + validate(config, range_start_chunk_id, range_end_chunk_id) + + dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, make_bsb) + dag.execute(max_workers=GOMAXPROCS) ``` ### Validation @@ -359,22 +357,20 @@ def run_backfill(config, range_start_chunk_id, range_end_chunk_id, source): - Running it first → clean abort, no partial work. ```python -def validate(config, range_start_chunk_id, range_end_chunk_id, source): +def validate(config, range_start_chunk_id, range_end_chunk_id): + # Argument sanity only. run_backfill trusts the caller's range — any source-coverage + # issue (upper or lower bound) surfaces at runtime as a per-task get_ledger failure. assert range_start_chunk_id >= 0 assert range_end_chunk_id >= range_start_chunk_id - # Upper bound only. Lower-bound availability (BSB bucket floor, captive core history - # start) surfaces as a per-task failure during execution. - assert source.tip() >= last_ledger_in_chunk(range_end_chunk_id) - # Defensive re-assert; daemon's validate_config owns the enforcement. - assert meta_store.get("config:chunks_per_txhash_index") == str(config.backfill.chunks_per_txhash_index) + assert meta_store.get("config:chunks_per_txhash_index") == str(config.service.chunks_per_txhash_index) ``` ### DAG Setup ```python -def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): +def build_dag(config, range_start_chunk_id, range_end_chunk_id, make_bsb): dag = new DAG() # Tx indexes whose LAST chunk is in range: schedule process_chunk for in-range chunks @@ -385,7 +381,7 @@ def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): for chunk_id in chunks_for_tx_index(tx_index_id, config): if not (range_start_chunk_id <= chunk_id <= range_end_chunk_id): continue - t = dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) + t = dag.add(ProcessChunkTask(chunk_id, make_bsb=make_bsb), deps=[]) chunk_tasks.append(t.id) b = dag.add(BuildTxHashIndexTask(tx_index_id), deps=chunk_tasks) dag.add(CleanupTxHashTask(tx_index_id), deps=[b.id]) @@ -393,19 +389,19 @@ def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): # Trailing partial tx index (last chunk past range_end): process_chunk only; a future # iteration that covers the missing trailing chunks will schedule the build. for chunk_id in trailing_partial_tx_index_chunks(range_start_chunk_id, range_end_chunk_id, config): - dag.add(ProcessChunkTask(chunk_id, source=source), deps=[]) + dag.add(ProcessChunkTask(chunk_id, make_bsb=make_bsb), deps=[]) return dag ``` - Trailing tx index whose last chunk is past `range_end_chunk_id`: `process_chunk` scheduled for in-range chunks only; no `build_txhash_index` / `cleanup_txhash`. -- `.bin` + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR Phase 2 hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). +- `.bin` + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR [Phase 2 (`.bin` hydration)](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin) hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). --- ## Task Details -### process_chunk(chunk_id, source) +### process_chunk(chunk_id, make_bsb) - Processes a single 10_000-ledger chunk end-to-end. - Occupies one DAG worker slot. @@ -421,7 +417,7 @@ def build_dag(config, range_start_chunk_id, range_end_chunk_id, source): **Pseudocode:** ```python -def process_chunk(chunk_id, source): +def process_chunk(chunk_id, make_bsb): first_ledger = first_ledger_in_chunk(chunk_id) last_ledger = last_ledger_in_chunk(chunk_id) @@ -431,36 +427,37 @@ def process_chunk(chunk_id, source): if not (need_lfs or need_txhash or need_events): return - # If :lfs is present, read from the local packfile (NVMe, no source call). Otherwise - # use the already-prepared source. - if not need_lfs: - ledger_reader = local_packfile(ledger_pack_path(chunk_id)) + # If :lfs is already on disk, read from the local packfile — no BSB, no network. + # Otherwise instantiate a per-task BSB scoped to THIS chunk's 10_000 ledgers. + if need_lfs: + ledger_reader = make_bsb() + ledger_reader.prepare_range(first_ledger, last_ledger) else: - ledger_reader = source + ledger_reader = local_packfile(ledger_pack_path(chunk_id)) - ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(chunk_id), overwrite=True) if need_txhash else None + ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None + txhash_writer = open(raw_txhash_path(chunk_id), overwrite=True) if need_txhash else None events_writer = events_segment.create(events_segment_path(chunk_id), overwrite=True) if need_events else None - for ledger_seq in range(first_ledger, last_ledger + 1): - lcm = ledger_reader.get_ledger(ledger_seq) - if need_lfs: ledger_writer.append(compress(lcm)) - if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx - if need_events: events_writer.append(extract_events(lcm)) - - # Fsync + flag each output independently (flag-after-fsync). - if need_lfs: - ledger_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") - if need_txhash: - txhash_writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") - if need_events: - events_writer.finalize() - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - - if not need_lfs: - ledger_reader.close() # close local packfile handle; source stays open across tasks + try: + for ledger_seq in range(first_ledger, last_ledger + 1): + lcm = ledger_reader.get_ledger(ledger_seq) + if need_lfs: ledger_writer.append(compress(lcm)) + if need_txhash: txhash_writer.append(extract_txhashes(lcm)) # 36 bytes per tx + if need_events: events_writer.append(extract_events(lcm)) + + # Fsync + flag each output independently (flag-after-fsync). + if need_lfs: + ledger_writer.fsync_and_close() + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") + if need_txhash: + txhash_writer.fsync_and_close() + meta_store.put(f"chunk:{chunk_id:08d}:txhash", "1") + if need_events: + events_writer.finalize() + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") + finally: + ledger_reader.close() # BSB: tears down the per-task instance. Local packfile: closes file handle. ``` Key properties: @@ -472,8 +469,8 @@ Key properties: - Naturally extends to new data types (add a fourth flag). **Source concurrency.** -- `source=BSBSource(...)`: `source.max_parallelism() = GOMAXPROCS`; many `process_chunk` tasks run in parallel. -- `source=CaptiveCoreSource(...)`: `source.max_parallelism() = 1`; single subprocess serializes. DAG dispatches chunks sequentially. +- Each `process_chunk` owns its own BSB instance; DAG dispatches up to `GOMAXPROCS` tasks in parallel. +- BSB's internal `NUM_WORKERS` is the per-instance download pool — not a cross-task concurrency knob. `6_000` chunks in the run means `6_000` independent BSB instances over the run's lifetime, up to `GOMAXPROCS` alive at any moment. - Interface: see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). ### build_txhash_index(tx_index_id) @@ -580,9 +577,7 @@ def run_dag(dag, max_workers): ### Worker Pool -- Single flat pool of `max_workers` slots, set by `source.max_parallelism()`: - - `BSBSource.max_parallelism() = GOMAXPROCS`. - - `CaptiveCoreSource.max_parallelism() = 1`. +- Single flat pool of `max_workers = GOMAXPROCS` slots. - Any mix of task types can occupy slots simultaneously. - `process_chunk`: 1 slot per task. - `build_txhash_index`: 1 slot per task (uses many goroutines internally). @@ -615,16 +610,16 @@ The daemon acquires a directory flock on the meta-store at startup. A second pro Two layers of retry: -- **Source-internal retries.** `LedgerSource` handles transient errors (BSB connection resets, throttling, captive-core subprocess hiccups) inside a single task execution. Invisible to the DAG. +- **BSB-internal retries.** `BSBSource` handles transient errors (connection resets, throttling) inside a single task execution. Invisible to the DAG. - **Task-level retries.** DAG wraps each task's `execute()` in a retry loop bounded by `MAX_RETRIES`. - Source retries exhausted → task retries whole. - - `MAX_RETRIES` exhausted → task marked failed → DAG halts dependents → `run_backfill` returns fatal → Phase 1 propagates → daemon exits non-zero. - - Operator fixes root cause + restarts → Phase 1 re-enters → `run_backfill` re-invoked with a fresh range → completed work skipped via per-chunk idempotency. + - `MAX_RETRIES` exhausted → task marked failed → DAG halts dependents → `run_backfill` returns fatal → [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) propagates → daemon exits non-zero. + - Operator fixes root cause + restarts → Phase 1 (catchup) re-enters → `run_backfill` re-invoked with a fresh range → completed work skipped via per-chunk idempotency. | Error | Handled by | Action | |-------|-----------|--------| -| Source transient error (throttle, connection reset) | Source-internal retry | Retried within the task; transparent to DAG | -| Source persistent error (source retries exhausted) | Task-level retry | `MAX_RETRIES` attempts; then ABORT | +| BSB transient error (throttle, connection reset) | BSB-internal retry | Retried within the task; transparent to DAG | +| BSB persistent error (BSB retries exhausted) | Task-level retry | `MAX_RETRIES` attempts; then ABORT | | Ledger pack write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | | Txhash write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | | Events write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 53a709778..a49d2b847 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -32,7 +32,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. - **`phase1_coverage_end_ledger`** — the last ledger of the contiguous prefix of `chunk:{chunkId}:lfs` flags starting from the lowest chunk on disk. Phase 1 uses this to decide what's still left to ingest. Returned by the same-named function. **Not the same** as `streaming:last_committed_ledger`. (Prior drafts called this concept "Phase 1 low-water mark"; the term was retired because it's semantically a HIGH-water mark — the newest confirmed ledger in contiguous coverage.) - **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside the Phase 4 ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. -- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from `source.tip()`. For `BSBSource`: read from BSB's range-end metadata. For `CaptiveCoreSource`: fetched via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVE_URLS`. Different from `last_committed_ledger` (the daemon's own progress). +- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS` whenever captive core is NOT yet running (Phase 1 catchup loop, Phase 4 leapfrog-from-tip on fresh start without BSB). Once captive core is running (Phase 4 ingestion loop), the tip comes from `ledger_backend.latest_tip()` against the running subprocess — it's authoritative and cheaper than another HTTP round-trip. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - Ledger active store — a per-chunk RocksDB (one instance per chunk). - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). @@ -63,7 +63,7 @@ Streaming reads the same TOML file as backfill, plus additional keys described b ### Shared Config (from backfill) -All of `[SERVICE]`, `[BACKFILL]`, `[IMMUTABLE_STORAGE.*]`, `[META_STORE]`, `[LOGGING]` apply unchanged. See [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) for the full schema. +`[SERVICE]` (for `DEFAULT_DATA_DIR` + `CHUNKS_PER_TXHASH_INDEX`), `[BSB]`, `[IMMUTABLE_STORAGE.*]`, `[META_STORE]`, `[LOGGING]` are detailed in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration). Streaming adds extra keys to `[SERVICE]` and introduces `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]` (below). ### Immutable Keys (stored in meta store, fatal if changed) @@ -74,40 +74,45 @@ Stored on first start; fatal on any subsequent start where the config value diff | `CHUNKS_PER_TXHASH_INDEX` | `config:chunks_per_txhash_index` | first run | Fatal if changed. | | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | -- Source selection (BSB vs captive core) is determined per-startup by `[BACKFILL.BSB]` presence; not stored as immutable. +- Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; not stored as immutable. - Operators may add or remove BSB between runs; daemon extends coverage forward from `phase1_coverage_end_ledger` regardless. - Retention immutability alone constrains the data envelope — source choice doesn't need its own gate. -### Streaming-Specific TOML +### TOML Sections Documented Here -**[STREAMING]** +**[SERVICE] — streaming additions** + +Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) (which covers `DEFAULT_DATA_DIR` and `CHUNKS_PER_TXHASH_INDEX`). | Key | Type | Default | Description | |---|---|---|---| | `RETENTION_LEDGERS` | uint32 | `0` | `0` = full history; otherwise must be a positive multiple of `LEDGERS_PER_INDEX`. See [Validation Rules](#validation-rules). | -| `CAPTIVE_CORE_CONFIG` | string | **required** | Path to CaptiveStellarCore config file. | | `DRIFT_WARNING_LEDGERS` | uint32 | `10` | `getHealth` reports unhealthy when ingestion drift exceeds this. ~60 seconds at 10 ledgers. | -**[STREAMING.ACTIVE_STORAGE]** +**[CAPTIVE_CORE]** | Key | Type | Default | Description | |---|---|---|---| -| `PATH` | string | `{DEFAULT_DATA_DIR}/active` | Base path for active RocksDB stores (ledger, txhash, events). | +| `CONFIG_PATH` | string | **required** | Path to CaptiveStellarCore config file. | -**[HISTORY_ARCHIVES]** +**[ACTIVE_STORAGE]** (optional) | Key | Type | Default | Description | |---|---|---|---| -| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` when Phase 1 uses captive core. Same key the existing ingest service reads. | +| `PATH` | string | `{DEFAULT_DATA_DIR}/active` | Base path for active RocksDB stores (ledger, txhash, events). | -**[BACKFILL.BSB]** — optional when the daemon runs +**[HISTORY_ARCHIVES]** -Same schema as in the backfill doc. Presence in the config file determines which source Phase 1 uses: +| Key | Type | Default | Description | +|---|---|---|---| +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` for the Phase 4 leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start). Same key the existing ingest service reads. | -- If `[BACKFILL.BSB]` is present: Phase 1 uses BSB (fast, parallel catchup from GCS). -- If `[BACKFILL.BSB]` is absent: Phase 1 uses captive core (slower, but no GCS dep). +**[BSB]** (optional) -See [Ledger Source](#ledger-source) for the full source-selection rule. +- Same schema as in the backfill doc. Presence in the config file determines Phase 1 behavior: + - Present: Phase 1 invokes backfill over the BSB (fast, parallel per-chunk catchup). + - Absent: Phase 1 is a no-op; Phase 4 captive core archive-catches-up from a leapfrog'd `resume_ledger` (slower, but no object-store dep). +- See [Ledger Source](#ledger-source) for the BSB-source details and [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup) for the full split. ### CLI Flags @@ -124,24 +129,30 @@ See [Ledger Source](#ledger-source) for the full source-selection rule. - `CHUNKS_PER_TXHASH_INDEX` immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). - `RETENTION_LEDGERS` immutable across runs. - `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. -- `[BACKFILL.BSB]` optional — presence determines Phase 1 source. May be added or removed between runs. +- `[BSB]` optional. When present → Phase 1 invokes backfill over the BSB; when absent → Phase 1 is a no-op and Phase 4 captive core handles initial catchup. May be added or removed between runs. +- **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. Not a supported operating mode. - `[HISTORY_ARCHIVES].URLS` required in all profiles. -- `CAPTIVE_CORE_CONFIG` required in all profiles. +- `[CAPTIVE_CORE].CONFIG_PATH` required in all profiles. ### Validation Pseudocode ```python def validate_config(config, meta_store): - cpi = config.backfill.chunks_per_txhash_index - retention_ledgers = config.streaming.retention_ledgers + cpi = config.service.chunks_per_txhash_index + retention_ledgers = config.service.retention_ledgers ledgers_per_index = cpi * LEDGERS_PER_CHUNK if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % ledgers_per_index) != 0): fatal(f"RETENTION_LEDGERS={retention_ledgers} must be 0 or a positive multiple of " f"LEDGERS_PER_INDEX={ledgers_per_index}.") - if not config.streaming.captive_core_config: - fatal("STREAMING.CAPTIVE_CORE_CONFIG is required.") + if config.bsb is None and retention_ledgers == 0: + fatal("[BSB] is absent AND RETENTION_LEDGERS=0 (full history). Full history requires " + "BSB — captive-core-from-genesis is not supported. Either add [BSB] or set " + "RETENTION_LEDGERS > 0.") + + if not config.captive_core.config_path: + fatal("CAPTIVE_CORE.CONFIG_PATH is required.") if not config.history_archives.urls: fatal("HISTORY_ARCHIVES.URLS is required.") @@ -161,11 +172,12 @@ def _enforce_immutable(meta_store, key, current_value): Three profiles emerge from config combinations. No profile flag. -| Profile | `RETENTION_LEDGERS` | `[BACKFILL.BSB]` | Phase 1 source | Use case | +| Profile | `RETENTION_LEDGERS` | `[BSB]` | Phase 1 behavior | Use case | |---|---|---|---|---| -| Archive | `0` | present | BSB | Public archive node; full history. | -| Pruning-history | `N * LEDGERS_PER_INDEX`, N ≥ 1 | present | BSB | Windowed history with bulk initial catchup. | -| Tip-tracker | `N * LEDGERS_PER_INDEX`, N ≥ 1 | absent | captive core | App developer; small retention; no GCS dep. | +| Archive | `0` | present | Backfill over full history (chunks `[0, current_chunk − 1]`) | Public archive node; full history. | +| Pruning-history | `N × LEDGERS_PER_INDEX`, N ≥ 1 | present | Backfill over retention window (leapfrog-aligned start) | Windowed history with bulk initial catchup. | +| Tip-tracker | `N × LEDGERS_PER_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | +| (invalid) | `0` | absent | — | Rejected by `validate_config`: full history requires BSB. | --- @@ -261,80 +273,57 @@ The daemon maintains three active stores for the current ingestion position. All ## Ledger Source -- Phase 1 reads ledgers from a `LedgerSource`; two implementations share one interface. -- Source is selected per-startup by `[BACKFILL.BSB]` presence; not stored as immutable. -- Operators may add or remove BSB between runs. Retention immutability alone constrains the data envelope. -- Interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`); both implementations expose that pattern natively, so random-access reads are native and no iterator shim is needed. +- **Backfill (Phase 1) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup). +- **Live streaming (Phase 4) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. ```python -class LedgerSource: - # Used by backfill inside Phase 1. Live streaming (Phase 4) bypasses this abstraction - # and calls ledgerBackend.PrepareRange(UnboundedRange(...)) + GetLedger(seq) directly. - # Contract: run_backfill calls prepare_range ONCE, then process_chunk tasks call - # get_ledger(seq) concurrently (thread-safe up to max_parallelism()). - - def tip(self) -> int: ... - def prepare_range(self, start_ledger, end_ledger) -> None: ... - def get_ledger(self, ledger_seq) -> LedgerCloseMeta: ... - def max_parallelism(self) -> int: ... - - -class BSBSource(LedgerSource): - # tip: BSB's range-end metadata. +class BSBSource: + # Used by backfill only. One instance per process_chunk task, torn down at end. + # Interface mirrors the stellar Go SDK's LedgerBackend (PrepareRange + GetLedger). # prepare_range: sets the BSB-backed LedgerBackend's range; BSB prefetch workers # (BUFFER_SIZE, NUM_WORKERS) fill buffers ahead of get_ledger. # get_ledger: SDK GetLedger(seq) reads from the prefetch buffer. - # max_parallelism: GOMAXPROCS. - ... - - -class CaptiveCoreSource(LedgerSource): - # tip: HTTP GET on /.well-known/stellar-history.json against HISTORY_ARCHIVE_URLS - # (same pattern as existing ingest service). - # prepare_range: spins up (or re-primes) captive core with BoundedRange(start, end). - # get_ledger: SDK GetLedger(seq); blocks until the subprocess emits that ledger. - # max_parallelism: 1 (single subprocess; multiple would OOM). + # close: tears down the prefetch workers + connection. ... ``` -### Source Selection Rule +### Make BSB Partial ```python -def select_phase1_ledger_source(config): - if config.backfill.bsb is not None: - return BSBSource(config.backfill.bsb) - return CaptiveCoreSource(config.streaming.captive_core_config, - config.history_archives.urls) +def make_bsb_partial(config): + # Returns a partial that each process_chunk calls to get a fresh BSBSource. + # None means Phase 1 is a no-op; Phase 4 captive core handles catchup. + if config.bsb is None: + return None + return functools.partial(BSBSource, config.bsb) ``` -### Retention Semantics Under Captive Core - -When Phase 1 uses captive core, `RETENTION_LEDGERS` directly determines how many ledgers captive core must archive-catchup on first start: - -- `RETENTION_LEDGERS = 10_000_000` at `cpi=1_000`: captive core archive-catches-up ~10M ledgers (hours to days). -- `RETENTION_LEDGERS = 10_000` at `cpi=1`: captive core archive-catches-up ~10K ledgers (~3–8 min). - -This is the main reason tip-tracker operators default to `cpi=1`: at cpi=1 a full index is 10K ledgers, so retention can be set small without violating the "multiple of LEDGERS_PER_INDEX" rule. - --- ## Startup Sequence Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 is the long-running state the daemon stays in until process exit. -- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip by invoking the backfill subroutine in a loop. +- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. - **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 left (for the trailing partial index) into the active txhash store, then deletes them. - **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. Truncates events hot segment beyond the last committed ledger. - **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle goroutine, flips the `daemon_ready` flag, enters the ingestion loop. Runs until process exit. "Phase" here refers to the startup ordering only. Once Phase 4 is entered, there's no Phase 5 — the daemon is in live-streaming steady state. +### Backfill vs Phase 1 + +- **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only, runs parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. +- **Phase 1 (catchup)** is a startup phase that runs on every daemon start. Its job: close the gap between on-disk state and current network tip before Phase 4 takes over. +- Phase 1 invokes backfill as its mechanism — but only when `[BSB]` is configured. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))` as part of its own startup. +- So: "backfill" and "Phase 1" overlap because Phase 1's whole purpose is "invoke backfill when BSB is configured". + ```python def run_streaming_daemon(config): meta_store = open_meta_store(config) validate_config(config, meta_store) - source = select_phase1_ledger_source(config) - phase1_catchup(config, meta_store, source) + make_bsb = make_bsb_partial(config) # None if [BSB] absent + phase1_catchup(config, meta_store, make_bsb) phase2_hydrate_txhash(config, meta_store) phase3_reconcile_orphans(config, meta_store) phase4_live_ingest(config, meta_store) @@ -344,31 +333,32 @@ Query serving is gated on Phase 4 being reached — see [Query Contract](#query- ### Phase 1 — Catchup -Runs the backfill subroutine (`run_backfill` from `01-backfill-workflow.md`) once per source-tip sample, until the gap closes to less than one chunk. - -- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. -- Every chunk Phase 1 persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. -- Works the same with BSB (parallel) or captive core (sequential); per-chunk work is atomic in both. +- **No-op path:** if `make_bsb is None` (no `[BSB]` configured), Phase 1 returns immediately. Phase 4's captive core will catch up from a leapfrog'd resume ledger. +- **BSB path:** runs the backfill subroutine (`run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md)) once per source-tip sample, until the gap closes to less than one chunk. +- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. ```python -def phase1_catchup(config, meta_store, source): - cpi = config.backfill.chunks_per_txhash_index - retention_ledgers = config.streaming.retention_ledgers +def phase1_catchup(config, meta_store, make_bsb): + if make_bsb is None: + return # No [BSB]; Phase 4's captive core handles catchup. + + cpi = config.service.chunks_per_txhash_index + retention_ledgers = config.service.retention_ledgers last_committed_ledger = phase1_coverage_end_ledger(meta_store) # Loop because tip advances during catchup; each iteration closes whatever's # accumulated since the previous sample. while True: - network_tip_ledger = source.tip() + network_tip_ledger = sample_network_tip(config.history_archives.urls) if (network_tip_ledger - last_committed_ledger) < LEDGERS_PER_CHUNK: - break # remaining gap < 1 chunk; captive core closes it in Phase 4 + break # remaining gap < 1 chunk; Phase 4's captive core closes it. range_start_chunk_id, range_end_chunk_id = compute_backfill_chunk_range( last_committed_ledger, network_tip_ledger, retention_ledgers, cpi) if range_end_chunk_id < range_start_chunk_id: break # leapfrog landed past last complete chunk — nothing to ingest yet - run_backfill(config, range_start_chunk_id, range_end_chunk_id, source=source) + run_backfill(config, range_start_chunk_id, range_end_chunk_id, make_bsb) # Re-derive from :lfs flags (not from range_end_chunk_id): a mid-iteration # crash can leave holes that the contiguous-prefix scan detects. @@ -416,11 +406,9 @@ def phase1_coverage_end_ledger(meta_store): return last_ledger_in_chunk(chunk_id - 1) ``` -**Worker concurrency:** `run_backfill` honors `source.max_parallelism()` — GOMAXPROCS for BSB, 1 for captive core (single subprocess, can't parallelize). +**Worker concurrency:** `run_backfill` caps DAG concurrency at `GOMAXPROCS`. Each `process_chunk` owns its own BSB instance (`make_bsb()`), prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunkchunk_id-make_bsb). -**Retention semantics by source:** -- BSB: retention determines Phase 1 range; catchup time ≈ `RETENTION_LEDGERS / (BSB throughput)`. -- Captive core: retention determines both Phase 1 range AND captive-core archive-catchup scope — size retention against the wall-clock cost. +**Retention effect:** retention determines Phase 1's chunk range. Catchup time ≈ `retention_window / (BSB throughput)`. ### Phase 2 — Hydrate TxHash Data from `.bin` @@ -430,7 +418,7 @@ def phase1_coverage_end_ledger(meta_store): ```python def phase2_hydrate_txhash(config, meta_store): - cpi = config.backfill.chunks_per_txhash_index + cpi = config.service.chunks_per_txhash_index # Step 1: sweep leftover .bin for tx indexes already flagged complete — backfill # may have set index:N:txhash before cleanup_txhash finished on crash. @@ -482,7 +470,7 @@ Completes any in-flight transitions left by a prior crash. All decisions derive def phase3_reconcile_orphans(config, meta_store): # resume_ledger derivation must match phase4_live_ingest exactly — if they disagree, # Phase 4 opens a fresh store while Phase 3's preserved store becomes an orphan. - cpi = config.backfill.chunks_per_txhash_index + cpi = config.service.chunks_per_txhash_index last_committed_ledger = meta_store.get("streaming:last_committed_ledger") if last_committed_ledger is None: last_committed_ledger = phase1_coverage_end_ledger(meta_store) @@ -525,28 +513,52 @@ Opens active stores for the resume position, spawns the lifecycle goroutine, sta def phase4_live_ingest(config, meta_store): last_committed_ledger = meta_store.get("streaming:last_committed_ledger") if last_committed_ledger is None: - # First start after Phase 1. - last_committed_ledger = phase1_coverage_end_ledger(meta_store) + # First start. + coverage_end = phase1_coverage_end_ledger(meta_store) + if coverage_end > GENESIS_LEDGER - 1: + # Phase 1 backfilled something (BSB-configured profile). Resume from the end + # of the contiguous :lfs prefix. + last_committed_ledger = coverage_end + else: + # Alice's path: [BSB] absent → Phase 1 was a no-op. Leapfrog DOWN from + # current tip to an index boundary so the first on-disk chunk will be + # complete. Captive core archive-catches-up from there. + last_committed_ledger = leapfrog_resume_from_tip(config) - 1 meta_store.put("streaming:last_committed_ledger", last_committed_ledger) resume_ledger = last_committed_ledger + 1 active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) run_in_background(run_prune_lifecycle_loop, config, meta_store) - ledger_backend = make_ledger_backend(config.streaming.captive_core_config) + ledger_backend = make_ledger_backend(config.captive_core.config_path) ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) set_daemon_ready() # in-memory; unblocks queries run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) +def leapfrog_resume_from_tip(config): + # Fresh-start, no BSB, no on-disk chunks. Choose the first ledger of the tx index + # containing (tip - RETENTION_LEDGERS). Captive core will archive-catch-up from here. + # Enforced elsewhere: validate_config fatals when [BSB] is absent AND retention_ledgers == 0 + # (captive-core-from-genesis is not a supported operating mode). + network_tip_ledger = sample_network_tip(config.history_archives.urls) + retention_ledgers = config.service.retention_ledgers + cpi = config.service.chunks_per_txhash_index + + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + target_tx_index_id = target_chunk_id // cpi + return (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + + def open_active_stores_for_resume(config, meta_store, resume_ledger): # Open/WAL-recover the current store for each of ledger/events/txhash AND pre-create # the "next" stores so the first boundary rollover is a pointer swap. # Events hot segment replays persisted deltas from disk; safe because Phase 3 already # truncated anything past last_committed_ledger. resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - resume_tx_index_id = resume_chunk_id // config.backfill.chunks_per_txhash_index + resume_tx_index_id = resume_chunk_id // config.service.chunks_per_txhash_index return ActiveStores( ledger = open_or_create_ledger_store(config, resume_chunk_id), @@ -568,7 +580,7 @@ Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` call ```python def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): - cpi = config.backfill.chunks_per_txhash_index # immutable; read once + cpi = config.service.chunks_per_txhash_index # immutable; read once ledger_seq = resume_ledger while True: lcm = ledger_backend.GetLedger(ledger_seq) # blocks until available @@ -651,7 +663,7 @@ def precreate_next_boundary_stores(active_stores, meta_store, target_chunk_id): # Idempotent — safe to re-run when target stores already exist. active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk_id) active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk_id, None) - cpi = config.backfill.chunks_per_txhash_index + cpi = config.service.chunks_per_txhash_index target_tx_index_id = target_chunk_id // cpi if target_tx_index_id != tx_index_id_of_chunk(target_chunk_id - 1): active_stores.txhash_next = open_or_create_txhash_store(config, target_tx_index_id) @@ -751,8 +763,8 @@ def run_prune_lifecycle_loop(config, meta_store): # Initial scan at entry catches any `"deleting"` state left by a prior crashed prune; # without it, a crashed prune could sit unserviced until the next chunk boundary # (up to ~16 h at cpi=1). Subsequent sweeps fire on chunk-boundary notifications. - cpi = config.backfill.chunks_per_txhash_index - retention_ledgers = config.streaming.retention_ledgers + cpi = config.service.chunks_per_txhash_index + retention_ledgers = config.service.retention_ledgers _run_prune_sweep(meta_store, retention_ledgers, cpi, config) while True: @@ -797,7 +809,7 @@ def prune_tx_index(tx_index_id, meta_store, config): # Two-phase marker for query-routing safety: set "deleting" BEFORE any file delete; # clear the key AFTER. Queries short-circuit on "deleting" (treated as absent). # Idempotent on crash-between-stages retry. - cpi = config.backfill.chunks_per_txhash_index + cpi = config.service.chunks_per_txhash_index meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") @@ -875,7 +887,7 @@ Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workf - **Crash between per-ledger checkpoint and LFS freeze completion.** - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent. - - Phase 1 on restart: `:lfs` missing → re-runs `process_chunk(chunk_id)` against source (idempotent per artifact). + - Phase 1 on restart (assumes `[BSB]` configured): `:lfs` missing → re-runs `process_chunk(chunk_id)` with a fresh per-task BSB (idempotent per artifact). - Phase 3 then: active ledger store present + `:lfs` now set → deletes the orphaned store. - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. From 1b4c28ac1bd5bc14ce6b8f9318e948d41790058b Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Thu, 23 Apr 2026 02:34:09 -0700 Subject: [PATCH 21/34] Design docs: sync to source-of-truth (compute_resume_ledger + shared resume_ledger) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Following the grill-me session's D-P1/D-P4/D-P7/D-P12/D-P13 updates in the source-of-truth, propagate to the in-repo docs: Phase 1 (catchup) is pure side-effect. phase1_catchup returns nothing; loop state is a local last_scheduled_end_chunk instead of a meta-store scan. compute_backfill_chunk_range replaced by a simpler leapfrog_start_chunk helper. phase1_coverage_end_ledger function deleted entirely. compute_resume_ledger is a new shared helper. Runs once per daemon start between Phase 2 (.bin hydration) and Phase 3 (reconcile). Scans every startup (even when streaming:last_committed_ledger is present — the scan doubles as a consistency check). Derivation table: - streaming:last_committed_ledger present → value + 1 - absent + contiguous :lfs chunks → last_ledger_in_chunk(end) + 1 - absent + no chunks → leapfrog_resume_from_tip Validation rules — any violation is fatal ("migration to streaming failed"): - no internal gap in :lfs coverage - start chunk aligns to a tx-index boundary - every :lfs chunk also has :events - every complete tx index in the range has index:txhash - streaming:last_committed_ledger consistent with scan's highest :lfs Phase 3 (reconcile) and Phase 4 (live ingestion) now accept resume_ledger as a parameter rather than re-deriving it — one computation, one source of truth for the handoff point. Phase 4 (live ingestion) no longer writes streaming:last_committed_ledger at bootstrap; invariant #9 preserved (the key is only written by the live ingestion loop on durable commit). Terminology: phase1_coverage_end_ledger entry retired; compute_resume_ledger added. Meta-store-key description for streaming:last_committed_ledger updated to reflect the no-bootstrap-write rule. Style sweep: every "Phase N" mention now carries its parenthetical tag (catchup / .bin hydration / reconcile / live ingestion). --- .../design-docs/01-backfill-workflow.md | 43 +-- .../design-docs/02-streaming-workflow.md | 351 ++++++++++-------- 2 files changed, 218 insertions(+), 176 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index eb65c47de..6a624b546 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -11,7 +11,7 @@ - Writes directly to immutable file formats — no RocksDB active stores. - Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool (`GOMAXPROCS` concurrency). - Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. -- **BSB-only.** Backfill does not use captive core as a ledger source. Captive core belongs to Phase 4 (live streaming); if BSB isn't configured, backfill is not invoked at all and Phase 4's captive core catches up from a leapfrog'd resume ledger as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion). +- **BSB-only.** Backfill does not use captive core as a ledger source. Captive core belongs to Phase 4 (live ingestion); if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion). **What it produces:** @@ -21,7 +21,7 @@ | `getTransaction` | Tx-index files | Per tx index (default 10_000_000 ledgers) | | `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | -For the distinction between *backfill (this subroutine)* and *Phase 1 (the startup phase that invokes it)* — two terms that get conflated because their scopes overlap — see [02-streaming-workflow.md — Backfill vs Phase 1](./02-streaming-workflow.md#backfill-vs-phase-1). +For the distinction between *backfill (this subroutine)* and *Phase 1 (catchup) (the startup phase that invokes it)* — two terms that get conflated because their scopes overlap — see [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). --- @@ -134,25 +134,6 @@ The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-back |-----|------|---------|-------------| | `PATH` | string | `{DEFAULT_DATA_DIR}/meta/rocksdb` | Meta store RocksDB directory. | -### Validation Rules - -- `validate` checks argument sanity and defensively re-asserts `CHUNKS_PER_TXHASH_INDEX` against the meta store — the daemon's `validate_config` is the real enforcer; see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). -- No source probe. `run_backfill` trusts the caller's range and fires the DAG. Per-chunk idempotency means already-done chunks are no-ops; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). -- `[BSB]` must be configured whenever `run_backfill` is invoked. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. -- DAG worker cap is `GOMAXPROCS`. BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. - -### Partial Tx Index Ranges - -When the caller's chunk range does not span a complete tx index, the trailing chunks have: - -- Their raw `.bin` files on disk (inside `IMMUTABLE_STORAGE.TXHASH_RAW.PATH`). -- Their `chunk:{chunk_id:08d}:txhash` flags set in the meta store. -- No RecSplit `.idx` files (RecSplit is built only when every chunk of the tx index is ready). - -These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 (`.bin` hydration) of the RPC service loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). - -Ledger and events data are useful per-chunk and are not blocked by tx-index alignment — `chunk:{chunk_id:08d}:lfs` and `chunk:{chunk_id:08d}:events` flags are set as soon as each chunk's outputs are durable. - ### Example TOML ```toml @@ -291,6 +272,26 @@ index:00000000:txhash → "1" tx_index_id=0 RecSplit complete index:00000001:txhash → absent tx_index_id=1 not yet built ``` +### Validation Rules + +- `validate` checks argument sanity and defensively re-asserts `CHUNKS_PER_TXHASH_INDEX` against the meta store — the daemon's `validate_config` is the real enforcer; see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). +- No source probe. `run_backfill` trusts the caller's range and fires the DAG. Per-chunk idempotency means already-done chunks are no-ops; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). +- `[BSB]` must be configured whenever `run_backfill` is invoked. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. +- DAG worker cap is `GOMAXPROCS`. BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. + +### Partial Tx Index Ranges + +When the caller's chunk range does not span a complete tx index, the trailing chunks have: + +- Their raw `.bin` files on disk (inside `IMMUTABLE_STORAGE.TXHASH_RAW.PATH`). +- Their `chunk:{chunk_id:08d}:txhash` flags set in the meta store. +- No RecSplit `.idx` files (RecSplit is built only when every chunk of the tx index is ready). + +These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 (`.bin` hydration) of the RPC service loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). + +Ledger and events data are useful per-chunk and are not blocked by tx-index alignment — `chunk:{chunk_id:08d}:lfs` and `chunk:{chunk_id:08d}:events` flags are set as soon as each chunk's outputs are durable. + + ### Key Lifecycle ``` diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index a49d2b847..ed7f6e899 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -7,7 +7,7 @@ The stellar-rpc daemon is the full-history RPC service. One binary, one invocati - Operator runs `stellar-rpc --config path/to/config.toml`. No subcommand. No `--mode` flag. No behavior-switching flags. - On every start the daemon runs four sequential startup phases, then enters a live ingestion loop it stays in until killed. - Behavior across the three operator profiles (archive, pruning-history, tip-tracker) is determined entirely by TOML config — no profile flag. -- Backfill (`01-backfill-workflow.md`) is used as an internal subroutine by Startup Phase 1. Operators never invoke backfill directly. +- Backfill (`01-backfill-workflow.md`) is used as an internal subroutine by Startup Phase 1 (catchup). Operators never invoke backfill directly. **What the daemon does end-to-end:** - Validates config against immutable meta-store state (`CHUNKS_PER_TXHASH_INDEX` and `RETENTION_LEDGERS`). @@ -26,13 +26,13 @@ The stellar-rpc daemon is the full-history RPC service. One binary, one invocati Terms used repeatedly throughout this doc. Skim on first read, refer back when a term surfaces later. - **Daemon** — the stellar-rpc binary running as one long-lived process. The only operator-facing entry point. -- **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 is reached, it stays there until the process exits. [Details](#startup-sequence). -- **Phase 1 catchup** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. +- **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 (live ingestion) is reached, it stays there until the process exits. [Details](#startup-sequence). +- **Phase 1 (catchup)** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. - **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI entry point exists. -- **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. -- **`phase1_coverage_end_ledger`** — the last ledger of the contiguous prefix of `chunk:{chunkId}:lfs` flags starting from the lowest chunk on disk. Phase 1 uses this to decide what's still left to ingest. Returned by the same-named function. **Not the same** as `streaming:last_committed_ledger`. (Prior drafts called this concept "Phase 1 low-water mark"; the term was retired because it's semantically a HIGH-water mark — the newest confirmed ledger in contiguous coverage.) -- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside the Phase 4 ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. -- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS` whenever captive core is NOT yet running (Phase 1 catchup loop, Phase 4 leapfrog-from-tip on fresh start without BSB). Once captive core is running (Phase 4 ingestion loop), the tip comes from `ledger_backend.latest_tip()` against the running subprocess — it's authoritative and cheaper than another HTTP round-trip. Different from `last_committed_ledger` (the daemon's own progress). +- **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. +- **`compute_resume_ledger`** — shared helper called once per daemon start, between Phase 2 (`.bin` hydration) and Phase 3 (reconcile). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger`. Result is consumed by Phase 3 (reconcile) and Phase 4 (live ingestion) — they never derive `resume_ledger` independently. See [Compute Resume Ledger](#compute-resume-ledger). +- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. +- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS` whenever captive core is NOT yet running (Phase 1 (catchup) loop, Phase 4 (live ingestion) leapfrog-from-tip on fresh start without BSB). Once captive core is running (inside Phase 4 (live ingestion)), the tip comes from `ledger_backend.latest_tip()` against the running subprocess — authoritative and cheaper than another HTTP round-trip. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - Ledger active store — a per-chunk RocksDB (one instance per chunk). - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). @@ -46,8 +46,8 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Txhash index** (a.k.a. "tx index", "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). Both docs use "tx index" as the dominant narrative form; "txhash index" appears where the output's role as a txhash lookup is the emphasis. - **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. - **Index boundary** — the moment ingestion commits the last ledger of an index. Triggers background RecSplit build for that index. Every index boundary is also a chunk boundary. -- **Catchup** — synonym for "close the gap between last-committed ledger and current tip". Performed inside Phase 1. -- **`.bin` file** — a backfill-produced raw txhash flat file (transient). Exists only for chunks the backfill subroutine has flagged `:txhash` but whose containing index has not yet had its RecSplit built. Deleted by Phase 2 once loaded into the active txhash RocksDB. Streaming's live path never produces `.bin` files. +- **Catchup** — synonym for "close the gap between last-committed ledger and current tip". Performed inside Phase 1 (catchup). +- **`.bin` file** — a backfill-produced raw txhash flat file (transient). Exists only for chunks the backfill subroutine has flagged `:txhash` but whose containing index has not yet had its RecSplit built. Deleted by Phase 2 (`.bin` hydration) once loaded into the active txhash RocksDB. Streaming's live path never produces `.bin` files. --- @@ -75,7 +75,7 @@ Stored on first start; fatal on any subsequent start where the config value diff | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | - Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; not stored as immutable. -- Operators may add or remove BSB between runs; daemon extends coverage forward from `phase1_coverage_end_ledger` regardless. +- Operators may add or remove BSB between runs; on each start, Phase 1 (catchup) either re-runs backfill from the retention-aligned start (BSB present) or no-ops (BSB absent). `compute_resume_ledger` then derives resume from whatever chunks are on disk. - Retention immutability alone constrains the data envelope — source choice doesn't need its own gate. ### TOML Sections Documented Here @@ -105,13 +105,13 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 | Key | Type | Default | Description | |---|---|---|---| -| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` for the Phase 4 leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start). Same key the existing ingest service reads. | +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` for Phase 4 (live ingestion)'s leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start). Same key the existing ingest service reads. | **[BSB]** (optional) -- Same schema as in the backfill doc. Presence in the config file determines Phase 1 behavior: - - Present: Phase 1 invokes backfill over the BSB (fast, parallel per-chunk catchup). - - Absent: Phase 1 is a no-op; Phase 4 captive core archive-catches-up from a leapfrog'd `resume_ledger` (slower, but no object-store dep). +- Same schema as in the backfill doc. Presence in the config file determines Phase 1 (catchup) behavior: + - Present: Phase 1 (catchup) invokes backfill over the BSB (fast, parallel per-chunk catchup). + - Absent: Phase 1 (catchup) is a no-op; Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` (slower, but no object-store dep). - See [Ledger Source](#ledger-source) for the BSB-source details and [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup) for the full split. ### CLI Flags @@ -129,7 +129,7 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 - `CHUNKS_PER_TXHASH_INDEX` immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). - `RETENTION_LEDGERS` immutable across runs. - `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. -- `[BSB]` optional. When present → Phase 1 invokes backfill over the BSB; when absent → Phase 1 is a no-op and Phase 4 captive core handles initial catchup. May be added or removed between runs. +- `[BSB]` optional. When present → Phase 1 (catchup) invokes backfill over the BSB; when absent → Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup. May be added or removed between runs. - **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. Not a supported operating mode. - `[HISTORY_ARCHIVES].URLS` required in all profiles. - `[CAPTIVE_CORE].CONFIG_PATH` required in all profiles. @@ -176,7 +176,7 @@ Three profiles emerge from config combinations. No profile flag. |---|---|---|---|---| | Archive | `0` | present | Backfill over full history (chunks `[0, current_chunk − 1]`) | Public archive node; full history. | | Pruning-history | `N × LEDGERS_PER_INDEX`, N ≥ 1 | present | Backfill over retention window (leapfrog-aligned start) | Windowed history with bulk initial catchup. | -| Tip-tracker | `N × LEDGERS_PER_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | +| Tip-tracker | `N × LEDGERS_PER_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | | (invalid) | `0` | absent | — | Rejected by `validate_config`: full history requires BSB. | --- @@ -189,7 +189,7 @@ Single RocksDB instance, WAL always enabled. Authoritative source for every star | Key | Value | Written when | |---|---|---| -| `streaming:last_committed_ledger` | uint32 (big-endian) | First written at top of Phase 4 to `phase1_coverage_end_ledger(meta_store)` (the end of the contiguous `:lfs` prefix — already a ledger sequence); subsequently after every committed live ledger. **Not updated during Phases 1–3.** Phase 1 progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | +| `streaming:last_committed_ledger` | uint32 (big-endian) | Written only by the live ingestion loop after all three active stores durably commit a ledger. **Never written at bootstrap.** When absent, [`compute_resume_ledger`](#compute-resume-ledger) derives resume from the contiguous `:lfs` prefix (first-ever post-Phase-1) or by leapfrogging down from the current network tip to an index boundary (tip-tracker fresh start). Phase 1 (catchup) progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | | `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | ### Keys Shared with Backfill @@ -199,24 +199,24 @@ Single RocksDB instance, WAL always enabled. Authoritative source for every star | `config:chunks_per_txhash_index` | Set on first run by whichever invocation runs first — here, first daemon start. | | `chunk:{chunk_id:08d}:lfs` | Set after ledger pack file fsync. | | `chunk:{chunk_id:08d}:events` | Set after events cold segment fsync. | -| `chunk:{chunk_id:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 hydration after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | +| `chunk:{chunk_id:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 (`.bin` hydration) after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | | `index:{tx_index_id:08d}:txhash` | `"1"` after all 16 RecSplit CF `.idx` files built and fsynced. Transitions to `"deleting"` at the start of `prune_tx_index`, deleted entirely when prune completes. Query routing treats `"deleting"` the same as absent. | ### Key Lifecycle in Streaming ``` -Phase 1 (backfill subroutine): +Phase 1 (catchup): chunk:{chunk_id}:lfs = "1" (after pack fsync) chunk:{chunk_id}:txhash = "1" (after .bin fsync) # only present for chunks that still have .bin on disk chunk:{chunk_id}:events = "1" (after cold segment fsync) - index:{tx_index_id}:txhash = "1" (after RecSplit, when all chunks of tx_index_id are done in Phase 1) + index:{tx_index_id}:txhash = "1" (after RecSplit, when all chunks of tx_index_id are done in Phase 1 (catchup)) Phase 2 (.bin hydration — see Startup Sequence): For every chunk with :txhash flag and a .bin file: load .bin into RocksDB txhash store delete chunk:{chunk_id}:txhash flag delete .bin file - After Phase 2, no chunk:{chunk_id}:txhash flags and no .bin files remain. + After Phase 2 (.bin hydration), no chunk:{chunk_id}:txhash flags and no .bin files remain. Live path (per ledger): streaming:last_committed_ledger = ledger_seq (after all 3 active stores commit) @@ -259,7 +259,7 @@ The daemon maintains three active stores for the current ingestion position. All - The store for the next chunk / index is pre-created before the boundary is reached, so boundary-time work is a pointer swap only. - Creation timing: when the ingestion loop commits a ledger within a configurable window before the boundary (e.g., `last_ledger_in_chunk(chunk_id) - 1_000`). The window must be large enough that store initialization (directory mkdir + RocksDB open + column family setup) completes before the boundary ledger arrives, and small enough that pre-creation doesn't run prematurely for chunks the daemon may never reach. -- On restart, a pre-created store is expected to exist — Phase 3 treats `resume_chunk + 1` (and `resume_index + 1`) as active, not an orphan. +- On restart, a pre-created store is expected to exist — Phase 3 (reconcile) treats `resume_chunk + 1` (and `resume_index + 1`) as active, not an orphan. ### Max Concurrent Stores @@ -273,8 +273,8 @@ The daemon maintains three active stores for the current ingestion position. All ## Ledger Source -- **Backfill (Phase 1) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup). -- **Live streaming (Phase 4) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. +- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup). +- **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. ```python class BSBSource: @@ -292,7 +292,7 @@ class BSBSource: ```python def make_bsb_partial(config): # Returns a partial that each process_chunk calls to get a fresh BSBSource. - # None means Phase 1 is a no-op; Phase 4 captive core handles catchup. + # None means Phase 1 (catchup) is a no-op; Phase 4 (live ingestion) captive core handles catchup. if config.bsb is None: return None return functools.partial(BSBSource, config.bsb) @@ -302,21 +302,21 @@ def make_bsb_partial(config): ## Startup Sequence -Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 is the long-running state the daemon stays in until process exit. +Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 (live ingestion) is the long-running state the daemon stays in until process exit. - **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. - **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 left (for the trailing partial index) into the active txhash store, then deletes them. - **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. Truncates events hot segment beyond the last committed ledger. - **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle goroutine, flips the `daemon_ready` flag, enters the ingestion loop. Runs until process exit. -"Phase" here refers to the startup ordering only. Once Phase 4 is entered, there's no Phase 5 — the daemon is in live-streaming steady state. +"Phase" here refers to the startup ordering only. Once Phase 4 (live ingestion) is entered, there's no Phase 5 — the daemon is in live-streaming steady state. -### Backfill vs Phase 1 +### Backfill vs Phase 1 (catchup) - **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only, runs parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. - **Phase 1 (catchup)** is a startup phase that runs on every daemon start. Its job: close the gap between on-disk state and current network tip before Phase 4 takes over. -- Phase 1 invokes backfill as its mechanism — but only when `[BSB]` is configured. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))` as part of its own startup. -- So: "backfill" and "Phase 1" overlap because Phase 1's whole purpose is "invoke backfill when BSB is configured". +- Phase 1 (catchup) invokes backfill as its mechanism — but only when `[BSB]` is configured. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))` as part of its own startup. +- So: "backfill" and "Phase 1 (catchup)" overlap because Phase 1 (catchup)'s whole purpose is "invoke backfill when BSB is configured". ```python def run_streaming_daemon(config): @@ -325,96 +325,69 @@ def run_streaming_daemon(config): make_bsb = make_bsb_partial(config) # None if [BSB] absent phase1_catchup(config, meta_store, make_bsb) phase2_hydrate_txhash(config, meta_store) - phase3_reconcile_orphans(config, meta_store) - phase4_live_ingest(config, meta_store) + resume_ledger = compute_resume_ledger(config, meta_store) + phase3_reconcile_orphans(config, meta_store, resume_ledger) + phase4_live_ingest(config, meta_store, resume_ledger) ``` -Query serving is gated on Phase 4 being reached — see [Query Contract](#query-contract). +Query serving is gated on Phase 4 (live ingestion) being reached — see [Query Contract](#query-contract). ### Phase 1 — Catchup -- **No-op path:** if `make_bsb is None` (no `[BSB]` configured), Phase 1 returns immediately. Phase 4's captive core will catch up from a leapfrog'd resume ledger. +- **No-op path:** if `make_bsb is None` (no `[BSB]` configured), Phase 1 (catchup) returns immediately. Phase 4 (live ingestion)'s captive core will catch up from a leapfrog'd resume ledger. - **BSB path:** runs the backfill subroutine (`run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md)) once per source-tip sample, until the gap closes to less than one chunk. -- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. +- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. ```python def phase1_catchup(config, meta_store, make_bsb): + # Pure side-effect. Re-runs the full retention-aligned range on every start; + # DAG idempotency inside run_backfill handles already-done chunks. if make_bsb is None: - return # No [BSB]; Phase 4's captive core handles catchup. + return # no [BSB] → no-op - cpi = config.service.chunks_per_txhash_index - retention_ledgers = config.service.retention_ledgers - last_committed_ledger = phase1_coverage_end_ledger(meta_store) + cpi = config.service.chunks_per_txhash_index + retention_ledgers = config.service.retention_ledgers + last_scheduled_end_chunk = -1 # Loop because tip advances during catchup; each iteration closes whatever's # accumulated since the previous sample. while True: network_tip_ledger = sample_network_tip(config.history_archives.urls) - if (network_tip_ledger - last_committed_ledger) < LEDGERS_PER_CHUNK: - break # remaining gap < 1 chunk; Phase 4's captive core closes it. - - range_start_chunk_id, range_end_chunk_id = compute_backfill_chunk_range( - last_committed_ledger, network_tip_ledger, retention_ledgers, cpi) - if range_end_chunk_id < range_start_chunk_id: - break # leapfrog landed past last complete chunk — nothing to ingest yet - - run_backfill(config, range_start_chunk_id, range_end_chunk_id, make_bsb) - - # Re-derive from :lfs flags (not from range_end_chunk_id): a mid-iteration - # crash can leave holes that the contiguous-prefix scan detects. - last_committed_ledger = phase1_coverage_end_ledger(meta_store) - - -def compute_backfill_chunk_range(last_committed_ledger, network_tip_ledger, retention_ledgers, cpi): - # Leapfrog aligns DOWN to the first chunk of the tx index containing - # (network_tip_ledger - retention_ledgers). retention_ledgers is a multiple of - # LEDGERS_PER_INDEX but network_tip_ledger is arbitrary, so that subtraction isn't - # tx-index-aligned in general and must be rounded. Worst case: up to - # LEDGERS_PER_INDEX - 1 ledgers past the strict retention line stay on disk. - gap_start_ledger = last_committed_ledger + 1 - if retention_ledgers > 0: - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) - target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - target_tx_index_id = target_chunk_id // cpi - leapfrog_start_ledger = (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER - else: - leapfrog_start_ledger = GENESIS_LEDGER - - range_start_ledger = max(gap_start_ledger, leapfrog_start_ledger) - range_start_chunk_id = (range_start_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - range_end_chunk_id = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 - return range_start_chunk_id, range_end_chunk_id - - -def phase1_coverage_end_ledger(meta_store): - # Last ledger of the contiguous :lfs prefix starting at the lowest on-disk chunk. - # Contiguous-tail (not max-of-:lfs) because parallel BSB workers can leave mid-range - # holes on crash; max would skip them and break no-gaps. - min_chunk_id = None - for key in meta_store.iter_prefix("chunk:"): - if not key.endswith(":lfs"): - continue - chunk_id = parse_chunk_id(key) - if min_chunk_id is None or chunk_id < min_chunk_id: - min_chunk_id = chunk_id - if min_chunk_id is None: - return GENESIS_LEDGER - 1 - - chunk_id = min_chunk_id - while meta_store.has(f"chunk:{chunk_id:08d}:lfs"): - chunk_id += 1 - return last_ledger_in_chunk(chunk_id - 1) + end_chunk = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 + if end_chunk <= last_scheduled_end_chunk: + break # no new complete chunks since last iteration + + start_chunk = leapfrog_start_chunk(network_tip_ledger, retention_ledgers, cpi) + if end_chunk < start_chunk: + break # leapfrog landed past tip — pre-first-complete-chunk + + run_backfill(config, start_chunk, end_chunk, make_bsb) + last_scheduled_end_chunk = end_chunk + + +def leapfrog_start_chunk(network_tip_ledger, retention_ledgers, cpi): + # Archive profile (retention=0): start at chunk 0. + # Pruning-history: align DOWN to the first chunk of the tx index containing + # (tip - retention_ledgers). retention_ledgers is a multiple of LEDGERS_PER_INDEX + # but tip is arbitrary, so the subtraction isn't index-aligned and must be rounded. + # Worst case: up to LEDGERS_PER_INDEX - 1 extra ledgers below strict retention. + if retention_ledgers == 0: + return 0 + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + target_tx_index_id = target_chunk_id // cpi + return target_tx_index_id * cpi ``` **Worker concurrency:** `run_backfill` caps DAG concurrency at `GOMAXPROCS`. Each `process_chunk` owns its own BSB instance (`make_bsb()`), prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunkchunk_id-make_bsb). -**Retention effect:** retention determines Phase 1's chunk range. Catchup time ≈ `retention_window / (BSB throughput)`. +**Retention effect:** retention determines Phase 1 (catchup)'s chunk range. Catchup time ≈ `retention_window / (BSB throughput)`. ### Phase 2 — Hydrate TxHash Data from `.bin` -- Phase 1 may leave `.bin` files for chunks in the last (incomplete) tx index. -- Phase 2 loads each into the active txhash store, then deletes the `.bin` + `chunk:{chunk_id:08d}:txhash` flag. -- After Phase 2: no `.bin` files and no `:txhash` chunk flags remain. +- Phase 1 (catchup) may leave `.bin` files for chunks in the last (incomplete) tx index. +- Phase 2 (`.bin` hydration) loads each into the active txhash store, then deletes the `.bin` + `chunk:{chunk_id:08d}:txhash` flag. +- After Phase 2 (`.bin` hydration): no `.bin` files and no `:txhash` chunk flags remain. ```python def phase2_hydrate_txhash(config, meta_store): @@ -450,7 +423,7 @@ def phase2_hydrate_txhash(config, meta_store): if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): os.remove(bin_file) finally: - # Close before returning: Phase 4 re-opens by directory path and the RocksDB + # Close before returning: Phase 4 (live ingestion) re-opens by directory path and the RocksDB # flock would collide if this handle stayed open. txhash_store.close() ``` @@ -458,23 +431,118 @@ def phase2_hydrate_txhash(config, meta_store): **Why "load then delete" matters.** - Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. - At `cpi=1_000` with frequent restarts over a day: thousands of redundant loads. -- Load-then-delete makes Phase 2 a no-op on every subsequent restart until the next Phase 1 deposits new `.bin` files. +- Load-then-delete makes Phase 2 (`.bin` hydration) a no-op on every subsequent restart until the next Phase 1 (catchup) deposits new `.bin` files. + +**Pure-streaming restarts** (no recent Phase 1 (catchup) output) never see `.bin` files; streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 (`.bin` hydration) is a no-op. + +### Compute Resume Ledger + +- `compute_resume_ledger` is a shared helper called once per daemon start, between Phase 2 (`.bin` hydration) and Phase 3 (reconcile). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. +- Consumed by Phase 3 (reconcile) for events-hot-segment truncation at `resume_ledger - 1`, and by Phase 4 (live ingestion) for active-store open + captive-core startup. +- **One helper, one call site.** Phase 3 (reconcile) and Phase 4 (live ingestion) never derive `resume_ledger` independently — avoids the bug class where they disagree and leave orphan active stores. +- **Scans every startup, even when `streaming:last_committed_ledger` is already set.** The scan's primary output in the mid-life-restart case is validation, not derivation; catching broken on-disk state before opening active stores is strictly safer than silently resuming on top. +- **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The daemon exits non-zero; no active stores are opened. + +**Derivation** — first match wins: + +| `streaming:last_committed_ledger` | Scan result | Situation | `resume_ledger` | +|---|---|---|---| +| present | (validated consistent) | Mid-life restart | `value + 1` | +| absent | contiguous `:lfs` chunks `[start..end]` | First-ever post-Phase-1 (catchup), or crash between Phase 1 (catchup) end and first live commit | `last_ledger_in_chunk(end) + 1` | +| absent | no `:lfs` chunks | Tip-tracker fresh start (no `[BSB]`) | `leapfrog_resume_from_tip(config)` | + +**Validation rules** (any violation → fatal): + +- **No internal gap in `:lfs` coverage.** Example FAIL: chunks `[0..90] ∪ [92..N]` with `91` missing. A trailing "no chunks beyond N" is normal end-of-prefix, not a gap. +- **Start aligns to a tx-index boundary.** `start_chunk == 0` (archive) OR `start_chunk % cpi == 0` (pruning-history — first chunk of a tx index). Example FAIL at `cpi=100`: scan yields `[3456..N]`; `3456 % 100 ≠ 0`. Correct start would have been `3500`. +- **Chunk flags consistent.** Every chunk in the contiguous range has both `:lfs` AND `:events`. A chunk with one but not the other means `process_chunk` crashed mid-task and was never re-run. +- **Index flags consistent.** Every complete tx index fully inside `[start, end]` has `index:{tx_index_id:08d}:txhash`. Trailing partial indexes do NOT — those wait for Phase 2 (`.bin` hydration) on first start, or become Phase 3 (reconcile) build-respawn candidates on restart. +- **Live checkpoint consistent with scan.** When `streaming:last_committed_ledger = L` is present, chunks through `chunk_id_of_ledger(L) - 1` must all have `:lfs`. Example FAIL: `L = 56_345_672` (chunk 5_634 ingesting), but scan's highest contiguous chunk is 5_632 — chunk 5_633 must have been frozen before chunk 5_634 could be active; its absence means a recent immutable artifact went missing out of band. + +```python +def compute_resume_ledger(config, meta_store): + # Shared helper: single scan, single source of truth for resume_ledger. + cpi = config.service.chunks_per_txhash_index + scan = scan_all_chunk_and_index_keys(meta_store) + validate_scan(scan, cpi) + + last_committed_ledger = meta_store.get("streaming:last_committed_ledger") + if last_committed_ledger is not None: + validate_last_committed_consistency(scan, last_committed_ledger) + return last_committed_ledger + 1 + + if scan.lfs_chunks: + end_chunk = scan.lfs_chunks[-1] # already validated contiguous + return last_ledger_in_chunk(end_chunk) + 1 + + return leapfrog_resume_from_tip(config) # no on-disk chunks: Alice fresh start + + +def validate_scan(scan, cpi): + # Fatal on any violation — "migration to streaming failed". + if not scan.lfs_chunks: + return + start, end = scan.lfs_chunks[0], scan.lfs_chunks[-1] + + expected = set(range(start, end + 1)) + actual = set(scan.lfs_chunks) + if actual != expected: + fatal(f"internal :lfs gap: missing chunks {sorted(expected - actual)}") + + if start != 0 and start % cpi != 0: + fatal(f"start chunk {start} not tx-index aligned (expected multiple of cpi={cpi})") + + if actual != set(scan.events_chunks): + fatal(":lfs / :events mismatch — a process_chunk task crashed mid-run and was never recovered") + + # Complete tx indexes = those whose ALL cpi chunks fall inside [start, end]. + first_complete_tx_index_id = (start + cpi - 1) // cpi + last_complete_tx_index_id = (end + 1) // cpi - 1 + complete = set(range(first_complete_tx_index_id, last_complete_tx_index_id + 1)) + missing = complete - set(scan.txhash_indexes) + if missing: + fatal(f"complete tx indexes {sorted(missing)} missing index:txhash flag") + + +def validate_last_committed_consistency(scan, last_committed_ledger): + # streaming:last_committed_ledger=L implies every chunk up to chunk_id_of_ledger(L)-1 + # must have :lfs. Chunk containing L itself is the currently-ingesting chunk and may + # or may not have :lfs depending on whether L fell on a chunk boundary. + active_chunk_id = (last_committed_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + required_last = active_chunk_id - 1 + if required_last < 0: + return + actual_last = scan.lfs_chunks[-1] if scan.lfs_chunks else -1 + if actual_last < required_last: + fatal(f"streaming:last_committed_ledger={last_committed_ledger} requires :lfs " + f"through chunk {required_last}; scan's highest is {actual_last} — " + f"a recent immutable artifact is missing") + + +def leapfrog_resume_from_tip(config): + # Tip-tracker fresh start: no BSB, no on-disk chunks. First chunk captive core + # ingests will be the first chunk of the tx index containing (tip - retention). + # validate_config already rejected the [BSB]-absent + retention=0 combination, so + # this helper is never called in archive-from-genesis shape. + network_tip_ledger = sample_network_tip(config.history_archives.urls) + retention_ledgers = config.service.retention_ledgers + cpi = config.service.chunks_per_txhash_index -**Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files; streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 is a no-op. + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + target_tx_index_id = target_chunk_id // cpi + return (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER +``` ### Phase 3 — Reconcile Orphaned Transitions Completes any in-flight transitions left by a prior crash. All decisions derive from meta store state + on-disk store directories. ```python -def phase3_reconcile_orphans(config, meta_store): - # resume_ledger derivation must match phase4_live_ingest exactly — if they disagree, - # Phase 4 opens a fresh store while Phase 3's preserved store becomes an orphan. - cpi = config.service.chunks_per_txhash_index - last_committed_ledger = meta_store.get("streaming:last_committed_ledger") - if last_committed_ledger is None: - last_committed_ledger = phase1_coverage_end_ledger(meta_store) - resume_ledger = max(last_committed_ledger + 1, GENESIS_LEDGER) +def phase3_reconcile_orphans(config, meta_store, resume_ledger): + # resume_ledger is computed once by the orchestrator (see Compute Resume Ledger) + # and shared with Phase 4 (live ingestion), so the two phases never disagree about the handoff point. + cpi = config.service.chunks_per_txhash_index resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK for store_dir in scan_ledger_store_dirs(config): @@ -501,7 +569,7 @@ def phase3_reconcile_orphans(config, meta_store): transitioning_txhash = open_active_txhash_store(config, tx_index_id) run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) - # Prevents duplicate event IDs when Phase 4 replays the first live ledger. + # Prevents duplicate event IDs when Phase 4 (live ingestion) replays the first live ledger. truncate_events_hot_segment(config, resume_ledger - 1) ``` @@ -510,23 +578,11 @@ def phase3_reconcile_orphans(config, meta_store): Opens active stores for the resume position, spawns the lifecycle goroutine, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). ```python -def phase4_live_ingest(config, meta_store): - last_committed_ledger = meta_store.get("streaming:last_committed_ledger") - if last_committed_ledger is None: - # First start. - coverage_end = phase1_coverage_end_ledger(meta_store) - if coverage_end > GENESIS_LEDGER - 1: - # Phase 1 backfilled something (BSB-configured profile). Resume from the end - # of the contiguous :lfs prefix. - last_committed_ledger = coverage_end - else: - # Alice's path: [BSB] absent → Phase 1 was a no-op. Leapfrog DOWN from - # current tip to an index boundary so the first on-disk chunk will be - # complete. Captive core archive-catches-up from there. - last_committed_ledger = leapfrog_resume_from_tip(config) - 1 - meta_store.put("streaming:last_committed_ledger", last_committed_ledger) - resume_ledger = last_committed_ledger + 1 - +def phase4_live_ingest(config, meta_store, resume_ledger): + # resume_ledger is already computed by the orchestrator (see Compute Resume Ledger). + # Phase 4 (live ingestion) does NOT write streaming:last_committed_ledger at bootstrap — the first + # write happens inside the live ingestion loop after the first durable commit + # (invariant #9). active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) run_in_background(run_prune_lifecycle_loop, config, meta_store) @@ -537,25 +593,10 @@ def phase4_live_ingest(config, meta_store): run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) -def leapfrog_resume_from_tip(config): - # Fresh-start, no BSB, no on-disk chunks. Choose the first ledger of the tx index - # containing (tip - RETENTION_LEDGERS). Captive core will archive-catch-up from here. - # Enforced elsewhere: validate_config fatals when [BSB] is absent AND retention_ledgers == 0 - # (captive-core-from-genesis is not a supported operating mode). - network_tip_ledger = sample_network_tip(config.history_archives.urls) - retention_ledgers = config.service.retention_ledgers - cpi = config.service.chunks_per_txhash_index - - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) - target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - target_tx_index_id = target_chunk_id // cpi - return (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER - - def open_active_stores_for_resume(config, meta_store, resume_ledger): # Open/WAL-recover the current store for each of ledger/events/txhash AND pre-create # the "next" stores so the first boundary rollover is a pointer swap. - # Events hot segment replays persisted deltas from disk; safe because Phase 3 already + # Events hot segment replays persisted deltas from disk; safe because Phase 3 (reconcile) already # truncated anything past last_committed_ledger. resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK resume_tx_index_id = resume_chunk_id // config.service.chunks_per_txhash_index @@ -594,7 +635,7 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ) # Atomic "daemon owns everything up to ledger_seq" signal — written only after - # all three stores have durably committed. Distinct from Phase 1's :lfs-derived + # all three stores have durably committed. Distinct from Phase 1 (catchup)'s :lfs-derived # coverage end. meta_store.put("streaming:last_committed_ledger", ledger_seq) @@ -676,7 +717,7 @@ Converts the retired ledger RocksDB store to an immutable `.pack` file, then dis ```python def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_store): # Order: overwrite=True (discard any prior partial) → write → fsync → flag → cleanup. - # Flag-after-fsync. Crash between flag and store-delete leaves an orphan dir; Phase 3 + # Flag-after-fsync. Crash between flag and store-delete leaves an orphan dir; Phase 3 (reconcile) # reconciles via `:lfs` present + store present → delete store. pack_path = ledger_pack_path(chunk_id) writer = packfile.create(pack_path, overwrite=True) @@ -690,7 +731,7 @@ def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_ def finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store): - # Phase 3 helper. Same as freeze_ledger_chunk_to_pack_file but opens the existing + # Phase 3 (reconcile) helper. Same as freeze_ledger_chunk_to_pack_file but opens the existing # store (WAL-recovered) and skips signal_lfs_complete (Phase 3 is synchronous). transitioning_ledger_store = open_or_create_ledger_store(config, chunk_id) pack_path = ledger_pack_path(chunk_id) @@ -835,13 +876,13 @@ def prune_tx_index(tx_index_id, meta_store, config): ## Query Contract -Query serving is gated on Phase 4 being reached. `getLedger`, `getTransaction`, `getEvents` all return **HTTP 4xx** during Phases 1–3. +Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, `getTransaction`, `getEvents` all return **HTTP 4xx** during Phases 1–3. ### Readiness Signal -- An in-memory boolean `daemon_ready` is set by `set_daemon_ready()` at the top of Phase 4, after Phases 1–3 complete and active stores are opened. -- Not persisted. On every startup the flag starts `false`; on every Phase 4 entry it flips to `true`. Clean shutdown discards it implicitly (process exits). -- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the daemon serves, every time. +- An in-memory boolean `daemon_ready` is set by `set_daemon_ready()` at the top of Phase 4 (live ingestion), after Phases 1–3 complete and active stores are opened. +- Not persisted. On every startup the flag starts `false`; on every Phase 4 (live ingestion) entry it flips to `true`. Clean shutdown discards it implicitly (process exits). +- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 (live ingestion) is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the daemon serves, every time. - Query handlers check the flag on each request. `false` → HTTP 4xx. `true` → route normally. ### Behavior During Phases 1–3 @@ -874,26 +915,26 @@ In addition to the backfill subroutine's invariants in [01-backfill-workflow.md 1. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. 2. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. 3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. -4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 itself avoids producing them. +4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 (catchup) itself avoids producing them. 5. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. ### Compound Recovery Scenarios -Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workflow.md#crash-recovery) handles every Phase 1 crash. Streaming adds: +Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workflow.md#crash-recovery) handles every Phase 1 (catchup) crash. Streaming adds: -- **Crash during Phase 2 `.bin` hydration.** +- **Crash during Phase 2 (`.bin` hydration).** - Chunks loaded pre-crash: no `:txhash` flag, no `.bin` → loop skips via flag check. - Chunks not yet loaded: `:txhash` + `.bin` present → loop picks them up. - **Crash between per-ledger checkpoint and LFS freeze completion.** - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent. - - Phase 1 on restart (assumes `[BSB]` configured): `:lfs` missing → re-runs `process_chunk(chunk_id)` with a fresh per-task BSB (idempotent per artifact). - - Phase 3 then: active ledger store present + `:lfs` now set → deletes the orphaned store. + - Phase 1 (catchup) on restart (assumes `[BSB]` configured): `:lfs` missing → re-runs `process_chunk(chunk_id)` with a fresh per-task BSB (idempotent per artifact). + - Phase 3 (reconcile) then: active ledger store present + `:lfs` now set → deletes the orphaned store. - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. - **Crash mid-RecSplit.** - State: `index:{tx_index_id}:txhash` absent; all `:lfs` chunks of the tx index present. - - Phase 3: re-spawns the RecSplit build after deleting partial `.idx` files. + - Phase 3 (reconcile): re-spawns the RecSplit build after deleting partial `.idx` files. - **Crash mid-prune.** - State: some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. From c2f6d24844a59b801c0f212d686c426959e7451a Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Thu, 23 Apr 2026 10:17:22 -0700 Subject: [PATCH 22/34] =?UTF-8?q?Design=20docs:=20rename=20run=5Fstreaming?= =?UTF-8?q?=5Fdaemon=20=E2=86=92=20run=5Frpc=5Fservice?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Orchestrator function name carried the same 'streaming daemon' smell as the prose mentions we swept earlier. The binary does backfill AND streaming; 'streaming daemon' erases the backfill half. 'RPC service' is the preferred term (per feedback_service_naming memory). Off-repo artifacts (source-of-truth, grill-me checkpoint, streaming-diff summary) updated in sync, but outside this commit. --- full-history/design-docs/02-streaming-workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index ed7f6e899..50f48e789 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -319,7 +319,7 @@ Four sequential phases, same code path for first start and every restart. The fi - So: "backfill" and "Phase 1 (catchup)" overlap because Phase 1 (catchup)'s whole purpose is "invoke backfill when BSB is configured". ```python -def run_streaming_daemon(config): +def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) make_bsb = make_bsb_partial(config) # None if [BSB] absent From 12c812caa74309e73cd6e9f3d41dcc8f7245b4df Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Thu, 23 Apr 2026 13:20:36 -0700 Subject: [PATCH 23/34] Design docs: move compute_resume_ledger after Phase 3 (reconcile); rename leapfrog_* helpers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit compute_resume_ledger ordering fix: it now runs AFTER Phase 3 (reconcile), not between Phase 2 (.bin hydration) and Phase 3. Reason: Phase 3's finish_interrupted_ledger_freeze writes :lfs for chunks whose freeze was in flight at a prior crash; running the scan-based strict validation before Phase 3 would see those mid-freeze chunks as internal :lfs gaps and false-positive-fatal on mid-freeze-crash restarts. Consequence: phase3_reconcile_orphans no longer takes resume_ledger as a parameter — it derives resume_chunk_id internally from streaming:last_committed_ledger (key present → use it; absent → short-circuit since no prior live ingestion to reconcile). phase4_live_ingest still takes resume_ledger from compute_resume_ledger. New orchestrator order: phase1 → phase2 → phase3 → compute_resume_ledger → phase4. Rename leapfrog_start_chunk → retention_aligned_start_chunk and leapfrog_resume_from_tip → retention_aligned_resume_ledger. The 'leapfrog' term was misleading without reading Terminology first; the retention- aligned names describe what the functions return rather than the verb. 'Leapfrog' stays in Terminology as a colloquial concept, cross-referenced to the two renamed helpers. --- .../design-docs/02-streaming-workflow.md | 129 +++++++++--------- 1 file changed, 68 insertions(+), 61 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 50f48e789..6cd8e4fd2 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -29,8 +29,8 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 (live ingestion) is reached, it stays there until the process exits. [Details](#startup-sequence). - **Phase 1 (catchup)** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. - **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI entry point exists. -- **Leapfrog** — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. -- **`compute_resume_ledger`** — shared helper called once per daemon start, between Phase 2 (`.bin` hydration) and Phase 3 (reconcile). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger`. Result is consumed by Phase 3 (reconcile) and Phase 4 (live ingestion) — they never derive `resume_ledger` independently. See [Compute Resume Ledger](#compute-resume-ledger). +- **Leapfrog** (colloquial) — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. Implemented by the `retention_aligned_start_chunk` helper (Phase 1 (catchup) callsite) and the `retention_aligned_resume_ledger` helper (`compute_resume_ledger`'s no-BSB fresh-start branch). +- **`compute_resume_ledger`** — shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` for Phase 4 (live ingestion). Runs post-Phase-3 so any in-flight freezes Phase 3 finished (and their newly-set `:lfs` flags) are visible to the scan. See [Compute Resume Ledger](#compute-resume-ledger). - **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. - **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS` whenever captive core is NOT yet running (Phase 1 (catchup) loop, Phase 4 (live ingestion) leapfrog-from-tip on fresh start without BSB). Once captive core is running (inside Phase 4 (live ingestion)), the tip comes from `ledger_backend.latest_tip()` against the running subprocess — authoritative and cheaper than another HTTP round-trip. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: @@ -325,8 +325,8 @@ def run_rpc_service(config): make_bsb = make_bsb_partial(config) # None if [BSB] absent phase1_catchup(config, meta_store, make_bsb) phase2_hydrate_txhash(config, meta_store) + phase3_reconcile_orphans(config, meta_store) resume_ledger = compute_resume_ledger(config, meta_store) - phase3_reconcile_orphans(config, meta_store, resume_ledger) phase4_live_ingest(config, meta_store, resume_ledger) ``` @@ -357,7 +357,7 @@ def phase1_catchup(config, meta_store, make_bsb): if end_chunk <= last_scheduled_end_chunk: break # no new complete chunks since last iteration - start_chunk = leapfrog_start_chunk(network_tip_ledger, retention_ledgers, cpi) + start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi) if end_chunk < start_chunk: break # leapfrog landed past tip — pre-first-complete-chunk @@ -365,11 +365,13 @@ def phase1_catchup(config, meta_store, make_bsb): last_scheduled_end_chunk = end_chunk -def leapfrog_start_chunk(network_tip_ledger, retention_ledgers, cpi): - # Archive profile (retention=0): start at chunk 0. - # Pruning-history: align DOWN to the first chunk of the tx index containing - # (tip - retention_ledgers). retention_ledgers is a multiple of LEDGERS_PER_INDEX - # but tip is arbitrary, so the subtraction isn't index-aligned and must be rounded. +def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): + # Called by: phase1_catchup (per loop iteration) to compute range_start_chunk_id. + # Returns the first chunk Phase 1 (catchup) should backfill: + # - Archive profile (retention=0): chunk 0 (full history from genesis). + # - Pruning-history (retention>0): first chunk of the tx index containing + # (tip - retention_ledgers). Aligned DOWN to a tx-index boundary so the first + # persisted chunk starts a complete index (upholds the no-gaps invariant). # Worst case: up to LEDGERS_PER_INDEX - 1 extra ledgers below strict retention. if retention_ledgers == 0: return 0 @@ -435,11 +437,53 @@ def phase2_hydrate_txhash(config, meta_store): **Pure-streaming restarts** (no recent Phase 1 (catchup) output) never see `.bin` files; streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 (`.bin` hydration) is a no-op. +### Phase 3 — Reconcile Orphaned Transitions + +Completes any in-flight transitions left by a prior crash. All decisions derive from meta store state + on-disk store directories. + +```python +def phase3_reconcile_orphans(config, meta_store): + # If no prior live ingestion (streaming:last_committed_ledger absent), no in-flight + # freezes exist — fresh datadir or first-ever start. Short-circuit. + last_committed_ledger = meta_store.get("streaming:last_committed_ledger") + if last_committed_ledger is None: + return + + cpi = config.service.chunks_per_txhash_index + resume_chunk_id = (last_committed_ledger + 1 - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + + for store_dir in scan_ledger_store_dirs(config): + chunk_id = parse_chunk_id_from_dir(store_dir) + if chunk_id == resume_chunk_id or chunk_id == resume_chunk_id + 1: + continue # active / pre-created; keep + if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): + delete_dir(store_dir) # orphaned post-flush cleanup + elif chunk_id < resume_chunk_id: + finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) + else: + delete_dir(store_dir) # orphan future store + + resume_tx_index_id = resume_chunk_id // cpi + for store_dir in scan_txhash_store_dirs(config): + tx_index_id = parse_tx_index_id_from_dir(store_dir) + if tx_index_id == resume_tx_index_id or tx_index_id == resume_tx_index_id + 1: + continue + if meta_store.has(f"index:{tx_index_id:08d}:txhash"): + delete_dir(store_dir) # RecSplit done; cleanup lingered + elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): + # Re-spawn build. Pass the handle, not the dir path — build_tx_index_recsplit_files + # reads from the store and closes it. + transitioning_txhash = open_active_txhash_store(config, tx_index_id) + run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) + + # Prevents duplicate event IDs when Phase 4 (live ingestion) replays the first live ledger. + truncate_events_hot_segment(config, last_committed_ledger) +``` + ### Compute Resume Ledger -- `compute_resume_ledger` is a shared helper called once per daemon start, between Phase 2 (`.bin` hydration) and Phase 3 (reconcile). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. -- Consumed by Phase 3 (reconcile) for events-hot-segment truncation at `resume_ledger - 1`, and by Phase 4 (live ingestion) for active-store open + captive-core startup. -- **One helper, one call site.** Phase 3 (reconcile) and Phase 4 (live ingestion) never derive `resume_ledger` independently — avoids the bug class where they disagree and leave orphan active stores. +- `compute_resume_ledger` is a shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. +- **Runs AFTER Phase 3 (reconcile).** Phase 3's `finish_interrupted_ledger_freeze` writes `:lfs` for chunks whose freeze was in flight at a prior crash; running `compute_resume_ledger` before Phase 3 would see those mid-freeze chunks as internal `:lfs` gaps and false-positive-fatal at startup. - **Scans every startup, even when `streaming:last_committed_ledger` is already set.** The scan's primary output in the mid-life-restart case is validation, not derivation; catching broken on-disk state before opening active stores is strictly safer than silently resuming on top. - **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The daemon exits non-zero; no active stores are opened. @@ -447,9 +491,9 @@ def phase2_hydrate_txhash(config, meta_store): | `streaming:last_committed_ledger` | Scan result | Situation | `resume_ledger` | |---|---|---|---| -| present | (validated consistent) | Mid-life restart | `value + 1` | +| present | (validated consistent) | Mid-life restart (possibly after Phase 3 (reconcile) just finished in-flight freezes) | `value + 1` | | absent | contiguous `:lfs` chunks `[start..end]` | First-ever post-Phase-1 (catchup), or crash between Phase 1 (catchup) end and first live commit | `last_ledger_in_chunk(end) + 1` | -| absent | no `:lfs` chunks | Tip-tracker fresh start (no `[BSB]`) | `leapfrog_resume_from_tip(config)` | +| absent | no `:lfs` chunks | Tip-tracker fresh start (no `[BSB]`) | `retention_aligned_resume_ledger(config)` | **Validation rules** (any violation → fatal): @@ -457,11 +501,11 @@ def phase2_hydrate_txhash(config, meta_store): - **Start aligns to a tx-index boundary.** `start_chunk == 0` (archive) OR `start_chunk % cpi == 0` (pruning-history — first chunk of a tx index). Example FAIL at `cpi=100`: scan yields `[3456..N]`; `3456 % 100 ≠ 0`. Correct start would have been `3500`. - **Chunk flags consistent.** Every chunk in the contiguous range has both `:lfs` AND `:events`. A chunk with one but not the other means `process_chunk` crashed mid-task and was never re-run. - **Index flags consistent.** Every complete tx index fully inside `[start, end]` has `index:{tx_index_id:08d}:txhash`. Trailing partial indexes do NOT — those wait for Phase 2 (`.bin` hydration) on first start, or become Phase 3 (reconcile) build-respawn candidates on restart. -- **Live checkpoint consistent with scan.** When `streaming:last_committed_ledger = L` is present, chunks through `chunk_id_of_ledger(L) - 1` must all have `:lfs`. Example FAIL: `L = 56_345_672` (chunk 5_634 ingesting), but scan's highest contiguous chunk is 5_632 — chunk 5_633 must have been frozen before chunk 5_634 could be active; its absence means a recent immutable artifact went missing out of band. +- **Live checkpoint consistent with scan.** When `streaming:last_committed_ledger = L` is present, chunks through `chunk_id_of_ledger(L) - 1` must all have `:lfs`. Example FAIL: `L = 56_345_672` (chunk 5_634 ingesting), but scan's highest contiguous chunk is 5_632 — chunk 5_633 must have been frozen before chunk 5_634 could be active; its absence means a recent immutable artifact went missing out of band. (Mid-freeze state at a prior crash does NOT false-positive this rule because Phase 3 (reconcile) has already finished any in-flight freeze before `compute_resume_ledger` runs.) ```python def compute_resume_ledger(config, meta_store): - # Shared helper: single scan, single source of truth for resume_ledger. + # Called by: run_rpc_service orchestrator, after Phase 3 (reconcile), before Phase 4 (live ingestion). cpi = config.service.chunks_per_txhash_index scan = scan_all_chunk_and_index_keys(meta_store) validate_scan(scan, cpi) @@ -475,10 +519,11 @@ def compute_resume_ledger(config, meta_store): end_chunk = scan.lfs_chunks[-1] # already validated contiguous return last_ledger_in_chunk(end_chunk) + 1 - return leapfrog_resume_from_tip(config) # no on-disk chunks: Alice fresh start + return retention_aligned_resume_ledger(config) # no on-disk chunks: Alice fresh start def validate_scan(scan, cpi): + # Called by: compute_resume_ledger (as part of the pre-derivation validation pass). # Fatal on any violation — "migration to streaming failed". if not scan.lfs_chunks: return @@ -505,6 +550,7 @@ def validate_scan(scan, cpi): def validate_last_committed_consistency(scan, last_committed_ledger): + # Called by: compute_resume_ledger (when streaming:last_committed_ledger is present). # streaming:last_committed_ledger=L implies every chunk up to chunk_id_of_ledger(L)-1 # must have :lfs. Chunk containing L itself is the currently-ingesting chunk and may # or may not have :lfs depending on whether L fell on a chunk boundary. @@ -519,11 +565,11 @@ def validate_last_committed_consistency(scan, last_committed_ledger): f"a recent immutable artifact is missing") -def leapfrog_resume_from_tip(config): - # Tip-tracker fresh start: no BSB, no on-disk chunks. First chunk captive core - # ingests will be the first chunk of the tx index containing (tip - retention). - # validate_config already rejected the [BSB]-absent + retention=0 combination, so - # this helper is never called in archive-from-genesis shape. +def retention_aligned_resume_ledger(config): + # Called by: compute_resume_ledger (tip-tracker fresh-start branch; no BSB, no on-disk chunks). + # First chunk captive core ingests will be the first chunk of the tx index containing + # (tip - retention). validate_config already rejected the [BSB]-absent + retention=0 + # combination, so this helper is never called in archive-from-genesis shape. network_tip_ledger = sample_network_tip(config.history_archives.urls) retention_ledgers = config.service.retention_ledgers cpi = config.service.chunks_per_txhash_index @@ -534,45 +580,6 @@ def leapfrog_resume_from_tip(config): return (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER ``` -### Phase 3 — Reconcile Orphaned Transitions - -Completes any in-flight transitions left by a prior crash. All decisions derive from meta store state + on-disk store directories. - -```python -def phase3_reconcile_orphans(config, meta_store, resume_ledger): - # resume_ledger is computed once by the orchestrator (see Compute Resume Ledger) - # and shared with Phase 4 (live ingestion), so the two phases never disagree about the handoff point. - cpi = config.service.chunks_per_txhash_index - resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - - for store_dir in scan_ledger_store_dirs(config): - chunk_id = parse_chunk_id_from_dir(store_dir) - if chunk_id == resume_chunk_id or chunk_id == resume_chunk_id + 1: - continue # active / pre-created; keep - if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): - delete_dir(store_dir) # orphaned post-flush cleanup - elif chunk_id < resume_chunk_id: - finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) - else: - delete_dir(store_dir) # orphan future store - - resume_tx_index_id = resume_chunk_id // cpi - for store_dir in scan_txhash_store_dirs(config): - tx_index_id = parse_tx_index_id_from_dir(store_dir) - if tx_index_id == resume_tx_index_id or tx_index_id == resume_tx_index_id + 1: - continue - if meta_store.has(f"index:{tx_index_id:08d}:txhash"): - delete_dir(store_dir) # RecSplit done; cleanup lingered - elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): - # Re-spawn build. Pass the handle, not the dir path — build_tx_index_recsplit_files - # reads from the store and closes it. - transitioning_txhash = open_active_txhash_store(config, tx_index_id) - run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) - - # Prevents duplicate event IDs when Phase 4 (live ingestion) replays the first live ledger. - truncate_events_hot_segment(config, resume_ledger - 1) -``` - ### Phase 4 — Live Ingestion Opens active stores for the resume position, spawns the lifecycle goroutine, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). From 513eedefdbb0088aefde0b308bee3f510583731a Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Fri, 24 Apr 2026 01:24:25 -0700 Subject: [PATCH 24/34] Design docs: retire drift, events=RocksDB, on-demand stores, Phase 1 cap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary of design decisions from the grill-me session at 2026-04-23 (full decision trail + TDD-ready CC enumeration live off-repo): - Phase 1 (catchup) iteration cap MAX_PHASE1_ITERATIONS=5, hard-coded. Per-iter logs capture backlog trajectory; cap-exceeded fatal points at [BSB].NUM_WORKERS / BUFFER_SIZE. - Drift detection during Phase 4 retired entirely. No tip-sampling goroutine, no DRIFT_WARNING_LEDGERS config, no streaming_drift_ledgers gauge. Backpressure-and-drift section deleted. getHealth payload now matches existing stellar-rpc shape (status / latestLedger / oldestLedger / ledgerRetentionWindow). - Events active store is per-chunk RocksDB (schema per getEvents full-history design). Idempotent per-ledger writes remove Phase 3's truncate_events_hot_segment step. Events Transition freezes the RocksDB store to cold segment (close handle + delete_dir after fsync + flag set). - Active store lifecycle is on-demand at boundary, not pre-created. ActiveStores drops *_next fields; precreate_next_boundary_stores deleted. Each boundary synchronously opens the next store (~100-200 ms, absorbed by the 6s inter-ledger idle at live cadence) and spawns background freeze with the transitioning handle. Phase 3 reconcile simplified — no resume_chunk+1 / resume_tx_index+1 keep branch. - Query-routing read-view invariant made explicit in Concurrency Model: queries during transition see pre-transition or post-transition data, never half-state; flag-is-truth applies to reads. - Active store naming made explicit and uniform: ledger-store-chunk-{chunk_id}/, events-store-chunk-{chunk_id}/, txhash-store-index-{tx_index_id}/. - Network identity config: NETWORK_PASSPHRASE under [SERVICE] and STELLAR_CORE_BINARY_PATH under [CAPTIVE_CORE], both required. Aligns with stellar-rpc admin-guide convention. - validate_config consolidated: per-field fatal checks replaced by ensure_required_config_fields_exist(config) — config tables already mark required / optional. - HTTP server binds at daemon startup (before Phase 1) so getHealth is always servable; QueryRouter gates other endpoints on daemon_ready. - Tip helper renamed sample_network_tip → get_latest_network_tip (owns retries + archive mod-64 quirk). Called pre-Phase-4 only. ledger_backend.latest_tip() not used anywhere. - Minor cleanup: Phase N parenthetical tags at README:13, 01-backfill:204, 02-streaming:307-308; removed DRIFT_WARNING_LEDGERS cross-refs from 01-backfill. --- .../design-docs/01-backfill-workflow.md | 6 +- .../design-docs/02-streaming-workflow.md | 183 +++++++++--------- full-history/design-docs/README.md | 2 +- 3 files changed, 94 insertions(+), 97 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index 6a624b546..b99ea0d4b 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -72,7 +72,7 @@ All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). ## Configuration - The service loads a single TOML file; backfill reads the subset documented here. -- Daemon-level sections not consumed by backfill — `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]`, plus `RETENTION_LEDGERS` / `DRIFT_WARNING_LEDGERS` under `[SERVICE]` — are documented in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). +- Daemon-level sections not consumed by backfill — `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]`, plus `RETENTION_LEDGERS` under `[SERVICE]` — are documented in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). ### TOML Config @@ -83,7 +83,7 @@ All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). | `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | | `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | -`[SERVICE]` also carries daemon-level keys not read by backfill — `RETENTION_LEDGERS`, `DRIFT_WARNING_LEDGERS` — see [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). +`[SERVICE]` also carries daemon-level keys not read by backfill — `RETENTION_LEDGERS` — see [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). **[IMMUTABLE_STORAGE.LEDGERS]** (optional) @@ -201,7 +201,7 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h └── txhash/ ├── raw/ ← IMMUTABLE_STORAGE.TXHASH_RAW.PATH │ ├── 00000/ ← chunk_ids 0–999 (1_000 .bin files) - │ │ ├── 00000000.bin ← TRANSIENT (deleted after RecSplit or by Phase 2 hydration) + │ │ ├── 00000000.bin ← TRANSIENT — deleted after RecSplit or by Phase 2 (.bin hydration) │ │ └── ... │ └── .../ └── index/ ← IMMUTABLE_STORAGE.TXHASH_INDEX.PATH diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 6cd8e4fd2..c55c5c358 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -32,11 +32,11 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Leapfrog** (colloquial) — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. Implemented by the `retention_aligned_start_chunk` helper (Phase 1 (catchup) callsite) and the `retention_aligned_resume_ledger` helper (`compute_resume_ledger`'s no-BSB fresh-start branch). - **`compute_resume_ledger`** — shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` for Phase 4 (live ingestion). Runs post-Phase-3 so any in-flight freezes Phase 3 finished (and their newly-set `:lfs` flags) are visible to the scan. See [Compute Resume Ledger](#compute-resume-ledger). - **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. -- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS` whenever captive core is NOT yet running (Phase 1 (catchup) loop, Phase 4 (live ingestion) leapfrog-from-tip on fresh start without BSB). Once captive core is running (inside Phase 4 (live ingestion)), the tip comes from `ledger_backend.latest_tip()` against the running subprocess — authoritative and cheaper than another HTTP round-trip. Different from `last_committed_ledger` (the daemon's own progress). +- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Always sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS`, wrapped in the `get_latest_network_tip()` helper (handles retries + the archive-tip-lags-true-tip-by-up-to-63-ledgers quirk). Called only in startup-phase contexts: the Phase 1 (catchup) loop per iter, and `retention_aligned_resume_ledger` for the no-BSB fresh-start case. Phase 4 (live ingestion) steady state does NOT sample tip. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - Ledger active store — a per-chunk RocksDB (one instance per chunk). - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). - - Events hot segment — in-memory roaring bitmaps plus persisted per-ledger index deltas (not a RocksDB; see [getEvents design](../../design-docs/getevents-full-history-design.md)). + - Events active store — per-chunk RocksDB (one instance per chunk; schema + column families per [getEvents full-history design](../../design-docs/getevents-full-history-design.md)). - **Immutable store** — on-disk files produced by freezing an active store. Three kinds: - Ledger pack file (one per chunk). - RecSplit index `.idx` files (16 per index). @@ -87,13 +87,14 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 | Key | Type | Default | Description | |---|---|---|---| | `RETENTION_LEDGERS` | uint32 | `0` | `0` = full history; otherwise must be a positive multiple of `LEDGERS_PER_INDEX`. See [Validation Rules](#validation-rules). | -| `DRIFT_WARNING_LEDGERS` | uint32 | `10` | `getHealth` reports unhealthy when ingestion drift exceeds this. ~60 seconds at 10 ledgers. | +| `NETWORK_PASSPHRASE` | string | **required** | Stellar network passphrase — for example, `"Public Global Stellar Network ; September 2015"` for pubnet; `"Test SDF Network ; September 2015"` for testnet. Must match the `NETWORK_PASSPHRASE` in the captive-core config file. Surfaced to all daemon code via the runtime config struct. | **[CAPTIVE_CORE]** | Key | Type | Default | Description | |---|---|---|---| -| `CONFIG_PATH` | string | **required** | Path to CaptiveStellarCore config file. | +| `CONFIG_PATH` | string | **required** | Path to the captive-core TOML config file (consumed by the embedded `stellar-core` subprocess). | +| `STELLAR_CORE_BINARY_PATH` | string | **required** | Path to the `stellar-core` binary that captive core spawns as a subprocess. | **[ACTIVE_STORAGE]** (optional) @@ -133,6 +134,8 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 - **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. Not a supported operating mode. - `[HISTORY_ARCHIVES].URLS` required in all profiles. - `[CAPTIVE_CORE].CONFIG_PATH` required in all profiles. +- `[CAPTIVE_CORE].STELLAR_CORE_BINARY_PATH` required in all profiles. +- `[SERVICE].NETWORK_PASSPHRASE` required in all profiles. ### Validation Pseudocode @@ -151,10 +154,9 @@ def validate_config(config, meta_store): "BSB — captive-core-from-genesis is not supported. Either add [BSB] or set " "RETENTION_LEDGERS > 0.") - if not config.captive_core.config_path: - fatal("CAPTIVE_CORE.CONFIG_PATH is required.") - if not config.history_archives.urls: - fatal("HISTORY_ARCHIVES.URLS is required.") + # Fatals with a clear "X is required" message for any key marked **required** + # in the [Configuration] tables above that is absent or empty. + ensure_required_config_fields_exist(config) _enforce_immutable(meta_store, "config:chunks_per_txhash_index", str(cpi)) _enforce_immutable(meta_store, "config:retention_ledgers", str(retention_ledgers)) @@ -249,24 +251,33 @@ The daemon maintains three active stores for the current ingestion position. All |---|---|---|---|---| | Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | | TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_INDEX` ledgers (index) | -| Events | In-memory hot segment + persisted index deltas | Sequential event ID | Event XDR + metadata | Every 10_000 ledgers (chunk) | +| Events | `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/` | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | Every 10_000 ledgers (chunk) | - Ledger and txhash stores are RocksDB. WAL required. - TxHash store uses 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`. -- Events hot segment is in-memory roaring bitmaps plus persisted per-ledger index deltas for crash recovery. See [getEvents full-history design](../../design-docs/getevents-full-history-design.md). - -### Store Pre-creation - -- The store for the next chunk / index is pre-created before the boundary is reached, so boundary-time work is a pointer swap only. -- Creation timing: when the ingestion loop commits a ledger within a configurable window before the boundary (e.g., `last_ledger_in_chunk(chunk_id) - 1_000`). The window must be large enough that store initialization (directory mkdir + RocksDB open + column family setup) completes before the boundary ledger arrives, and small enough that pre-creation doesn't run prematurely for chunks the daemon may never reach. -- On restart, a pre-created store is expected to exist — Phase 3 (reconcile) treats `resume_chunk + 1` (and `resume_index + 1`) as active, not an orphan. +- Events active store is a per-chunk RocksDB; schema + column families per [getEvents full-history design](../../design-docs/getevents-full-history-design.md). Per-ledger writes are idempotent. + +### Store Lifecycle + +- **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed: + - Phase 4 (live ingestion) entry opens exactly one store per data type: `resume_chunk`'s ledger + events stores, and `resume_tx_index`'s txhash store. No pre-creation of a "next" store. + - Each chunk boundary synchronously opens the next chunk's ledger + events stores after capturing the current ones as transitioning handles. + - Each tx-index boundary synchronously opens the next tx-index's txhash store similarly. +- **Synchronous open cost.** mkdir + RocksDB open + column-family setup is ~100–200 ms. At live cadence (6 s/ledger) this fits entirely inside the inter-ledger idle time — zero throughput impact. During archive replay (~500 ledgers/s) the cost is absorbed once per chunk boundary, ~100 ms each; over Alice's 10M-retention fresh start (~1_000 chunks) that's ~100 s of cumulative stall distributed across a ~6 h replay, sub-1%. +- **Transition.** At each boundary, the ingestion loop (a) captures the current store handle as `transitioning`, (b) synchronously opens the next store, (c) spawns the background freeze goroutine with the `transitioning` handle. Ingestion proceeds against the new active store immediately. +- **Deletion.** The freeze goroutine closes the transitioning handle and deletes its RocksDB directory AFTER writing the immutable artifact and setting the meta-store flag (flag-after-fsync). A crash between flag-set and dir-delete leaves an orphan that Phase 3 (reconcile) classifies as flag-is-truth and deletes. +- **Crash recovery.** Phase 3 (reconcile) classifies each on-disk active-store directory by chunk/index ID + flag presence: + - Dir is for `resume_chunk` / `resume_tx_index` → keep (the active store the live loop will resume against). + - `:lfs` / `:events` / `:txhash` flag present + dir present → delete dir (flag-is-truth; freeze completed, delete lingered). + - Flag absent + chunk/index ID < resume → `finish_interrupted_ledger_freeze` (or equivalent) — complete the freeze, set flag, delete dir. + - Else → future orphan (from filesystem corruption, stale dirs, or legacy daemon versions that used pre-creation); delete dir. ### Max Concurrent Stores | Store | Max active | Max transitioning | Max total | |---|---|---|---| | Ledger | 1 | 1 | 2 | -| Events | 1 (hot segment) | 1 (freezing cold segment) | 2 | +| Events | 1 | 1 | 2 | | TxHash | 1 | 1 | 2 | --- @@ -304,9 +315,9 @@ def make_bsb_partial(config): Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 (live ingestion) is the long-running state the daemon stays in until process exit. -- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. -- **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 left (for the trailing partial index) into the active txhash store, then deletes them. -- **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. Truncates events hot segment beyond the last committed ledger. +- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. +- **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 (catchup) left (for the trailing partial index) into the active txhash store, then deletes them. +- **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. - **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle goroutine, flips the `daemon_ready` flag, enters the ingestion loop. Runs until process exit. "Phase" here refers to the startup ordering only. Once Phase 4 (live ingestion) is entered, there's no Phase 5 — the daemon is in live-streaming steady state. @@ -322,6 +333,8 @@ Four sequential phases, same code path for first start and every restart. The fi def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) + start_http_server(config) # QueryRouter serves getHealth immediately; + # gates getLedger/getTransaction/getEvents on daemon_ready. make_bsb = make_bsb_partial(config) # None if [BSB] absent phase1_catchup(config, meta_store, make_bsb) phase2_hydrate_txhash(config, meta_store) @@ -339,6 +352,12 @@ Query serving is gated on Phase 4 (live ingestion) being reached — see [Query - Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. ```python +# Diagnostic safety net for Phase 1 (catchup). BSB throughput >> tip advance rate +# in practice, so 2–3 iters is typical. Hitting the cap means the BSB source is +# degraded, not that 5 is the wrong number — hard-coded, not TOML-configurable. +MAX_PHASE1_ITERATIONS = 5 + + def phase1_catchup(config, meta_store, make_bsb): # Pure side-effect. Re-runs the full retention-aligned range on every start; # DAG idempotency inside run_backfill handles already-done chunks. @@ -350,20 +369,34 @@ def phase1_catchup(config, meta_store, make_bsb): last_scheduled_end_chunk = -1 # Loop because tip advances during catchup; each iteration closes whatever's - # accumulated since the previous sample. - while True: - network_tip_ledger = sample_network_tip(config.history_archives.urls) + # accumulated since the previous sample. Capped at MAX_PHASE1_ITERATIONS to + # surface degraded-BSB scenarios as a fatal instead of a silent-hang. + for iter_count in range(1, MAX_PHASE1_ITERATIONS + 1): + network_tip_ledger = get_latest_network_tip(config.history_archives.urls) end_chunk = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 if end_chunk <= last_scheduled_end_chunk: - break # no new complete chunks since last iteration + log.info(f"phase1_catchup converged after iter={iter_count - 1}") + return # no new complete chunks since last iteration start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi) if end_chunk < start_chunk: - break # leapfrog landed past tip — pre-first-complete-chunk + log.info(f"phase1_catchup: leapfrog landed past tip on iter={iter_count} — no-op") + return # leapfrog landed past tip — pre-first-complete-chunk + + backlog_ledgers = (network_tip_ledger - last_ledger_in_chunk(last_scheduled_end_chunk) + if last_scheduled_end_chunk >= 0 else network_tip_ledger) + log.info(f"phase1_catchup iter={iter_count}/{MAX_PHASE1_ITERATIONS} " + f"tip={network_tip_ledger} start_chunk={start_chunk} end_chunk={end_chunk} " + f"backlog_ledgers={backlog_ledgers}") run_backfill(config, start_chunk, end_chunk, make_bsb) last_scheduled_end_chunk = end_chunk + fatal(f"Phase 1 (catchup) did not converge within MAX_PHASE1_ITERATIONS={MAX_PHASE1_ITERATIONS}. " + f"BSB throughput is likely slower than network tip advance rate — check " + f"[BSB].NUM_WORKERS / [BSB].BUFFER_SIZE and network latency to the BSB bucket. " + f"Per-iter backlog_ledgers is in the prior log lines.") + def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): # Called by: phase1_catchup (per loop iteration) to compute range_start_chunk_id. @@ -454,8 +487,8 @@ def phase3_reconcile_orphans(config, meta_store): for store_dir in scan_ledger_store_dirs(config): chunk_id = parse_chunk_id_from_dir(store_dir) - if chunk_id == resume_chunk_id or chunk_id == resume_chunk_id + 1: - continue # active / pre-created; keep + if chunk_id == resume_chunk_id: + continue # active; keep if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): delete_dir(store_dir) # orphaned post-flush cleanup elif chunk_id < resume_chunk_id: @@ -466,8 +499,8 @@ def phase3_reconcile_orphans(config, meta_store): resume_tx_index_id = resume_chunk_id // cpi for store_dir in scan_txhash_store_dirs(config): tx_index_id = parse_tx_index_id_from_dir(store_dir) - if tx_index_id == resume_tx_index_id or tx_index_id == resume_tx_index_id + 1: - continue + if tx_index_id == resume_tx_index_id: + continue # active; keep if meta_store.has(f"index:{tx_index_id:08d}:txhash"): delete_dir(store_dir) # RecSplit done; cleanup lingered elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): @@ -476,8 +509,6 @@ def phase3_reconcile_orphans(config, meta_store): transitioning_txhash = open_active_txhash_store(config, tx_index_id) run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) - # Prevents duplicate event IDs when Phase 4 (live ingestion) replays the first live ledger. - truncate_events_hot_segment(config, last_committed_ledger) ``` ### Compute Resume Ledger @@ -570,7 +601,7 @@ def retention_aligned_resume_ledger(config): # First chunk captive core ingests will be the first chunk of the tx index containing # (tip - retention). validate_config already rejected the [BSB]-absent + retention=0 # combination, so this helper is never called in archive-from-genesis shape. - network_tip_ledger = sample_network_tip(config.history_archives.urls) + network_tip_ledger = get_latest_network_tip(config.history_archives.urls) retention_ledgers = config.service.retention_ledgers cpi = config.service.chunks_per_txhash_index @@ -588,8 +619,7 @@ Opens active stores for the resume position, spawns the lifecycle goroutine, sta def phase4_live_ingest(config, meta_store, resume_ledger): # resume_ledger is already computed by the orchestrator (see Compute Resume Ledger). # Phase 4 (live ingestion) does NOT write streaming:last_committed_ledger at bootstrap — the first - # write happens inside the live ingestion loop after the first durable commit - # (invariant #9). + # write happens inside the live ingestion loop after the first durable commit. active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) run_in_background(run_prune_lifecycle_loop, config, meta_store) @@ -601,20 +631,16 @@ def phase4_live_ingest(config, meta_store, resume_ledger): def open_active_stores_for_resume(config, meta_store, resume_ledger): - # Open/WAL-recover the current store for each of ledger/events/txhash AND pre-create - # the "next" stores so the first boundary rollover is a pointer swap. - # Events hot segment replays persisted deltas from disk; safe because Phase 3 (reconcile) already - # truncated anything past last_committed_ledger. + # Open exactly one store per data type for the resume position. Subsequent + # chunks / indexes are opened on-demand at boundary time — see on_chunk_boundary + # and on_tx_index_boundary. resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK resume_tx_index_id = resume_chunk_id // config.service.chunks_per_txhash_index return ActiveStores( - ledger = open_or_create_ledger_store(config, resume_chunk_id), - ledger_next = open_or_create_ledger_store(config, resume_chunk_id + 1), - events = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id, resume_ledger), - events_next = open_or_create_events_hot_segment(config, meta_store, resume_chunk_id + 1, None), - txhash = open_or_create_txhash_store(config, resume_tx_index_id), - txhash_next = open_or_create_txhash_store(config, resume_tx_index_id + 1), + ledger = open_or_create_ledger_store(config, resume_chunk_id), + events = open_or_create_events_store(config, meta_store, resume_chunk_id), + txhash = open_or_create_txhash_store(config, resume_tx_index_id), ) ``` @@ -638,7 +664,7 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r wait_all( run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), - run_in_background(write_events_hot_segment, active_stores.events, ledger_seq, lcm), + run_in_background(write_events_store, active_stores.events, ledger_seq, lcm), ) # Atomic "daemon owns everything up to ledger_seq" signal — written only after @@ -659,7 +685,7 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ledger_seq += 1 ``` -- Each per-store write is atomic: RocksDB WriteBatch + WAL for ledger and txhash stores; atomic commit of events hot-segment + persisted deltas. +- Each per-store write is atomic: RocksDB WriteBatch + WAL across all three active stores (ledger / txhash / events). - Key/value schemas are in [Active Store Architecture](#active-store-architecture). --- @@ -669,17 +695,17 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r Three independent background transitions per chunk/index boundary; each has its own goroutine, flag, and cleanup. Live ingestion never blocks on them. - **LFS transition** — per chunk. Retired ledger RocksDB → `.pack` file. -- **Events transition** — per chunk. Retired events hot segment → cold segment (3 files). +- **Events transition** — per chunk. Retired events RocksDB store → cold segment (3 files). - **RecSplit transition** — per index. Retired txhash RocksDB → 16 `.idx` files. - Streaming's freeze transitions never produce `.bin` files; those are transient backfill output only (Phase 1). ### Concurrency Model -- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `ledger_next`, `events`, `events_next`, `txhash`, `txhash_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. +- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `events`, `txhash` — one handle per data type, no `*_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. - **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). Go's `sync.Mutex` inside the meta-store wrapper + RocksDB's own single-writer semantics keep these serialized. - **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. -- **Query handlers read from storage-manager layer** (see [01-backfill-workflow.md](./01-backfill-workflow.md)'s sibling docs and the pending query-routing design). Each per-data-type storage manager owns its own state-transition synchronization; the query handler never touches `active_stores` directly. -- **Pre-creation happens at store-open time, not at a mid-chunk tripwire.** `open_active_stores_for_resume` (Phase 4 entry) opens BOTH `resume_chunk_id`'s store AND `resume_chunk_id + 1`'s store up front. Subsequent pre-creation happens inside `on_chunk_boundary` after the rollover — it opens `chunk_id + 2` so the NEXT rollover has the pre-created store already waiting. Amortizes creation cost; keeps the ingestion loop's hot path free of store-open latency. +- **Query handlers read from storage-manager layer** — each per-data-type storage manager (ledger / events / txhash) owns its own state-transition synchronization; the query handler never touches `active_stores` directly. **Read-view invariant:** during a transition, a query sees either pre-transition data (routed to the transitioning store) or post-transition data (routed to the new active store + the newly-flagged immutable artifact) — never a half-state mix. **Flag-is-truth applies to reads too:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag isn't set. Concrete lock primitives + routing logic belong in a separate query-routing design doc. +- **Stores are opened on-demand at boundary, not pre-created.** `open_active_stores_for_resume` (Phase 4 entry) opens exactly one store per data type (`resume_chunk` ledger + events, `resume_tx_index` txhash). Each chunk/tx-index boundary synchronously opens the next store (~100-200 ms) AFTER capturing the current one as transitioning, then spawns the background freeze. At live cadence the sync open fits inside the 6 s inter-ledger idle — zero throughput impact. No `precreate_next_boundary_stores` helper, no `*_next` fields on `active_stores`. ### Chunk Boundary (every 10_000 ledgers) @@ -687,34 +713,20 @@ Triggered when the ingestion loop commits `last_ledger_in_chunk(chunk_id)`. Hand ```python def on_chunk_boundary(chunk_id, active_stores, meta_store): - # LFS: drain the last in-flight LFS freeze (max-1-transitioning), swap pointers, - # spawn the freeze. + # LFS: drain the last in-flight LFS freeze (max-1-transitioning), capture the current + # handle, synchronously open the next chunk's store, spawn the freeze. wait_for_lfs_complete() transitioning_ledger_store = active_stores.ledger - active_stores.ledger = active_stores.ledger_next + active_stores.ledger = open_or_create_ledger_store(config, chunk_id + 1) # ~100-200 ms synchronous run_in_background(freeze_ledger_chunk_to_pack_file, chunk_id, transitioning_ledger_store, meta_store) # Events: same shape, independent goroutine (does NOT wait on LFS). wait_for_events_complete() - freezing_events_segment = active_stores.events - active_stores.events = active_stores.events_next - run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, freezing_events_segment, meta_store) - - # Pre-create "next-next" so the NEXT boundary is also a pointer swap. Background; - # not on the hot path. - run_in_background(precreate_next_boundary_stores, active_stores, meta_store, chunk_id + 2) + transitioning_events_store = active_stores.events + active_stores.events = open_or_create_events_store(config, meta_store, chunk_id + 1) # ~100-200 ms + run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, transitioning_events_store, meta_store) notify_lifecycle() # wake prune loop (this notification is ONLY for prune eligibility) - - -def precreate_next_boundary_stores(active_stores, meta_store, target_chunk_id): - # Idempotent — safe to re-run when target stores already exist. - active_stores.ledger_next = open_or_create_ledger_store(config, target_chunk_id) - active_stores.events_next = open_or_create_events_hot_segment(config, meta_store, target_chunk_id, None) - cpi = config.service.chunks_per_txhash_index - target_tx_index_id = target_chunk_id // cpi - if target_tx_index_id != tx_index_id_of_chunk(target_chunk_id - 1): - active_stores.txhash_next = open_or_create_txhash_store(config, target_tx_index_id) ``` ### LFS Transition @@ -753,16 +765,17 @@ def finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store): ### Events Transition -Converts the retired events hot segment to three immutable files (events cold segment). +Converts the retired events RocksDB store to three immutable files (events cold segment). ```python -def freeze_events_chunk_to_cold_segment(chunk_id, freezing_events_segment, meta_store): +def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, meta_store): # Same flag-after-fsync order as freeze_ledger_chunk_to_pack_file. events_path = events_segment_path(chunk_id) - write_cold_segment(freezing_events_segment, events_path) # 3 files: events.pack, index.pack, index.hash + write_cold_segment(transitioning_events_store, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - freezing_events_segment.discard() # drops in-memory bitmaps + persisted deltas + transitioning_events_store.close() + delete_dir(events_store_path(chunk_id)) signal_events_complete() ``` @@ -779,7 +792,7 @@ def on_tx_index_boundary(tx_index_id, active_stores, meta_store): verify_all_chunk_flags(tx_index_id, meta_store) # defense-in-depth transitioning_txhash_store = active_stores.txhash - active_stores.txhash = active_stores.txhash_next + active_stores.txhash = open_or_create_txhash_store(config, tx_index_id + 1) # ~100-200 ms synchronous run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash_store, meta_store) ``` @@ -889,13 +902,14 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` - An in-memory boolean `daemon_ready` is set by `set_daemon_ready()` at the top of Phase 4 (live ingestion), after Phases 1–3 complete and active stores are opened. - Not persisted. On every startup the flag starts `false`; on every Phase 4 (live ingestion) entry it flips to `true`. Clean shutdown discards it implicitly (process exits). +- The HTTP server binds its port at daemon startup (before Phase 1 (catchup)) so `getHealth` is always servable regardless of current phase. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `daemon_ready`. - This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 (live ingestion) is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the daemon serves, every time. - Query handlers check the flag on each request. `false` → HTTP 4xx. `true` → route normally. ### Behavior During Phases 1–3 - `/getLedger`, `/getTransaction`, `/getEvents` → `HTTP 4xx` with no payload detail. -- `/getHealth` → always served; returns `catching_up` + drift when daemon is pre-Phase-4, otherwise `streaming` + drift. +- `/getHealth` → always served. Response payload matches the existing stellar-rpc shape: `status` (`catching_up` during Phases 1–3, `healthy` during Phase 4 (live ingestion)), `latestLedger` (= `streaming:last_committed_ledger`, or `0` if absent), `oldestLedger` (first ingested ledger), `ledgerRetentionWindow`. No drift field, no network-tip field. - No partial / incremental serving. The daemon does not serve "whatever is ingested so far" while Phases 1–3 are running. ### Behavior When an Index Is Being Pruned @@ -949,23 +963,6 @@ Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workf --- -## Backpressure and Drift Detection - -- Live ingestion runs at the network's production rate (~1 ledger / 6 s). Freeze transitions run in background and must not stall the ingestion loop. -- If ingestion drifts, the cause is typically disk I/O saturation or RocksDB compaction stalls. - -### Drift Metric - -```python -drift_ledgers = ledger_backend.latest_tip() - meta_store.get("streaming:last_committed_ledger") -``` - -- Exposed as a Prometheus gauge `streaming_drift_ledgers`. -- `getHealth` returns `unhealthy` when `drift_ledgers > DRIFT_WARNING_LEDGERS` (default 10). -- No automatic response (no pause, no abort). Operator investigates. - ---- - ## Error Handling Three distinct policies — runtime ABORT, transition retry-via-flag-absence, startup FATAL. diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md index e53783762..8b545047f 100644 --- a/full-history/design-docs/README.md +++ b/full-history/design-docs/README.md @@ -10,7 +10,7 @@ ## Reading Order - Read **01 Backfill** first. It defines shared concepts used by both docs: geometry, meta-store key schema, shared TOML config, flag-after-fsync. -- Read **02 Streaming** second. It builds on 01's vocabulary and describes how the daemon invokes backfill as its Phase 1 subroutine. +- Read **02 Streaming** second. It builds on 01's vocabulary and describes how the daemon invokes backfill as its Phase 1 (catchup) subroutine. ## See Also From 1ba6904b63aa7920a7b929b9964237d59f5ebfb3 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Fri, 24 Apr 2026 01:36:25 -0700 Subject: [PATCH 25/34] Design docs: Phase 3 reconciles events store + trim retired-term negations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Emerged from the 5-pass consistency review at session close: - Phase 3 (reconcile) gains a scan_events_store_dirs loop mirroring the existing ledger-store loop. Drift introduced during the events=RocksDB rewrite — ledger + txhash had reconciliation loops, events did not, so a mid-events-freeze crash would leave an unreconciled RocksDB dir. Added finish_interrupted_events_freeze helper in §Events Transition (mirrors finish_interrupted_ledger_freeze: re-runs write_cold_segment + fsync + flag + close + delete_dir synchronously). - §Store Lifecycle + §Concurrency Model: dropped retired-term negations ("No pre-creation of a next store", "legacy daemon versions that used pre-creation", ", not pre-created" title qualifier, "No precreate_next_boundary_stores helper, no *_next fields"). Design doc should describe the current design; retired concepts live in the decision store. --- .../design-docs/02-streaming-workflow.md | 30 +++++++++++++++++-- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index c55c5c358..9b3624ff1 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -260,7 +260,7 @@ The daemon maintains three active stores for the current ingestion position. All ### Store Lifecycle - **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed: - - Phase 4 (live ingestion) entry opens exactly one store per data type: `resume_chunk`'s ledger + events stores, and `resume_tx_index`'s txhash store. No pre-creation of a "next" store. + - Phase 4 (live ingestion) entry opens exactly one store per data type: `resume_chunk`'s ledger + events stores, and `resume_tx_index`'s txhash store. - Each chunk boundary synchronously opens the next chunk's ledger + events stores after capturing the current ones as transitioning handles. - Each tx-index boundary synchronously opens the next tx-index's txhash store similarly. - **Synchronous open cost.** mkdir + RocksDB open + column-family setup is ~100–200 ms. At live cadence (6 s/ledger) this fits entirely inside the inter-ledger idle time — zero throughput impact. During archive replay (~500 ledgers/s) the cost is absorbed once per chunk boundary, ~100 ms each; over Alice's 10M-retention fresh start (~1_000 chunks) that's ~100 s of cumulative stall distributed across a ~6 h replay, sub-1%. @@ -270,7 +270,7 @@ The daemon maintains three active stores for the current ingestion position. All - Dir is for `resume_chunk` / `resume_tx_index` → keep (the active store the live loop will resume against). - `:lfs` / `:events` / `:txhash` flag present + dir present → delete dir (flag-is-truth; freeze completed, delete lingered). - Flag absent + chunk/index ID < resume → `finish_interrupted_ledger_freeze` (or equivalent) — complete the freeze, set flag, delete dir. - - Else → future orphan (from filesystem corruption, stale dirs, or legacy daemon versions that used pre-creation); delete dir. + - Else → future orphan; delete dir. ### Max Concurrent Stores @@ -496,6 +496,17 @@ def phase3_reconcile_orphans(config, meta_store): else: delete_dir(store_dir) # orphan future store + for store_dir in scan_events_store_dirs(config): + chunk_id = parse_chunk_id_from_dir(store_dir) + if chunk_id == resume_chunk_id: + continue # active; keep + if meta_store.has(f"chunk:{chunk_id:08d}:events"): + delete_dir(store_dir) # orphaned post-flush cleanup + elif chunk_id < resume_chunk_id: + finish_interrupted_events_freeze(store_dir, chunk_id, meta_store) + else: + delete_dir(store_dir) # orphan future store + resume_tx_index_id = resume_chunk_id // cpi for store_dir in scan_txhash_store_dirs(config): tx_index_id = parse_tx_index_id_from_dir(store_dir) @@ -705,7 +716,7 @@ Three independent background transitions per chunk/index boundary; each has its - **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). Go's `sync.Mutex` inside the meta-store wrapper + RocksDB's own single-writer semantics keep these serialized. - **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. - **Query handlers read from storage-manager layer** — each per-data-type storage manager (ledger / events / txhash) owns its own state-transition synchronization; the query handler never touches `active_stores` directly. **Read-view invariant:** during a transition, a query sees either pre-transition data (routed to the transitioning store) or post-transition data (routed to the new active store + the newly-flagged immutable artifact) — never a half-state mix. **Flag-is-truth applies to reads too:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag isn't set. Concrete lock primitives + routing logic belong in a separate query-routing design doc. -- **Stores are opened on-demand at boundary, not pre-created.** `open_active_stores_for_resume` (Phase 4 entry) opens exactly one store per data type (`resume_chunk` ledger + events, `resume_tx_index` txhash). Each chunk/tx-index boundary synchronously opens the next store (~100-200 ms) AFTER capturing the current one as transitioning, then spawns the background freeze. At live cadence the sync open fits inside the 6 s inter-ledger idle — zero throughput impact. No `precreate_next_boundary_stores` helper, no `*_next` fields on `active_stores`. +- **Stores are opened on-demand at boundary.** `open_active_stores_for_resume` (Phase 4 entry) opens exactly one store per data type (`resume_chunk` ledger + events, `resume_tx_index` txhash). Each chunk/tx-index boundary synchronously opens the next store (~100-200 ms) AFTER capturing the current one as transitioning, then spawns the background freeze. At live cadence the sync open fits inside the 6 s inter-ledger idle — zero throughput impact. ### Chunk Boundary (every 10_000 ledgers) @@ -777,6 +788,19 @@ def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, me transitioning_events_store.close() delete_dir(events_store_path(chunk_id)) signal_events_complete() + + +def finish_interrupted_events_freeze(store_dir, chunk_id, meta_store): + # Phase 3 (reconcile) helper. Same as freeze_events_chunk_to_cold_segment but opens + # the existing transitioning store (WAL-recovered) and runs synchronously (no + # signal_events_complete since Phase 3 is not parallel with ingestion). + transitioning_events_store = open_or_create_events_store(config, meta_store, chunk_id) + events_path = events_segment_path(chunk_id) + write_cold_segment(transitioning_events_store, events_path) + fsync_all(events_path) + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") + transitioning_events_store.close() + delete_dir(store_dir) ``` ### Tx-Index Boundary (every `LEDGERS_PER_INDEX` ledgers) From c015a978ae23b4868a17a740f5d6f0bca6690306 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Fri, 24 Apr 2026 11:09:08 -0700 Subject: [PATCH 26/34] =?UTF-8?q?Design=20docs:=20aggressive=20pseudocode?= =?UTF-8?q?=20prune=20=E2=80=94=20main()=20+=20bullet-collapse=20+=20comme?= =?UTF-8?q?nt=20trim?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two-pass pruning over 02-streaming-workflow.md at end of session. - Deleted BSBSource class + make_bsb_partial pseudocode blocks (all comments were garbage restating SDK concepts); replaced with 2 prose bullets in Ledger Source. - Added main() above run_rpc_service — ties CLI parse + config load + logging init + orchestrator invocation together as the true entry point. - Collapsed finish_interrupted_ledger_freeze + finish_interrupted_events_freeze pseudocodes into 1-line bullets (same shape as freeze_*, called synchronously from Phase 3, no signal_*_complete). - Collapsed phase3_reconcile_orphans from 35+ lines of 3 parallel loops to 10 lines + 3 prose bullets describing each classifier helper. - Trimmed verbose "# Called by:" / "# Step 1/2/3:" / worked-example comments across validate_scan, validate_last_committed_consistency, prunable_tx_index_ids, phase2_hydrate_txhash, on_chunk_boundary, on_tx_index_boundary, run_live_ingestion_loop, run_prune_lifecycle_loop, prune_tx_index. - Backfill vs Phase 1 bullets: dropped redundant recap bullet (4 → 3). 02-streaming: 1010 → 888 lines (-12%). Python pseudocode blocks: 502 → 373 (-26%). Core logic preserved; intent over enumeration. --- .../design-docs/02-streaming-workflow.md | 258 +++++------------- 1 file changed, 68 insertions(+), 190 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 9b3624ff1..070cfcbb2 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -287,27 +287,8 @@ The daemon maintains three active stores for the current ingestion position. All - **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup). - **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. -```python -class BSBSource: - # Used by backfill only. One instance per process_chunk task, torn down at end. - # Interface mirrors the stellar Go SDK's LedgerBackend (PrepareRange + GetLedger). - # prepare_range: sets the BSB-backed LedgerBackend's range; BSB prefetch workers - # (BUFFER_SIZE, NUM_WORKERS) fill buffers ahead of get_ledger. - # get_ledger: SDK GetLedger(seq) reads from the prefetch buffer. - # close: tears down the prefetch workers + connection. - ... -``` - -### Make BSB Partial - -```python -def make_bsb_partial(config): - # Returns a partial that each process_chunk calls to get a fresh BSBSource. - # None means Phase 1 (catchup) is a no-op; Phase 4 (live ingestion) captive core handles catchup. - if config.bsb is None: - return None - return functools.partial(BSBSource, config.bsb) -``` +- **`BSBSource`** is the backfill-only ledger source — one instance per `process_chunk`, interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`), torn down at end-of-task. +- **`make_bsb_partial(config)`** returns a partial that instantiates `BSBSource(config.bsb)` per call; returns `None` when `[BSB]` is absent so `phase1_catchup` can short-circuit. --- @@ -324,18 +305,22 @@ Four sequential phases, same code path for first start and every restart. The fi ### Backfill vs Phase 1 (catchup) -- **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only, runs parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. -- **Phase 1 (catchup)** is a startup phase that runs on every daemon start. Its job: close the gap between on-disk state and current network tip before Phase 4 takes over. -- Phase 1 (catchup) invokes backfill as its mechanism — but only when `[BSB]` is configured. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))` as part of its own startup. -- So: "backfill" and "Phase 1 (catchup)" overlap because Phase 1 (catchup)'s whole purpose is "invoke backfill when BSB is configured". +- **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only; parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. +- **Phase 1 (catchup)** is the startup phase that runs on every daemon start. Its job: close the gap between on-disk state and current network tip before Phase 4 (live ingestion) takes over. Invokes backfill as its mechanism when `[BSB]` is configured; otherwise no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))`. ```python +def main(): + args = parse_cli_flags() # --config, --log-level, --log-format + config = load_config_toml(args.config) + init_logging(config.logging, cli_overrides=args) + run_rpc_service(config) + + def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) - start_http_server(config) # QueryRouter serves getHealth immediately; - # gates getLedger/getTransaction/getEvents on daemon_ready. - make_bsb = make_bsb_partial(config) # None if [BSB] absent + start_http_server(config) + make_bsb = make_bsb_partial(config) phase1_catchup(config, meta_store, make_bsb) phase2_hydrate_txhash(config, meta_store) phase3_reconcile_orphans(config, meta_store) @@ -352,50 +337,32 @@ Query serving is gated on Phase 4 (live ingestion) being reached — see [Query - Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. ```python -# Diagnostic safety net for Phase 1 (catchup). BSB throughput >> tip advance rate -# in practice, so 2–3 iters is typical. Hitting the cap means the BSB source is -# degraded, not that 5 is the wrong number — hard-coded, not TOML-configurable. -MAX_PHASE1_ITERATIONS = 5 +MAX_PHASE1_ITERATIONS = 5 # safety-net cap; hitting it means BSB is degraded. def phase1_catchup(config, meta_store, make_bsb): - # Pure side-effect. Re-runs the full retention-aligned range on every start; - # DAG idempotency inside run_backfill handles already-done chunks. if make_bsb is None: - return # no [BSB] → no-op + return # [BSB] absent → no-op cpi = config.service.chunks_per_txhash_index retention_ledgers = config.service.retention_ledgers last_scheduled_end_chunk = -1 - # Loop because tip advances during catchup; each iteration closes whatever's - # accumulated since the previous sample. Capped at MAX_PHASE1_ITERATIONS to - # surface degraded-BSB scenarios as a fatal instead of a silent-hang. for iter_count in range(1, MAX_PHASE1_ITERATIONS + 1): network_tip_ledger = get_latest_network_tip(config.history_archives.urls) end_chunk = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 if end_chunk <= last_scheduled_end_chunk: - log.info(f"phase1_catchup converged after iter={iter_count - 1}") - return # no new complete chunks since last iteration - + return # converged start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi) if end_chunk < start_chunk: - log.info(f"phase1_catchup: leapfrog landed past tip on iter={iter_count} — no-op") - return # leapfrog landed past tip — pre-first-complete-chunk - - backlog_ledgers = (network_tip_ledger - last_ledger_in_chunk(last_scheduled_end_chunk) - if last_scheduled_end_chunk >= 0 else network_tip_ledger) + return # leapfrog past tip log.info(f"phase1_catchup iter={iter_count}/{MAX_PHASE1_ITERATIONS} " - f"tip={network_tip_ledger} start_chunk={start_chunk} end_chunk={end_chunk} " - f"backlog_ledgers={backlog_ledgers}") - + f"tip={network_tip_ledger} range=[{start_chunk}, {end_chunk}]") run_backfill(config, start_chunk, end_chunk, make_bsb) last_scheduled_end_chunk = end_chunk - fatal(f"Phase 1 (catchup) did not converge within MAX_PHASE1_ITERATIONS={MAX_PHASE1_ITERATIONS}. " - f"BSB throughput is likely slower than network tip advance rate — check " - f"[BSB].NUM_WORKERS / [BSB].BUFFER_SIZE and network latency to the BSB bucket. " - f"Per-iter backlog_ledgers is in the prior log lines.") + fatal(f"phase1_catchup exceeded {MAX_PHASE1_ITERATIONS} iters; " + f"check [BSB].NUM_WORKERS / BUFFER_SIZE (backlog trail in logs).") def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): @@ -428,15 +395,15 @@ def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): def phase2_hydrate_txhash(config, meta_store): cpi = config.service.chunks_per_txhash_index - # Step 1: sweep leftover .bin for tx indexes already flagged complete — backfill - # may have set index:N:txhash before cleanup_txhash finished on crash. + # Sweep leftover .bin + flag for tx indexes already flagged complete (crash between + # index:N:txhash set and cleanup_txhash finish). for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) - # Step 2: load .bin for the trailing incomplete tx index into the active RocksDB. + # Load .bin for the trailing incomplete tx index into the active RocksDB. incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) if incomplete_tx_index_id is None: return @@ -449,18 +416,15 @@ def phase2_hydrate_txhash(config, meta_store): bin_path = raw_txhash_path(chunk_id) if os.path.exists(bin_path): load_bin_into_rocksdb(bin_path, txhash_store) - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # flag first - delete_if_exists(bin_path) # then .bin + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") + delete_if_exists(bin_path) - # Step 3: sweep orphan .bin (flag gone, .bin lingering from a prior crash - # between the two deletes above). + # Sweep orphan .bin (flag already deleted, .bin lingered from a prior crash). for bin_file in scan_bin_files_for_tx_index(incomplete_tx_index_id): if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): os.remove(bin_file) finally: - # Close before returning: Phase 4 (live ingestion) re-opens by directory path and the RocksDB - # flock would collide if this handle stayed open. - txhash_store.close() + txhash_store.close() # Phase 4 re-opens by directory path; flock would collide otherwise. ``` **Why "load then delete" matters.** @@ -476,52 +440,24 @@ Completes any in-flight transitions left by a prior crash. All decisions derive ```python def phase3_reconcile_orphans(config, meta_store): - # If no prior live ingestion (streaming:last_committed_ledger absent), no in-flight - # freezes exist — fresh datadir or first-ever start. Short-circuit. last_committed_ledger = meta_store.get("streaming:last_committed_ledger") if last_committed_ledger is None: - return + return # fresh start — nothing in flight cpi = config.service.chunks_per_txhash_index resume_chunk_id = (last_committed_ledger + 1 - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - for store_dir in scan_ledger_store_dirs(config): - chunk_id = parse_chunk_id_from_dir(store_dir) - if chunk_id == resume_chunk_id: - continue # active; keep - if meta_store.has(f"chunk:{chunk_id:08d}:lfs"): - delete_dir(store_dir) # orphaned post-flush cleanup - elif chunk_id < resume_chunk_id: - finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store) - else: - delete_dir(store_dir) # orphan future store - - for store_dir in scan_events_store_dirs(config): - chunk_id = parse_chunk_id_from_dir(store_dir) - if chunk_id == resume_chunk_id: - continue # active; keep - if meta_store.has(f"chunk:{chunk_id:08d}:events"): - delete_dir(store_dir) # orphaned post-flush cleanup - elif chunk_id < resume_chunk_id: - finish_interrupted_events_freeze(store_dir, chunk_id, meta_store) - else: - delete_dir(store_dir) # orphan future store - - resume_tx_index_id = resume_chunk_id // cpi - for store_dir in scan_txhash_store_dirs(config): - tx_index_id = parse_tx_index_id_from_dir(store_dir) - if tx_index_id == resume_tx_index_id: - continue # active; keep - if meta_store.has(f"index:{tx_index_id:08d}:txhash"): - delete_dir(store_dir) # RecSplit done; cleanup lingered - elif all_chunks_in_tx_index_have_lfs_flag(meta_store, tx_index_id, cpi): - # Re-spawn build. Pass the handle, not the dir path — build_tx_index_recsplit_files - # reads from the store and closes it. - transitioning_txhash = open_active_txhash_store(config, tx_index_id) - run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash, meta_store) - + reconcile_ledger_store_dirs(config, meta_store, resume_chunk_id) + reconcile_events_store_dirs(config, meta_store, resume_chunk_id) + reconcile_txhash_store_dirs(config, meta_store, resume_chunk_id // cpi, cpi) ``` +Each `reconcile_*_store_dirs` helper scans its own active-store directory type and classifies each dir it finds: + +- **`reconcile_ledger_store_dirs`** — per `chunk_id` found: `== resume_chunk_id` → keep (active); `chunk:{chunk_id}:lfs` flag present → delete dir (flag-is-truth, freeze completed); `< resume_chunk_id` and flag absent → call `finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store)`; else delete as future orphan. +- **`reconcile_events_store_dirs`** — identical classification with `:events` flag and `finish_interrupted_events_freeze`. +- **`reconcile_txhash_store_dirs`** — per `tx_index_id` found: `== resume_tx_index_id` → keep; `index:{tx_index_id:08d}:txhash` present → delete dir (RecSplit done); flag absent and all chunks of `tx_index_id` have `:lfs` → open the store synchronously and spawn `build_tx_index_recsplit_files` in background (the builder reads from the handle and closes it). + ### Compute Resume Ledger - `compute_resume_ledger` is a shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. @@ -547,7 +483,6 @@ def phase3_reconcile_orphans(config, meta_store): ```python def compute_resume_ledger(config, meta_store): - # Called by: run_rpc_service orchestrator, after Phase 3 (reconcile), before Phase 4 (live ingestion). cpi = config.service.chunks_per_txhash_index scan = scan_all_chunk_and_index_keys(meta_store) validate_scan(scan, cpi) @@ -556,16 +491,12 @@ def compute_resume_ledger(config, meta_store): if last_committed_ledger is not None: validate_last_committed_consistency(scan, last_committed_ledger) return last_committed_ledger + 1 - if scan.lfs_chunks: - end_chunk = scan.lfs_chunks[-1] # already validated contiguous - return last_ledger_in_chunk(end_chunk) + 1 - - return retention_aligned_resume_ledger(config) # no on-disk chunks: Alice fresh start + return last_ledger_in_chunk(scan.lfs_chunks[-1]) + 1 # first-ever post-Phase-1 + return retention_aligned_resume_ledger(config) # Alice fresh start def validate_scan(scan, cpi): - # Called by: compute_resume_ledger (as part of the pre-derivation validation pass). # Fatal on any violation — "migration to streaming failed". if not scan.lfs_chunks: return @@ -575,14 +506,11 @@ def validate_scan(scan, cpi): actual = set(scan.lfs_chunks) if actual != expected: fatal(f"internal :lfs gap: missing chunks {sorted(expected - actual)}") - if start != 0 and start % cpi != 0: fatal(f"start chunk {start} not tx-index aligned (expected multiple of cpi={cpi})") - if actual != set(scan.events_chunks): - fatal(":lfs / :events mismatch — a process_chunk task crashed mid-run and was never recovered") + fatal(":lfs / :events mismatch — process_chunk crashed mid-run, unrecovered") - # Complete tx indexes = those whose ALL cpi chunks fall inside [start, end]. first_complete_tx_index_id = (start + cpi - 1) // cpi last_complete_tx_index_id = (end + 1) // cpi - 1 complete = set(range(first_complete_tx_index_id, last_complete_tx_index_id + 1)) @@ -592,10 +520,8 @@ def validate_scan(scan, cpi): def validate_last_committed_consistency(scan, last_committed_ledger): - # Called by: compute_resume_ledger (when streaming:last_committed_ledger is present). - # streaming:last_committed_ledger=L implies every chunk up to chunk_id_of_ledger(L)-1 - # must have :lfs. Chunk containing L itself is the currently-ingesting chunk and may - # or may not have :lfs depending on whether L fell on a chunk boundary. + # `streaming:last_committed_ledger = L` implies every chunk up to chunk_id_of_ledger(L)-1 + # must have :lfs (the chunk containing L itself is the currently-ingesting one). active_chunk_id = (last_committed_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK required_last = active_chunk_id - 1 if required_last < 0: @@ -603,8 +529,7 @@ def validate_last_committed_consistency(scan, last_committed_ledger): actual_last = scan.lfs_chunks[-1] if scan.lfs_chunks else -1 if actual_last < required_last: fatal(f"streaming:last_committed_ledger={last_committed_ledger} requires :lfs " - f"through chunk {required_last}; scan's highest is {actual_last} — " - f"a recent immutable artifact is missing") + f"through chunk {required_last}; scan's highest is {actual_last}") def retention_aligned_resume_ledger(config): @@ -665,30 +590,24 @@ Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` call ```python def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): - cpi = config.service.chunks_per_txhash_index # immutable; read once + cpi = config.service.chunks_per_txhash_index ledger_seq = resume_ledger while True: lcm = ledger_backend.GetLedger(ledger_seq) # blocks until available - # Fan out to all three active stores; wait for all to durably commit before - # advancing the checkpoint. Each write is idempotent on retry. + # All three writes durably commit before advancing the checkpoint. wait_all( - run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), - run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), - run_in_background(write_events_store, active_stores.events, ledger_seq, lcm), + run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), + run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), + run_in_background(write_events_store, active_stores.events, ledger_seq, lcm), ) - - # Atomic "daemon owns everything up to ledger_seq" signal — written only after - # all three stores have durably committed. Distinct from Phase 1 (catchup)'s :lfs-derived - # coverage end. meta_store.put("streaming:last_committed_ledger", ledger_seq) chunk_id = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK if ledger_seq == last_ledger_in_chunk(chunk_id): on_chunk_boundary(chunk_id, active_stores, meta_store) - # Tx-index boundary runs AFTER on_chunk_boundary (every tx-index boundary is - # also a chunk boundary). + # Every tx-index boundary is also a chunk boundary; index handler runs after chunk handler. tx_index_id = chunk_id // cpi if ledger_seq == last_ledger_in_tx_index(tx_index_id): on_tx_index_boundary(tx_index_id, active_stores, meta_store) @@ -724,20 +643,19 @@ Triggered when the ingestion loop commits `last_ledger_in_chunk(chunk_id)`. Hand ```python def on_chunk_boundary(chunk_id, active_stores, meta_store): - # LFS: drain the last in-flight LFS freeze (max-1-transitioning), capture the current - # handle, synchronously open the next chunk's store, spawn the freeze. + # LFS + events each: drain prior freeze, capture current handle, open chunk+1 sync (~100-200 ms), + # spawn background freeze. LFS and events run independently (events doesn't wait on LFS). wait_for_lfs_complete() transitioning_ledger_store = active_stores.ledger - active_stores.ledger = open_or_create_ledger_store(config, chunk_id + 1) # ~100-200 ms synchronous + active_stores.ledger = open_or_create_ledger_store(config, chunk_id + 1) run_in_background(freeze_ledger_chunk_to_pack_file, chunk_id, transitioning_ledger_store, meta_store) - # Events: same shape, independent goroutine (does NOT wait on LFS). wait_for_events_complete() transitioning_events_store = active_stores.events - active_stores.events = open_or_create_events_store(config, meta_store, chunk_id + 1) # ~100-200 ms + active_stores.events = open_or_create_events_store(config, meta_store, chunk_id + 1) run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, transitioning_events_store, meta_store) - notify_lifecycle() # wake prune loop (this notification is ONLY for prune eligibility) + notify_lifecycle() # wake prune loop ``` ### LFS Transition @@ -746,9 +664,8 @@ Converts the retired ledger RocksDB store to an immutable `.pack` file, then dis ```python def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_store): - # Order: overwrite=True (discard any prior partial) → write → fsync → flag → cleanup. - # Flag-after-fsync. Crash between flag and store-delete leaves an orphan dir; Phase 3 (reconcile) - # reconciles via `:lfs` present + store present → delete store. + # overwrite=True discards any prior partial; flag-after-fsync. Crash between flag + # and store-delete leaves an orphan that Phase 3 (reconcile) picks up (flag-is-truth). pack_path = ledger_pack_path(chunk_id) writer = packfile.create(pack_path, overwrite=True) for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): @@ -760,27 +677,16 @@ def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_ signal_lfs_complete() -def finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store): - # Phase 3 (reconcile) helper. Same as freeze_ledger_chunk_to_pack_file but opens the existing - # store (WAL-recovered) and skips signal_lfs_complete (Phase 3 is synchronous). - transitioning_ledger_store = open_or_create_ledger_store(config, chunk_id) - pack_path = ledger_pack_path(chunk_id) - writer = packfile.create(pack_path, overwrite=True) - for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): - writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) - writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") - transitioning_ledger_store.close() - delete_dir(store_dir) ``` +`finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_lfs_complete`. + ### Events Transition Converts the retired events RocksDB store to three immutable files (events cold segment). ```python def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, meta_store): - # Same flag-after-fsync order as freeze_ledger_chunk_to_pack_file. events_path = events_segment_path(chunk_id) write_cold_segment(transitioning_events_store, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) @@ -790,33 +696,22 @@ def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, me signal_events_complete() -def finish_interrupted_events_freeze(store_dir, chunk_id, meta_store): - # Phase 3 (reconcile) helper. Same as freeze_events_chunk_to_cold_segment but opens - # the existing transitioning store (WAL-recovered) and runs synchronously (no - # signal_events_complete since Phase 3 is not parallel with ingestion). - transitioning_events_store = open_or_create_events_store(config, meta_store, chunk_id) - events_path = events_segment_path(chunk_id) - write_cold_segment(transitioning_events_store, events_path) - fsync_all(events_path) - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") - transitioning_events_store.close() - delete_dir(store_dir) ``` +`finish_interrupted_events_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_events_complete`. + ### Tx-Index Boundary (every `LEDGERS_PER_INDEX` ledgers) The last chunk of a tx index has just rolled over. Before RecSplit can start, every chunk in the tx index must have its `:lfs` and `:events` flags set. ```python def on_tx_index_boundary(tx_index_id, active_stores, meta_store): - # Drain ALL in-flight LFS + events (the final chunk's freeze may still be running) - # — RecSplit cannot race its input. + # Drain all in-flight chunk-level freezes for this tx index before RecSplit. wait_for_lfs_complete() wait_for_events_complete() - verify_all_chunk_flags(tx_index_id, meta_store) # defense-in-depth - + verify_all_chunk_flags(tx_index_id, meta_store) transitioning_txhash_store = active_stores.txhash - active_stores.txhash = open_or_create_txhash_store(config, tx_index_id + 1) # ~100-200 ms synchronous + active_stores.txhash = open_or_create_txhash_store(config, tx_index_id + 1) run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash_store, meta_store) ``` @@ -845,12 +740,10 @@ Retention is enforced by a single background goroutine, woken at chunk boundarie ```python def run_prune_lifecycle_loop(config, meta_store): - # Initial scan at entry catches any `"deleting"` state left by a prior crashed prune; - # without it, a crashed prune could sit unserviced until the next chunk boundary - # (up to ~16 h at cpi=1). Subsequent sweeps fire on chunk-boundary notifications. + # Initial sweep catches `"deleting"` state left by a prior crashed prune; + # subsequent sweeps fire on chunk-boundary notifications. cpi = config.service.chunks_per_txhash_index retention_ledgers = config.service.retention_ledgers - _run_prune_sweep(meta_store, retention_ledgers, cpi, config) while True: wait_for_chunk_boundary_notification() @@ -863,18 +756,8 @@ def _run_prune_sweep(meta_store, retention_ledgers, cpi, config): def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): - # Returns tx_index_ids fully past retention and still prune-eligible (`:txhash` is - # `"1"` or `"deleting"`). Uses streaming:last_committed_ledger (daemon's own progress), - # not the source-reported tip. - # - # Derivation: tx_index_id eligible iff - # last_committed_ledger > last_ledger_in_tx_index(tx_index_id) + retention_ledgers - # → max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) - # // LEDGERS_PER_INDEX) - 1 - # Check at last_committed_ledger=70_000_002, retention_ledgers=10_000_000, cpi=1_000: - # max_eligible = (70_000_002 - 2 - 10_000_000) // 10_000_000 - 1 = 5. - # tx_index 5 ends at 60_000_001 + 10M = 70_000_001; 70_000_002 > that → eligible. ✓ - # tx_index 6 ends at 70_000_001 + 10M = 80_000_001; NOT eligible. ✓ + # Returns tx_index_ids fully past retention and prune-eligible (`:txhash` is `"1"` or + # `"deleting"`). Eligibility: last_committed_ledger > last_ledger_in_tx_index(N) + retention_ledgers. if retention_ledgers == 0: return [] last_committed_ledger = meta_store.get("streaming:last_committed_ledger") @@ -891,20 +774,15 @@ def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): def prune_tx_index(tx_index_id, meta_store, config): - # Two-phase marker for query-routing safety: set "deleting" BEFORE any file delete; - # clear the key AFTER. Queries short-circuit on "deleting" (treated as absent). - # Idempotent on crash-between-stages retry. + # Two-phase marker: set "deleting" first, clear the key last. Idempotent on retry. cpi = config.service.chunks_per_txhash_index - meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") - for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) meta_store.delete(f"chunk:{chunk_id:08d}:lfs") meta_store.delete(f"chunk:{chunk_id:08d}:events") delete_recsplit_idx_files(tx_index_id) - meta_store.delete(f"index:{tx_index_id:08d}:txhash") ``` From a25e95ea7d9e3f2034a9ffca1db8db1fd87af51f Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Fri, 24 Apr 2026 16:47:54 -0700 Subject: [PATCH 27/34] Design docs: Overview restructure + pseudocode helper-ification - Backfill Overview: scene-setting blurb; produces-first then how-it-does-it; meta-store concept introduced inline - Geometry: trimmed to constants + helper-name mentions; math bodies + worked examples + cpi=1000 row table relocated - Streaming pseudocode: inline arithmetic replaced with named helpers (chunk_id_of_ledger, last_completed_chunk_id, chunks_for_tx_index, first_chunk_id_of_tx_index_containing, first_ledger_of_tx_index_containing, first/last_fully_covered_tx_index_id, max_prunable_tx_index_id, tx_index_id_of_chunk) - First-use inline definitions for BSB / LFS / WAL / MPHF / RecSplit / GCS / captive core / leapfrog'd resume ledger / operator profiles - Configuration section: single-bullet lead-in; drop two negation-style cross-doc callouts - Fix broken intra-doc links --- .../design-docs/01-backfill-workflow.md | 79 ++++++---------- .../design-docs/02-streaming-workflow.md | 94 ++++++++----------- 2 files changed, 67 insertions(+), 106 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index b99ea0d4b..9b6765237 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -2,24 +2,29 @@ ## Overview -- Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). -- Internal to the daemon; no `full-history-backfill` subcommand, no per-run flags. -- Input: an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `make_bsb` partial function — calling `make_bsb()` returns a fresh [`BSBSource`](./02-streaming-workflow.md#ledger-source) instance. - -**What it does:** -- Ingests historical ledgers via per-task BSB instances. Each `process_chunk` calls `make_bsb()` to get its own `BSBSource`, calls `prepare_range` scoped to the chunk's 10_000 ledgers, reads in a loop, and tears down. Independent per chunk — no shared source state. -- Writes directly to immutable file formats — no RocksDB active stores. -- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool (`GOMAXPROCS` concurrency). -- Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. -- **BSB-only.** Backfill does not use captive core as a ledger source. Captive core belongs to Phase 4 (live ingestion); if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion). +Backfill is the RPC service's historical-ingestion subroutine — it pulls ledgers from a remote object store (BSB) and writes them as immutable, query-ready artifacts on local disk. It runs once per daemon start, as part of Phase 1 (catchup), to close the gap between on-disk state and the current network tip before live ingestion takes over. Interruption at any point leaves recoverable state; on restart, already-complete work is skipped. **What it produces:** -| Query it enables | Immutable output | Scope | +Three immutable artifact types, one per full-history RPC query, scoped to **chunks** (10_000-ledger blocks) and **tx indexes** (consecutive chunks, default 1_000 = 10_000_000 ledgers each — see [Geometry](#geometry)). + +| Immutable output | Query it enables | Scope | |-----------------|-----------------|-------| -| `getLedger` | Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | Per chunk (10_000 ledgers) | -| `getTransaction` | Tx-index files | Per tx index (default 10_000_000 ledgers) | -| `getEvents` | [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | Per chunk | +| Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | `getLedger` | Per chunk (10_000 ledgers) | +| Tx-index files | `getTransaction` | Per tx index (default 10_000_000 ledgers) | +| [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | `getEvents` | Per chunk | + +**How it does it:** + +- Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). Internal to the daemon; no `full-history-backfill` subcommand, no per-run flags. +- Ledger source is **BSB** (Buffered Storage Backend) — a remote object-store reader for `LedgerCloseMeta`, configured under `[BSB]` in the TOML config. Interface details in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). +- Input: an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `make_bsb` partial function — calling `make_bsb()` returns a fresh `BSBSource` instance. +- Ingests historical ledgers via per-task BSB instances. Each `process_chunk` task (the per-chunk unit of work; full pseudocode in [Task Details](#task-details) below) calls `make_bsb()` to get its own `BSBSource`, calls `prepare_range` scoped to the chunk's 10_000 ledgers, reads in a loop, and tears down. Independent per chunk — no shared source state. +- Writes directly to immutable file formats — no RocksDB active stores (mutable RocksDB instances holding in-flight live-ingestion data; streaming's concern, see [02-streaming-workflow.md — Active Store Architecture](./02-streaming-workflow.md#active-store-architecture)). +- Tracks per-chunk and per-tx-index completion in a small **meta store** — a dedicated RocksDB with WAL always on, separate from streaming's active stores. Each flag is written after its artifact's `fsync`, and flag presence drives all resume decisions. +- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool capped at `GOMAXPROCS`. +- Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. +- **BSB-only.** Backfill does not use captive core (the embedded `stellar-core` subprocess the daemon runs for live ingestion) as a ledger source. Captive core belongs to Phase 4 (live ingestion); if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger — a start ledger chosen forward of genesis so ingestion stays within the retention window — as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion) and [Ledger Source](./02-streaming-workflow.md#ledger-source). For the distinction between *backfill (this subroutine)* and *Phase 1 (catchup) (the startup phase that invokes it)* — two terms that get conflated because their scopes overlap — see [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). @@ -27,52 +32,24 @@ For the distinction between *backfill (this subroutine)* and *Phase 1 (catchup) ## Geometry -Stellar's first ledger is `GENESIS_LEDGER = 2` (not 0 or 1). -Every formula that maps `ledger_seq ↔ chunk_id` subtracts `GENESIS_LEDGER` to zero-base the axis. -In the pseudocode below, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. +Stellar's first ledger is `GENESIS_LEDGER = 2`. Mapping functions subtract it to zero-base the `ledger_seq ↔ chunk_id` axis. ```python GENESIS_LEDGER = 2 LEDGERS_PER_CHUNK = 10_000 # hardcoded; not configurable -CHUNKS_PER_TXHASH_INDEX = # 1 / 10 / 100 / 1_000; default 1_000 +CHUNKS_PER_TXHASH_INDEX = 1000 # read from config, immutable after first run. Acceptable values - 1 / 10 / 100 / 1_000; default 1_000 LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK # at cpi=1_000 this is 10_000_000 - -chunk_id_of_ledger(ledger_seq) = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - # 56_342_637 → (56_342_637 - 2) // 10_000 = 5_634 - -first_ledger_in_chunk(chunk_id) = (chunk_id * LEDGERS_PER_CHUNK) + GENESIS_LEDGER - # chunk_id=5_634 → (5_634 * 10_000) + 2 = 56_340_002 — ends in ..._02 - -last_ledger_in_chunk(chunk_id) = ((chunk_id + 1) * LEDGERS_PER_CHUNK) + (GENESIS_LEDGER - 1) - # chunk_id=5_634 → ((5_635) * 10_000) + 1 = 56_350_001 — ends in ..._01 - -tx_index_id_of_chunk(chunk_id) = chunk_id // CHUNKS_PER_TXHASH_INDEX - # chunk_id=5_634 → tx_index_id=5 (at cpi=1_000) - -first_ledger_in_tx_index(tx_index_id) = (tx_index_id * LEDGERS_PER_INDEX) + GENESIS_LEDGER - # tx_index_id=5 → (5 * 10_000_000) + 2 = 50_000_002 - -last_ledger_in_tx_index(tx_index_id) = ((tx_index_id + 1) * LEDGERS_PER_INDEX) + (GENESIS_LEDGER - 1) - # tx_index_id=5 → ((6) * 10_000_000) + 1 = 60_000_001 ``` -Example rows at `CHUNKS_PER_TXHASH_INDEX = 1000` (default): - -| `tx_index_id` | First Ledger | Last Ledger | Chunks | -|---|---|---|---| -| `0` | `2` | `10_000_001` | `0 – 999` | -| `1` | `10_000_002` | `20_000_001` | `1_000 – 1_999` | -| `2` | `20_000_002` | `30_000_001` | `2_000 – 2_999` | - -All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). +- In pseudocode, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. +- All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). --- ## Configuration -- The service loads a single TOML file; backfill reads the subset documented here. -- Daemon-level sections not consumed by backfill — `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]`, plus `RETENTION_LEDGERS` under `[SERVICE]` — are documented in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). +- Backfill reads the subset of the unified TOML config described below. Daemon-level keys unused by backfill are specified in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). ### TOML Config @@ -83,8 +60,6 @@ All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). | `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | | `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | -`[SERVICE]` also carries daemon-level keys not read by backfill — `RETENTION_LEDGERS` — see [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). - **[IMMUTABLE_STORAGE.LEDGERS]** (optional) | Key | Type | Default | Description | @@ -107,7 +82,7 @@ All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). | Key | Type | Default | Description | |-----|------|---------|-------------| -| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/index` | Base path for RecSplit index files (permanent). | +| `PATH` | string | `{DEFAULT_DATA_DIR}/txhash/index` | Base path for RecSplit (minimal-perfect-hash index library) `.idx` files (permanent). | The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores owned by the streaming workflow). @@ -179,7 +154,7 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h ``` {DEFAULT_DATA_DIR}/ ├── meta/ -│ └── rocksdb/ ← Meta store (WAL always enabled) +│ └── rocksdb/ ← Meta store (WAL = Write-Ahead Log; always enabled) │ ├── ledgers/ ← IMMUTABLE_STORAGE.LEDGERS.PATH │ ├── 00000/ ← chunk_ids 0–999 (1_000 .pack files) @@ -194,7 +169,7 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h │ ├── 00000/ ← chunk_ids 0–999 (3_000 files: 3 per chunk) │ │ ├── 00000000-events.pack ← compressed event blocks │ │ ├── 00000000-index.pack ← serialized roaring bitmaps -│ │ ├── 00000000-index.hash ← MPHF for term → slot lookup +│ │ ├── 00000000-index.hash ← MPHF (Minimal Perfect Hash Function) for term → slot lookup │ │ └── ... │ └── .../ │ diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 070cfcbb2..92e5700f1 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -6,15 +6,15 @@ The stellar-rpc daemon is the full-history RPC service. One binary, one invocati - Operator runs `stellar-rpc --config path/to/config.toml`. No subcommand. No `--mode` flag. No behavior-switching flags. - On every start the daemon runs four sequential startup phases, then enters a live ingestion loop it stays in until killed. -- Behavior across the three operator profiles (archive, pruning-history, tip-tracker) is determined entirely by TOML config — no profile flag. +- Behavior across the three operator profiles — **archive** (full history), **pruning-history** (retention-windowed history with BSB catchup), **tip-tracker** (retention-windowed history, no object store; captive-core-only) — is determined entirely by TOML config; no profile flag. Full matrix: [Operator Profiles](#operator-profiles). - Backfill (`01-backfill-workflow.md`) is used as an internal subroutine by Startup Phase 1 (catchup). Operators never invoke backfill directly. **What the daemon does end-to-end:** -- Validates config against immutable meta-store state (`CHUNKS_PER_TXHASH_INDEX` and `RETENTION_LEDGERS`). -- Catches up to the current network tip using BSB or captive core, whichever is configured. +- Validates config against immutable meta-store state: `CHUNKS_PER_TXHASH_INDEX` (chunks-per-tx-index constant; defines on-disk layout) and `RETENTION_LEDGERS` (history window in ledgers, or `0` for full history). Both detailed in [Configuration](#configuration). +- Catches up to the current **network tip** (most recent ledger the Stellar network has produced, sampled from the history archive — defined in [Terminology](#terminology)) using **BSB** (Buffered Storage Backend — remote object-store reader for `LedgerCloseMeta`; see [Ledger Source](#ledger-source)) or captive core (embedded `stellar-core` subprocess; see [Ledger Source](#ledger-source)), whichever is configured. - Hydrates any in-flight state left by a prior run. -- Ingests live ledgers from `CaptiveStellarCore` at ~1 per 6 seconds. -- Writes each live ledger to three active stores (ledger, txhash, events). +- Ingests live ledgers from `CaptiveStellarCore` (the stellar Go SDK's captive-core client type — wraps the embedded `stellar-core` subprocess) at ~1 per 6 seconds. +- Writes each live ledger to three **active stores** — mutable per-chunk or per-index RocksDB instances for ledger, txhash, events — detailed in [Active Store Architecture](#active-store-architecture). - Freezes active stores to immutable files at chunk and index boundaries in background. - Prunes past-retention indexes atomically when retention is configured. - Serves `getLedger`, `getTransaction`, `getEvents` only after startup phases complete. Returns HTTP 4xx during startup. @@ -28,7 +28,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Daemon** — the stellar-rpc binary running as one long-lived process. The only operator-facing entry point. - **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 (live ingestion) is reached, it stays there until the process exits. [Details](#startup-sequence). - **Phase 1 (catchup)** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. -- **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI entry point exists. +- **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI (command-line) entry point exists. - **Leapfrog** (colloquial) — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. Implemented by the `retention_aligned_start_chunk` helper (Phase 1 (catchup) callsite) and the `retention_aligned_resume_ledger` helper (`compute_resume_ledger`'s no-BSB fresh-start branch). - **`compute_resume_ledger`** — shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` for Phase 4 (live ingestion). Runs post-Phase-3 so any in-flight freezes Phase 3 finished (and their newly-set `:lfs` flags) are visible to the scan. See [Compute Resume Ledger](#compute-resume-ledger). - **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. @@ -39,9 +39,9 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - Events active store — per-chunk RocksDB (one instance per chunk; schema + column families per [getEvents full-history design](../../design-docs/getevents-full-history-design.md)). - **Immutable store** — on-disk files produced by freezing an active store. Three kinds: - Ledger pack file (one per chunk). - - RecSplit index `.idx` files (16 per index). + - **RecSplit** index `.idx` files (16 per index) — minimal-perfect-hash files for `txhash → ledger_seq` lookup. - Events cold segment (three files per chunk: `events.pack`, `index.pack`, `index.hash`). -- **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three transitions total per chunk (LFS, events) and one per index (RecSplit). +- **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three flavors: **LFS** (shorthand for the ledger active store → `.pack` file freeze) and **events** (events active store → cold segment) run per chunk; **RecSplit** (txhash active store → 16 `.idx` files) runs per index. - **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `first_ledger_in_chunk(chunk_id)` always ends in `..._02`; `last_ledger_in_chunk(chunk_id)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. - **Txhash index** (a.k.a. "tx index", "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). Both docs use "tx index" as the dominant narrative form; "txhash index" appears where the output's role as a txhash lookup is the emphasis. - **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. @@ -53,7 +53,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a ## Geometry -See [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Streaming uses the same constants (`GENESIS_LEDGER`, `LEDGERS_PER_CHUNK`, `LEDGERS_PER_INDEX`, `CHUNKS_PER_TXHASH_INDEX`) and the same mapping functions (`chunk_id_of_ledger`, `first_ledger_in_chunk`, `last_ledger_in_chunk`, `tx_index_id_of_chunk`, `first_ledger_in_tx_index`, `last_ledger_in_tx_index`). +See [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Streaming uses the same constants (`GENESIS_LEDGER`, `LEDGERS_PER_CHUNK`, `LEDGERS_PER_INDEX`, `CHUNKS_PER_TXHASH_INDEX`), mapping functions, and derived helpers. --- @@ -63,7 +63,7 @@ Streaming reads the same TOML file as backfill, plus additional keys described b ### Shared Config (from backfill) -`[SERVICE]` (for `DEFAULT_DATA_DIR` + `CHUNKS_PER_TXHASH_INDEX`), `[BSB]`, `[IMMUTABLE_STORAGE.*]`, `[META_STORE]`, `[LOGGING]` are detailed in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration). Streaming adds extra keys to `[SERVICE]` and introduces `[CAPTIVE_CORE]`, `[ACTIVE_STORAGE]`, `[HISTORY_ARCHIVES]` (below). +`[SERVICE]` (daemon-wide settings — `DEFAULT_DATA_DIR`, `CHUNKS_PER_TXHASH_INDEX`), `[BSB]` (Buffered Storage Backend source settings), `[IMMUTABLE_STORAGE.*]` (on-disk paths for immutable artifacts — ledger packs, events, raw txhash, txhash index), `[META_STORE]` (meta-store RocksDB path), `[LOGGING]` (log level + format) are detailed in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration). Streaming adds extra keys to `[SERVICE]` and introduces `[CAPTIVE_CORE]` (embedded `stellar-core` subprocess settings), `[ACTIVE_STORAGE]` (active RocksDB paths), `[HISTORY_ARCHIVES]` (Stellar history-archive URLs for tip sampling) — all defined below. ### Immutable Keys (stored in meta store, fatal if changed) @@ -113,7 +113,7 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 - Same schema as in the backfill doc. Presence in the config file determines Phase 1 (catchup) behavior: - Present: Phase 1 (catchup) invokes backfill over the BSB (fast, parallel per-chunk catchup). - Absent: Phase 1 (catchup) is a no-op; Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` (slower, but no object-store dep). -- See [Ledger Source](#ledger-source) for the BSB-source details and [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup) for the full split. +- See [Ledger Source](#ledger-source) for the BSB-source details and [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup) for the full split. ### CLI Flags @@ -143,11 +143,10 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 def validate_config(config, meta_store): cpi = config.service.chunks_per_txhash_index retention_ledgers = config.service.retention_ledgers - ledgers_per_index = cpi * LEDGERS_PER_CHUNK - if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % ledgers_per_index) != 0): + if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % LEDGERS_PER_INDEX) != 0): fatal(f"RETENTION_LEDGERS={retention_ledgers} must be 0 or a positive multiple of " - f"LEDGERS_PER_INDEX={ledgers_per_index}.") + f"LEDGERS_PER_INDEX={LEDGERS_PER_INDEX}.") if config.bsb is None and retention_ledgers == 0: fatal("[BSB] is absent AND RETENTION_LEDGERS=0 (full history). Full history requires " @@ -185,7 +184,7 @@ Three profiles emerge from config combinations. No profile flag. ## Meta Store Keys -Single RocksDB instance, WAL always enabled. Authoritative source for every startup decision. +Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative source for every startup decision. ### Keys Introduced by Streaming @@ -284,7 +283,7 @@ The daemon maintains three active stores for the current ingestion position. All ## Ledger Source -- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [01-backfill-workflow.md — Backfill vs Phase 1](./01-backfill-workflow.md#backfill-vs-phase-1-catchup). +- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). - **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. - **`BSBSource`** is the backfill-only ledger source — one instance per `process_chunk`, interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`), torn down at end-of-task. @@ -344,16 +343,15 @@ def phase1_catchup(config, meta_store, make_bsb): if make_bsb is None: return # [BSB] absent → no-op - cpi = config.service.chunks_per_txhash_index retention_ledgers = config.service.retention_ledgers last_scheduled_end_chunk = -1 for iter_count in range(1, MAX_PHASE1_ITERATIONS + 1): network_tip_ledger = get_latest_network_tip(config.history_archives.urls) - end_chunk = ((network_tip_ledger - (GENESIS_LEDGER - 1)) // LEDGERS_PER_CHUNK) - 1 + end_chunk = last_completed_chunk_id(network_tip_ledger) if end_chunk <= last_scheduled_end_chunk: return # converged - start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi) + start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers) if end_chunk < start_chunk: return # leapfrog past tip log.info(f"phase1_catchup iter={iter_count}/{MAX_PHASE1_ITERATIONS} " @@ -365,7 +363,7 @@ def phase1_catchup(config, meta_store, make_bsb): f"check [BSB].NUM_WORKERS / BUFFER_SIZE (backlog trail in logs).") -def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): +def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers): # Called by: phase1_catchup (per loop iteration) to compute range_start_chunk_id. # Returns the first chunk Phase 1 (catchup) should backfill: # - Archive profile (retention=0): chunk 0 (full history from genesis). @@ -375,10 +373,8 @@ def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): # Worst case: up to LEDGERS_PER_INDEX - 1 extra ledgers below strict retention. if retention_ledgers == 0: return 0 - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) - target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - target_tx_index_id = target_chunk_id // cpi - return target_tx_index_id * cpi + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + return first_chunk_id_of_tx_index_containing(target_ledger) ``` **Worker concurrency:** `run_backfill` caps DAG concurrency at `GOMAXPROCS`. Each `process_chunk` owns its own BSB instance (`make_bsb()`), prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunkchunk_id-make_bsb). @@ -393,12 +389,10 @@ def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers, cpi): ```python def phase2_hydrate_txhash(config, meta_store): - cpi = config.service.chunks_per_txhash_index - # Sweep leftover .bin + flag for tx indexes already flagged complete (crash between # index:N:txhash set and cleanup_txhash finish). for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): - for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): + for chunk_id in chunks_for_tx_index(tx_index_id): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) @@ -410,7 +404,7 @@ def phase2_hydrate_txhash(config, meta_store): txhash_store = open_active_txhash_store(config, incomplete_tx_index_id) try: - for chunk_id in range(incomplete_tx_index_id * cpi, (incomplete_tx_index_id + 1) * cpi): + for chunk_id in chunks_for_tx_index(incomplete_tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): continue bin_path = raw_txhash_path(chunk_id) @@ -444,12 +438,11 @@ def phase3_reconcile_orphans(config, meta_store): if last_committed_ledger is None: return # fresh start — nothing in flight - cpi = config.service.chunks_per_txhash_index - resume_chunk_id = (last_committed_ledger + 1 - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + resume_chunk_id = chunk_id_of_ledger(last_committed_ledger + 1) reconcile_ledger_store_dirs(config, meta_store, resume_chunk_id) reconcile_events_store_dirs(config, meta_store, resume_chunk_id) - reconcile_txhash_store_dirs(config, meta_store, resume_chunk_id // cpi, cpi) + reconcile_txhash_store_dirs(config, meta_store, tx_index_id_of_chunk(resume_chunk_id)) ``` Each `reconcile_*_store_dirs` helper scans its own active-store directory type and classifies each dir it finds: @@ -511,8 +504,8 @@ def validate_scan(scan, cpi): if actual != set(scan.events_chunks): fatal(":lfs / :events mismatch — process_chunk crashed mid-run, unrecovered") - first_complete_tx_index_id = (start + cpi - 1) // cpi - last_complete_tx_index_id = (end + 1) // cpi - 1 + first_complete_tx_index_id = first_fully_covered_tx_index_id(start) + last_complete_tx_index_id = last_fully_covered_tx_index_id(end) complete = set(range(first_complete_tx_index_id, last_complete_tx_index_id + 1)) missing = complete - set(scan.txhash_indexes) if missing: @@ -522,7 +515,7 @@ def validate_scan(scan, cpi): def validate_last_committed_consistency(scan, last_committed_ledger): # `streaming:last_committed_ledger = L` implies every chunk up to chunk_id_of_ledger(L)-1 # must have :lfs (the chunk containing L itself is the currently-ingesting one). - active_chunk_id = (last_committed_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + active_chunk_id = chunk_id_of_ledger(last_committed_ledger) required_last = active_chunk_id - 1 if required_last < 0: return @@ -539,12 +532,9 @@ def retention_aligned_resume_ledger(config): # combination, so this helper is never called in archive-from-genesis shape. network_tip_ledger = get_latest_network_tip(config.history_archives.urls) retention_ledgers = config.service.retention_ledgers - cpi = config.service.chunks_per_txhash_index - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) - target_chunk_id = (target_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - target_tx_index_id = target_chunk_id // cpi - return (target_tx_index_id * cpi * LEDGERS_PER_CHUNK) + GENESIS_LEDGER + target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + return first_ledger_of_tx_index_containing(target_ledger) ``` ### Phase 4 — Live Ingestion @@ -570,8 +560,8 @@ def open_active_stores_for_resume(config, meta_store, resume_ledger): # Open exactly one store per data type for the resume position. Subsequent # chunks / indexes are opened on-demand at boundary time — see on_chunk_boundary # and on_tx_index_boundary. - resume_chunk_id = (resume_ledger - GENESIS_LEDGER) // LEDGERS_PER_CHUNK - resume_tx_index_id = resume_chunk_id // config.service.chunks_per_txhash_index + resume_chunk_id = chunk_id_of_ledger(resume_ledger) + resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) return ActiveStores( ledger = open_or_create_ledger_store(config, resume_chunk_id), @@ -590,7 +580,6 @@ Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` call ```python def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): - cpi = config.service.chunks_per_txhash_index ledger_seq = resume_ledger while True: lcm = ledger_backend.GetLedger(ledger_seq) # blocks until available @@ -603,12 +592,12 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ) meta_store.put("streaming:last_committed_ledger", ledger_seq) - chunk_id = (ledger_seq - GENESIS_LEDGER) // LEDGERS_PER_CHUNK + chunk_id = chunk_id_of_ledger(ledger_seq) if ledger_seq == last_ledger_in_chunk(chunk_id): on_chunk_boundary(chunk_id, active_stores, meta_store) # Every tx-index boundary is also a chunk boundary; index handler runs after chunk handler. - tx_index_id = chunk_id // cpi + tx_index_id = tx_index_id_of_chunk(chunk_id) if ledger_seq == last_ledger_in_tx_index(tx_index_id): on_tx_index_boundary(tx_index_id, active_stores, meta_store) @@ -742,27 +731,25 @@ Retention is enforced by a single background goroutine, woken at chunk boundarie def run_prune_lifecycle_loop(config, meta_store): # Initial sweep catches `"deleting"` state left by a prior crashed prune; # subsequent sweeps fire on chunk-boundary notifications. - cpi = config.service.chunks_per_txhash_index retention_ledgers = config.service.retention_ledgers - _run_prune_sweep(meta_store, retention_ledgers, cpi, config) + _run_prune_sweep(meta_store, retention_ledgers, config) while True: wait_for_chunk_boundary_notification() - _run_prune_sweep(meta_store, retention_ledgers, cpi, config) + _run_prune_sweep(meta_store, retention_ledgers, config) -def _run_prune_sweep(meta_store, retention_ledgers, cpi, config): - for tx_index_id in prunable_tx_index_ids(meta_store, retention_ledgers, cpi): +def _run_prune_sweep(meta_store, retention_ledgers, config): + for tx_index_id in prunable_tx_index_ids(meta_store, retention_ledgers): prune_tx_index(tx_index_id, meta_store, config) -def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): +def prunable_tx_index_ids(meta_store, retention_ledgers): # Returns tx_index_ids fully past retention and prune-eligible (`:txhash` is `"1"` or # `"deleting"`). Eligibility: last_committed_ledger > last_ledger_in_tx_index(N) + retention_ledgers. if retention_ledgers == 0: return [] last_committed_ledger = meta_store.get("streaming:last_committed_ledger") - ledgers_per_index = cpi * LEDGERS_PER_CHUNK - max_eligible_tx_index_id = ((last_committed_ledger - GENESIS_LEDGER - retention_ledgers) // ledgers_per_index) - 1 + max_eligible_tx_index_id = max_prunable_tx_index_id(last_committed_ledger, retention_ledgers) if max_eligible_tx_index_id < 0: return [] result = [] @@ -775,9 +762,8 @@ def prunable_tx_index_ids(meta_store, retention_ledgers, cpi): def prune_tx_index(tx_index_id, meta_store, config): # Two-phase marker: set "deleting" first, clear the key last. Idempotent on retry. - cpi = config.service.chunks_per_txhash_index meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") - for chunk_id in range(tx_index_id * cpi, (tx_index_id + 1) * cpi): + for chunk_id in chunks_for_tx_index(tx_index_id): delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) meta_store.delete(f"chunk:{chunk_id:08d}:lfs") From 911f43129f97e84ab857bd027b554a523331f0a9 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sat, 25 Apr 2026 10:22:29 -0700 Subject: [PATCH 28/34] Design docs: backfill 3-arc restructure + cross-doc cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Backfill: restructured to Shape (Overview/Geometry/Configuration/Directory/Meta Store Keys) → How Backfill Runs → Resilience. Overview now produces-first with scene-setting blurb; meta store introduced inline. Pseudocode uses named helpers (chunk_id_of_ledger, last_chunk_in_tx_index, max_prunable_tx_index_id, etc.) instead of inline GENESIS_LEDGER / LEDGERS_PER_CHUNK arithmetic. make_bsb partial removed (process_chunk now constructs BSBSource(config.bsb) directly). MAX_PHASE1_ITERATIONS retired (loop runs unbounded). build_dag inlined with worked examples; trailing-partial completion paths cross-referenced to streaming. Task Details deduplicated. PR URLs replaced with local design-doc paths. - Streaming: phase1_catchup reads BSB-tip directly via bsb_latest_complete_chunk_id (no network-tip sampling); make_bsb removed; GOMAXPROCS renamed to MAX_CPU_THREADS; retention_aligned_start_chunk param generalized to tip_ledger; broken cross-doc anchors to backfill fixed. - First-use inline definitions tightened across both docs (BSB / LFS / MPHF / WAL / RecSplit / GCS expanded; DAG / GOMAXPROCS / LCM / RPC etc. left bare per readability). --- .../design-docs/01-backfill-workflow.md | 308 ++++++++---------- .../design-docs/02-streaming-workflow.md | 45 +-- 2 files changed, 161 insertions(+), 192 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index 9b6765237..a3f789282 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -10,23 +10,25 @@ Three immutable artifact types, one per full-history RPC query, scoped to **chun | Immutable output | Query it enables | Scope | |-----------------|-----------------|-------| -| Ledger [pack file](https://github.com/stellar/stellar-rpc/pull/633) | `getLedger` | Per chunk (10_000 ledgers) | +| Ledger [pack file](../../design-docs/packfile-library.md) | `getLedger` | Per chunk (10_000 ledgers) | | Tx-index files | `getTransaction` | Per tx index (default 10_000_000 ledgers) | -| [Events cold segment](https://github.com/stellar/stellar-rpc/pull/635) | `getEvents` | Per chunk | +| [Events cold segment](../../design-docs/getevents-full-history-design.md) | `getEvents` | Per chunk | **How it does it:** - Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). Internal to the daemon; no `full-history-backfill` subcommand, no per-run flags. - Ledger source is **BSB** (Buffered Storage Backend) — a remote object-store reader for `LedgerCloseMeta`, configured under `[BSB]` in the TOML config. Interface details in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). -- Input: an integer chunk range `[range_start_chunk_id, range_end_chunk_id]` and a `make_bsb` partial function — calling `make_bsb()` returns a fresh `BSBSource` instance. -- Ingests historical ledgers via per-task BSB instances. Each `process_chunk` task (the per-chunk unit of work; full pseudocode in [Task Details](#task-details) below) calls `make_bsb()` to get its own `BSBSource`, calls `prepare_range` scoped to the chunk's 10_000 ledgers, reads in a loop, and tears down. Independent per chunk — no shared source state. +- Ingests historical ledgers one chunk at a time. Each chunk uses its own BSB reader scoped to that chunk's 10_000 ledgers; no shared source state across chunks. - Writes directly to immutable file formats — no RocksDB active stores (mutable RocksDB instances holding in-flight live-ingestion data; streaming's concern, see [02-streaming-workflow.md — Active Store Architecture](./02-streaming-workflow.md#active-store-architecture)). - Tracks per-chunk and per-tx-index completion in a small **meta store** — a dedicated RocksDB with WAL always on, separate from streaming's active stores. Each flag is written after its artifact's `fsync`, and flag presence drives all resume decisions. -- Schedules work as a DAG of idempotent tasks dispatched via a flat worker pool capped at `GOMAXPROCS`. -- Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency. -- **BSB-only.** Backfill does not use captive core (the embedded `stellar-core` subprocess the daemon runs for live ingestion) as a ledger source. Captive core belongs to Phase 4 (live ingestion); if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger — a start ledger chosen forward of genesis so ingestion stays within the retention window — as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion) and [Ledger Source](./02-streaming-workflow.md#ledger-source). +- Schedules work as a DAG of idempotent tasks dispatched via a bounded worker pool. +- Returns when every chunk in the range is complete; on crash, Phase 1 (catchup) re-invokes with the same range and already-complete chunks are skipped via per-chunk idempotency primitives. +- **BSB-only.** + - Backfill does not use captive core (the embedded `stellar-core` subprocess that the service runs during live ingestion) as a ledger source. + - Captive core belongs to Phase 4 (live ingestion) + - if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger — a start ledger chosen forward of genesis so ingestion stays within the retention window — as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion) and [Ledger Source](./02-streaming-workflow.md#ledger-source). -For the distinction between *backfill (this subroutine)* and *Phase 1 (catchup) (the startup phase that invokes it)* — two terms that get conflated because their scopes overlap — see [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). +For the distinction between **backfill (this subroutine)** and **Phase 1 (catchup) (the startup phase that invokes it)** — two terms that get conflated because their scopes overlap, refer [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). --- @@ -37,7 +39,7 @@ Stellar's first ledger is `GENESIS_LEDGER = 2`. Mapping functions subtract it to ```python GENESIS_LEDGER = 2 LEDGERS_PER_CHUNK = 10_000 # hardcoded; not configurable -CHUNKS_PER_TXHASH_INDEX = 1000 # read from config, immutable after first run. Acceptable values - 1 / 10 / 100 / 1_000; default 1_000 +CHUNKS_PER_TXHASH_INDEX = 1_000 # read from config, immutable after first run. Acceptable values - 1 / 10 / 100 / 1_000; default 1_000 LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK # at cpi=1_000 this is 10_000_000 ``` @@ -55,10 +57,10 @@ LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK **[SERVICE]** -| Key | Type | Default | Description | -|-----|------|---------|-------------| -| `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | -| `CHUNKS_PER_TXHASH_INDEX` | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | +| Key | Type | Default | Description | +|--------------------------------------|------|---------|-------------| +| `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | +| `CHUNKS_PER_TXHASH_INDEX` (optional) | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | **[IMMUTABLE_STORAGE.LEDGERS]** (optional) @@ -158,7 +160,7 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h │ ├── ledgers/ ← IMMUTABLE_STORAGE.LEDGERS.PATH │ ├── 00000/ ← chunk_ids 0–999 (1_000 .pack files) -│ │ ├── 00000000.pack ← ledger pack file (PR #633) +│ │ ├── 00000000.pack ← ledger pack file for chunk_id=0 (ledgers 2–10_001) │ │ ├── 00000001.pack │ │ └── ... │ ├── 00001/ ← chunk_ids 1_000–1_999 @@ -208,14 +210,16 @@ Directory-count tradeoffs for a 2_000-chunk (20M-ledger) dataset: - **Nibble** = high 4 bits of `txhash[0]`, i.e., `txhash[0] >> 4`. Values `0`–`f`. Determines which of 16 CFs a txhash is routed to. - **Raw txhash format**: 36 bytes per entry, no header: `[txhash: 32 bytes][ledger_seq: 4 bytes big-endian]`. -- **Events cold segment**: see [getEvents full-history design](https://github.com/stellar/stellar-rpc/pull/635) for the full format. +- **Events cold segment**: see [getEvents full-history design](../../design-docs/getevents-full-history-design.md) for the full format. --- ## Meta Store Keys +*This section is a reference for the key schema and lifecycle. It reads more naturally after [How Backfill Runs](#how-backfill-runs) below, which defines the tasks that write and consume these keys.* + - Single RocksDB instance with WAL (Write-Ahead Log) always enabled. -- Authoritative source for crash recovery — all resume decisions derive from key presence. +- Authoritative for everything backfill decides: which chunks and tx indexes are done (progress tracker), which config values can't change across runs (e.g., `CHUNKS_PER_TXHASH_INDEX`, stored on first run and fatal if changed), and where to resume after a crash (every resume decision derives from key presence). ### Key Schema @@ -233,9 +237,6 @@ All IDs use uniform `%08d` zero-padding, matching the directory structure. - Each chunk flag is written independently after its output's fsync — a crash may leave some flags set and others absent for the same chunk. - On resume, each chunk's flags are checked independently — only missing outputs are produced. - WAL is always enabled — disabling it would invalidate all crash recovery. -- `chunk:{chunk_id:08d}:txhash` keys are deleted after the tx index is built (the raw `.bin` files they reference are also deleted); all other flags are permanent within backfill's scope. - -**Streaming's extension.** Streaming's prune path may transition `index:{tx_index_id:08d}:txhash` through an intermediate `"deleting"` value before clearing the key entirely. Backfill's `build_txhash_index` only ever writes `"1"`. See [02-streaming-workflow.md — Pruning](./02-streaming-workflow.md#pruning) for the prune mechanism. **Examples:** ``` @@ -247,26 +248,6 @@ index:00000000:txhash → "1" tx_index_id=0 RecSplit complete index:00000001:txhash → absent tx_index_id=1 not yet built ``` -### Validation Rules - -- `validate` checks argument sanity and defensively re-asserts `CHUNKS_PER_TXHASH_INDEX` against the meta store — the daemon's `validate_config` is the real enforcer; see [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). -- No source probe. `run_backfill` trusts the caller's range and fires the DAG. Per-chunk idempotency means already-done chunks are no-ops; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). -- `[BSB]` must be configured whenever `run_backfill` is invoked. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. -- DAG worker cap is `GOMAXPROCS`. BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. - -### Partial Tx Index Ranges - -When the caller's chunk range does not span a complete tx index, the trailing chunks have: - -- Their raw `.bin` files on disk (inside `IMMUTABLE_STORAGE.TXHASH_RAW.PATH`). -- Their `chunk:{chunk_id:08d}:txhash` flags set in the meta store. -- No RecSplit `.idx` files (RecSplit is built only when every chunk of the tx index is ready). - -These trailing artifacts persist on disk after `run_backfill` returns. Phase 2 (`.bin` hydration) of the RPC service loads them into the active txhash RocksDB store on startup and then deletes the `.bin` files and `chunk:{chunk_id:08d}:txhash` flags (see [02-streaming-workflow.md — Phase 2](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin)). - -Ledger and events data are useful per-chunk and are not blocked by tx-index alignment — `chunk:{chunk_id:08d}:lfs` and `chunk:{chunk_id:08d}:events` flags are set as soon as each chunk's outputs are durable. - - ### Key Lifecycle ``` @@ -282,7 +263,11 @@ After a completed tx index: --- -## Tasks and Dependencies +## How Backfill Runs + +Backfill's work is a static DAG. The `run_backfill` orchestrator validates the caller's range, builds the DAG over that range, and dispatches it with a bounded-concurrency worker pool (`MAX_CPU_THREADS`-capped). Each task is idempotent and checks its own completion state — the scheduler is just the dispatcher. + +### Task Types and Dependencies The backfill DAG has three task types: @@ -295,8 +280,6 @@ The backfill DAG has three task types: - Each task is a black box to the DAG scheduler — it calls `execute()` and waits for return. - What happens inside (goroutines, I/O, parallelism) is up to the task. -### Dependency Diagram - For the chunks of one tx index (first chunk through last chunk): ``` @@ -313,87 +296,135 @@ process_chunk(chunk_id=last) ─┘ ### Main Flow -`run_backfill` is invoked by the daemon's [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with an integer chunk range and a `make_bsb` partial: +`run_backfill` is invoked by the daemon's [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with an integer chunk range: ```python -def run_backfill(config, range_start_chunk_id, range_end_chunk_id, make_bsb): - # make_bsb is a partial (e.g. functools.partial(BSBSource, config.bsb)). Each call - # returns a fresh BSBSource. Every process_chunk that needs to download ledgers - # owns its own BSB for its chunk's range — no shared-source state across tasks. - validate(config, range_start_chunk_id, range_end_chunk_id) - - dag = build_dag(config, range_start_chunk_id, range_end_chunk_id, make_bsb) - dag.execute(max_workers=GOMAXPROCS) +def run_backfill(config, range_start_chunk_id, range_end_chunk_id): + validate(range_start_chunk_id, range_end_chunk_id) + + dag = build_dag(config, range_start_chunk_id, range_end_chunk_id) + dag.execute(max_workers=MAX_CPU_THREADS) ``` -### Validation +### Pre-DAG Validation -- Runs before DAG construction, not as a DAG task. -- If it were a task: no-dependency tasks would start concurrently; a validation failure would leave in-flight work to cancel. -- Running it first → clean abort, no partial work. +- Validation runs in two layers, both pre-DAG: + 1. **Daemon startup** (`validate_config`) — runs once per process start, before any backfill is invoked. Authoritative enforcer of config-immutability: `CHUNKS_PER_TXHASH_INDEX` (and `RETENTION_LEDGERS`) cannot change across runs. Defined in [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). + 2. **Per `run_backfill` call** (`validate`) — runs before DAG construction. Argument sanity only (chunk-range bounds). +- Why pre-DAG and not a DAG task: no-dependency tasks would start concurrently; a validation failure would leave in-flight work to cancel. Pre-DAG = clean abort, no partial work. +- No source probe. `run_backfill` trusts the caller's range; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). +- `[BSB]` must be configured. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. ```python -def validate(config, range_start_chunk_id, range_end_chunk_id): +def validate(range_start_chunk_id, range_end_chunk_id): # Argument sanity only. run_backfill trusts the caller's range — any source-coverage # issue (upper or lower bound) surfaces at runtime as a per-task get_ledger failure. assert range_start_chunk_id >= 0 assert range_end_chunk_id >= range_start_chunk_id - - # Defensive re-assert; daemon's validate_config owns the enforcement. - assert meta_store.get("config:chunks_per_txhash_index") == str(config.service.chunks_per_txhash_index) ``` ### DAG Setup ```python -def build_dag(config, range_start_chunk_id, range_end_chunk_id, make_bsb): +def build_dag(config, range_start_chunk_id, range_end_chunk_id): + # Invariant: range_start_chunk_id is always tx-index-aligned + # Phase 1 (catchup) is the only caller of this function, and it aligns the start chunk ID to the nearest tx index boundary) + # So, there is never a partial-at-start that would create an unbuildable index. + # A partial-at-end (trailing partial) is normal: BSB-tip lands wherever network production is, mid-index is typical. + dag = new DAG() + first_index = tx_index_id_of_chunk(range_start_chunk_id) + last_index = tx_index_id_of_chunk(range_end_chunk_id) - # Tx indexes whose LAST chunk is in range: schedule process_chunk for in-range chunks - # only (prior chunks are already `:lfs`-flagged from a prior iteration) + build + - # cleanup. build has every chunk's .bin when it runs. - for tx_index_id in tx_indexes_ending_in_range(range_start_chunk_id, range_end_chunk_id, config): + for tx_index_id in range(first_index, last_index + 1): chunk_tasks = [] - for chunk_id in chunks_for_tx_index(tx_index_id, config): - if not (range_start_chunk_id <= chunk_id <= range_end_chunk_id): - continue - t = dag.add(ProcessChunkTask(chunk_id, make_bsb=make_bsb), deps=[]) - chunk_tasks.append(t.id) - b = dag.add(BuildTxHashIndexTask(tx_index_id), deps=chunk_tasks) - dag.add(CleanupTxHashTask(tx_index_id), deps=[b.id]) - - # Trailing partial tx index (last chunk past range_end): process_chunk only; a future - # iteration that covers the missing trailing chunks will schedule the build. - for chunk_id in trailing_partial_tx_index_chunks(range_start_chunk_id, range_end_chunk_id, config): - dag.add(ProcessChunkTask(chunk_id, make_bsb=make_bsb), deps=[]) + for chunk_id in chunks_for_tx_index(tx_index_id): + if range_start_chunk_id <= chunk_id <= range_end_chunk_id: + t = dag.add(ProcessChunkTask(chunk_id, config), deps=[]) + chunk_tasks.append(t.id) + + # If the tx_index is fully covered (i.e., last chunk ≤ range_end), + # schedule build_txhash_index + cleanup_txhash. + # Otherwise the tx_index is the trailing partial — its last chunk isn't available for + # consumption yet — and only the process_chunk tasks above run; + # tx-build is skipped in that case. + if last_chunk_in_tx_index(tx_index_id) <= range_end_chunk_id: + build_task = dag.add(BuildTxHashIndexTask(tx_index_id), deps=chunk_tasks) + dag.add(CleanupTxHashTask(tx_index_id), deps=[build_task.id]) return dag ``` -- Trailing tx index whose last chunk is past `range_end_chunk_id`: `process_chunk` scheduled for in-range chunks only; no `build_txhash_index` / `cleanup_txhash`. -- `.bin` + `chunk:{chunk_id:08d}:txhash` flags persist until a future `run_backfill` covers the missing chunks OR [Phase 2 (`.bin` hydration)](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin) hydrates them — see [Partial Tx Index Ranges](#partial-tx-index-ranges). +**Examples:** ---- +- `input chunk range = [0, 5_999]`, `cpi = 1_000` → tx_indexes 0..5 fully covered and created; no trailing partial. +- `input chunk range = [0, 6_100]`, `cpi = 1_000` → tx_indexes 0..5 fully covered; tx_index 6 trailing partial with only chunk 6_000 created; `build_txhash_index` skipped for tx_index 6. + +**Trailing partial:** + +- On disk: chunks have `.bin` files + `:lfs` + `:events` + `:txhash` flags; `index:{tx_index_id:08d}:txhash` absent. +- Ledger and events data are not blocked by tx-index alignment — their flags land as each chunk's outputs are durable. +- The deferred build runs later: via a subsequent `run_backfill` call when BSB covers the tail, or via streaming's live-ingestion path — see [02-streaming-workflow.md — Phase 2 (`.bin` hydration)](./02-streaming-workflow.md#phase-2--hydrate-txhash-data-from-bin) and [02-streaming-workflow.md — RecSplit Transition](./02-streaming-workflow.md#recsplit-transition). + +### DAG Scheduler + +- The subroutine builds a single DAG per invocation and executes it with bounded concurrency. +- The DAG is the only scheduling mechanism — no per-tx-index coordinators, no secondary worker pools. +- Each task's `execute()` is wrapped with a retry loop bounded by `MAX_RETRIES` (implementation-defined constant). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. + +```python +def run_dag(dag, max_workers): + worker_slots = Semaphore(max_workers) + runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) + + def execute_task(task): + for attempt in range(1, MAX_RETRIES + 1): + error = task.execute() + if error is None: + break + if attempt == MAX_RETRIES: + mark_failed(task, error) # halt dependents + break + log.warn("retry", task, attempt, error) + worker_slots.release() + + for downstream_task in dag.dependents_of(task): + downstream_task.mark_dependency_done(task) + if downstream_task.all_dependencies_done(): + runnable_tasks.push(downstream_task) + + while runnable_tasks: + current_task = runnable_tasks.pop() + worker_slots.acquire() + run_in_background(execute_task, current_task) +``` + +### Worker Pool + +- Single flat pool of `max_workers = MAX_CPU_THREADS` slots. +- Any mix of task types can occupy slots simultaneously. +- `process_chunk`: 1 slot per task. +- `build_txhash_index`: 1 slot per task (uses many goroutines internally). +- `cleanup_txhash`: 1 slot per task. +- BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. -## Task Details +### Task Details -### process_chunk(chunk_id, make_bsb) +#### `process_chunk` - Processes a single 10_000-ledger chunk end-to-end. -- Occupies one DAG worker slot. -- Only produces missing outputs — checks each flag independently. -- Internal concurrency is an implementation detail. +- Idempotent at flag granularity — produces only outputs whose flag is missing; a partially-completed chunk resumes from where it left off. -**Outputs** (all produced in a single task, only if missing): +**Outputs** (each only if its flag is missing): -- Ledger pack file (`{chunk_id:08d}.pack`) — compressed ledger data in [packfile format](https://github.com/stellar/stellar-rpc/pull/633). -- Raw txhash flat file (`{chunk_id:08d}.bin`) — 36-byte entries consumed by RecSplit builder. -- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — per [getEvents design](https://github.com/stellar/stellar-rpc/pull/635). +- Ledger pack file (`{chunk_id:08d}.pack`) — see [packfile format](../../design-docs/packfile-library.md). +- Raw txhash flat file (`{chunk_id:08d}.bin`) — 36-byte entries (`txhash[32]` + `ledgerSeq[4]`) consumed by the RecSplit builder. +- Events cold segment (`events.pack` + `index.pack` + `index.hash`) — see [getEvents design](../../design-docs/getevents-full-history-design.md). **Pseudocode:** ```python -def process_chunk(chunk_id, make_bsb): +def process_chunk(chunk_id, config): first_ledger = first_ledger_in_chunk(chunk_id) last_ledger = last_ledger_in_chunk(chunk_id) @@ -406,13 +437,13 @@ def process_chunk(chunk_id, make_bsb): # If :lfs is already on disk, read from the local packfile — no BSB, no network. # Otherwise instantiate a per-task BSB scoped to THIS chunk's 10_000 ledgers. if need_lfs: - ledger_reader = make_bsb() + ledger_reader = BSBSource(config.bsb) ledger_reader.prepare_range(first_ledger, last_ledger) else: ledger_reader = local_packfile(ledger_pack_path(chunk_id)) - ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None - txhash_writer = open(raw_txhash_path(chunk_id), overwrite=True) if need_txhash else None + ledger_writer = packfile.create(ledger_pack_path(chunk_id), overwrite=True) if need_lfs else None + txhash_writer = open(raw_txhash_path(chunk_id), overwrite=True) if need_txhash else None events_writer = events_segment.create(events_segment_path(chunk_id), overwrite=True) if need_events else None try: @@ -436,24 +467,18 @@ def process_chunk(chunk_id, make_bsb): ledger_reader.close() # BSB: tears down the per-task instance. Local packfile: closes file handle. ``` -Key properties: +**Notes:** -- Only missing outputs are produced — a partially-completed chunk resumes from where it left off. -- If the ledger pack file (`:lfs` flag) is already present, reads from local NVMe instead of the source (avoids redundant download). -- Each flag is written independently after its output's fsync — no atomic WriteBatch needed. -- `packfile.create()` with `overwrite=True` handles truncation of partial files from prior crashes — no explicit `delete_if_exists` check needed. -- Naturally extends to new data types (add a fourth flag). +- If `:lfs` is set (pack file already on disk), reads from local NVMe instead of BSB — avoids redundant downloads on restart. +- Each flag is written independently after its output's `fsync`; no atomic WriteBatch needed. +- `packfile.create(..., overwrite=True)` handles truncation of partial files from prior crashes; no explicit cleanup before write. +- Each `process_chunk` owns its own BSB instance, scoped to the chunk's 10_000 ledgers and torn down at task exit. Cross-task concurrency cap is the DAG [Worker Pool](#worker-pool); the BSB interface is documented in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). +- Adding a new data type = adding a fourth flag + writer; no other task changes. -**Source concurrency.** -- Each `process_chunk` owns its own BSB instance; DAG dispatches up to `GOMAXPROCS` tasks in parallel. -- BSB's internal `NUM_WORKERS` is the per-instance download pool — not a cross-task concurrency knob. `6_000` chunks in the run means `6_000` independent BSB instances over the run's lifetime, up to `GOMAXPROCS` alive at any moment. -- Interface: see [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). - -### build_txhash_index(tx_index_id) +#### `build_txhash_index` - Builds the RecSplit index for one completed tx index. -- Occupies one DAG worker slot, but spawns several goroutines internally. -- The DAG guarantees all chunk `.bin` files exist before this runs. +- Occupies one DAG worker slot but spawns multiple goroutines internally (per-stage worker counts are in the pseudocode). **Pseudocode:** @@ -486,82 +511,37 @@ def build_txhash_index(tx_index_id): meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") ``` -Key properties: +**Notes:** -- COUNT and ADD each read all `.bin` files (two full passes over the data). -- BUILD runs 16 goroutines in parallel (one per CF) — each CF is independent. -- VERIFY always runs (there is no `--verify-recsplit=false` escape hatch — backfill trades throughput for correctness every time). -- All-or-nothing recovery: if `index:{tx_index_id:08d}:txhash` is absent on restart → delete partial `.idx` files → rerun entire build. +- VERIFY always runs — no `--verify-recsplit=false` escape hatch; backfill trades throughput for correctness every time. +- All-or-nothing recovery on restart: absent `index:{tx_index_id:08d}:txhash` ⇒ delete partial `.idx` files and re-run the full build. -### cleanup_txhash(tx_index_id) +#### `cleanup_txhash` -- Runs after `build_txhash_index` completes successfully. +- Runs after `build_txhash_index` completes successfully. Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery falls out naturally — on restart, the DAG sees the tx-index flag set but per-chunk `:txhash` flags still present, and cleanup re-runs as a normal task. **Pseudocode:** ```python def cleanup_txhash(tx_index_id): - for chunk_id in chunks_for_tx_index(tx_index_id, config): + for chunk_id in chunks_for_tx_index(tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): continue delete_if_exists(raw_txhash_path(chunk_id)) # idempotent; crash-between is safe meta_store.delete(f"chunk:{chunk_id:08d}:txhash") ``` -Key properties: +**Notes:** -- Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery works naturally. -- Per-chunk idempotency: each chunk checks its own `chunk:{chunk_id:08d}:txhash` key before deleting — a crash mid-cleanup resumes from where cleanup left off. -- On restart: DAG sees the tx-index key present (build complete) but `chunk:{chunk_id:08d}:txhash` keys still exist → cleanup runs as a normal task. +- Per-chunk idempotency: each chunk checks its own `chunk:{chunk_id:08d}:txhash` flag before deleting; a crash mid-cleanup resumes safely. --- -## Execution Model +## Resilience -### DAG Scheduler +Crash recovery and error handling share one foundation: flag-after-fsync makes the meta store authoritative, and every task checks its own flags before doing work. Transient failures retry at BSB-internal and task-level layers; persistent failures abort the run, and on restart already-complete work is skipped. -- The subroutine builds a single DAG per invocation and executes it with bounded concurrency. -- The DAG is the only scheduling mechanism — no per-tx-index coordinators, no secondary worker pools. -- Each task's `execute()` is wrapped with a retry loop bounded by `MAX_RETRIES` (implementation-defined constant). Any transient failure (BSB errors, temporary I/O issues) triggers a retry at the task level. - -```python -def run_dag(dag, max_workers): - worker_slots = Semaphore(max_workers) - runnable_tasks = ThreadSafeQueue(dag.tasks_with_no_pending_dependencies()) - - def execute_task(task): - for attempt in range(1, MAX_RETRIES + 1): - error = task.execute() - if error is None: - break - if attempt == MAX_RETRIES: - mark_failed(task, error) # halt dependents - break - log.warn("retry", task, attempt, error) - worker_slots.release() - - for downstream_task in dag.dependents_of(task): - downstream_task.mark_dependency_done(task) - if downstream_task.all_dependencies_done(): - runnable_tasks.push(downstream_task) - - while runnable_tasks: - current_task = runnable_tasks.pop() - worker_slots.acquire() - run_in_background(execute_task, current_task) -``` - -### Worker Pool - -- Single flat pool of `max_workers = GOMAXPROCS` slots. -- Any mix of task types can occupy slots simultaneously. -- `process_chunk`: 1 slot per task. -- `build_txhash_index`: 1 slot per task (uses many goroutines internally). -- `cleanup_txhash`: 1 slot per task. - ---- - -## Crash Recovery +### Crash Recovery No separate reconciliation phase — every task's `execute()` checks its own completion state: @@ -580,9 +560,7 @@ Three invariants make this work: The daemon acquires a directory flock on the meta-store at startup. A second process against the same datadir fails immediately. ---- - -## Error Handling +### Error Handling Two layers of retry: diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 92e5700f1..4033a8dc1 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -32,7 +32,7 @@ Terms used repeatedly throughout this doc. Skim on first read, refer back when a - **Leapfrog** (colloquial) — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. Implemented by the `retention_aligned_start_chunk` helper (Phase 1 (catchup) callsite) and the `retention_aligned_resume_ledger` helper (`compute_resume_ledger`'s no-BSB fresh-start branch). - **`compute_resume_ledger`** — shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` for Phase 4 (live ingestion). Runs post-Phase-3 so any in-flight freezes Phase 3 finished (and their newly-set `:lfs` flags) are visible to the scan. See [Compute Resume Ledger](#compute-resume-ledger). - **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. -- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Always sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS`, wrapped in the `get_latest_network_tip()` helper (handles retries + the archive-tip-lags-true-tip-by-up-to-63-ledgers quirk). Called only in startup-phase contexts: the Phase 1 (catchup) loop per iter, and `retention_aligned_resume_ledger` for the no-BSB fresh-start case. Phase 4 (live ingestion) steady state does NOT sample tip. Different from `last_committed_ledger` (the daemon's own progress). +- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS`, wrapped in the `get_latest_network_tip()` helper (handles retries + the archive-tip-lags-true-tip-by-up-to-63-ledgers quirk). Called only by `retention_aligned_resume_ledger` (no-BSB tip-tracker fresh-start case). Phase 1 (catchup) reads BSB directly via `bsb_latest_complete_chunk_id` and never samples the network tip. Phase 4 (live ingestion) steady state does NOT sample tip either. Different from `last_committed_ledger` (the daemon's own progress). - **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - Ledger active store — a per-chunk RocksDB (one instance per chunk). - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). @@ -283,11 +283,10 @@ The daemon maintains three active stores for the current ingestion position. All ## Ledger Source -- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk BSB via the `make_bsb` partial, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). +- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk `BSBSource` from `[BSB]` config, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). - **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. - **`BSBSource`** is the backfill-only ledger source — one instance per `process_chunk`, interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`), torn down at end-of-task. -- **`make_bsb_partial(config)`** returns a partial that instantiates `BSBSource(config.bsb)` per call; returns `None` when `[BSB]` is absent so `phase1_catchup` can short-circuit. --- @@ -319,8 +318,7 @@ def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) start_http_server(config) - make_bsb = make_bsb_partial(config) - phase1_catchup(config, meta_store, make_bsb) + phase1_catchup(config, meta_store) phase2_hydrate_txhash(config, meta_store) phase3_reconcile_orphans(config, meta_store) resume_ledger = compute_resume_ledger(config, meta_store) @@ -331,39 +329,32 @@ Query serving is gated on Phase 4 (live ingestion) being reached — see [Query ### Phase 1 — Catchup -- **No-op path:** if `make_bsb is None` (no `[BSB]` configured), Phase 1 (catchup) returns immediately. Phase 4 (live ingestion)'s captive core will catch up from a leapfrog'd resume ledger. -- **BSB path:** runs the backfill subroutine (`run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md)) once per source-tip sample, until the gap closes to less than one chunk. -- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. +- **No-op path:** if `config.bsb is None` (no `[BSB]` configured), Phase 1 (catchup) returns immediately. Phase 4 (live ingestion)'s captive core will catch up from a leapfrog'd resume ledger. +- **BSB path:** runs the backfill subroutine (`run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md)) once per BSB-tip sample, until BSB has no new complete chunks beyond the last scheduled range. +- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id, config)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. +- Phase 1 reads from BSB, so the relevant horizon is BSB's latest chunk-aligned position — not the network tip. The gap between BSB's tip and the actual network tip (typically minutes of upload lag) is closed by Phase 4 (live ingestion)'s captive core. ```python -MAX_PHASE1_ITERATIONS = 5 # safety-net cap; hitting it means BSB is degraded. - - -def phase1_catchup(config, meta_store, make_bsb): - if make_bsb is None: +def phase1_catchup(config, meta_store): + if config.bsb is None: return # [BSB] absent → no-op retention_ledgers = config.service.retention_ledgers last_scheduled_end_chunk = -1 - for iter_count in range(1, MAX_PHASE1_ITERATIONS + 1): - network_tip_ledger = get_latest_network_tip(config.history_archives.urls) - end_chunk = last_completed_chunk_id(network_tip_ledger) + while True: + end_chunk = bsb_latest_complete_chunk_id(config.bsb) if end_chunk <= last_scheduled_end_chunk: - return # converged - start_chunk = retention_aligned_start_chunk(network_tip_ledger, retention_ledgers) + return # BSB has no new complete chunks + start_chunk = retention_aligned_start_chunk(last_ledger_in_chunk(end_chunk), retention_ledgers) if end_chunk < start_chunk: return # leapfrog past tip - log.info(f"phase1_catchup iter={iter_count}/{MAX_PHASE1_ITERATIONS} " - f"tip={network_tip_ledger} range=[{start_chunk}, {end_chunk}]") - run_backfill(config, start_chunk, end_chunk, make_bsb) + log.info(f"phase1_catchup bsb_tip_chunk={end_chunk} range=[{start_chunk}, {end_chunk}]") + run_backfill(config, start_chunk, end_chunk) last_scheduled_end_chunk = end_chunk - fatal(f"phase1_catchup exceeded {MAX_PHASE1_ITERATIONS} iters; " - f"check [BSB].NUM_WORKERS / BUFFER_SIZE (backlog trail in logs).") - -def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers): +def retention_aligned_start_chunk(tip_ledger, retention_ledgers): # Called by: phase1_catchup (per loop iteration) to compute range_start_chunk_id. # Returns the first chunk Phase 1 (catchup) should backfill: # - Archive profile (retention=0): chunk 0 (full history from genesis). @@ -373,11 +364,11 @@ def retention_aligned_start_chunk(network_tip_ledger, retention_ledgers): # Worst case: up to LEDGERS_PER_INDEX - 1 extra ledgers below strict retention. if retention_ledgers == 0: return 0 - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) + target_ledger = max(tip_ledger - retention_ledgers, GENESIS_LEDGER) return first_chunk_id_of_tx_index_containing(target_ledger) ``` -**Worker concurrency:** `run_backfill` caps DAG concurrency at `GOMAXPROCS`. Each `process_chunk` owns its own BSB instance (`make_bsb()`), prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunkchunk_id-make_bsb). +**Worker concurrency:** `run_backfill` caps DAG concurrency at `MAX_CPU_THREADS`. Each `process_chunk` owns its own `BSBSource` instance, prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunk). **Retention effect:** retention determines Phase 1 (catchup)'s chunk range. Catchup time ≈ `retention_window / (BSB throughput)`. From a94c92b3f5a75e298d58ec638f7b959ff0fff2ba Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sat, 25 Apr 2026 16:27:55 -0700 Subject: [PATCH 29/34] Design docs: streaming editorial pass + cleanup-ordering invariant MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Streaming doc: ≥6 cold-read passes — Resilience consolidation (Crash Recovery + Concurrent Access + Error Handling under one umbrella), Terminology rewritten in plain English, dropped impl leaks (CaptiveStellarCore, RocksDB-as-noun, sync.Mutex / chan struct{} / sync.WaitGroup, daemon vocabulary → "service", Alice persona, function-name jargon in prose). - Backfill doc: H4 → H3 + horizontal-rule separators for the three task functions (process_chunk / build_txhash_index / cleanup_txhash); cleanup_txhash comment tightened to capture the file-then-flag rule plainly. - Both docs: rename CHUNKS_PER_TXHASH_INDEX → CHUNKS_PER_TX_INDEX, LEDGERS_PER_INDEX → LEDGERS_PER_TX_INDEX (and lowercase meta-store key forms). - Cleanup-ordering invariant added to Flag Semantics (file-before-flag-delete) and applied uniformly to Phase 2 hydration (Sweep 1 + Sweep 2 reordered, Sweep 3 filesystem scan retired), prune_tx_index (defense-in-depth on .bin + per-chunk :txhash flag), and cleanup_txhash. - Phase 2 + Resilience: expand Compound Recovery Scenarios with test-actionable Phase 2 crash points; pseudocode comments trimmed to a single block-level note per cleanup site. - Active Store Architecture pruned: 3-row Max Concurrent Stores table → bold-italic one-liner; redundant RocksDB / WAL bullets collapsed into the lead-in; Synchronous open cost trimmed to a single line. - README: streaming-doc scope row mentions resilience. Anchors revalidated across all design-doc files; no breakage. --- .../design-docs/01-backfill-workflow.md | 86 ++-- .../design-docs/02-streaming-workflow.md | 381 +++++++++--------- full-history/design-docs/README.md | 4 +- 3 files changed, 247 insertions(+), 224 deletions(-) diff --git a/full-history/design-docs/01-backfill-workflow.md b/full-history/design-docs/01-backfill-workflow.md index a3f789282..e09385a1c 100644 --- a/full-history/design-docs/01-backfill-workflow.md +++ b/full-history/design-docs/01-backfill-workflow.md @@ -2,7 +2,8 @@ ## Overview -Backfill is the RPC service's historical-ingestion subroutine — it pulls ledgers from a remote object store (BSB) and writes them as immutable, query-ready artifacts on local disk. It runs once per daemon start, as part of Phase 1 (catchup), to close the gap between on-disk state and the current network tip before live ingestion takes over. Interruption at any point leaves recoverable state; on restart, already-complete work is skipped. +Backfill is the RPC service's historical-ingestion subroutine — it pulls ledgers from a configured remote object store (GCS or S3) and writes them as **immutable, query-ready artifacts** on local disk. +It runs once per service start, as part of Phase 1 (catchup), to close the gap between on-disk state and the current network tip before live ingestion takes over. Interruption at any point leaves recoverable state; on restart, already-complete work is skipped. **What it produces:** @@ -16,7 +17,7 @@ Three immutable artifact types, one per full-history RPC query, scoped to **chun **How it does it:** -- Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). Internal to the daemon; no `full-history-backfill` subcommand, no per-run flags. +- Backfill is a subroutine invoked by the RPC service's **Phase 1 (catchup)** — see [02-streaming-workflow.md — Phase 1](./02-streaming-workflow.md#phase-1--catchup). Internal to the service; no `full-history-backfill` subcommand, no per-run flags. - Ledger source is **BSB** (Buffered Storage Backend) — a remote object-store reader for `LedgerCloseMeta`, configured under `[BSB]` in the TOML config. Interface details in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). - Ingests historical ledgers one chunk at a time. Each chunk uses its own BSB reader scoped to that chunk's 10_000 ledgers; no shared source state across chunks. - Writes directly to immutable file formats — no RocksDB active stores (mutable RocksDB instances holding in-flight live-ingestion data; streaming's concern, see [02-streaming-workflow.md — Active Store Architecture](./02-streaming-workflow.md#active-store-architecture)). @@ -28,7 +29,7 @@ Three immutable artifact types, one per full-history RPC query, scoped to **chun - Captive core belongs to Phase 4 (live ingestion) - if BSB isn't configured, backfill is not invoked at all and Phase 4 (live ingestion)'s captive core catches up from a leapfrog'd resume ledger — a start ledger chosen forward of genesis so ingestion stays within the retention window — as part of normal startup. See [02-streaming-workflow.md — Phase 4](./02-streaming-workflow.md#phase-4--live-ingestion) and [Ledger Source](./02-streaming-workflow.md#ledger-source). -For the distinction between **backfill (this subroutine)** and **Phase 1 (catchup) (the startup phase that invokes it)** — two terms that get conflated because their scopes overlap, refer [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). +For the distinction between **backfill (this subroutine)** and **Phase 1 (catchup) (the startup phase that invokes backfill)** — two terms that get conflated because their scopes overlap, refer [02-streaming-workflow.md — Backfill vs Phase 1 (catchup)](./02-streaming-workflow.md#backfill-vs-phase-1-catchup). --- @@ -38,20 +39,19 @@ Stellar's first ledger is `GENESIS_LEDGER = 2`. Mapping functions subtract it to ```python GENESIS_LEDGER = 2 -LEDGERS_PER_CHUNK = 10_000 # hardcoded; not configurable -CHUNKS_PER_TXHASH_INDEX = 1_000 # read from config, immutable after first run. Acceptable values - 1 / 10 / 100 / 1_000; default 1_000 -LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK - # at cpi=1_000 this is 10_000_000 +LEDGERS_PER_CHUNK = 10_000 # hardcoded; not configurable +CHUNKS_PER_TX_INDEX = 1_000 # read from config, immutable after first run. Acceptable values - 1, 10, 100, 1_000; default 1_000 +LEDGERS_PER_TX_INDEX = CHUNKS_PER_TX_INDEX * LEDGERS_PER_CHUNK # at cpi=1_000 this is 10_000_000 ``` -- In pseudocode, `cpi` in inline comments is shorthand for `CHUNKS_PER_TXHASH_INDEX`. +- In pseudocode, `cpi` in inline comments is shorthand for `CHUNKS_PER_TX_INDEX`. - All IDs use uniform `%08d` zero-padding (supports up to `99_999_999`). --- ## Configuration - -- Backfill reads the subset of the unified TOML config described below. Daemon-level keys unused by backfill are specified in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration). +Backfill reads the subset of the unified TOML config described below. +_Service-level keys, used by the streaming flow, are specified in [02-streaming-workflow.md — Configuration](./02-streaming-workflow.md#configuration)._ ### TOML Config @@ -60,7 +60,7 @@ LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK | Key | Type | Default | Description | |--------------------------------------|------|---------|-------------| | `DEFAULT_DATA_DIR` | string | **required** | Base directory for meta store and default storage paths. | -| `CHUNKS_PER_TXHASH_INDEX` (optional) | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | +| `CHUNKS_PER_TX_INDEX` (optional) | int | `1000` | Chunks per tx index. Defines data layout; stored in the meta store on first run and fatal if changed on any subsequent run. | **[IMMUTABLE_STORAGE.LEDGERS]** (optional) @@ -88,7 +88,7 @@ LEDGERS_PER_INDEX = CHUNKS_PER_TXHASH_INDEX * LEDGERS_PER_CHUNK The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-backed mutable stores owned by the streaming workflow). -**[BSB]** — Buffered Storage Backend (optional at the daemon level; required when [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) selects `BSBSource`) +**[BSB]** — Buffered Storage Backend (optional at the service level; required when [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) selects `BSBSource`) | Key | Type | Default | Description | |-----|------|---------|-------------| @@ -102,8 +102,8 @@ The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-back | Key | Type | Default | Description | |-----|------|---------|-------------| -| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. Daemon CLI flag `--log-level` wins when both are set. | -| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. Daemon CLI flag `--log-format` wins when both are set. | +| `LEVEL` | string | `"info"` | Minimum log severity. Accepted values: `debug` / `info` / `warn` / `error`. Service CLI flag `--log-level` wins when both are set. | +| `FORMAT` | string | `"text"` | Log output format. Accepted values: `text` / `json`. Service CLI flag `--log-format` wins when both are set. | **[META_STORE]** (optional) @@ -116,7 +116,7 @@ The `IMMUTABLE_STORAGE` prefix disambiguates from `ACTIVE_STORAGE` (RocksDB-back ```toml [SERVICE] DEFAULT_DATA_DIR = "/data/stellar-rpc" -CHUNKS_PER_TXHASH_INDEX = 1000 +CHUNKS_PER_TX_INDEX = 1000 [IMMUTABLE_STORAGE.LEDGERS] PATH = "/mnt/nvme/ledgers" @@ -187,11 +187,11 @@ With geometry and storage paths (`IMMUTABLE_STORAGE.*`) defined above, here is h └── .../ ``` -`CHUNKS_PER_TXHASH_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. +`CHUNKS_PER_TX_INDEX` only affects `txhash/index/` — all other trees use the hardcoded 1_000-chunk `bucket_id` grouping regardless. Directory-count tradeoffs for a 2_000-chunk (20M-ledger) dataset: -| `CHUNKS_PER_TXHASH_INDEX` | Tx-index dirs | Tradeoff | +| `CHUNKS_PER_TX_INDEX` | Tx-index dirs | Tradeoff | |---------------------------|---------------|----------| | `1000` (default) | `2_000 / 1_000 = 2` | Fewer dirs, larger indexes — longer build time per index, fewer files to search at query time | | `100` | `2_000 / 100 = 20` | More dirs, smaller indexes — faster build time per index, more files to search at query time | @@ -219,7 +219,7 @@ Directory-count tradeoffs for a 2_000-chunk (20M-ledger) dataset: *This section is a reference for the key schema and lifecycle. It reads more naturally after [How Backfill Runs](#how-backfill-runs) below, which defines the tasks that write and consume these keys.* - Single RocksDB instance with WAL (Write-Ahead Log) always enabled. -- Authoritative for everything backfill decides: which chunks and tx indexes are done (progress tracker), which config values can't change across runs (e.g., `CHUNKS_PER_TXHASH_INDEX`, stored on first run and fatal if changed), and where to resume after a crash (every resume decision derives from key presence). +- Authoritative for everything backfill decides: which chunks and tx indexes are done (progress tracker), which config values can't change across runs (e.g., `CHUNKS_PER_TX_INDEX`, stored on first run and fatal if changed), and where to resume after a crash (every resume decision derives from key presence). ### Key Schema @@ -278,9 +278,9 @@ The backfill DAG has three task types: | `cleanup_txhash(tx_index_id)` | Per tx index | `build_txhash_index` for this tx index | Deletes raw `.bin` files + `chunk:{chunk_id:08d}:txhash` meta keys | - Each task is a black box to the DAG scheduler — it calls `execute()` and waits for return. -- What happens inside (goroutines, I/O, parallelism) is up to the task. +- What happens inside (concurrency, I/O, parallelism) is up to the task. -For the chunks of one tx index (first chunk through last chunk): +For the chunks of a single tx index (first chunk through last chunk, inclusive), the dependencies look like this: ``` process_chunk(chunk_id=first) ─┐ @@ -296,7 +296,7 @@ process_chunk(chunk_id=last) ─┘ ### Main Flow -`run_backfill` is invoked by the daemon's [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with an integer chunk range: +`run_backfill` is invoked by the service's [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) with an integer chunk range: ```python def run_backfill(config, range_start_chunk_id, range_end_chunk_id): @@ -309,8 +309,8 @@ def run_backfill(config, range_start_chunk_id, range_end_chunk_id): ### Pre-DAG Validation - Validation runs in two layers, both pre-DAG: - 1. **Daemon startup** (`validate_config`) — runs once per process start, before any backfill is invoked. Authoritative enforcer of config-immutability: `CHUNKS_PER_TXHASH_INDEX` (and `RETENTION_LEDGERS`) cannot change across runs. Defined in [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). - 2. **Per `run_backfill` call** (`validate`) — runs before DAG construction. Argument sanity only (chunk-range bounds). + 1. **Service startup** (`validate_config`) — runs once per process start, before any backfill is invoked. Authoritative enforcer of config-immutability: `CHUNKS_PER_TX_INDEX` (and `RETENTION_LEDGERS`) cannot change across runs. Defined in [02-streaming-workflow.md — Validation Pseudocode](./02-streaming-workflow.md#validation-pseudocode). + 2. **Per `run_backfill` call** - `validate` runs before DAG construction. Argument sanity only (chunk-range bounds). - Why pre-DAG and not a DAG task: no-dependency tasks would start concurrently; a validation failure would leave in-flight work to cancel. Pre-DAG = clean abort, no partial work. - No source probe. `run_backfill` trusts the caller's range; source-coverage problems surface at runtime as task failures — see [Error Handling](#error-handling). - `[BSB]` must be configured. Phase 1 (catchup) only calls `run_backfill` when `[BSB]` is present. @@ -328,8 +328,8 @@ def validate(range_start_chunk_id, range_end_chunk_id): ```python def build_dag(config, range_start_chunk_id, range_end_chunk_id): # Invariant: range_start_chunk_id is always tx-index-aligned - # Phase 1 (catchup) is the only caller of this function, and it aligns the start chunk ID to the nearest tx index boundary) - # So, there is never a partial-at-start that would create an unbuildable index. + # Phase 1 (catchup) is the only caller of this function, and it aligns the start chunk ID to the nearest tx index boundary, which is why validate() doesn't check for that. + # This means the first tx index in the range is always fully covered by the chunk range, and thus always buildable. # A partial-at-end (trailing partial) is normal: BSB-tip lands wherever network production is, mid-index is typical. dag = new DAG() @@ -357,10 +357,10 @@ def build_dag(config, range_start_chunk_id, range_end_chunk_id): **Examples:** -- `input chunk range = [0, 5_999]`, `cpi = 1_000` → tx_indexes 0..5 fully covered and created; no trailing partial. -- `input chunk range = [0, 6_100]`, `cpi = 1_000` → tx_indexes 0..5 fully covered; tx_index 6 trailing partial with only chunk 6_000 created; `build_txhash_index` skipped for tx_index 6. +- `input chunk range = [0, 5_999]`, `cpi = 1_000`, starting chunk is 0 - already tx-index aligned → tx_indexes 0..5 fully covered and created; no trailing partial. +- `input chunk range = [3_000, 6_100]`, `cpi = 1_000`, starting chunk is 3 - already tx-index aligned → tx_indexes 3..5 fully covered; tx_index 6 trailing partial with only chunk 6_000 created; `build_txhash_index` skipped for tx_index 6. -**Trailing partial:** +**Trailing partial tx-index:** - On disk: chunks have `.bin` files + `:lfs` + `:events` + `:txhash` flags; `index:{tx_index_id:08d}:txhash` absent. - Ledger and events data are not blocked by tx-index alignment — their flags land as each chunk's outputs are durable. @@ -404,13 +404,13 @@ def run_dag(dag, max_workers): - Single flat pool of `max_workers = MAX_CPU_THREADS` slots. - Any mix of task types can occupy slots simultaneously. - `process_chunk`: 1 slot per task. -- `build_txhash_index`: 1 slot per task (uses many goroutines internally). +- `build_txhash_index`: 1 slot per task (uses internal parallelism across many concurrent workers). - `cleanup_txhash`: 1 slot per task. - BSB's `NUM_WORKERS` is a per-BSB internal download pool, not a cross-task concurrency knob. -### Task Details +--- -#### `process_chunk` +### `process_chunk` - Processes a single 10_000-ledger chunk end-to-end. - Idempotent at flag granularity — produces only outputs whose flag is missing; a partially-completed chunk resumes from where it left off. @@ -434,7 +434,7 @@ def process_chunk(chunk_id, config): if not (need_lfs or need_txhash or need_events): return - # If :lfs is already on disk, read from the local packfile — no BSB, no network. + # If :lfs is already on disk, read from the local packfile — no need to use BSB. # Otherwise instantiate a per-task BSB scoped to THIS chunk's 10_000 ledgers. if need_lfs: ledger_reader = BSBSource(config.bsb) @@ -475,10 +475,12 @@ def process_chunk(chunk_id, config): - Each `process_chunk` owns its own BSB instance, scoped to the chunk's 10_000 ledgers and torn down at task exit. Cross-task concurrency cap is the DAG [Worker Pool](#worker-pool); the BSB interface is documented in [02-streaming-workflow.md — Ledger Source](./02-streaming-workflow.md#ledger-source). - Adding a new data type = adding a fourth flag + writer; no other task changes. -#### `build_txhash_index` +--- + +### `build_txhash_index` - Builds the RecSplit index for one completed tx index. -- Occupies one DAG worker slot but spawns multiple goroutines internally (per-stage worker counts are in the pseudocode). +- Occupies one DAG worker slot but spawns multiple concurrent workers internally (per-stage worker counts are in the pseudocode). **Pseudocode:** @@ -516,7 +518,9 @@ def build_txhash_index(tx_index_id): - VERIFY always runs — no `--verify-recsplit=false` escape hatch; backfill trades throughput for correctness every time. - All-or-nothing recovery on restart: absent `index:{tx_index_id:08d}:txhash` ⇒ delete partial `.idx` files and re-run the full build. -#### `cleanup_txhash` +--- + +### `cleanup_txhash` - Runs after `build_txhash_index` completes successfully. Modeled as a separate DAG task (not inline in `build_txhash_index`) so crash recovery falls out naturally — on restart, the DAG sees the tx-index flag set but per-chunk `:txhash` flags still present, and cleanup re-runs as a normal task. @@ -524,10 +528,12 @@ def build_txhash_index(tx_index_id): ```python def cleanup_txhash(tx_index_id): + # File-before-flag-delete on every cleanup pair (see 02-streaming-workflow.md — Flag Semantics). + # On any crash mid-pair, the flag is the recovery signal — never an orphan file with no record. for chunk_id in chunks_for_tx_index(tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): continue - delete_if_exists(raw_txhash_path(chunk_id)) # idempotent; crash-between is safe + delete_if_exists(raw_txhash_path(chunk_id)) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") ``` @@ -546,7 +552,7 @@ Crash recovery and error handling share one foundation: flag-after-fsync makes t No separate reconciliation phase — every task's `execute()` checks its own completion state: - `build_dag()` registers ALL tasks for the chunk range on every invocation; no meta-store scanning in setup. -- `process_chunk` checks each output flag independently — missing produced, existing skipped. +- `process_chunk` checks each output flag independently — missing output is produced; existing output is skipped. - `build_txhash_index` checks `index:{tx_index_id:08d}:txhash` — present → early return; absent → delete partial `.idx` files, rerun full build. - `cleanup_txhash` checks `chunk:{chunk_id:08d}:txhash` per-chunk — cleaned skipped, remaining cleaned. @@ -558,7 +564,7 @@ Three invariants make this work: ### Concurrent Access Prevention -The daemon acquires a directory flock on the meta-store at startup. A second process against the same datadir fails immediately. +The service acquires a directory flock on the meta-store at startup. A second process against the same datadir fails immediately. ### Error Handling @@ -567,7 +573,7 @@ Two layers of retry: - **BSB-internal retries.** `BSBSource` handles transient errors (connection resets, throttling) inside a single task execution. Invisible to the DAG. - **Task-level retries.** DAG wraps each task's `execute()` in a retry loop bounded by `MAX_RETRIES`. - Source retries exhausted → task retries whole. - - `MAX_RETRIES` exhausted → task marked failed → DAG halts dependents → `run_backfill` returns fatal → [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) propagates → daemon exits non-zero. + - `MAX_RETRIES` exhausted → task marked failed → DAG halts dependents → `run_backfill` returns fatal → [Phase 1 (catchup)](./02-streaming-workflow.md#phase-1--catchup) propagates error to the service → service exits non-zero. - Operator fixes root cause + restarts → Phase 1 (catchup) re-enters → `run_backfill` re-invoked with a fresh range → completed work skipped via per-chunk idempotency. | Error | Handled by | Action | @@ -579,4 +585,4 @@ Two layers of retry: | Events write / fsync failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; flag not set | | RecSplit build failure | Task-level retry | `MAX_RETRIES` attempts; then ABORT; tx-index key absent | | VERIFY stage mismatch | None | ABORT immediately — data corruption; operator investigates | -| Meta store write failure | None | ABORT immediately — treat as crash; operator re-runs daemon | +| Meta store write failure | None | ABORT immediately — treat as crash; operator re-runs service | diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 4033a8dc1..0e8106c2d 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -2,19 +2,19 @@ ## Overview -The stellar-rpc daemon is the full-history RPC service. One binary, one invocation, one long-running process. +stellar-rpc is the **unified full-history RPC service** — historical backfill and live streaming under one binary, one invocation, one long-running process. - Operator runs `stellar-rpc --config path/to/config.toml`. No subcommand. No `--mode` flag. No behavior-switching flags. -- On every start the daemon runs four sequential startup phases, then enters a live ingestion loop it stays in until killed. -- Behavior across the three operator profiles — **archive** (full history), **pruning-history** (retention-windowed history with BSB catchup), **tip-tracker** (retention-windowed history, no object store; captive-core-only) — is determined entirely by TOML config; no profile flag. Full matrix: [Operator Profiles](#operator-profiles). -- Backfill (`01-backfill-workflow.md`) is used as an internal subroutine by Startup Phase 1 (catchup). Operators never invoke backfill directly. +- On every start, the service runs four sequential startup phases, then enters a live ingestion loop it stays in until killed. +- Behavior across the three operator profiles — **archive** (full history), **pruning-history** (retention-windowed history with bulk catchup from a remote object store), **tip-tracker** (retention-windowed history, no object store; captive-core-only) — is determined entirely by TOML config; no profile flag. Full matrix: [Operator Profiles](#operator-profiles). +- Backfill (specified in [01-backfill-workflow.md](./01-backfill-workflow.md)) is used as an internal subroutine by Phase 1 (catchup). Operators never invoke backfill directly. -**What the daemon does end-to-end:** -- Validates config against immutable meta-store state: `CHUNKS_PER_TXHASH_INDEX` (chunks-per-tx-index constant; defines on-disk layout) and `RETENTION_LEDGERS` (history window in ledgers, or `0` for full history). Both detailed in [Configuration](#configuration). +**What the service does end-to-end:** +- Validates config against immutable meta-store state: `CHUNKS_PER_TX_INDEX` (chunks-per-tx-index constant; defines on-disk layout) and `RETENTION_LEDGERS` (history window in ledgers, or `0` for full history). Both detailed in [Configuration](#configuration). - Catches up to the current **network tip** (most recent ledger the Stellar network has produced, sampled from the history archive — defined in [Terminology](#terminology)) using **BSB** (Buffered Storage Backend — remote object-store reader for `LedgerCloseMeta`; see [Ledger Source](#ledger-source)) or captive core (embedded `stellar-core` subprocess; see [Ledger Source](#ledger-source)), whichever is configured. - Hydrates any in-flight state left by a prior run. -- Ingests live ledgers from `CaptiveStellarCore` (the stellar Go SDK's captive-core client type — wraps the embedded `stellar-core` subprocess) at ~1 per 6 seconds. -- Writes each live ledger to three **active stores** — mutable per-chunk or per-index RocksDB instances for ledger, txhash, events — detailed in [Active Store Architecture](#active-store-architecture). +- Ingests live ledgers from captive core. +- Writes each live ledger to three **active Rocksdb stores** — mutable per-chunk or per-index RocksDB instances for ledger, txhash, events — detailed in [Active Store Architecture](#active-store-architecture). - Freezes active stores to immutable files at chunk and index boundaries in background. - Prunes past-retention indexes atomically when retention is configured. - Serves `getLedger`, `getTransaction`, `getEvents` only after startup phases complete. Returns HTTP 4xx during startup. @@ -23,37 +23,51 @@ The stellar-rpc daemon is the full-history RPC service. One binary, one invocati ## Terminology -Terms used repeatedly throughout this doc. Skim on first read, refer back when a term surfaces later. - -- **Daemon** — the stellar-rpc binary running as one long-lived process. The only operator-facing entry point. -- **Startup phases 1–4** — sequential bootstrap work the daemon runs once per process start, before serving queries. Not a lifecycle concept — once Phase 4 (live ingestion) is reached, it stays there until the process exits. [Details](#startup-sequence). -- **Phase 1 (catchup)** — the startup phase that closes the gap between the last-committed ledger and the current network tip. Invokes the backfill subroutine internally. -- **Backfill (subroutine)** — a self-contained mechanism that ingests a known `[range_start, range_end]` chunk range via a static DAG of per-chunk tasks (`process_chunk`, `build_txhash_index`, `cleanup_txhash`). Specified in `01-backfill-workflow.md`. In the unified design, backfill is an internal callable only — no CLI (command-line) entry point exists. -- **Leapfrog** (colloquial) — when retention is configured (`RETENTION_LEDGERS > 0`), Phase 1 (catchup) skips past ledgers older than `tip - RETENTION_LEDGERS` by starting ingestion at the first ledger of the txhash index that contains `tip - RETENTION_LEDGERS`. Always lands on an index boundary — upholds the invariant that every persisted chunk is the first chunk of its index or a forward-contiguous extension of one. Implemented by the `retention_aligned_start_chunk` helper (Phase 1 (catchup) callsite) and the `retention_aligned_resume_ledger` helper (`compute_resume_ledger`'s no-BSB fresh-start branch). -- **`compute_resume_ledger`** — shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` for Phase 4 (live ingestion). Runs post-Phase-3 so any in-flight freezes Phase 3 finished (and their newly-set `:lfs` flags) are visible to the scan. See [Compute Resume Ledger](#compute-resume-ledger). -- **`streaming:last_committed_ledger` (per-ledger checkpoint)** — meta-store key written once per live ledger inside Phase 4 (live ingestion)'s ingestion loop. Tracks live-streaming progress. Never touched during Phases 1–3. Bound locally as `last_committed_ledger` in pseudocode. -- **`network_tip_ledger`** — the most recent ledger the Stellar network has produced. Sampled from the history archive via HTTP GET on `/.well-known/stellar-history.json` against `HISTORY_ARCHIVES.URLS`, wrapped in the `get_latest_network_tip()` helper (handles retries + the archive-tip-lags-true-tip-by-up-to-63-ledgers quirk). Called only by `retention_aligned_resume_ledger` (no-BSB tip-tracker fresh-start case). Phase 1 (catchup) reads BSB directly via `bsb_latest_complete_chunk_id` and never samples the network tip. Phase 4 (live ingestion) steady state does NOT sample tip either. Different from `last_committed_ledger` (the daemon's own progress). -- **Active store** — a mutable store holding in-flight ledger data for the chunk or index currently being ingested. Three kinds: - - Ledger active store — a per-chunk RocksDB (one instance per chunk). - - TxHash active store — a per-index RocksDB with 16 column families (one instance per index). - - Events active store — per-chunk RocksDB (one instance per chunk; schema + column families per [getEvents full-history design](../../design-docs/getevents-full-history-design.md)). -- **Immutable store** — on-disk files produced by freezing an active store. Three kinds: - - Ledger pack file (one per chunk). - - **RecSplit** index `.idx` files (16 per index) — minimal-perfect-hash files for `txhash → ledger_seq` lookup. - - Events cold segment (three files per chunk: `events.pack`, `index.pack`, `index.hash`). -- **Freeze transition** — a background goroutine that converts an active store's contents to immutable files and deletes the active store. Three flavors: **LFS** (shorthand for the ledger active store → `.pack` file freeze) and **events** (events active store → cold segment) run per chunk; **RecSplit** (txhash active store → 16 `.idx` files) runs per index. -- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze. `first_ledger_in_chunk(chunk_id)` always ends in `..._02`; `last_ledger_in_chunk(chunk_id)` always ends in `..._01`. No partial chunks — every chunk on disk is a full 10_000-ledger chunk. -- **Txhash index** (a.k.a. "tx index", "index") — `CHUNKS_PER_TXHASH_INDEX` consecutive chunks. Atomic unit of retention pruning. Formulas in [Geometry](#geometry). Both docs use "tx index" as the dominant narrative form; "txhash index" appears where the output's role as a txhash lookup is the emphasis. -- **Chunk boundary** — the moment ingestion commits the last ledger of a chunk. Triggers background LFS + events freeze for that chunk. -- **Index boundary** — the moment ingestion commits the last ledger of an index. Triggers background RecSplit build for that index. Every index boundary is also a chunk boundary. -- **Catchup** — synonym for "close the gap between last-committed ledger and current tip". Performed inside Phase 1 (catchup). -- **`.bin` file** — a backfill-produced raw txhash flat file (transient). Exists only for chunks the backfill subroutine has flagged `:txhash` but whose containing index has not yet had its RecSplit built. Deleted by Phase 2 (`.bin` hydration) once loaded into the active txhash RocksDB. Streaming's live path never produces `.bin` files. +Vocabulary used throughout this doc. Skim on first read; refer back as terms come up. + +- **Service** — the stellar-rpc binary running as one long-lived process. The only thing an operator starts. + +- **Startup phases 1–4** — the four steps the service runs at every start before it begins serving queries. Phase 1 catches up history, Phase 2 hydrates leftover state, Phase 3 reconciles anything left mid-flight by a prior crash, Phase 4 takes over for live streaming. Once Phase 4 is reached, the service stays there until it exits — there is no Phase 5. + +- **Phase 1 (catchup)** — the startup step that closes the gap between what's already on disk and what the Stellar network has produced so far. Uses backfill as its mechanism. + +- **Backfill** — the process of pulling historical ledgers from a remote object store and writing them to disk as immutable artifacts. Backfill is internal to the service — operators never invoke it directly. Specified in [01-backfill-workflow.md](./01-backfill-workflow.md). + +- **Leapfrog** (colloquial) — how the service picks a starting ledger when retention is configured: the start always lands on a tx-index boundary, never mid-index, so the first tx index ingested is complete. Without this rounding, the chunks before the start would fall below the retention floor and never be ingested, leaving the tx index broken and the ingest-work on its later chunks wasted. Used in two places: Phase 1 (catchup) when there's a remote object store to read from, and at Phase 4 (live ingestion) entry on a no-object-store fresh start. + +- **Network tip** — the most recent ledger the Stellar network has produced. The service learns this from a public Stellar history archive over HTTP, not from its own state. + +- **Resume ledger** — at every start, the service decides which ledger it should resume live ingestion at, based on what's already on disk plus anything a prior crash left mid-flight. The first ledger ingested in the new run is the resume ledger. + +- **`streaming:last_committed_ledger`** — the local state-store key that records the last ledger the service successfully wrote during live streaming. Updated once per live ledger; never written during the startup phases. + +- **Active store** — a writable store that holds in-flight data for whatever chunk or txhash index is currently being ingested. Three kinds, one per data type: + - **Ledger active store** — one instance per chunk. + - **TxHash active store** — one instance per txhash index. + - **Events active store** — one instance per chunk. + +- **Immutable store** — on-disk files produced when an active store is frozen. Three kinds, paired with the active stores above: + - **Ledger pack file** — one per chunk. + - **TxHash lookup files** — multiple per txhash index, for fast `txhash → ledger` lookup. + - **Events cold segment** — three files per chunk. + +- **Freeze transition** — the background work of converting an active store into its immutable counterpart, then deleting the active store. Three kinds: **ledger freeze (LFS)** and **events freeze** happen at every chunk boundary; **txhash freeze** happens at every index boundary. + +- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze: every chunk on disk is a complete 10_000-ledger chunk, never partial. + +- **Txhash index** (a.k.a. "tx index" or just "index") — a group of consecutive chunks (default: 1_000 chunks = 10_000_000 ledgers). Atomic unit of retention pruning: a tx index is pruned as a whole, never per chunk. Formulas in [Geometry](#geometry). + +- **Chunk boundary** — the moment ingestion finishes a chunk. Triggers the chunk's ledger and events freezes in the background. + +- **Index boundary** — the moment ingestion finishes a tx index. Triggers the tx index's txhash freeze in the background. Every index boundary is also a chunk boundary. + +- **`.bin` file** — a transient on-disk file produced by backfill while a tx index is still being filled in. Holds the raw txhashes for one chunk. Deleted once the tx index is complete (or once its contents are loaded into the active txhash store at startup). --- ## Geometry -See [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Streaming uses the same constants (`GENESIS_LEDGER`, `LEDGERS_PER_CHUNK`, `LEDGERS_PER_INDEX`, `CHUNKS_PER_TXHASH_INDEX`), mapping functions, and derived helpers. +See [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry). Streaming uses the same constants (`GENESIS_LEDGER`, `LEDGERS_PER_CHUNK`, `LEDGERS_PER_TX_INDEX`, `CHUNKS_PER_TX_INDEX`), mapping functions, and derived helpers. --- @@ -63,7 +77,15 @@ Streaming reads the same TOML file as backfill, plus additional keys described b ### Shared Config (from backfill) -`[SERVICE]` (daemon-wide settings — `DEFAULT_DATA_DIR`, `CHUNKS_PER_TXHASH_INDEX`), `[BSB]` (Buffered Storage Backend source settings), `[IMMUTABLE_STORAGE.*]` (on-disk paths for immutable artifacts — ledger packs, events, raw txhash, txhash index), `[META_STORE]` (meta-store RocksDB path), `[LOGGING]` (log level + format) are detailed in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration). Streaming adds extra keys to `[SERVICE]` and introduces `[CAPTIVE_CORE]` (embedded `stellar-core` subprocess settings), `[ACTIVE_STORAGE]` (active RocksDB paths), `[HISTORY_ARCHIVES]` (Stellar history-archive URLs for tip sampling) — all defined below. +These sections come from backfill — see [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) for the full schemas: + +- `[SERVICE]` — service-wide settings (`DEFAULT_DATA_DIR`, `CHUNKS_PER_TX_INDEX`). +- `[BSB]` — Buffered Storage Backend source settings. +- `[IMMUTABLE_STORAGE.*]` — on-disk paths for immutable artifacts (ledger packs, events, raw txhash, txhash index). +- `[META_STORE]` — meta-store RocksDB path. +- `[LOGGING]` — log level + format. + +Streaming extends `[SERVICE]` with extra keys and introduces `[CAPTIVE_CORE]` (embedded `stellar-core` subprocess settings), `[ACTIVE_STORAGE]` (active RocksDB paths), and `[HISTORY_ARCHIVES]` (Stellar history-archive URLs for tip sampling) — all defined in [TOML Sections Documented Here](#toml-sections-documented-here) below. ### Immutable Keys (stored in meta store, fatal if changed) @@ -71,23 +93,23 @@ Stored on first start; fatal on any subsequent start where the config value diff | Key | Stored under | Set by | Rule | |---|---|---|---| -| `CHUNKS_PER_TXHASH_INDEX` | `config:chunks_per_txhash_index` | first run | Fatal if changed. | +| `CHUNKS_PER_TX_INDEX` | `config:chunks_per_tx_index` | first run | Fatal if changed. | | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | - Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; not stored as immutable. - Operators may add or remove BSB between runs; on each start, Phase 1 (catchup) either re-runs backfill from the retention-aligned start (BSB present) or no-ops (BSB absent). `compute_resume_ledger` then derives resume from whatever chunks are on disk. -- Retention immutability alone constrains the data envelope — source choice doesn't need its own gate. +- Locking the source choice would add nothing — `RETENTION_LEDGERS` already pins down what range of ledgers ends up on disk, and that's what actually has to stay consistent across runs. Whether each ledger arrived via BSB or captive core doesn't change anything on disk.. ### TOML Sections Documented Here **[SERVICE] — streaming additions** -Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) (which covers `DEFAULT_DATA_DIR` and `CHUNKS_PER_TXHASH_INDEX`). +Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./01-backfill-workflow.md#configuration) -| Key | Type | Default | Description | -|---|---|---|---| -| `RETENTION_LEDGERS` | uint32 | `0` | `0` = full history; otherwise must be a positive multiple of `LEDGERS_PER_INDEX`. See [Validation Rules](#validation-rules). | -| `NETWORK_PASSPHRASE` | string | **required** | Stellar network passphrase — for example, `"Public Global Stellar Network ; September 2015"` for pubnet; `"Test SDF Network ; September 2015"` for testnet. Must match the `NETWORK_PASSPHRASE` in the captive-core config file. Surfaced to all daemon code via the runtime config struct. | +| Key | Type | Default | Description | +|---|---|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `RETENTION_LEDGERS` | uint32 | `0` | `0` = full history; otherwise must be a positive multiple of `LEDGERS_PER_TX_INDEX`. See [Validation Rules](#validation-rules). | +| `NETWORK_PASSPHRASE` | string | **required** | Stellar network passphrase. Must match the `NETWORK_PASSPHRASE` in the captive-core config file. | **[CAPTIVE_CORE]** @@ -104,9 +126,9 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 **[HISTORY_ARCHIVES]** -| Key | Type | Default | Description | -|---|---|---|---| -| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample tip via `/.well-known/stellar-history.json` for Phase 4 (live ingestion)'s leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start). Same key the existing ingest service reads. | +| Key | Type | Default | Description | +|---|---|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample network tip for Phase 4 (live ingestion)'s leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start).| **[BSB]** (optional) @@ -123,13 +145,15 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 | `--log-level` | string | from `[LOGGING].LEVEL` | Override log level. | | `--log-format` | string | from `[LOGGING].FORMAT` | Override log format. | -**No other flags.** No `--mode`, no `--start-ledger`, no `--end-ledger`, no subcommand. Any per-run behavior is either driven by config or derived at runtime from meta store + tip. +**No other flags.** - No `--mode`; no `--start-ledger`, `--end-ledger`; no separate subcommand for backfill or streaming. Any per-run behavior is either driven by config or derived at runtime from meta store + tip. ### Validation Rules -- `CHUNKS_PER_TXHASH_INDEX` immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). -- `RETENTION_LEDGERS` immutable across runs. -- `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_INDEX`. Valid at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000`, etc. Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum). Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. +- `CHUNKS_PER_TX_INDEX` - immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). +- [`RETENTION_LEDGERS` - immutable across runs. Must be `0` OR a positive integer multiple of `LEDGERS_PER_TX_INDEX` (defined in [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry)). + - Valid values of `RETENTION_LEDGERS` at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000` etc. + - Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum/not a multiple). + - Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. - `[BSB]` optional. When present → Phase 1 (catchup) invokes backfill over the BSB; when absent → Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup. May be added or removed between runs. - **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. Not a supported operating mode. - `[HISTORY_ARCHIVES].URLS` required in all profiles. @@ -139,32 +163,22 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 ### Validation Pseudocode +`validate_config` applies the rules above and then enforces immutability for the two immutable keys. The non-obvious mechanism is the immutable-key check itself — store on first run, compare on every subsequent run: + ```python def validate_config(config, meta_store): - cpi = config.service.chunks_per_txhash_index - retention_ledgers = config.service.retention_ledgers - - if retention_ledgers != 0 and (retention_ledgers <= 0 or (retention_ledgers % LEDGERS_PER_INDEX) != 0): - fatal(f"RETENTION_LEDGERS={retention_ledgers} must be 0 or a positive multiple of " - f"LEDGERS_PER_INDEX={LEDGERS_PER_INDEX}.") + apply_static_rules(config) # required-field presence, RETENTION_LEDGERS + # multiple-of-LEDGERS_PER_TX_INDEX, [BSB]+retention=0 fatal + # — see "Validation Rules" above for the full contract. - if config.bsb is None and retention_ledgers == 0: - fatal("[BSB] is absent AND RETENTION_LEDGERS=0 (full history). Full history requires " - "BSB — captive-core-from-genesis is not supported. Either add [BSB] or set " - "RETENTION_LEDGERS > 0.") - - # Fatals with a clear "X is required" message for any key marked **required** - # in the [Configuration] tables above that is absent or empty. - ensure_required_config_fields_exist(config) - - _enforce_immutable(meta_store, "config:chunks_per_txhash_index", str(cpi)) - _enforce_immutable(meta_store, "config:retention_ledgers", str(retention_ledgers)) + _enforce_immutable(meta_store, "config:chunks_per_tx_index", str(config.service.chunks_per_tx_index)) + _enforce_immutable(meta_store, "config:retention_ledgers", str(config.service.retention_ledgers)) def _enforce_immutable(meta_store, key, current_value): stored = meta_store.get(key) if stored is None: - meta_store.put(key, current_value) + meta_store.put(key, current_value) # first-run snapshot elif stored != current_value: fatal(f"{key} changed: stored={stored}, config={current_value}. Wipe datadir.") ``` @@ -176,14 +190,16 @@ Three profiles emerge from config combinations. No profile flag. | Profile | `RETENTION_LEDGERS` | `[BSB]` | Phase 1 behavior | Use case | |---|---|---|---|---| | Archive | `0` | present | Backfill over full history (chunks `[0, current_chunk − 1]`) | Public archive node; full history. | -| Pruning-history | `N × LEDGERS_PER_INDEX`, N ≥ 1 | present | Backfill over retention window (leapfrog-aligned start) | Windowed history with bulk initial catchup. | -| Tip-tracker | `N × LEDGERS_PER_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | +| Pruning-history | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | present | Backfill over retention window (leapfrog-aligned start) | Windowed history with bulk initial catchup. | +| Tip-tracker | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | | (invalid) | `0` | absent | — | Rejected by `validate_config`: full history requires BSB. | --- ## Meta Store Keys +*This section is a reference for the key schema and lifecycle. It reads more naturally after [Startup Sequence](#startup-sequence) below, which defines the phases that write and consume these keys.* + Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative source for every startup decision. ### Keys Introduced by Streaming @@ -195,13 +211,15 @@ Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative sou ### Keys Shared with Backfill -| Key | Semantics | -|---|---| -| `config:chunks_per_txhash_index` | Set on first run by whichever invocation runs first — here, first daemon start. | -| `chunk:{chunk_id:08d}:lfs` | Set after ledger pack file fsync. | -| `chunk:{chunk_id:08d}:events` | Set after events cold segment fsync. | -| `chunk:{chunk_id:08d}:txhash` | Set by backfill subroutine after `.bin` fsync; deleted during Phase 2 (`.bin` hydration) after `.bin` is loaded into RocksDB. Streaming live path does not write this key — streaming writes txhash directly to the active RocksDB txhash store. | -| `index:{tx_index_id:08d}:txhash` | `"1"` after all 16 RecSplit CF `.idx` files built and fsynced. Transitions to `"deleting"` at the start of `prune_tx_index`, deleted entirely when prune completes. Query routing treats `"deleting"` the same as absent. | +Defined in [01-backfill-workflow.md — Meta Store Keys](./01-backfill-workflow.md#meta-store-keys); streaming uses the same contract: + +- `config:chunks_per_tx_index` +- `chunk:{chunk_id:08d}:lfs` +- `chunk:{chunk_id:08d}:events` +- `chunk:{chunk_id:08d}:txhash` +- `index:{tx_index_id:08d}:txhash` + +Streaming-specific use of these keys (which paths write them when, and the `"deleting"` marker on `index:txhash`) is shown in [Key Lifecycle in Streaming](#key-lifecycle-in-streaming) below. ### Key Lifecycle in Streaming @@ -213,10 +231,10 @@ Phase 1 (catchup): index:{tx_index_id}:txhash = "1" (after RecSplit, when all chunks of tx_index_id are done in Phase 1 (catchup)) Phase 2 (.bin hydration — see Startup Sequence): - For every chunk with :txhash flag and a .bin file: - load .bin into RocksDB txhash store - delete chunk:{chunk_id}:txhash flag - delete .bin file + For every chunk with :txhash flag (and possibly a .bin file): + if .bin exists: load .bin into the active txhash RocksDB + delete .bin file (file FIRST — see Flag Semantics) + delete chunk:{chunk_id}:txhash flag (flag LAST) After Phase 2 (.bin hydration), no chunk:{chunk_id}:txhash flags and no .bin files remain. Live path (per ledger): @@ -230,81 +248,72 @@ Live path (per index, background): index:{tx_index_id}:txhash = "1" (after RecSplit + verify) Pruning (background, when tx_index_id is past retention): - index:{tx_index_id}:txhash = "deleting" (FIRST; queries now return 4xx for this index) + index:{tx_index_id}:txhash = "deleting" (set BEFORE any files are deleted; queries return 4xx from here on) [delete all files + per-chunk :lfs + :events keys for tx_index_id] - index:{tx_index_id}:txhash → deleted (LAST) + index:{tx_index_id}:txhash → deleted (cleared AFTER all files are gone) ``` ### Flag Semantics -- **Flag-after-fsync.** A flag is set only after the artifact it represents has been fsynced. Flag absent = artifact missing (or incomplete). +- **Flag-after-fsync (creation order).** A flag is set only AFTER the artifact it represents has been fsynced. Flag absent ⇒ artifact missing or incomplete; flag present ⇒ artifact is durable. +- **File-before-flag-delete (cleanup order).** When deleting, the file is removed FIRST and the flag is cleared LAST. Flag present ⇒ cleanup may not be complete; flag absent ⇒ cleanup is done and no file exists. The reverse order (flag-then-file) would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. - **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility — derives from meta store key presence. No filesystem-scan-and-infer. +These three rules together mean: at any point during creation OR cleanup, the meta-store flag is the always-correct signal of the artifact's state on disk. A crash anywhere in the sequence leaves a state the next start can recover from by checking flag presence alone. + --- ## Active Store Architecture -The daemon maintains three active stores for the current ingestion position. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). +The service maintains three RocksDB-backed active stores for the current ingestion position; WAL must always be enabled. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). | Store | Path | Key | Value | Transition cadence | |---|---|---|---|---| | Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | -| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_INDEX` ledgers (index) | +| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_TX_INDEX` ledgers (index) | | Events | `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/` | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | Every 10_000 ledgers (chunk) | -- Ledger and txhash stores are RocksDB. WAL required. -- TxHash store uses 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`. -- Events active store is a per-chunk RocksDB; schema + column families per [getEvents full-history design](../../design-docs/getevents-full-history-design.md). Per-ledger writes are idempotent. +- TxHash store uses 16 column families (`cf-0`..`cf-f`) routed by the high nibble of the txhash (`txhash[0] >> 4`); each CF pairs 1:1 with one of the 16 RecSplit `.idx` files at the index boundary. +- Events writes are idempotent at per-ledger granularity — a re-write of the same ledger sequence overwrites cleanly, so crash-replay is corruption-free. ### Store Lifecycle - **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed: - - Phase 4 (live ingestion) entry opens exactly one store per data type: `resume_chunk`'s ledger + events stores, and `resume_tx_index`'s txhash store. - - Each chunk boundary synchronously opens the next chunk's ledger + events stores after capturing the current ones as transitioning handles. - - Each tx-index boundary synchronously opens the next tx-index's txhash store similarly. -- **Synchronous open cost.** mkdir + RocksDB open + column-family setup is ~100–200 ms. At live cadence (6 s/ledger) this fits entirely inside the inter-ledger idle time — zero throughput impact. During archive replay (~500 ledgers/s) the cost is absorbed once per chunk boundary, ~100 ms each; over Alice's 10M-retention fresh start (~1_000 chunks) that's ~100 s of cumulative stall distributed across a ~6 h replay, sub-1%. -- **Transition.** At each boundary, the ingestion loop (a) captures the current store handle as `transitioning`, (b) synchronously opens the next store, (c) spawns the background freeze goroutine with the `transitioning` handle. Ingestion proceeds against the new active store immediately. -- **Deletion.** The freeze goroutine closes the transitioning handle and deletes its RocksDB directory AFTER writing the immutable artifact and setting the meta-store flag (flag-after-fsync). A crash between flag-set and dir-delete leaves an orphan that Phase 3 (reconcile) classifies as flag-is-truth and deletes. -- **Crash recovery.** Phase 3 (reconcile) classifies each on-disk active-store directory by chunk/index ID + flag presence: - - Dir is for `resume_chunk` / `resume_tx_index` → keep (the active store the live loop will resume against). - - `:lfs` / `:events` / `:txhash` flag present + dir present → delete dir (flag-is-truth; freeze completed, delete lingered). - - Flag absent + chunk/index ID < resume → `finish_interrupted_ledger_freeze` (or equivalent) — complete the freeze, set flag, delete dir. - - Else → future orphan; delete dir. - -### Max Concurrent Stores - -| Store | Max active | Max transitioning | Max total | -|---|---|---|---| -| Ledger | 1 | 1 | 2 | -| Events | 1 | 1 | 2 | -| TxHash | 1 | 1 | 2 | + - At every chunk boundary, the next chunk's ledger and events stores open synchronously, while the just-finished ones are handed off to the background freeze. + - At every tx-index boundary, the next tx index's txhash store opens the same way. +- **Synchronous open cost.** Opening a new active store doesn't take long enough to matter — about 100 ms, at max. +- **Transition.** At each boundary, the ingestion loop hands off the just-finished store to a background freeze task and continues writing into the freshly-opened next store. Ingestion never blocks on the freeze. +- **Deletion.** The freeze task deletes the just-finished store's directory only AFTER writing the immutable artifact and setting its meta-store flag (flag-after-fsync). A crash between flag-set and directory-delete leaves an orphan that Phase 3 (reconcile) classifies as "flag-is-truth" and deletes, thereby leaving no orphans. +- **Crash recovery.** Active-store directories that survive a crash are reconciled organically on the next start — see [Phase 3 — Reconcile Orphaned Transitions](#phase-3--reconcile-orphaned-transitions) for the full classification. + +***Max concurrency:*** each store kind (ledger / events / txhash) holds at most **one active + one transitioning at a time** — capped at 2 instances per kind, enforced by the per-kind single-flight gates in [Concurrency Model](#concurrency-model). --- ## Ledger Source -- **Backfill (Phase 1 (catchup)) uses `BSBSource` only.** Each `process_chunk` instantiates its own per-chunk `BSBSource` from `[BSB]` config, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). -- **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. +Two ledger sources, scoped to different phases: -- **`BSBSource`** is the backfill-only ledger source — one instance per `process_chunk`, interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`), torn down at end-of-task. +- **Backfill (Phase 1 (catchup)) uses `BSBSource`** — the backfill-only reader; interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`). Each `process_chunk` instantiates its own from `[BSB]` config, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). +- **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. --- ## Startup Sequence -Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 (live ingestion) is the long-running state the daemon stays in until process exit. +Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 (live ingestion) is the long-running state the service stays in until process exit. - **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. - **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 (catchup) left (for the trailing partial index) into the active txhash store, then deletes them. - **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. -- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle goroutine, flips the `daemon_ready` flag, enters the ingestion loop. Runs until process exit. +- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, flips the `service_ready` flag, enters the ingestion loop. Runs until process exit. -"Phase" here refers to the startup ordering only. Once Phase 4 (live ingestion) is entered, there's no Phase 5 — the daemon is in live-streaming steady state. +"Phase" here refers to the startup ordering only. Once Phase 4 (live ingestion) is entered, there's no Phase 5 — the service is in live-streaming steady state. ### Backfill vs Phase 1 (catchup) - **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only; parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. -- **Phase 1 (catchup)** is the startup phase that runs on every daemon start. Its job: close the gap between on-disk state and current network tip before Phase 4 (live ingestion) takes over. Invokes backfill as its mechanism when `[BSB]` is configured; otherwise no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))`. +- **Phase 1 (catchup)** is the startup phase that runs on every service start. Its job: close the gap between on-disk state and current network tip before Phase 4 (live ingestion) takes over. Invokes backfill as its mechanism when `[BSB]` is configured; otherwise no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))`. ```python def main(): @@ -355,13 +364,8 @@ def phase1_catchup(config, meta_store): def retention_aligned_start_chunk(tip_ledger, retention_ledgers): - # Called by: phase1_catchup (per loop iteration) to compute range_start_chunk_id. - # Returns the first chunk Phase 1 (catchup) should backfill: - # - Archive profile (retention=0): chunk 0 (full history from genesis). - # - Pruning-history (retention>0): first chunk of the tx index containing - # (tip - retention_ledgers). Aligned DOWN to a tx-index boundary so the first - # persisted chunk starts a complete index (upholds the no-gaps invariant). - # Worst case: up to LEDGERS_PER_INDEX - 1 extra ledgers below strict retention. + # Aligns DOWN to a tx-index boundary (no-gaps invariant); costs up to + # LEDGERS_PER_TX_INDEX - 1 extra ledgers below strict retention. if retention_ledgers == 0: return 0 target_ledger = max(tip_ledger - retention_ledgers, GENESIS_LEDGER) @@ -380,15 +384,20 @@ def retention_aligned_start_chunk(tip_ledger, retention_ledgers): ```python def phase2_hydrate_txhash(config, meta_store): - # Sweep leftover .bin + flag for tx indexes already flagged complete (crash between - # index:N:txhash set and cleanup_txhash finish). + # Both sweeps below delete the .bin file BEFORE deleting its :txhash flag (see Flag Semantics). + # On any crash mid-pair, the flag is the recovery signal — never an orphan file with no record. + + # Sweep 1: once an index's RecSplit is built, the per-chunk .bin files become + # redundant (their data is now in the index). Backfill deletes them via + # cleanup_txhash; this sweep finishes the job if backfill crashed mid-cleanup + # and left some chunks with their .bin file + :txhash flag still around. for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): for chunk_id in chunks_for_tx_index(tx_index_id): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(raw_txhash_path(chunk_id)) + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") - # Load .bin for the trailing incomplete tx index into the active RocksDB. + # Sweep 2: hydrate the trailing incomplete tx index into the active RocksDB. incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) if incomplete_tx_index_id is None: return @@ -401,13 +410,8 @@ def phase2_hydrate_txhash(config, meta_store): bin_path = raw_txhash_path(chunk_id) if os.path.exists(bin_path): load_bin_into_rocksdb(bin_path, txhash_store) - meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_if_exists(bin_path) - - # Sweep orphan .bin (flag already deleted, .bin lingered from a prior crash). - for bin_file in scan_bin_files_for_tx_index(incomplete_tx_index_id): - if not meta_store.has(f"chunk:{parse_chunk_id(bin_file):08d}:txhash"): - os.remove(bin_file) + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") finally: txhash_store.close() # Phase 4 re-opens by directory path; flock would collide otherwise. ``` @@ -444,10 +448,10 @@ Each `reconcile_*_store_dirs` helper scans its own active-store directory type a ### Compute Resume Ledger -- `compute_resume_ledger` is a shared helper called once per daemon start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. +- `compute_resume_ledger` is a shared helper called once per service start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. - **Runs AFTER Phase 3 (reconcile).** Phase 3's `finish_interrupted_ledger_freeze` writes `:lfs` for chunks whose freeze was in flight at a prior crash; running `compute_resume_ledger` before Phase 3 would see those mid-freeze chunks as internal `:lfs` gaps and false-positive-fatal at startup. - **Scans every startup, even when `streaming:last_committed_ledger` is already set.** The scan's primary output in the mid-life-restart case is validation, not derivation; catching broken on-disk state before opening active stores is strictly safer than silently resuming on top. -- **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The daemon exits non-zero; no active stores are opened. +- **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The service exits non-zero; no active stores are opened. **Derivation** — first match wins: @@ -467,7 +471,7 @@ Each `reconcile_*_store_dirs` helper scans its own active-store directory type a ```python def compute_resume_ledger(config, meta_store): - cpi = config.service.chunks_per_txhash_index + cpi = config.service.chunks_per_tx_index scan = scan_all_chunk_and_index_keys(meta_store) validate_scan(scan, cpi) @@ -477,7 +481,7 @@ def compute_resume_ledger(config, meta_store): return last_committed_ledger + 1 if scan.lfs_chunks: return last_ledger_in_chunk(scan.lfs_chunks[-1]) + 1 # first-ever post-Phase-1 - return retention_aligned_resume_ledger(config) # Alice fresh start + return retention_aligned_resume_ledger(config) # tip-tracker fresh start (no BSB) def validate_scan(scan, cpi): @@ -517,10 +521,8 @@ def validate_last_committed_consistency(scan, last_committed_ledger): def retention_aligned_resume_ledger(config): - # Called by: compute_resume_ledger (tip-tracker fresh-start branch; no BSB, no on-disk chunks). - # First chunk captive core ingests will be the first chunk of the tx index containing - # (tip - retention). validate_config already rejected the [BSB]-absent + retention=0 - # combination, so this helper is never called in archive-from-genesis shape. + # Tip-tracker fresh-start branch (no BSB, no on-disk chunks). validate_config + # rejects [BSB]-absent + retention=0, so GENESIS_LEDGER is only a defensive floor. network_tip_ledger = get_latest_network_tip(config.history_archives.urls) retention_ledgers = config.service.retention_ledgers @@ -530,27 +532,23 @@ def retention_aligned_resume_ledger(config): ### Phase 4 — Live Ingestion -Opens active stores for the resume position, spawns the lifecycle goroutine, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). +Opens active stores for the resume position, spawns the lifecycle task, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). ```python def phase4_live_ingest(config, meta_store, resume_ledger): - # resume_ledger is already computed by the orchestrator (see Compute Resume Ledger). - # Phase 4 (live ingestion) does NOT write streaming:last_committed_ledger at bootstrap — the first - # write happens inside the live ingestion loop after the first durable commit. + # streaming:last_committed_ledger is NOT written at bootstrap — first write happens + # inside the live ingestion loop after the first durable commit. active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) run_in_background(run_prune_lifecycle_loop, config, meta_store) ledger_backend = make_ledger_backend(config.captive_core.config_path) ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) - set_daemon_ready() # in-memory; unblocks queries + set_service_ready() # in-memory; unblocks queries run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) def open_active_stores_for_resume(config, meta_store, resume_ledger): - # Open exactly one store per data type for the resume position. Subsequent - # chunks / indexes are opened on-demand at boundary time — see on_chunk_boundary - # and on_tx_index_boundary. resume_chunk_id = chunk_id_of_ledger(resume_ledger) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) @@ -567,7 +565,7 @@ Captive core takes 4–5 minutes to spin up and start emitting at `resume_ledger ## Ingestion Loop -Single goroutine. Pull-based: the daemon drives sequential `GetLedger(seq)` calls. Same code path drains captive core's internal buffer during catchup and switches cadence to live closes (~5 s per ledger) once caught up. +Single background task. Pull-based: the service drives sequential `GetLedger(seq)` calls. Same code path drains captive core's internal buffer during catchup and switches cadence to live closes (~5 s per ledger) once caught up. ```python def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger): @@ -602,20 +600,26 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ## Freeze Transitions -Three independent background transitions per chunk/index boundary; each has its own goroutine, flag, and cleanup. Live ingestion never blocks on them. +Three independent background transitions per chunk/index boundary; each has its own task, flag, and cleanup. Live ingestion never blocks on them. - **LFS transition** — per chunk. Retired ledger RocksDB → `.pack` file. - **Events transition** — per chunk. Retired events RocksDB store → cold segment (3 files). - **RecSplit transition** — per index. Retired txhash RocksDB → 16 `.idx` files. -- Streaming's freeze transitions never produce `.bin` files; those are transient backfill output only (Phase 1). +- Streaming's freeze transitions never produce `.bin` files; those are transient backfill output only, produced during Phase 1 (catchup). ### Concurrency Model - **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `events`, `txhash` — one handle per data type, no `*_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. -- **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). Go's `sync.Mutex` inside the meta-store wrapper + RocksDB's own single-writer semantics keep these serialized. -- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit). Implementation: an unbuffered `chan struct{}` per kind, or equivalently a `sync.Mutex`. `wait_for_lfs_complete()` acquires; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases. Second transition starts only after the first releases. Not a `sync.WaitGroup` — that would wait for ALL transitions globally, wrong semantics. -- **Query handlers read from storage-manager layer** — each per-data-type storage manager (ledger / events / txhash) owns its own state-transition synchronization; the query handler never touches `active_stores` directly. **Read-view invariant:** during a transition, a query sees either pre-transition data (routed to the transitioning store) or post-transition data (routed to the new active store + the newly-flagged immutable artifact) — never a half-state mix. **Flag-is-truth applies to reads too:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag isn't set. Concrete lock primitives + routing logic belong in a separate query-routing design doc. -- **Stores are opened on-demand at boundary.** `open_active_stores_for_resume` (Phase 4 entry) opens exactly one store per data type (`resume_chunk` ledger + events, `resume_tx_index` txhash). Each chunk/tx-index boundary synchronously opens the next store (~100-200 ms) AFTER capturing the current one as transitioning, then spawns the background freeze. At live cadence the sync open fits inside the 6 s inter-ledger idle — zero throughput impact. +- **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). The meta-store wrapper serializes them with internal locking on top of RocksDB's own single-writer semantics. +- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** + - One outstanding transition per kind (LFS / events / RecSplit); the second starts only after the first releases. + - `wait_for_lfs_complete()` acquires the gate; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases it. + - Not a global wait barrier — that would block until every in-flight transition across all kinds finished, defeating per-kind independence. +- **Query handlers read from storage-manager layer** — each per-data-type storage manager (ledger / events / txhash) owns its own state-transition synchronization; the query handler never touches `active_stores` directly. + - **Read-view invariant:** during a transition, a query sees either pre-transition data (routed to the transitioning store) or post-transition data (routed to the new active store + the newly-flagged immutable artifact) — never a half-state mix. + - **Flag-is-truth applies to reads too:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag isn't set. + - Concrete lock primitives + routing logic belong in a separate query-routing design doc. +- **Stores are opened on-demand at boundary** — see [Store Lifecycle](#store-lifecycle) for the open + transition sequence and the synchronous-open cost analysis. ### Chunk Boundary (every 10_000 ledgers) @@ -655,8 +659,6 @@ def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_ transitioning_ledger_store.close() delete_dir(ledger_store_path(chunk_id)) signal_lfs_complete() - - ``` `finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_lfs_complete`. @@ -674,13 +676,11 @@ def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, me transitioning_events_store.close() delete_dir(events_store_path(chunk_id)) signal_events_complete() - - ``` `finish_interrupted_events_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_events_complete`. -### Tx-Index Boundary (every `LEDGERS_PER_INDEX` ledgers) +### Tx-Index Boundary (every `LEDGERS_PER_TX_INDEX` ledgers) The last chunk of a tx index has just rolled over. Before RecSplit can start, every chunk in the tx index must have its `:lfs` and `:events` flags set. @@ -716,7 +716,7 @@ def build_tx_index_recsplit_files(tx_index_id, transitioning_txhash_store, meta_ ## Pruning -Retention is enforced by a single background goroutine, woken at chunk boundaries. Prune granularity is the whole txhash index — never per chunk. +Retention is enforced by a single background task, woken at chunk boundaries. Prune granularity is the whole txhash index — never per chunk. ```python def run_prune_lifecycle_loop(config, meta_store): @@ -735,8 +735,7 @@ def _run_prune_sweep(meta_store, retention_ledgers, config): def prunable_tx_index_ids(meta_store, retention_ledgers): - # Returns tx_index_ids fully past retention and prune-eligible (`:txhash` is `"1"` or - # `"deleting"`). Eligibility: last_committed_ledger > last_ledger_in_tx_index(N) + retention_ledgers. + # Eligible: tx_index fully past retention AND `:txhash` is `"1"` or `"deleting"`. if retention_ledgers == 0: return [] last_committed_ledger = meta_store.get("streaming:last_committed_ledger") @@ -755,10 +754,19 @@ def prune_tx_index(tx_index_id, meta_store, config): # Two-phase marker: set "deleting" first, clear the key last. Idempotent on retry. meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") for chunk_id in chunks_for_tx_index(tx_index_id): + # Files first, flags last (same invariant as Phase 2 hydration: flag presence ⇒ + # cleanup not yet done). All deletes are idempotent (rm -f / delete-if-exists). delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) + # Defense-in-depth: also clean the transient txhash artifacts. In normal flow + # these are gone long before retention catches up — Phase 2 hydration deletes + # them on the trailing partial, and cleanup_txhash deletes them on completed + # indexes. Belt-and-suspenders here so prune is self-contained against any + # upstream cleanup gap. + delete_if_exists(raw_txhash_path(chunk_id)) meta_store.delete(f"chunk:{chunk_id:08d}:lfs") meta_store.delete(f"chunk:{chunk_id:08d}:events") + meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_recsplit_idx_files(tx_index_id) meta_store.delete(f"index:{tx_index_id:08d}:txhash") ``` @@ -768,8 +776,8 @@ def prune_tx_index(tx_index_id, meta_store, config): - Whole-index gating closes that window. **How much extra data sits on disk.** -- At most `LEDGERS_PER_INDEX - 1` ledgers past the strict retention line. -- `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_INDEX`, so the line never bisects an index; the next-eligible index is exactly `LEDGERS_PER_INDEX` further. +- At most `LEDGERS_PER_TX_INDEX - 1` ledgers past the strict retention line. +- `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_TX_INDEX`, so the line never bisects an index; the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. --- @@ -779,17 +787,17 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ### Readiness Signal -- An in-memory boolean `daemon_ready` is set by `set_daemon_ready()` at the top of Phase 4 (live ingestion), after Phases 1–3 complete and active stores are opened. +- An in-memory boolean `service_ready` is set by `set_service_ready()` at the top of Phase 4 (live ingestion), after Phases 1–3 complete and active stores are opened. - Not persisted. On every startup the flag starts `false`; on every Phase 4 (live ingestion) entry it flips to `true`. Clean shutdown discards it implicitly (process exits). -- The HTTP server binds its port at daemon startup (before Phase 1 (catchup)) so `getHealth` is always servable regardless of current phase. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `daemon_ready`. -- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 (live ingestion) is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the daemon serves, every time. +- The HTTP server binds its port at service startup (before Phase 1 (catchup)) so `getHealth` is always servable regardless of current phase. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `service_ready`. +- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 (live ingestion) is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the service serves, every time. - Query handlers check the flag on each request. `false` → HTTP 4xx. `true` → route normally. ### Behavior During Phases 1–3 - `/getLedger`, `/getTransaction`, `/getEvents` → `HTTP 4xx` with no payload detail. - `/getHealth` → always served. Response payload matches the existing stellar-rpc shape: `status` (`catching_up` during Phases 1–3, `healthy` during Phase 4 (live ingestion)), `latestLedger` (= `streaming:last_committed_ledger`, or `0` if absent), `oldestLedger` (first ingested ledger), `ledgerRetentionWindow`. No drift field, no network-tip field. -- No partial / incremental serving. The daemon does not serve "whatever is ingested so far" while Phases 1–3 are running. +- No partial / incremental serving. The service does not serve "whatever is ingested so far" while Phases 1–3 are running. ### Behavior When an Index Is Being Pruned @@ -799,16 +807,20 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ### Rationale - Without an explicit gate, implementations drift toward "best-effort serve whatever is ingested" — inconsistent across operators, breaks client assumptions. -- Explicit `daemon_ready` + HTTP 4xx gives clients an unambiguous signal. +- Explicit `service_ready` + HTTP 4xx gives clients an unambiguous signal. - `catching_up` health status gives operators visibility into progress. --- -## Crash Recovery +## Resilience + +Crash recovery and error handling share one foundation: flag-after-fsync makes the meta store authoritative, and every startup decision derives from flag presence alone — never filesystem scanning. Streaming extends backfill's resilience model with per-ledger checkpoint discipline, max-1-transitioning freeze gates, and a two-phase prune marker. + +### Crash Recovery No separate recovery phase. Every startup runs Phases 1–4 regardless — already-complete work is detected and skipped via meta store flags. -### Invariants +#### Invariants In addition to the backfill subroutine's invariants in [01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery), streaming adds the following: @@ -818,13 +830,16 @@ In addition to the backfill subroutine's invariants in [01-backfill-workflow.md 4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 (catchup) itself avoids producing them. 5. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. -### Compound Recovery Scenarios +#### Compound Recovery Scenarios Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workflow.md#crash-recovery) handles every Phase 1 (catchup) crash. Streaming adds: -- **Crash during Phase 2 (`.bin` hydration).** - - Chunks loaded pre-crash: no `:txhash` flag, no `.bin` → loop skips via flag check. - - Chunks not yet loaded: `:txhash` + `.bin` present → loop picks them up. +- **Crash during Phase 2 (`.bin` hydration).** All sub-cases are recoverable because every cleanup pair runs file-delete BEFORE flag-delete (see [Flag Semantics](#flag-semantics)). + - **Sweep 1, mid-loop.** Already-cleaned chunks: flag absent → skipped on retry. Pending chunks: flag + file still present → cleaned on retry. + - **Sweep 1, between file-delete and flag-delete.** Flag set, file already gone. Restart: flag triggers retry, `delete_if_exists` is a no-op on the missing file, flag deleted. + - **Sweep 2, between `load_bin_into_rocksdb` and file-delete.** Flag set, file present, data already durable in the active txhash RocksDB. Restart: re-loads (RocksDB put is idempotent on the same key/value), then deletes file, then flag. + - **Sweep 2, between file-delete and flag-delete.** Flag set, file gone, data durable. Restart: flag triggers retry, `os.path.exists(bin_path)` is False so load is skipped, file delete is a no-op, flag deleted. + - **No filesystem scan needed in any case** — the meta-store flag is the only signal the next start consults. - **Crash between per-ledger checkpoint and LFS freeze completion.** - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent. @@ -840,26 +855,28 @@ Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workf - State: some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. - `prunable_tx_index_ids` picks up `"deleting"` alongside `"1"` → `prune_tx_index(tx_index_id)` re-runs, idempotent (file deletes `rm -f`, key deletes `delete_if_exists`). ---- +### Concurrent Access Prevention + +The service acquires a directory flock on the meta-store at startup. A second service process against the same datadir fails immediately. Same mechanism as backfill — see [01-backfill-workflow.md — Concurrent Access Prevention](./01-backfill-workflow.md#concurrent-access-prevention). -## Error Handling +### Error Handling Three distinct policies — runtime ABORT, transition retry-via-flag-absence, startup FATAL. -### Runtime (Phase 4 ingestion) +#### Runtime — Phase 4 (live ingestion) - **CaptiveStellarCore unavailable.** RETRY with backoff; ABORT after `CAPTIVE_CORE_RETRY_MAX` attempts (implementation-defined). - **Per-ledger store write failure (ledger / txhash / events).** ABORT — disk full or storage corruption. - **Meta-store write failure.** ABORT — cannot maintain checkpoint. -### Freeze transitions (LFS / events / RecSplit) +#### Freeze transitions (LFS / events / RecSplit) All three follow the flag-after-fsync invariant: on failure, don't set the completion flag; abort the transition; restart retries the whole transition from scratch (partial `.idx` files get cleaned by the build's own preamble). - **RecSplit verification mismatch.** ABORT; do NOT delete the transitioning txhash store; operator investigates. -### Startup (FATAL — datadir / config issues) +#### Startup (FATAL — datadir / config issues) -- `CHUNKS_PER_TXHASH_INDEX` or `RETENTION_LEDGERS` changed: wipe datadir to change. -- `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_INDEX`: fix config. +- `CHUNKS_PER_TX_INDEX` or `RETENTION_LEDGERS` changed: wipe datadir to change. +- `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_TX_INDEX`: fix config. - Head not index-aligned / gap in chunk flags: datadir corruption; wipe. diff --git a/full-history/design-docs/README.md b/full-history/design-docs/README.md index 8b545047f..8b3816547 100644 --- a/full-history/design-docs/README.md +++ b/full-history/design-docs/README.md @@ -5,12 +5,12 @@ | Doc | Scope | |-----|-------| | [01-backfill-workflow.md](./01-backfill-workflow.md) | Backfill subroutine internals — DAG, per-chunk tasks, shared TOML config, meta-store key schema, crash recovery | -| [02-streaming-workflow.md](./02-streaming-workflow.md) | Unified daemon end-to-end — startup phases, live ingestion, freeze transitions, pruning, query contract | +| [02-streaming-workflow.md](./02-streaming-workflow.md) | Unified service end-to-end — startup phases, live ingestion, freeze transitions, pruning, query contract, resilience (crash recovery + concurrent-access guards + error handling) | ## Reading Order - Read **01 Backfill** first. It defines shared concepts used by both docs: geometry, meta-store key schema, shared TOML config, flag-after-fsync. -- Read **02 Streaming** second. It builds on 01's vocabulary and describes how the daemon invokes backfill as its Phase 1 (catchup) subroutine. +- Read **02 Streaming** second. It builds on 01's vocabulary and describes how the service invokes backfill as its Phase 1 (catchup) subroutine. ## See Also From 39ca22c85aa84eb57b825f08d4de259a134ae91c Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sat, 25 Apr 2026 20:35:02 -0700 Subject: [PATCH 30/34] Streaming doc: hot-key tracking for active stores (D-S3) Closes the only remaining filesystem-scan in the design. Phase 3 (reconcile) now iterates meta-store keys instead of scanning {ACTIVE_STORAGE.PATH}/. - New keys: hot:chunk:{C:08d}:lfs, hot:chunk:{C:08d}:events, hot:index:{N:08d}:txhash. Binary "1"/absent. Set BEFORE mkdir (inside open_active_*_store), cleared AFTER delete_dir (inside the freeze tasks). - Phase 3 reconcile: replaces 3 filesystem-scanning helpers with one scan_prefix("hot:") loop dispatching by store_kind + chunk_or_tx_index_id. Clean if/elif/else chain with 4 scenarios (resume target / flag-is-truth cleanup / interrupted freeze / future-orphan defensive). delete_dir_if_exists everywhere; log.warn on the future-orphan branch. - Phase 2 hydration: open_active_txhash_store now takes meta_store and encapsulates the hot-key write. No new code lines in phase2 body. - Naming unification: open_or_create_*_store -> open_active_*_store across all callsites. finish_interrupted_*_freeze drops the redundant store_dir parameter (derivable from chunk_id under hot-key tracking). - Flag Semantics: third rule (flag-driven recovery) now applies universally; no implicit "data artifacts only" scope. - Resilience invariants: new invariant #6 (hot-key tracking). - Compound Recovery Scenarios: refreshed Phase 3 entries; new entries for crash-mid-hot-store-creation and crash-between-dir-delete-and- hot-key-clear. Source-of-truth + handoff updated in lockstep (untracked, in .claude/ and ~/.claude/projects/.../streaming-artifacts/). Anchors revalidated. --- .../design-docs/02-streaming-workflow.md | 131 ++++++++++++------ 1 file changed, 91 insertions(+), 40 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 0e8106c2d..1a63d81ac 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -208,6 +208,9 @@ Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative sou |---|---|---| | `streaming:last_committed_ledger` | uint32 (big-endian) | Written only by the live ingestion loop after all three active stores durably commit a ledger. **Never written at bootstrap.** When absent, [`compute_resume_ledger`](#compute-resume-ledger) derives resume from the contiguous `:lfs` prefix (first-ever post-Phase-1) or by leapfrogging down from the current network tip to an index boundary (tip-tracker fresh start). Phase 1 (catchup) progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | | `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | +| `hot:chunk:{chunk_id:08d}:lfs` | `"1"` | Written **before** the active ledger store directory is created; deleted **after** that directory is removed by the freeze task. Presence indicates the directory exists or its lifecycle is incomplete (creation in flight, or freeze cleanup not yet finished). | +| `hot:chunk:{chunk_id:08d}:events` | `"1"` | Same pattern as `hot:chunk:lfs`, scoped to the active events store directory. | +| `hot:index:{tx_index_id:08d}:txhash` | `"1"` | Same pattern, scoped to the active txhash store directory. Per-index cadence (one per tx index, not per chunk). | ### Keys Shared with Backfill @@ -237,15 +240,29 @@ Phase 2 (.bin hydration — see Startup Sequence): delete chunk:{chunk_id}:txhash flag (flag LAST) After Phase 2 (.bin hydration), no chunk:{chunk_id}:txhash flags and no .bin files remain. +Active store open (Phase 2 / Phase 4 entry / boundary handlers): + hot:* keys are set BEFORE mkdir, one per active store kind: + hot:chunk:{chunk_id}:lfs = "1" + hot:chunk:{chunk_id}:events = "1" + hot:index:{tx_index_id}:txhash = "1" + Live path (per ledger): streaming:last_committed_ledger = ledger_seq (after all 3 active stores commit) -Live path (per chunk, background): - chunk:{chunk_id}:lfs = "1" (after pack fsync) - chunk:{chunk_id}:events = "1" (after cold segment fsync) +Live path (per chunk, background freeze): + Freeze flag set AFTER fsync; hot key cleared AFTER dir removed (file-before-flag-delete). + chunk:{chunk_id}:lfs = "1" + [delete rocksdb ledger store dir] + hot:chunk:{chunk_id}:lfs → deleted -Live path (per index, background): - index:{tx_index_id}:txhash = "1" (after RecSplit + verify) + chunk:{chunk_id}:events = "1" + [delete rocksdb events store dir] + hot:chunk:{chunk_id}:events → deleted + +Live path (per index, background freeze): + index:{tx_index_id}:txhash = "1" + [delete rocksdb txhash store dir] + hot:index:{tx_index_id}:txhash → deleted Pruning (background, when tx_index_id is past retention): index:{tx_index_id}:txhash = "deleting" (set BEFORE any files are deleted; queries return 4xx from here on) @@ -257,15 +274,15 @@ Pruning (background, when tx_index_id is past retention): - **Flag-after-fsync (creation order).** A flag is set only AFTER the artifact it represents has been fsynced. Flag absent ⇒ artifact missing or incomplete; flag present ⇒ artifact is durable. - **File-before-flag-delete (cleanup order).** When deleting, the file is removed FIRST and the flag is cleared LAST. Flag present ⇒ cleanup may not be complete; flag absent ⇒ cleanup is done and no file exists. The reverse order (flag-then-file) would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. -- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility — derives from meta store key presence. No filesystem-scan-and-infer. +- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility, active-store directory reconciliation — everything derives from meta store key presence. No filesystem-scan-and-infer anywhere. -These three rules together mean: at any point during creation OR cleanup, the meta-store flag is the always-correct signal of the artifact's state on disk. A crash anywhere in the sequence leaves a state the next start can recover from by checking flag presence alone. +These three rules together mean: at any point during creation OR cleanup of any artifact (immutable file OR active store directory), the meta-store flag is the always-correct signal of the artifact's state on disk. A crash anywhere in the sequence leaves a state the next start can recover from by checking flag presence alone. --- ## Active Store Architecture -The service maintains three RocksDB-backed active stores for the current ingestion position; WAL must always be enabled. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). +The service maintains three RocksDB-backed active stores for the current ingestion position; WAL must always be enabled. Each active store directory's existence is tracked in the meta store via a `hot:*` key (set before the directory is created, cleared after it is removed) — Phase 3 (reconcile) uses these keys to find directories that need recovery without ever scanning the filesystem. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). | Store | Path | Key | Value | Transition cadence | |---|---|---|---|---| @@ -278,12 +295,12 @@ The service maintains three RocksDB-backed active stores for the current ingesti ### Store Lifecycle -- **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed: +- **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed. - At every chunk boundary, the next chunk's ledger and events stores open synchronously, while the just-finished ones are handed off to the background freeze. - At every tx-index boundary, the next tx index's txhash store opens the same way. - **Synchronous open cost.** Opening a new active store doesn't take long enough to matter — about 100 ms, at max. - **Transition.** At each boundary, the ingestion loop hands off the just-finished store to a background freeze task and continues writing into the freshly-opened next store. Ingestion never blocks on the freeze. -- **Deletion.** The freeze task deletes the just-finished store's directory only AFTER writing the immutable artifact and setting its meta-store flag (flag-after-fsync). A crash between flag-set and directory-delete leaves an orphan that Phase 3 (reconcile) classifies as "flag-is-truth" and deletes, thereby leaving no orphans. +- **Deletion.** The freeze task deletes the just-finished rocksdb active store's directory only AFTER writing the immutable artifact and setting its meta-store freeze flag (flag-after-fsync). - **Crash recovery.** Active-store directories that survive a crash are reconciled organically on the next start — see [Phase 3 — Reconcile Orphaned Transitions](#phase-3--reconcile-orphaned-transitions) for the full classification. ***Max concurrency:*** each store kind (ledger / events / txhash) holds at most **one active + one transitioning at a time** — capped at 2 instances per kind, enforced by the per-kind single-flight gates in [Concurrency Model](#concurrency-model). @@ -402,7 +419,7 @@ def phase2_hydrate_txhash(config, meta_store): if incomplete_tx_index_id is None: return - txhash_store = open_active_txhash_store(config, incomplete_tx_index_id) + txhash_store = open_active_txhash_store(config, meta_store, incomplete_tx_index_id) try: for chunk_id in chunks_for_tx_index(incomplete_tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): @@ -433,18 +450,39 @@ def phase3_reconcile_orphans(config, meta_store): if last_committed_ledger is None: return # fresh start — nothing in flight - resume_chunk_id = chunk_id_of_ledger(last_committed_ledger + 1) + resume_chunk_id = chunk_id_of_ledger(last_committed_ledger + 1) + resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) - reconcile_ledger_store_dirs(config, meta_store, resume_chunk_id) - reconcile_events_store_dirs(config, meta_store, resume_chunk_id) - reconcile_txhash_store_dirs(config, meta_store, tx_index_id_of_chunk(resume_chunk_id)) + # Iterate hot:* keys (no filesystem scan); each branch acts on the parsed (store_kind, id). + # chunk_or_tx_index_id holds chunk_id for "chunk:..." kinds, tx_index_id for "index:txhash". + for hot_key in meta_store.scan_prefix("hot:"): + store_kind, chunk_or_tx_index_id = parse_hot_key(hot_key) + resume_chunk_or_tx_index_id = ( + resume_chunk_id if store_kind.startswith("chunk:") else resume_tx_index_id + ) + store_path = active_store_path_for(store_kind, chunk_or_tx_index_id) + freeze_flag_key = freeze_flag_key_for(store_kind, chunk_or_tx_index_id) + + if chunk_or_tx_index_id == resume_chunk_or_tx_index_id: + continue # A: resume target — Phase 4 reopens. + + elif meta_store.has(freeze_flag_key): + # B: freeze done; dir-delete or hot-key clear didn't finish. Flag-is-truth. store_path is orphaned but frozen; safe to delete and clear. + delete_dir_if_exists(store_path) + meta_store.delete(hot_key) + + elif chunk_or_tx_index_id < resume_chunk_or_tx_index_id: + # C: freeze was interrupted (data durable in store, artifact not yet written/flagged). Restart the freeze to completion. + finish_interrupted_freeze(store_kind, chunk_or_tx_index_id, meta_store) + + else: + # D: future-orphan — should not occur in normal flow. Log + defensive cleanup. + log.warn(f"phase3: future-orphan {store_kind}/{chunk_or_tx_index_id:08d} > resume {resume_chunk_or_tx_index_id:08d}") + delete_dir_if_exists(store_path) + meta_store.delete(hot_key) ``` -Each `reconcile_*_store_dirs` helper scans its own active-store directory type and classifies each dir it finds: - -- **`reconcile_ledger_store_dirs`** — per `chunk_id` found: `== resume_chunk_id` → keep (active); `chunk:{chunk_id}:lfs` flag present → delete dir (flag-is-truth, freeze completed); `< resume_chunk_id` and flag absent → call `finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store)`; else delete as future orphan. -- **`reconcile_events_store_dirs`** — identical classification with `:events` flag and `finish_interrupted_events_freeze`. -- **`reconcile_txhash_store_dirs`** — per `tx_index_id` found: `== resume_tx_index_id` → keep; `index:{tx_index_id:08d}:txhash` present → delete dir (RecSplit done); flag absent and all chunks of `tx_index_id` have `:lfs` → open the store synchronously and spawn `build_tx_index_recsplit_files` in background (the builder reads from the handle and closes it). +`finish_interrupted_freeze(store_kind, chunk_or_tx_index_id, meta_store)` dispatches by `store_kind` to the per-kind synchronous form: `finish_interrupted_ledger_freeze` (for `chunk:lfs`), `finish_interrupted_events_freeze` (for `chunk:events`), or `finish_interrupted_recsplit_build` (for `index:txhash`). Each opens the active store via the matching `open_active_*_store` helper (idempotent on existing or partial dirs), then runs the same write + fsync + flag-set + close + `delete_dir_if_exists` + clear-hot-key sequence as the live-path freeze. ### Compute Resume Ledger @@ -552,10 +590,11 @@ def open_active_stores_for_resume(config, meta_store, resume_ledger): resume_chunk_id = chunk_id_of_ledger(resume_ledger) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) + # Each open_active_*_store sets its hot:* key before mkdir (see Flag Semantics). return ActiveStores( - ledger = open_or_create_ledger_store(config, resume_chunk_id), - events = open_or_create_events_store(config, meta_store, resume_chunk_id), - txhash = open_or_create_txhash_store(config, resume_tx_index_id), + ledger = open_active_ledger_store(config, meta_store, resume_chunk_id), + events = open_active_events_store(config, meta_store, resume_chunk_id), + txhash = open_active_txhash_store(config, meta_store, resume_tx_index_id), ) ``` @@ -631,12 +670,12 @@ def on_chunk_boundary(chunk_id, active_stores, meta_store): # spawn background freeze. LFS and events run independently (events doesn't wait on LFS). wait_for_lfs_complete() transitioning_ledger_store = active_stores.ledger - active_stores.ledger = open_or_create_ledger_store(config, chunk_id + 1) + active_stores.ledger = open_active_ledger_store(config, meta_store, chunk_id + 1) run_in_background(freeze_ledger_chunk_to_pack_file, chunk_id, transitioning_ledger_store, meta_store) wait_for_events_complete() transitioning_events_store = active_stores.events - active_stores.events = open_or_create_events_store(config, meta_store, chunk_id + 1) + active_stores.events = open_active_events_store(config, meta_store, chunk_id + 1) run_in_background(freeze_events_chunk_to_cold_segment, chunk_id, transitioning_events_store, meta_store) notify_lifecycle() # wake prune loop @@ -655,13 +694,14 @@ def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_ for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # freeze flag (artifact is durable) transitioning_ledger_store.close() - delete_dir(ledger_store_path(chunk_id)) + delete_dir(ledger_store_path(chunk_id)) # remove the active store dir + meta_store.delete(f"hot:chunk:{chunk_id:08d}:lfs") # clear hot key (file-before-flag-delete) signal_lfs_complete() ``` -`finish_interrupted_ledger_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_lfs_complete`. +`finish_interrupted_ledger_freeze(chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the active store via `open_active_ledger_store`, runs the same write + fsync + flag + close + `delete_dir_if_exists` + clear-hot-key sequence, no `signal_lfs_complete`. ### Events Transition @@ -672,13 +712,14 @@ def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, me events_path = events_segment_path(chunk_id) write_cold_segment(transitioning_events_store, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # freeze flag transitioning_events_store.close() - delete_dir(events_store_path(chunk_id)) + delete_dir(events_store_path(chunk_id)) # remove the active store dir + meta_store.delete(f"hot:chunk:{chunk_id:08d}:events") # clear hot key signal_events_complete() ``` -`finish_interrupted_events_freeze(store_dir, chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the store at `store_dir`, runs the same write + fsync + flag + close + `delete_dir(store_dir)` sequence, no `signal_events_complete`. +`finish_interrupted_events_freeze(chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the active store via `open_active_events_store`, runs the same write + fsync + flag + close + `delete_dir_if_exists` + clear-hot-key sequence, no `signal_events_complete`. ### Tx-Index Boundary (every `LEDGERS_PER_TX_INDEX` ledgers) @@ -691,7 +732,7 @@ def on_tx_index_boundary(tx_index_id, active_stores, meta_store): wait_for_events_complete() verify_all_chunk_flags(tx_index_id, meta_store) transitioning_txhash_store = active_stores.txhash - active_stores.txhash = open_or_create_txhash_store(config, tx_index_id + 1) + active_stores.txhash = open_active_txhash_store(config, meta_store, tx_index_id + 1) run_in_background(build_tx_index_recsplit_files, tx_index_id, transitioning_txhash_store, meta_store) ``` @@ -704,12 +745,13 @@ def build_tx_index_recsplit_files(tx_index_id, transitioning_txhash_store, meta_ # Same flag-after-fsync pattern as LFS / events freeze; verify before flag. idx_path = recsplit_index_path(tx_index_id) delete_partial_idx_files(idx_path) - build_recsplit(transitioning_txhash_store, idx_path) # 16 .idx files + build_recsplit(transitioning_txhash_store, idx_path) # 16 .idx files fsync_all_idx_files(idx_path) verify_spot_check(tx_index_id, idx_path, meta_store) - meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") + meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") # freeze flag transitioning_txhash_store.close() - delete_dir(txhash_store_path(tx_index_id)) + delete_dir(txhash_store_path(tx_index_id)) # remove the active store dir + meta_store.delete(f"hot:index:{tx_index_id:08d}:txhash") # clear hot key ``` --- @@ -829,6 +871,7 @@ In addition to the backfill subroutine's invariants in [01-backfill-workflow.md 3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. 4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 (catchup) itself avoids producing them. 5. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. +6. **Hot-key tracking.** Every active store directory has a corresponding `hot:*` key, set BEFORE `mkdir` and cleared AFTER `delete_dir`. Phase 3 (reconcile) iterates `hot:*` keys to find directories that need recovery — no filesystem scan anywhere in the design. #### Compound Recovery Scenarios @@ -841,15 +884,23 @@ Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workf - **Sweep 2, between file-delete and flag-delete.** Flag set, file gone, data durable. Restart: flag triggers retry, `os.path.exists(bin_path)` is False so load is skipped, file delete is a no-op, flag deleted. - **No filesystem scan needed in any case** — the meta-store flag is the only signal the next start consults. -- **Crash between per-ledger checkpoint and LFS freeze completion.** - - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent. +- **Crash between per-ledger checkpoint and freeze completion (LFS / events).** + - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent; `hot:chunk:{chunk_id}:lfs` set; active ledger store dir present. - Phase 1 (catchup) on restart (assumes `[BSB]` configured): `:lfs` missing → re-runs `process_chunk(chunk_id)` with a fresh per-task BSB (idempotent per artifact). - - Phase 3 (reconcile) then: active ledger store present + `:lfs` now set → deletes the orphaned store. + - Phase 3 (reconcile) iterates `hot:*` keys. Hits SCENARIO B (freeze flag now set + chunk_id < resume): `delete_dir_if_exists` + clear hot key. Cleanup is idempotent. - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. - **Crash mid-RecSplit.** - - State: `index:{tx_index_id}:txhash` absent; all `:lfs` chunks of the tx index present. - - Phase 3 (reconcile): re-spawns the RecSplit build after deleting partial `.idx` files. + - State: `index:{tx_index_id}:txhash` absent; `hot:index:{tx_index_id}:txhash` set; all `:lfs` chunks of the tx index present; partial `.idx` files possibly on disk. + - Phase 3 (reconcile) iterates `hot:*` keys. Hits SCENARIO C (no freeze flag, `chunk_or_tx_index_id < resume_chunk_or_tx_index_id`): `finish_interrupted_freeze("index:txhash", ...)` runs `build_tx_index_recsplit_files` synchronously. The build's preamble deletes any partial `.idx` files, rebuilds, sets the flag, deletes the dir, clears the hot key. + +- **Crash mid hot-store creation.** + - State: `hot:chunk:{chunk_id}:lfs` (or events / txhash) set, but `mkdir` / RocksDB open didn't complete. Dir might be absent or partially set up. Freeze flag absent. + - Phase 3 (reconcile): if `chunk_id == resume`, SCENARIO A — keep; Phase 4 reopens via `open_active_*_store` which is idempotent (mkdir is no-op on existing dir, RocksDB recovers from any partial WAL state). If `chunk_id < resume`, SCENARIO C — `finish_interrupted_freeze` reopens and re-runs the freeze (handles empty/partial RocksDB the same way). No special-case handling needed. + +- **Crash between hot-store dir-delete and `meta_store.delete(hot:*)`.** + - State: freeze flag set, dir already gone, hot key still set. + - Phase 3 (reconcile) hits SCENARIO B. `delete_dir_if_exists` no-ops on the missing dir; clears the hot key. Consistent with the file-before-flag-delete invariant: the hot key is the recovery signal, never an orphan dir without a key. - **Crash mid-prune.** - State: some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. From 10268c403b52bead610202bbdf4702a8a850ca18 Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sat, 25 Apr 2026 21:09:42 -0700 Subject: [PATCH 31/34] Streaming doc: length + clarity prune MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 933 → 835 lines (-98, -10.5%). Semantics preserved. - Hoist repeated parentheticals in Key Lifecycle in Streaming block. - Collapse Active Store Architecture table → 3 store bullets; fold the 16-CF txhash detail and idempotent-events note into their store rows; merge the Max-concurrency one-liner into Boundary swap. - delete_dir → delete_dir_if_exists in all three freeze pseudocode blocks (matches source-of-truth invariant #11 for cleanup paths). - Drop self-invented CAPTIVE_CORE_RETRY_MAX bullet. - Drop per-kind finish_interrupted_*_freeze enumeration in narrative; the dispatcher finish_interrupted_freeze just reopens the active store and runs the corresponding live-path freeze. - Delete the Error Handling section (most bullets restated invariants / Validation Rules; the RecSplit-verify-mismatch corner case stays in source-of-truth CC39 for TDD). - Drop Crash mid-prune and Crash-between-hot-store-dir-delete-and-hot-key compound recovery scenarios (subsumed by invariant #5 + Phase 3 SCENARIO B). - Compress Concurrency Model bullets, Pruning rationale, Query Contract Readiness Signal, Resilience preface; collapse multi-line wrapping comments in pseudocode. - Rename CaptiveStellarCore → captive core in narrative. --- .../design-docs/02-streaming-workflow.md | 294 ++++++------------ 1 file changed, 98 insertions(+), 196 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 1a63d81ac..327f5687a 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -97,8 +97,7 @@ Stored on first start; fatal on any subsequent start where the config value diff | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | - Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; not stored as immutable. -- Operators may add or remove BSB between runs; on each start, Phase 1 (catchup) either re-runs backfill from the retention-aligned start (BSB present) or no-ops (BSB absent). `compute_resume_ledger` then derives resume from whatever chunks are on disk. -- Locking the source choice would add nothing — `RETENTION_LEDGERS` already pins down what range of ledgers ends up on disk, and that's what actually has to stay consistent across runs. Whether each ledger arrived via BSB or captive core doesn't change anything on disk.. +- Operators may add or remove `[BSB]` between runs; `compute_resume_ledger` derives resume from on-disk chunks regardless of which source produced them. `RETENTION_LEDGERS` already pins the retained ledger window — locking the source choice would add nothing. ### TOML Sections Documented Here @@ -167,10 +166,7 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 ```python def validate_config(config, meta_store): - apply_static_rules(config) # required-field presence, RETENTION_LEDGERS - # multiple-of-LEDGERS_PER_TX_INDEX, [BSB]+retention=0 fatal - # — see "Validation Rules" above for the full contract. - + apply_static_rules(config) # see "Validation Rules" above _enforce_immutable(meta_store, "config:chunks_per_tx_index", str(config.service.chunks_per_tx_index)) _enforce_immutable(meta_store, "config:retention_ledgers", str(config.service.retention_ledgers)) @@ -178,7 +174,7 @@ def validate_config(config, meta_store): def _enforce_immutable(meta_store, key, current_value): stored = meta_store.get(key) if stored is None: - meta_store.put(key, current_value) # first-run snapshot + meta_store.put(key, current_value) elif stored != current_value: fatal(f"{key} changed: stored={stored}, config={current_value}. Wipe datadir.") ``` @@ -227,83 +223,65 @@ Streaming-specific use of these keys (which paths write them when, and the `"del ### Key Lifecycle in Streaming ``` -Phase 1 (catchup): - chunk:{chunk_id}:lfs = "1" (after pack fsync) - chunk:{chunk_id}:txhash = "1" (after .bin fsync) # only present for chunks that still have .bin on disk - chunk:{chunk_id}:events = "1" (after cold segment fsync) - index:{tx_index_id}:txhash = "1" (after RecSplit, when all chunks of tx_index_id are done in Phase 1 (catchup)) - -Phase 2 (.bin hydration — see Startup Sequence): - For every chunk with :txhash flag (and possibly a .bin file): - if .bin exists: load .bin into the active txhash RocksDB - delete .bin file (file FIRST — see Flag Semantics) - delete chunk:{chunk_id}:txhash flag (flag LAST) - After Phase 2 (.bin hydration), no chunk:{chunk_id}:txhash flags and no .bin files remain. - -Active store open (Phase 2 / Phase 4 entry / boundary handlers): - hot:* keys are set BEFORE mkdir, one per active store kind: +Phase 1 (catchup) — every freeze flag set AFTER its artifact's fsync: + chunk:{chunk_id}:lfs = "1" + chunk:{chunk_id}:txhash = "1" # only while .bin is still on disk + chunk:{chunk_id}:events = "1" + index:{tx_index_id}:txhash = "1" # set when all chunks of tx_index_id are done + +Phase 2 (.bin hydration — see Startup Sequence) — file-before-flag-delete: + for each chunk with :txhash flag: + if .bin exists: load into active txhash RocksDB + delete .bin file + delete chunk:{chunk_id}:txhash flag + After Phase 2, no :txhash chunk flags and no .bin files remain. + +Active store open (Phase 2 / Phase 4 entry / boundary handlers) — +hot:* keys set BEFORE mkdir, one per active store kind: hot:chunk:{chunk_id}:lfs = "1" hot:chunk:{chunk_id}:events = "1" hot:index:{tx_index_id}:txhash = "1" -Live path (per ledger): - streaming:last_committed_ledger = ledger_seq (after all 3 active stores commit) - -Live path (per chunk, background freeze): - Freeze flag set AFTER fsync; hot key cleared AFTER dir removed (file-before-flag-delete). - chunk:{chunk_id}:lfs = "1" - [delete rocksdb ledger store dir] - hot:chunk:{chunk_id}:lfs → deleted +Live path (per ledger, after all 3 active stores commit): + streaming:last_committed_ledger = ledger_seq - chunk:{chunk_id}:events = "1" - [delete rocksdb events store dir] - hot:chunk:{chunk_id}:events → deleted +Live path (per chunk, background freeze) — flag AFTER fsync, hot key AFTER dir delete: + chunk:{chunk_id}:lfs = "1" → delete ledger store dir → clear hot:chunk:{chunk_id}:lfs + chunk:{chunk_id}:events = "1" → delete events store dir → clear hot:chunk:{chunk_id}:events -Live path (per index, background freeze): - index:{tx_index_id}:txhash = "1" - [delete rocksdb txhash store dir] - hot:index:{tx_index_id}:txhash → deleted +Live path (per index, background freeze) — same pattern: + index:{tx_index_id}:txhash = "1" → delete txhash store dir → clear hot:index:{tx_index_id}:txhash -Pruning (background, when tx_index_id is past retention): - index:{tx_index_id}:txhash = "deleting" (set BEFORE any files are deleted; queries return 4xx from here on) - [delete all files + per-chunk :lfs + :events keys for tx_index_id] - index:{tx_index_id}:txhash → deleted (cleared AFTER all files are gone) +Pruning (background, when tx_index_id is past retention) — two-phase marker: + index:{tx_index_id}:txhash = "deleting" (queries return 4xx from here on) + delete all files + per-chunk :lfs + :events keys for tx_index_id + index:{tx_index_id}:txhash → deleted ``` ### Flag Semantics - **Flag-after-fsync (creation order).** A flag is set only AFTER the artifact it represents has been fsynced. Flag absent ⇒ artifact missing or incomplete; flag present ⇒ artifact is durable. -- **File-before-flag-delete (cleanup order).** When deleting, the file is removed FIRST and the flag is cleared LAST. Flag present ⇒ cleanup may not be complete; flag absent ⇒ cleanup is done and no file exists. The reverse order (flag-then-file) would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. -- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility, active-store directory reconciliation — everything derives from meta store key presence. No filesystem-scan-and-infer anywhere. +- **File-before-flag-delete (cleanup order).** The file is removed FIRST; the flag is cleared LAST. Flag present ⇒ cleanup may be incomplete; flag absent ⇒ cleanup done, no file exists. Reverse order would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. +- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility, active-store directory reconciliation — derives from meta-store key presence. No filesystem-scan-and-infer anywhere. -These three rules together mean: at any point during creation OR cleanup of any artifact (immutable file OR active store directory), the meta-store flag is the always-correct signal of the artifact's state on disk. A crash anywhere in the sequence leaves a state the next start can recover from by checking flag presence alone. +Together: the meta-store flag is the always-correct signal of artifact state on disk, both for immutable files and for active-store directories. A crash anywhere leaves a state the next start recovers from by flag presence alone. --- ## Active Store Architecture -The service maintains three RocksDB-backed active stores for the current ingestion position; WAL must always be enabled. Each active store directory's existence is tracked in the meta store via a `hot:*` key (set before the directory is created, cleared after it is removed) — Phase 3 (reconcile) uses these keys to find directories that need recovery without ever scanning the filesystem. All per-chunk and per-index lifecycle is driven by the [freeze transitions](#freeze-transitions). - -| Store | Path | Key | Value | Transition cadence | -|---|---|---|---|---| -| Ledger | `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/` | `uint32BE(ledgerSeq)` | `zstd(LCM bytes)` | Every 10_000 ledgers (chunk) | -| TxHash | `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/` | `txhash[32]` | `uint32BE(ledgerSeq)` | Every `LEDGERS_PER_TX_INDEX` ledgers (index) | -| Events | `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/` | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | per [getEvents full-history design](../../design-docs/getevents-full-history-design.md) | Every 10_000 ledgers (chunk) | +Three RocksDB-backed active stores; WAL always enabled. Each directory has a `hot:*` key (set before mkdir, cleared after dir removal); Phase 3 (reconcile) finds directories needing recovery via meta-store scan, never filesystem scan. Lifecycle driven by [freeze transitions](#freeze-transitions). -- TxHash store uses 16 column families (`cf-0`..`cf-f`) routed by the high nibble of the txhash (`txhash[0] >> 4`); each CF pairs 1:1 with one of the 16 RecSplit `.idx` files at the index boundary. -- Events writes are idempotent at per-ledger granularity — a re-write of the same ledger sequence overwrites cleanly, so crash-replay is corruption-free. +- **Ledger** — one per chunk at `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/`. Key `uint32BE(ledgerSeq)`, value `zstd(LCM bytes)`. +- **TxHash** — one per tx index at `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/`. Key `txhash[32]`, value `uint32BE(ledgerSeq)`. 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`; each CF pairs 1:1 with one of the 16 RecSplit `.idx` files at the index boundary. +- **Events** — one per chunk at `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/`. Schema per [getEvents full-history design](../../design-docs/getevents-full-history-design.md). Per-ledger writes are idempotent — re-write of the same ledger overwrites cleanly, so crash-replay is corruption-free. ### Store Lifecycle -- **Creation.** Active stores are opened on-demand, synchronously, at the boundary where they're first needed. - - At every chunk boundary, the next chunk's ledger and events stores open synchronously, while the just-finished ones are handed off to the background freeze. - - At every tx-index boundary, the next tx index's txhash store opens the same way. -- **Synchronous open cost.** Opening a new active store doesn't take long enough to matter — about 100 ms, at max. -- **Transition.** At each boundary, the ingestion loop hands off the just-finished store to a background freeze task and continues writing into the freshly-opened next store. Ingestion never blocks on the freeze. -- **Deletion.** The freeze task deletes the just-finished rocksdb active store's directory only AFTER writing the immutable artifact and setting its meta-store freeze flag (flag-after-fsync). -- **Crash recovery.** Active-store directories that survive a crash are reconciled organically on the next start — see [Phase 3 — Reconcile Orphaned Transitions](#phase-3--reconcile-orphaned-transitions) for the full classification. - -***Max concurrency:*** each store kind (ledger / events / txhash) holds at most **one active + one transitioning at a time** — capped at 2 instances per kind, enforced by the per-kind single-flight gates in [Concurrency Model](#concurrency-model). +- **Boundary swap.** At every chunk boundary the next chunk's ledger + events stores open synchronously while the just-finished ones are handed to background freeze tasks; tx-index boundaries do the same for txhash. Each kind therefore holds at most one active + one transitioning at a time. Ingestion never blocks on the freeze. +- **Synchronous open cost.** ~100 ms maximum — small enough to ignore. +- **Deletion.** The freeze task deletes the active store's directory only after the immutable artifact is fsynced and its freeze flag is set. +- **Crash recovery.** Active-store directories surviving a crash are reconciled on the next start — see [Phase 3 — Reconcile Orphaned Transitions](#phase-3--reconcile-orphaned-transitions). --- @@ -311,8 +289,8 @@ The service maintains three RocksDB-backed active stores for the current ingesti Two ledger sources, scoped to different phases: -- **Backfill (Phase 1 (catchup)) uses `BSBSource`** — the backfill-only reader; interface mirrors the stellar Go SDK's `LedgerBackend` (`PrepareRange` + `GetLedger`). Each `process_chunk` instantiates its own from `[BSB]` config, prepares range for its 10_000 ledgers, reads, tears down. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). -- **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — no `LedgerSource` wrapper. Phase 4 (live ingestion) calls the stellar Go SDK's `ledgerBackend.PrepareRange(UnboundedRange(resume_ledger)) + GetLedger(seq)` against the captive-core subprocess. +- **Backfill (Phase 1 (catchup)) uses `BSBSource`** — backfill-only reader (`PrepareRange` + `GetLedger`). Each `process_chunk` constructs its own scoped to its chunk's 10_000 ledgers. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). +- **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — `PrepareRange(UnboundedRange(resume_ledger))` + per-ledger `GetLedger(seq)` against the captive-core subprocess. --- @@ -325,8 +303,6 @@ Four sequential phases, same code path for first start and every restart. The fi - **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. - **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, flips the `service_ready` flag, enters the ingestion loop. Runs until process exit. -"Phase" here refers to the startup ordering only. Once Phase 4 (live ingestion) is entered, there's no Phase 5 — the service is in live-streaming steady state. - ### Backfill vs Phase 1 (catchup) - **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only; parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. @@ -381,17 +357,14 @@ def phase1_catchup(config, meta_store): def retention_aligned_start_chunk(tip_ledger, retention_ledgers): - # Aligns DOWN to a tx-index boundary (no-gaps invariant); costs up to - # LEDGERS_PER_TX_INDEX - 1 extra ledgers below strict retention. + # Aligns DOWN to a tx-index boundary (no-gaps); up to LEDGERS_PER_TX_INDEX - 1 ledgers below strict retention. if retention_ledgers == 0: return 0 target_ledger = max(tip_ledger - retention_ledgers, GENESIS_LEDGER) return first_chunk_id_of_tx_index_containing(target_ledger) ``` -**Worker concurrency:** `run_backfill` caps DAG concurrency at `MAX_CPU_THREADS`. Each `process_chunk` owns its own `BSBSource` instance, prepares range for its 10_000 ledgers, reads, and tears down — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunk). - -**Retention effect:** retention determines Phase 1 (catchup)'s chunk range. Catchup time ≈ `retention_window / (BSB throughput)`. +**Worker concurrency:** `run_backfill` caps DAG concurrency at `MAX_CPU_THREADS` — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunk). Catchup time ≈ `retention_window / (BSB throughput)`. ### Phase 2 — Hydrate TxHash Data from `.bin` @@ -401,20 +374,16 @@ def retention_aligned_start_chunk(tip_ledger, retention_ledgers): ```python def phase2_hydrate_txhash(config, meta_store): - # Both sweeps below delete the .bin file BEFORE deleting its :txhash flag (see Flag Semantics). - # On any crash mid-pair, the flag is the recovery signal — never an orphan file with no record. + # Both sweeps: file-before-flag-delete (see Flag Semantics). - # Sweep 1: once an index's RecSplit is built, the per-chunk .bin files become - # redundant (their data is now in the index). Backfill deletes them via - # cleanup_txhash; this sweep finishes the job if backfill crashed mid-cleanup - # and left some chunks with their .bin file + :txhash flag still around. + # Sweep 1: clean leftover .bin from completed indexes (cleanup_txhash crashed mid-pair). for tx_index_id in tx_index_ids_with_txhash_flag(meta_store): for chunk_id in chunks_for_tx_index(tx_index_id): if meta_store.has(f"chunk:{chunk_id:08d}:txhash"): delete_if_exists(raw_txhash_path(chunk_id)) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") - # Sweep 2: hydrate the trailing incomplete tx index into the active RocksDB. + # Sweep 2: hydrate the trailing incomplete tx index into the active txhash store. incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) if incomplete_tx_index_id is None: return @@ -430,15 +399,11 @@ def phase2_hydrate_txhash(config, meta_store): delete_if_exists(bin_path) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") finally: - txhash_store.close() # Phase 4 re-opens by directory path; flock would collide otherwise. + txhash_store.close() # Phase 4 re-opens by path; flock would collide otherwise. ``` -**Why "load then delete" matters.** -- Without immediate deletion, every restart during the incomplete-index lifetime would re-load the same `.bin` files into RocksDB. -- At `cpi=1_000` with frequent restarts over a day: thousands of redundant loads. -- Load-then-delete makes Phase 2 (`.bin` hydration) a no-op on every subsequent restart until the next Phase 1 (catchup) deposits new `.bin` files. - -**Pure-streaming restarts** (no recent Phase 1 (catchup) output) never see `.bin` files; streaming's live path writes txhash directly to the active RocksDB txhash store. Phase 2 (`.bin` hydration) is a no-op. +- **Why "load then delete".** Without it, every restart during the incomplete-index lifetime would re-load the same `.bin` files. Load-then-delete makes Phase 2 a no-op on every subsequent restart until Phase 1 (catchup) deposits new `.bin` files. +- **Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files; the live path writes txhash directly to the active store. Phase 2 is a no-op. ### Phase 3 — Reconcile Orphaned Transitions @@ -453,8 +418,7 @@ def phase3_reconcile_orphans(config, meta_store): resume_chunk_id = chunk_id_of_ledger(last_committed_ledger + 1) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) - # Iterate hot:* keys (no filesystem scan); each branch acts on the parsed (store_kind, id). - # chunk_or_tx_index_id holds chunk_id for "chunk:..." kinds, tx_index_id for "index:txhash". + # Iterate hot:* keys; each branch acts on the parsed (store_kind, id). for hot_key in meta_store.scan_prefix("hot:"): store_kind, chunk_or_tx_index_id = parse_hot_key(hot_key) resume_chunk_or_tx_index_id = ( @@ -464,30 +428,30 @@ def phase3_reconcile_orphans(config, meta_store): freeze_flag_key = freeze_flag_key_for(store_kind, chunk_or_tx_index_id) if chunk_or_tx_index_id == resume_chunk_or_tx_index_id: - continue # A: resume target — Phase 4 reopens. + continue # A: resume target — Phase 4 reopens. elif meta_store.has(freeze_flag_key): - # B: freeze done; dir-delete or hot-key clear didn't finish. Flag-is-truth. store_path is orphaned but frozen; safe to delete and clear. + # B: flag-is-truth. Frozen, but cleanup didn't finish. delete_dir_if_exists(store_path) meta_store.delete(hot_key) elif chunk_or_tx_index_id < resume_chunk_or_tx_index_id: - # C: freeze was interrupted (data durable in store, artifact not yet written/flagged). Restart the freeze to completion. + # C: freeze interrupted; restart it to completion. finish_interrupted_freeze(store_kind, chunk_or_tx_index_id, meta_store) else: - # D: future-orphan — should not occur in normal flow. Log + defensive cleanup. + # D: future-orphan — shouldn't occur. Log + defensive cleanup. log.warn(f"phase3: future-orphan {store_kind}/{chunk_or_tx_index_id:08d} > resume {resume_chunk_or_tx_index_id:08d}") delete_dir_if_exists(store_path) meta_store.delete(hot_key) ``` -`finish_interrupted_freeze(store_kind, chunk_or_tx_index_id, meta_store)` dispatches by `store_kind` to the per-kind synchronous form: `finish_interrupted_ledger_freeze` (for `chunk:lfs`), `finish_interrupted_events_freeze` (for `chunk:events`), or `finish_interrupted_recsplit_build` (for `index:txhash`). Each opens the active store via the matching `open_active_*_store` helper (idempotent on existing or partial dirs), then runs the same write + fsync + flag-set + close + `delete_dir_if_exists` + clear-hot-key sequence as the live-path freeze. +`finish_interrupted_freeze` reopens the active store (idempotent on existing or partial dirs) and runs the corresponding live-path freeze ([LFS](#lfs-transition), [events](#events-transition), or [RecSplit](#recsplit-transition)) to produce the artifact. ### Compute Resume Ledger - `compute_resume_ledger` is a shared helper called once per service start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. -- **Runs AFTER Phase 3 (reconcile).** Phase 3's `finish_interrupted_ledger_freeze` writes `:lfs` for chunks whose freeze was in flight at a prior crash; running `compute_resume_ledger` before Phase 3 would see those mid-freeze chunks as internal `:lfs` gaps and false-positive-fatal at startup. +- **Runs AFTER Phase 3 (reconcile).** Phase 3 writes the `:lfs` flag for chunks whose freeze was in flight at a prior crash; running `compute_resume_ledger` before Phase 3 would see those mid-freeze chunks as internal `:lfs` gaps and false-positive-fatal at startup. - **Scans every startup, even when `streaming:last_committed_ledger` is already set.** The scan's primary output in the mid-life-restart case is validation, not derivation; catching broken on-disk state before opening active stores is strictly safer than silently resuming on top. - **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The service exits non-zero; no active stores are opened. @@ -546,8 +510,7 @@ def validate_scan(scan, cpi): def validate_last_committed_consistency(scan, last_committed_ledger): - # `streaming:last_committed_ledger = L` implies every chunk up to chunk_id_of_ledger(L)-1 - # must have :lfs (the chunk containing L itself is the currently-ingesting one). + # L = last_committed_ledger ⇒ all chunks up to chunk_id_of_ledger(L)-1 must have :lfs (L's own chunk is still ingesting). active_chunk_id = chunk_id_of_ledger(last_committed_ledger) required_last = active_chunk_id - 1 if required_last < 0: @@ -559,11 +522,9 @@ def validate_last_committed_consistency(scan, last_committed_ledger): def retention_aligned_resume_ledger(config): - # Tip-tracker fresh-start branch (no BSB, no on-disk chunks). validate_config - # rejects [BSB]-absent + retention=0, so GENESIS_LEDGER is only a defensive floor. + # Tip-tracker fresh start (no BSB, no on-disk chunks). [BSB]-absent + retention=0 is rejected by validate_config. network_tip_ledger = get_latest_network_tip(config.history_archives.urls) retention_ledgers = config.service.retention_ledgers - target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) return first_ledger_of_tx_index_containing(target_ledger) ``` @@ -574,8 +535,7 @@ Opens active stores for the resume position, spawns the lifecycle task, starts c ```python def phase4_live_ingest(config, meta_store, resume_ledger): - # streaming:last_committed_ledger is NOT written at bootstrap — first write happens - # inside the live ingestion loop after the first durable commit. + # streaming:last_committed_ledger is first written by the live loop, not at bootstrap. active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) run_in_background(run_prune_lifecycle_loop, config, meta_store) @@ -611,9 +571,7 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ledger_seq = resume_ledger while True: lcm = ledger_backend.GetLedger(ledger_seq) # blocks until available - - # All three writes durably commit before advancing the checkpoint. - wait_all( + wait_all( # all three writes durably commit before advancing the checkpoint run_in_background(write_ledger_store, active_stores.ledger, ledger_seq, lcm), run_in_background(write_txhash_store, active_stores.txhash, ledger_seq, lcm), run_in_background(write_events_store, active_stores.events, ledger_seq, lcm), @@ -648,17 +606,14 @@ Three independent background transitions per chunk/index boundary; each has its ### Concurrency Model -- **`active_stores` is the ingestion loop's owned state.** Fields (`ledger`, `events`, `txhash` — one handle per data type, no `*_next`) are mutated only by the ingestion loop thread — specifically inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze transitions receive a handle by value at spawn time and never read back through `active_stores`. -- **Meta-store is single-writer.** Meta-store flag writes come from: the ingestion loop (per-ledger checkpoint), freeze transitions (artifact `:lfs` / `:events` / `:txhash` flags after fsync), and the lifecycle loop (`"deleting"` marker + key delete during prune). The meta-store wrapper serializes them with internal locking on top of RocksDB's own single-writer semantics. -- **`wait_for_lfs_complete()` / `wait_for_events_complete()` are per-kind single-flight gates.** - - One outstanding transition per kind (LFS / events / RecSplit); the second starts only after the first releases. - - `wait_for_lfs_complete()` acquires the gate; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases it. - - Not a global wait barrier — that would block until every in-flight transition across all kinds finished, defeating per-kind independence. -- **Query handlers read from storage-manager layer** — each per-data-type storage manager (ledger / events / txhash) owns its own state-transition synchronization; the query handler never touches `active_stores` directly. - - **Read-view invariant:** during a transition, a query sees either pre-transition data (routed to the transitioning store) or post-transition data (routed to the new active store + the newly-flagged immutable artifact) — never a half-state mix. - - **Flag-is-truth applies to reads too:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag isn't set. +- **`active_stores` is the ingestion loop's owned state.** Fields `ledger` / `events` / `txhash` (one handle per data type, no `*_next`) are mutated only inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze tasks receive a handle by value at spawn and never read back through `active_stores`. +- **Meta-store is single-writer.** Writers are the ingestion loop (per-ledger checkpoint), freeze tasks (`:lfs` / `:events` / `:txhash` flags), and the lifecycle loop (`"deleting"` marker + prune key delete). The meta-store wrapper serializes them on top of RocksDB's single-writer semantics. +- **Per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit); the next starts only after the previous releases. `wait_for_lfs_complete()` acquires the LFS gate; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases it (events / RecSplit follow the same shape). Not a global barrier — kinds remain independent. +- **Query handlers read from a storage-manager layer.** Per-data-type managers (ledger / events / txhash) own their own state-transition synchronization; query handlers never touch `active_stores` directly. + - **Read-view invariant:** a query sees either pre-transition data (routed to the transitioning store) or post-transition data (new active store + newly-flagged immutable artifact) — never a half-state mix. + - **Flag-is-truth applies to reads:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag is unset. - Concrete lock primitives + routing logic belong in a separate query-routing design doc. -- **Stores are opened on-demand at boundary** — see [Store Lifecycle](#store-lifecycle) for the open + transition sequence and the synchronous-open cost analysis. +- **Stores are opened on-demand at boundary** — see [Store Lifecycle](#store-lifecycle) for the open + transition sequence and synchronous-open cost. ### Chunk Boundary (every 10_000 ledgers) @@ -666,8 +621,7 @@ Triggered when the ingestion loop commits `last_ledger_in_chunk(chunk_id)`. Hand ```python def on_chunk_boundary(chunk_id, active_stores, meta_store): - # LFS + events each: drain prior freeze, capture current handle, open chunk+1 sync (~100-200 ms), - # spawn background freeze. LFS and events run independently (events doesn't wait on LFS). + # LFS + events run independently — events doesn't wait on LFS. wait_for_lfs_complete() transitioning_ledger_store = active_stores.ledger active_stores.ledger = open_active_ledger_store(config, meta_store, chunk_id + 1) @@ -687,22 +641,20 @@ Converts the retired ledger RocksDB store to an immutable `.pack` file, then dis ```python def freeze_ledger_chunk_to_pack_file(chunk_id, transitioning_ledger_store, meta_store): - # overwrite=True discards any prior partial; flag-after-fsync. Crash between flag - # and store-delete leaves an orphan that Phase 3 (reconcile) picks up (flag-is-truth). + # overwrite=True discards any prior partial. Crash after the freeze flag but + # before delete_dir leaves an orphan; Phase 3 (reconcile) picks it up. pack_path = ledger_pack_path(chunk_id) writer = packfile.create(pack_path, overwrite=True) for ledger_seq in range(first_ledger_in_chunk(chunk_id), last_ledger_in_chunk(chunk_id) + 1): writer.append(transitioning_ledger_store.get(uint32_big_endian(ledger_seq))) writer.fsync_and_close() - meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") # freeze flag (artifact is durable) + meta_store.put(f"chunk:{chunk_id:08d}:lfs", "1") transitioning_ledger_store.close() - delete_dir(ledger_store_path(chunk_id)) # remove the active store dir - meta_store.delete(f"hot:chunk:{chunk_id:08d}:lfs") # clear hot key (file-before-flag-delete) + delete_dir_if_exists(ledger_store_path(chunk_id)) + meta_store.delete(f"hot:chunk:{chunk_id:08d}:lfs") # cleared AFTER dir removed signal_lfs_complete() ``` -`finish_interrupted_ledger_freeze(chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the active store via `open_active_ledger_store`, runs the same write + fsync + flag + close + `delete_dir_if_exists` + clear-hot-key sequence, no `signal_lfs_complete`. - ### Events Transition Converts the retired events RocksDB store to three immutable files (events cold segment). @@ -712,15 +664,13 @@ def freeze_events_chunk_to_cold_segment(chunk_id, transitioning_events_store, me events_path = events_segment_path(chunk_id) write_cold_segment(transitioning_events_store, events_path) # 3 files: events.pack, index.pack, index.hash fsync_all(events_path) - meta_store.put(f"chunk:{chunk_id:08d}:events", "1") # freeze flag + meta_store.put(f"chunk:{chunk_id:08d}:events", "1") transitioning_events_store.close() - delete_dir(events_store_path(chunk_id)) # remove the active store dir - meta_store.delete(f"hot:chunk:{chunk_id:08d}:events") # clear hot key + delete_dir_if_exists(events_store_path(chunk_id)) + meta_store.delete(f"hot:chunk:{chunk_id:08d}:events") signal_events_complete() ``` -`finish_interrupted_events_freeze(chunk_id, meta_store)` is the Phase 3 (reconcile) synchronous form: opens the active store via `open_active_events_store`, runs the same write + fsync + flag + close + `delete_dir_if_exists` + clear-hot-key sequence, no `signal_events_complete`. - ### Tx-Index Boundary (every `LEDGERS_PER_TX_INDEX` ledgers) The last chunk of a tx index has just rolled over. Before RecSplit can start, every chunk in the tx index must have its `:lfs` and `:events` flags set. @@ -742,16 +692,16 @@ Builds the 16 RecSplit `.idx` files for tx_index_id from the retired txhash acti ```python def build_tx_index_recsplit_files(tx_index_id, transitioning_txhash_store, meta_store): - # Same flag-after-fsync pattern as LFS / events freeze; verify before flag. + # Verify before flag; flag-after-fsync as in LFS / events. idx_path = recsplit_index_path(tx_index_id) delete_partial_idx_files(idx_path) build_recsplit(transitioning_txhash_store, idx_path) # 16 .idx files fsync_all_idx_files(idx_path) verify_spot_check(tx_index_id, idx_path, meta_store) - meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") # freeze flag + meta_store.put(f"index:{tx_index_id:08d}:txhash", "1") transitioning_txhash_store.close() - delete_dir(txhash_store_path(tx_index_id)) # remove the active store dir - meta_store.delete(f"hot:index:{tx_index_id:08d}:txhash") # clear hot key + delete_dir_if_exists(txhash_store_path(tx_index_id)) + meta_store.delete(f"hot:index:{tx_index_id:08d}:txhash") ``` --- @@ -762,8 +712,7 @@ Retention is enforced by a single background task, woken at chunk boundaries. Pr ```python def run_prune_lifecycle_loop(config, meta_store): - # Initial sweep catches `"deleting"` state left by a prior crashed prune; - # subsequent sweeps fire on chunk-boundary notifications. + # Initial sweep catches "deleting" state from a prior crashed prune; later sweeps fire on boundary. retention_ledgers = config.service.retention_ledgers _run_prune_sweep(meta_store, retention_ledgers, config) while True: @@ -793,19 +742,13 @@ def prunable_tx_index_ids(meta_store, retention_ledgers): def prune_tx_index(tx_index_id, meta_store, config): - # Two-phase marker: set "deleting" first, clear the key last. Idempotent on retry. + # Two-phase marker: "deleting" set first, key cleared last. Idempotent on retry. meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") for chunk_id in chunks_for_tx_index(tx_index_id): - # Files first, flags last (same invariant as Phase 2 hydration: flag presence ⇒ - # cleanup not yet done). All deletes are idempotent (rm -f / delete-if-exists). + # Files-first, flags-last; all deletes idempotent. delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) - # Defense-in-depth: also clean the transient txhash artifacts. In normal flow - # these are gone long before retention catches up — Phase 2 hydration deletes - # them on the trailing partial, and cleanup_txhash deletes them on completed - # indexes. Belt-and-suspenders here so prune is self-contained against any - # upstream cleanup gap. - delete_if_exists(raw_txhash_path(chunk_id)) + delete_if_exists(raw_txhash_path(chunk_id)) # defence-in-depth; normally already gone meta_store.delete(f"chunk:{chunk_id:08d}:lfs") meta_store.delete(f"chunk:{chunk_id:08d}:events") meta_store.delete(f"chunk:{chunk_id:08d}:txhash") @@ -813,13 +756,8 @@ def prune_tx_index(tx_index_id, meta_store, config): meta_store.delete(f"index:{tx_index_id:08d}:txhash") ``` -**Why index-atomic.** -- Per-chunk pruning would open a window where `getTransaction` resolves to a ledger seq whose pack has already been deleted. -- Whole-index gating closes that window. - -**How much extra data sits on disk.** -- At most `LEDGERS_PER_TX_INDEX - 1` ledgers past the strict retention line. -- `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_TX_INDEX`, so the line never bisects an index; the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. +- **Why index-atomic.** Per-chunk pruning would open a window where `getTransaction` resolves to a ledger seq whose pack has been deleted; whole-index gating closes it. +- **Extra data on disk.** Up to `LEDGERS_PER_TX_INDEX - 1` ledgers past strict retention. `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_TX_INDEX`, so the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. --- @@ -829,11 +767,9 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ### Readiness Signal -- An in-memory boolean `service_ready` is set by `set_service_ready()` at the top of Phase 4 (live ingestion), after Phases 1–3 complete and active stores are opened. -- Not persisted. On every startup the flag starts `false`; on every Phase 4 (live ingestion) entry it flips to `true`. Clean shutdown discards it implicitly (process exits). -- The HTTP server binds its port at service startup (before Phase 1 (catchup)) so `getHealth` is always servable regardless of current phase. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `service_ready`. -- This means: clients see `HTTP 4xx` from `getLedger`/`getTransaction`/`getEvents` on every startup until Phase 4 (live ingestion) is reached, regardless of whether prior runs have served queries. Intentional: catchup and recovery phases must complete before the service serves, every time. -- Query handlers check the flag on each request. `false` → HTTP 4xx. `true` → route normally. +- In-memory boolean `service_ready`, flipped to `true` by `set_service_ready()` at Phase 4 (live ingestion) entry, after Phases 1–3 complete and active stores are opened. Not persisted; every startup begins `false`. +- HTTP server binds at service startup (before Phase 1 (catchup)), so `getHealth` is always servable. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `service_ready`: `false` → HTTP 4xx; `true` → route normally. +- Clients see `HTTP 4xx` from the three read endpoints on every startup until Phase 4 is reached, regardless of prior runs. Intentional — catchup and recovery phases must complete before the service serves, every time. ### Behavior During Phases 1–3 @@ -856,11 +792,11 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ## Resilience -Crash recovery and error handling share one foundation: flag-after-fsync makes the meta store authoritative, and every startup decision derives from flag presence alone — never filesystem scanning. Streaming extends backfill's resilience model with per-ledger checkpoint discipline, max-1-transitioning freeze gates, and a two-phase prune marker. +Flag-after-fsync makes the meta store authoritative for every startup decision — never filesystem scanning. Streaming extends backfill's resilience model with per-ledger checkpoint discipline, per-kind single-flight freeze gates, and a two-phase prune marker. ### Crash Recovery -No separate recovery phase. Every startup runs Phases 1–4 regardless — already-complete work is detected and skipped via meta store flags. +No separate recovery phase. Every startup runs Phases 1–4 — already-complete work is detected and skipped via meta-store flags. #### Invariants @@ -870,8 +806,8 @@ In addition to the backfill subroutine's invariants in [01-backfill-workflow.md 2. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. 3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. 4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 (catchup) itself avoids producing them. -5. **Two-phase prune marker.** `prune_tx_index` writes `index:{tx_index_id}:txhash = "deleting"` before any file delete and clears the key after. Queries treat `"deleting"` as absent. Crash mid-prune resumes idempotently on restart because `"deleting"` is still picked up by `prunable_tx_index_ids`. -6. **Hot-key tracking.** Every active store directory has a corresponding `hot:*` key, set BEFORE `mkdir` and cleared AFTER `delete_dir`. Phase 3 (reconcile) iterates `hot:*` keys to find directories that need recovery — no filesystem scan anywhere in the design. +5. **Two-phase prune marker.** `index:{tx_index_id}:txhash = "deleting"` is set before any file delete; the key is cleared last. Queries treat `"deleting"` as absent. Idempotent on crash (`prunable_tx_index_ids` picks `"deleting"` back up). See [Pruning](#pruning). +6. **Hot-key tracking.** Every active store directory has a `hot:*` key, set BEFORE `mkdir` and cleared AFTER dir removal. Phase 3 (reconcile) iterates `hot:*` keys to find directories needing recovery — no filesystem scan anywhere. #### Compound Recovery Scenarios @@ -891,43 +827,9 @@ Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workf - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. - **Crash mid-RecSplit.** - - State: `index:{tx_index_id}:txhash` absent; `hot:index:{tx_index_id}:txhash` set; all `:lfs` chunks of the tx index present; partial `.idx` files possibly on disk. - - Phase 3 (reconcile) iterates `hot:*` keys. Hits SCENARIO C (no freeze flag, `chunk_or_tx_index_id < resume_chunk_or_tx_index_id`): `finish_interrupted_freeze("index:txhash", ...)` runs `build_tx_index_recsplit_files` synchronously. The build's preamble deletes any partial `.idx` files, rebuilds, sets the flag, deletes the dir, clears the hot key. + - State: `index:{tx_index_id}:txhash` absent; `hot:index:{tx_index_id}:txhash` set; all `:lfs` chunks present; partial `.idx` files possibly on disk. + - Phase 3 (reconcile) hits SCENARIO C → `finish_interrupted_freeze` re-runs the build (its preamble blanket-deletes partial `.idx` files). - **Crash mid hot-store creation.** - - State: `hot:chunk:{chunk_id}:lfs` (or events / txhash) set, but `mkdir` / RocksDB open didn't complete. Dir might be absent or partially set up. Freeze flag absent. - - Phase 3 (reconcile): if `chunk_id == resume`, SCENARIO A — keep; Phase 4 reopens via `open_active_*_store` which is idempotent (mkdir is no-op on existing dir, RocksDB recovers from any partial WAL state). If `chunk_id < resume`, SCENARIO C — `finish_interrupted_freeze` reopens and re-runs the freeze (handles empty/partial RocksDB the same way). No special-case handling needed. - -- **Crash between hot-store dir-delete and `meta_store.delete(hot:*)`.** - - State: freeze flag set, dir already gone, hot key still set. - - Phase 3 (reconcile) hits SCENARIO B. `delete_dir_if_exists` no-ops on the missing dir; clears the hot key. Consistent with the file-before-flag-delete invariant: the hot key is the recovery signal, never an orphan dir without a key. - -- **Crash mid-prune.** - - State: some files deleted, some chunk keys cleared, `index:{tx_index_id}:txhash = "deleting"` still present. - - `prunable_tx_index_ids` picks up `"deleting"` alongside `"1"` → `prune_tx_index(tx_index_id)` re-runs, idempotent (file deletes `rm -f`, key deletes `delete_if_exists`). - -### Concurrent Access Prevention - -The service acquires a directory flock on the meta-store at startup. A second service process against the same datadir fails immediately. Same mechanism as backfill — see [01-backfill-workflow.md — Concurrent Access Prevention](./01-backfill-workflow.md#concurrent-access-prevention). - -### Error Handling - -Three distinct policies — runtime ABORT, transition retry-via-flag-absence, startup FATAL. - -#### Runtime — Phase 4 (live ingestion) - -- **CaptiveStellarCore unavailable.** RETRY with backoff; ABORT after `CAPTIVE_CORE_RETRY_MAX` attempts (implementation-defined). -- **Per-ledger store write failure (ledger / txhash / events).** ABORT — disk full or storage corruption. -- **Meta-store write failure.** ABORT — cannot maintain checkpoint. - -#### Freeze transitions (LFS / events / RecSplit) - -All three follow the flag-after-fsync invariant: on failure, don't set the completion flag; abort the transition; restart retries the whole transition from scratch (partial `.idx` files get cleaned by the build's own preamble). - -- **RecSplit verification mismatch.** ABORT; do NOT delete the transitioning txhash store; operator investigates. - -#### Startup (FATAL — datadir / config issues) - -- `CHUNKS_PER_TX_INDEX` or `RETENTION_LEDGERS` changed: wipe datadir to change. -- `RETENTION_LEDGERS` not a multiple of `LEDGERS_PER_TX_INDEX`: fix config. -- Head not index-aligned / gap in chunk flags: datadir corruption; wipe. + - State: `hot:chunk:{chunk_id}:lfs` (or events / txhash) set, but `mkdir` / RocksDB open didn't complete. Dir absent or partially set up; freeze flag absent. + - Phase 3 (reconcile): if `chunk_id == resume`, SCENARIO A — keep; Phase 4 reopens via `open_active_*_store` (idempotent — mkdir no-ops on existing dir, RocksDB recovers from partial WAL). If `chunk_id < resume`, SCENARIO C — `finish_interrupted_freeze` reopens and re-runs the freeze. From d96ace62c4c1d30ac5272f4233c6dedbbb01cfdb Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sun, 26 Apr 2026 15:49:28 -0700 Subject: [PATCH 32/34] Streaming doc: long-downtime correctness + pruning marker family MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the long-downtime restart bug. Adds two invariants on no permanent partial state. Replaces the "deleting" value with a separate pruning:index:* key family (binary schema everywhere). Repositions service_ready before captive-core spinup. - Phase 1 advances streaming:last_committed_ledger via advance_progress_marker at end of catchup. Fixes the bug where long-downtime + retention restart would re-ingest weeks of past-retention ledgers via captive core. - Phase 3 split into two passes: Pass 1 discards past-retention orphans (active-store dirs, freeze flags, pruning markers below the floor); Pass 2 reconciles in-flight transitions for entries above the floor. Pass 1 closes the "incomplete past-retention tx index" gap that the lifecycle prune loop can't reach. - compute_resume_ledger simplified to 3-case dispatch with no validation; no-BSB long-downtime stale-marker check leapfrogs forward when the marker is below the current retention floor. - Pruning marker moved to a separate pruning:index:{N} key family. Every meta-store key is now strictly binary ("1" or absent); no special values. Schema uniformity, simpler reader code. - service_ready flips after compute_resume_ledger and before phase4_live_ingest. Historical queries unblock immediately after Phase 3 instead of waiting for the 4-5 minute captive-core spinup. - Two new invariants: no permanently-partial tx index (every index reaches complete or discarded); no permanent orphans (every flag has an artifact; every active-store dir has a hot key). - Vocabulary cleanup: "watermark" → "progress marker"; "leapfrog" → "retention-aligned start" / "skip past stale"; "false-positive fatal" rewritten as plain English. --- .../design-docs/02-streaming-workflow.md | 629 +++++++++++++----- 1 file changed, 476 insertions(+), 153 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 327f5687a..7d40ec6b3 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -33,7 +33,7 @@ Vocabulary used throughout this doc. Skim on first read; refer back as terms com - **Backfill** — the process of pulling historical ledgers from a remote object store and writing them to disk as immutable artifacts. Backfill is internal to the service — operators never invoke it directly. Specified in [01-backfill-workflow.md](./01-backfill-workflow.md). -- **Leapfrog** (colloquial) — how the service picks a starting ledger when retention is configured: the start always lands on a tx-index boundary, never mid-index, so the first tx index ingested is complete. Without this rounding, the chunks before the start would fall below the retention floor and never be ingested, leaving the tx index broken and the ingest-work on its later chunks wasted. Used in two places: Phase 1 (catchup) when there's a remote object store to read from, and at Phase 4 (live ingestion) entry on a no-object-store fresh start. +- **Retention-aligned start** — how the service picks the starting chunk when retention is configured: the start always lands on a tx-index boundary, never mid-index, so the first tx index ingested is complete. Without this rounding, the chunks before the start would fall below the retention floor and never be ingested, leaving the tx index broken and the ingest-work on its later chunks wasted. Used in two places: Phase 1 (catchup) range-start computation when BSB is configured, and `compute_resume_ledger`'s no-BSB path (fresh start or stale-marker recovery). - **Network tip** — the most recent ledger the Stellar network has produced. The service learns this from a public Stellar history archive over HTTP, not from its own state. @@ -127,13 +127,13 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 | Key | Type | Default | Description | |---|---|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample network tip for Phase 4 (live ingestion)'s leapfrog-from-tip computation (when `[BSB]` is absent on first-ever start).| +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample network tip for the no-BSB resume-cursor calculation (when `[BSB]` is absent and the service needs a tip reference for retention floor / fresh-start alignment). | **[BSB]** (optional) - Same schema as in the backfill doc. Presence in the config file determines Phase 1 (catchup) behavior: - Present: Phase 1 (catchup) invokes backfill over the BSB (fast, parallel per-chunk catchup). - - Absent: Phase 1 (catchup) is a no-op; Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` (slower, but no object-store dep). + - Absent: Phase 1 (catchup) is a no-op; Phase 4 (live ingestion)'s captive core archive-catches-up from a `resume_ledger` aligned to the retention-aligned tx-index boundary (slower, but no object-store dependency). - See [Ledger Source](#ledger-source) for the BSB-source details and [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup) for the full split. ### CLI Flags @@ -186,8 +186,8 @@ Three profiles emerge from config combinations. No profile flag. | Profile | `RETENTION_LEDGERS` | `[BSB]` | Phase 1 behavior | Use case | |---|---|---|---|---| | Archive | `0` | present | Backfill over full history (chunks `[0, current_chunk − 1]`) | Public archive node; full history. | -| Pruning-history | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | present | Backfill over retention window (leapfrog-aligned start) | Windowed history with bulk initial catchup. | -| Tip-tracker | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a leapfrog'd `resume_ledger` | App developer; short retention; no object-store dep. | +| Pruning-history | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | present | Backfill over retention window (start aligned to first chunk of the tx index containing the retention floor) | Windowed history with bulk initial catchup. | +| Tip-tracker | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a `resume_ledger` aligned to the retention-aligned tx-index boundary | App developer; short retention; no object-store dep. | | (invalid) | `0` | absent | — | Rejected by `validate_config`: full history requires BSB. | --- @@ -202,11 +202,12 @@ Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative sou | Key | Value | Written when | |---|---|---| -| `streaming:last_committed_ledger` | uint32 (big-endian) | Written only by the live ingestion loop after all three active stores durably commit a ledger. **Never written at bootstrap.** When absent, [`compute_resume_ledger`](#compute-resume-ledger) derives resume from the contiguous `:lfs` prefix (first-ever post-Phase-1) or by leapfrogging down from the current network tip to an index boundary (tip-tracker fresh start). Phase 1 (catchup) progress is tracked by `chunk:{chunk_id}:lfs` flags alone. | +| `streaming:last_committed_ledger` | uint32 (big-endian) | Monotonic progress marker — highest ledger that has been persisted. Two writers (both via `advance_progress_marker`, never regressing): (1) Phase 1 (catchup) at end of catchup, advancing to `last_ledger_in_chunk(highest completed chunk)`; (2) live ingestion loop, per ledger, after all three active stores durably commit. When absent, [`compute_resume_ledger`](#compute-resume-ledger) falls back to the highest `:lfs` chunk (Phase 1 crashed before the post-catchup write) or to `retention_aligned_resume_ledger` (tip-tracker fresh start, no BSB, never ingested). | | `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | | `hot:chunk:{chunk_id:08d}:lfs` | `"1"` | Written **before** the active ledger store directory is created; deleted **after** that directory is removed by the freeze task. Presence indicates the directory exists or its lifecycle is incomplete (creation in flight, or freeze cleanup not yet finished). | | `hot:chunk:{chunk_id:08d}:events` | `"1"` | Same pattern as `hot:chunk:lfs`, scoped to the active events store directory. | | `hot:index:{tx_index_id:08d}:txhash` | `"1"` | Same pattern, scoped to the active txhash store directory. Per-index cadence (one per tx index, not per chunk). | +| `pruning:index:{tx_index_id:08d}` | `"1"` | Set by `prune_tx_index` BEFORE any file delete; cleared AFTER all of tx_index_id's artifacts and the `index:{tx_index_id}:txhash` key are gone. Presence means a prune is in progress (or was interrupted by a crash and needs to be re-attempted). QueryRouter treats presence as "this tx index is unservable" and returns 4xx. Lifecycle loop's `prunable_tx_index_ids` includes any index with this key set, so a crashed prune resumes idempotently on restart. | ### Keys Shared with Backfill @@ -218,7 +219,7 @@ Defined in [01-backfill-workflow.md — Meta Store Keys](./01-backfill-workflow. - `chunk:{chunk_id:08d}:txhash` - `index:{tx_index_id:08d}:txhash` -Streaming-specific use of these keys (which paths write them when, and the `"deleting"` marker on `index:txhash`) is shown in [Key Lifecycle in Streaming](#key-lifecycle-in-streaming) below. +Streaming-specific use of these keys (which paths write them when) is shown in [Key Lifecycle in Streaming](#key-lifecycle-in-streaming) below. All values are binary (`"1"` or absent); prune-in-progress is tracked via the separate `pruning:index:*` key family rather than overloading the value space. ### Key Lifecycle in Streaming @@ -252,10 +253,11 @@ Live path (per chunk, background freeze) — flag AFTER fsync, hot key AFTER dir Live path (per index, background freeze) — same pattern: index:{tx_index_id}:txhash = "1" → delete txhash store dir → clear hot:index:{tx_index_id}:txhash -Pruning (background, when tx_index_id is past retention) — two-phase marker: - index:{tx_index_id}:txhash = "deleting" (queries return 4xx from here on) +Pruning (background, when tx_index_id is past retention) — separate pruning marker: + pruning:index:{tx_index_id} = "1" (queries return 4xx from here on) delete all files + per-chunk :lfs + :events keys for tx_index_id index:{tx_index_id}:txhash → deleted + pruning:index:{tx_index_id} → deleted (cleared LAST; survives crashes for recovery) ``` ### Flag Semantics @@ -264,7 +266,7 @@ Pruning (background, when tx_index_id is past retention) — two-phase marker: - **File-before-flag-delete (cleanup order).** The file is removed FIRST; the flag is cleared LAST. Flag present ⇒ cleanup may be incomplete; flag absent ⇒ cleanup done, no file exists. Reverse order would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. - **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility, active-store directory reconciliation — derives from meta-store key presence. No filesystem-scan-and-infer anywhere. -Together: the meta-store flag is the always-correct signal of artifact state on disk, both for immutable files and for active-store directories. A crash anywhere leaves a state the next start recovers from by flag presence alone. +_**The meta-store flag is the always-correct signal of artifact state on disk, both for immutable files and for active-store directories. A crash anywhere leaves a state the next start recovers from by flag presence alone.**_ --- @@ -281,7 +283,7 @@ Three RocksDB-backed active stores; WAL always enabled. Each directory has a `ho - **Boundary swap.** At every chunk boundary the next chunk's ledger + events stores open synchronously while the just-finished ones are handed to background freeze tasks; tx-index boundaries do the same for txhash. Each kind therefore holds at most one active + one transitioning at a time. Ingestion never blocks on the freeze. - **Synchronous open cost.** ~100 ms maximum — small enough to ignore. - **Deletion.** The freeze task deletes the active store's directory only after the immutable artifact is fsynced and its freeze flag is set. -- **Crash recovery.** Active-store directories surviving a crash are reconciled on the next start — see [Phase 3 — Reconcile Orphaned Transitions](#phase-3--reconcile-orphaned-transitions). +- **Crash recovery.** Active-store directories surviving a crash are reconciled on the next start — see [Phase 3 — Reconcile](#phase-3--reconcile). --- @@ -301,7 +303,7 @@ Four sequential phases, same code path for first start and every restart. The fi - **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. - **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 (catchup) left (for the trailing partial index) into the active txhash store, then deletes them. - **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. -- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, flips the `service_ready` flag, enters the ingestion loop. Runs until process exit. +- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, enters the ingestion loop. Runs until process exit. Note: `service_ready` is flipped by `run_rpc_service` BEFORE Phase 4 entry — historical queries are served during captive core's 4-5 minute spinup. ### Backfill vs Phase 1 (catchup) @@ -319,11 +321,25 @@ def main(): def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) - start_http_server(config) - phase1_catchup(config, meta_store) - phase2_hydrate_txhash(config, meta_store) - phase3_reconcile_orphans(config, meta_store) + start_http_server(config) # /getHealth servable; getLedger/Tx/Events 4xx until set_service_ready + + # Phases 1-3 + compute_resume_ledger: bring on-disk state into consistency. + # No query traffic during this window. + last_phase1_chunk_id = phase1_catchup(config, meta_store) + phase2_hydrate_txhash(config, meta_store, last_phase1_chunk_id) + phase3_reconcile(config, meta_store) resume_ledger = compute_resume_ledger(config, meta_store) + + # On-disk state is now consistent. Queries against frozen artifacts can be + # served immediately — they don't depend on captive core having started. + # Flipping service_ready here (rather than after captive-core spinup) cuts + # the 4xx window by the captive-core startup time (~4-5 min). + set_service_ready() + + # Phase 4 opens active stores and starts captive core. Live ingestion begins + # asynchronously. Queries for ledgers > streaming:last_committed_ledger return + # "not yet available" until ingestion catches up; that's the same client-visible + # behavior whether captive core has spun up or not. phase4_live_ingest(config, meta_store, resume_ledger) ``` @@ -331,15 +347,41 @@ Query serving is gated on Phase 4 (live ingestion) being reached — see [Query ### Phase 1 — Catchup -- **No-op path:** if `config.bsb is None` (no `[BSB]` configured), Phase 1 (catchup) returns immediately. Phase 4 (live ingestion)'s captive core will catch up from a leapfrog'd resume ledger. -- **BSB path:** runs the backfill subroutine (`run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md)) once per BSB-tip sample, until BSB has no new complete chunks beyond the last scheduled range. -- Unit of work = one whole chunk, never partial. DAG dispatches chunk IDs; `process_chunk(chunk_id, config)` ingests `first_ledger_in_chunk..last_ledger_in_chunk` inclusive. Every chunk Phase 1 (catchup) persists starts at `..._02`, ends at `..._01` — the chunk-alignment invariant the no-gaps guarantee rests on. -- Phase 1 reads from BSB, so the relevant horizon is BSB's latest chunk-aligned position — not the network tip. The gap between BSB's tip and the actual network tip (typically minutes of upload lag) is closed by Phase 4 (live ingestion)'s captive core. +- **No-op path:** if `config.bsb is None` (tip-tracker profile), Phase 1 returns `None` immediately. Phase 4's captive core does archive-catchup from `retention_aligned_resume_ledger`. +- **BSB path:** runs `run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md) in a loop. Each iteration samples BSB's latest complete chunk and backfills `[retention_aligned_start_chunk, end_chunk]` inclusive. Loop exits when BSB has no new complete chunks. Phase 1 reads from BSB, so its horizon is BSB's chunk-aligned tip; the residual gap to network tip is closed by Phase 4's captive core. +- **Side effects on the meta store:** + 1. Backfill writes `:lfs` / `:events` / `chunk:*:txhash` / `index:*:txhash` flags as it materializes artifacts (per backfill design). + 2. After the loop, **Phase 1 advances `streaming:last_committed_ledger`** to `last_ledger_in_chunk(highest completed chunk)`. This is the durable record that Phase 1 catchup actually progressed past the prior value — used by Phase 3 to compute the retention floor and by `compute_resume_ledger` to derive the resume cursor. +- **Return value:** the highest chunk_id Phase 1 completed, or `None` for no-op. Phase 2 consumes this to find the trailing partial tx index without re-scanning meta-store. ```python -def phase1_catchup(config, meta_store): +def phase1_catchup(config, meta_store) -> Optional[int]: + """ + Catch up history via BSB (no-op when [BSB] absent). + + Scenarios handled: + - First-ever start (archive / pruning-history) — backfill from + retention_aligned_start_chunk to BSB tip. + - First-ever start (tip-tracker, no BSB) — early return None. + - Quick restart, BSB unchanged — loop runs once, run_backfill is a + full no-op (every chunk's flags already set), loop exits on iter 2. + - Mid-life restart, BSB advanced — backfill the new chunks; idempotent + skip on already-frozen chunks. + - Long-downtime restart with retention — backfill range starts at the new + retention-aligned position (computed from current BSB tip), skipping + past chunks now below the retention floor. Those stale chunks are left + for Phase 3 to clean up. + - Crash during Phase 1 — backfill is per-chunk-idempotent. On restart, + completed chunks skip; in-flight ones re-run. last_scheduled_end_chunk + is local-only and resets on every start; Phase 1 always re-runs from + scratch (cheap because of idempotent skips). + + Returns: highest chunk_id Phase 1 completed (Phase 2's input), + or None for no-op. + """ if config.bsb is None: - return # [BSB] absent → no-op + # Tip-tracker profile. Nothing to backfill. + return None retention_ledgers = config.service.retention_ledgers last_scheduled_end_chunk = -1 @@ -347,14 +389,26 @@ def phase1_catchup(config, meta_store): while True: end_chunk = bsb_latest_complete_chunk_id(config.bsb) if end_chunk <= last_scheduled_end_chunk: - return # BSB has no new complete chunks + break # BSB has no new complete chunks since last iter start_chunk = retention_aligned_start_chunk(last_ledger_in_chunk(end_chunk), retention_ledgers) if end_chunk < start_chunk: - return # leapfrog past tip + break # retention-aligned start landed past BSB's tip; Phase 4 picks up log.info(f"phase1_catchup bsb_tip_chunk={end_chunk} range=[{start_chunk}, {end_chunk}]") run_backfill(config, start_chunk, end_chunk) last_scheduled_end_chunk = end_chunk + if last_scheduled_end_chunk < 0: + return None # no chunks completed (e.g., BSB hasn't published any yet) + + # Bump the streaming:last_committed_ledger key to reflect Phase 1's catchup. + # This pushes the key past any stale value left by a prior run that's now + # below the retention floor. Without this advance, Phase 3 would compute the + # retention floor from the stale prior value, and compute_resume_ledger would + # tell captive core to resume at a stale ledger — re-ingesting chunks pruning + # is about to delete. + advance_progress_marker(meta_store, last_ledger_in_chunk(last_scheduled_end_chunk)) + return last_scheduled_end_chunk + def retention_aligned_start_chunk(tip_ledger, retention_ledgers): # Aligns DOWN to a tx-index boundary (no-gaps); up to LEDGERS_PER_TX_INDEX - 1 ledgers below strict retention. @@ -362,6 +416,26 @@ def retention_aligned_start_chunk(tip_ledger, retention_ledgers): return 0 target_ledger = max(tip_ledger - retention_ledgers, GENESIS_LEDGER) return first_chunk_id_of_tx_index_containing(target_ledger) + + +def advance_progress_marker(meta_store, candidate_ledger): + """ + Move streaming:last_committed_ledger forward to candidate_ledger, but only + if that's an advance — never regress. + + Two callers: + - phase1_catchup, once at end of catchup, with last_ledger_in_chunk(highest + completed chunk). + - run_live_ingestion_loop, once per ledger, after all three active stores + durably commit that ledger. + + Monotonicity matters: a regression would cause compute_resume_ledger to + point captive core at already-durable ledgers (re-ingest waste), and Phase 3 + to compute a stale retention floor. + """ + prior = meta_store.get("streaming:last_committed_ledger") + if prior is None or candidate_ledger > prior: + meta_store.put("streaming:last_committed_ledger", candidate_ledger) ``` **Worker concurrency:** `run_backfill` caps DAG concurrency at `MAX_CPU_THREADS` — see [01-backfill-workflow.md — process_chunk](./01-backfill-workflow.md#process_chunk). Catchup time ≈ `retention_window / (BSB throughput)`. @@ -373,7 +447,7 @@ def retention_aligned_start_chunk(tip_ledger, retention_ledgers): - After Phase 2 (`.bin` hydration): no `.bin` files and no `:txhash` chunk flags remain. ```python -def phase2_hydrate_txhash(config, meta_store): +def phase2_hydrate_txhash(config, meta_store, last_phase1_chunk_id): # Both sweeps: file-before-flag-delete (see Flag Semantics). # Sweep 1: clean leftover .bin from completed indexes (cleanup_txhash crashed mid-pair). @@ -384,13 +458,17 @@ def phase2_hydrate_txhash(config, meta_store): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") # Sweep 2: hydrate the trailing incomplete tx index into the active txhash store. - incomplete_tx_index_id = current_incomplete_tx_index_id(meta_store) - if incomplete_tx_index_id is None: - return - - txhash_store = open_active_txhash_store(config, meta_store, incomplete_tx_index_id) + # Phase 1 returns the highest chunk it completed; the trailing tx index is + # the one containing that chunk — no separate scan needed. + if last_phase1_chunk_id is None: + return # no Phase 1 work → no .bin files + tx_index_id = tx_index_id_of_chunk(last_phase1_chunk_id) + if meta_store.has(f"index:{tx_index_id:08d}:txhash"): + return # last touched index already complete (RecSplit done) + + txhash_store = open_active_txhash_store(config, meta_store, tx_index_id) try: - for chunk_id in chunks_for_tx_index(incomplete_tx_index_id): + for chunk_id in chunks_for_tx_index(tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): continue bin_path = raw_txhash_path(chunk_id) @@ -399,158 +477,353 @@ def phase2_hydrate_txhash(config, meta_store): delete_if_exists(bin_path) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") finally: - txhash_store.close() # Phase 4 re-opens by path; flock would collide otherwise. + txhash_store.close() ``` - **Why "load then delete".** Without it, every restart during the incomplete-index lifetime would re-load the same `.bin` files. Load-then-delete makes Phase 2 a no-op on every subsequent restart until Phase 1 (catchup) deposits new `.bin` files. - **Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files; the live path writes txhash directly to the active store. Phase 2 is a no-op. -### Phase 3 — Reconcile Orphaned Transitions +### Phase 3 — Reconcile + +Two passes, both strictly meta-store-driven (no filesystem scan): -Completes any in-flight transitions left by a prior crash. All decisions derive from meta store state + on-disk store directories. +- **Pass 1 — discard past-retention orphans.** After a long downtime, some active-store dirs and immutable artifacts from prior runs may now be below the new retention floor (because Phase 1's retention-aligned start chunk has moved forward to track the network tip). They have no meaningful next transition — can't be frozen if their hot DB is partial, shouldn't be kept because they're past retention. Discarded outright. +- **Pass 2 — recover in-flight transitions.** For active stores ABOVE the retention floor whose freeze was interrupted by the prior crash, complete the freeze (or clean up if the freeze flag was already set but cleanup didn't finish). + +Order matters: Pass 1 first so Pass 2's resume-relative classification (A/B/C/D) only sees in-range entries. ```python -def phase3_reconcile_orphans(config, meta_store): - last_committed_ledger = meta_store.get("streaming:last_committed_ledger") - if last_committed_ledger is None: - return # fresh start — nothing in flight +def phase3_reconcile(config, meta_store): + """ + Reconciles state left by the prior process exit. + + Scenarios handled: + - Fresh first-ever start — both passes early-return (streaming:last_committed_ledger + absent, no hot:* keys yet). + - Quick restart, no crash mid-freeze — Pass 1 finds nothing past retention; + Pass 2 keeps the resume position's hot:* keys via SCENARIO A. + - Long-downtime restart with retention (BSB present) — Phase 1's catchup + already advanced streaming:last_committed_ledger past stale chunks; Pass 1 + discards prior-run hot:* keys + flags now below the floor; Pass 2 sees + nothing left to do. + - Long-downtime restart with retention (no BSB) — Phase 1 was a no-op; + the streaming:last_committed_ledger key is stale; Pass 1 samples network + tip from history archive and uses that as the floor reference. + - Crash mid-LFS / mid-events / mid-RecSplit freeze — Pass 1 unaffected + (the in-flight chunk is at the resume position, well above retention); + Pass 2 SCENARIO C re-runs the freeze. + - Crash between freeze flag set and active-dir delete — Pass 2 SCENARIO B + cleans up the orphan dir. + - Future-orphan (defensive) — Pass 2 SCENARIO D logs + cleans up. + """ + pass1_discard_past_retention_orphans(config, meta_store) + pass2_recover_in_flight_transitions(config, meta_store) + + +def pass1_discard_past_retention_orphans(config, meta_store): + """ + Find every artifact (hot DB dir, immutable file, freeze flag) below the + retention floor and discard it. + + Why this is needed: when a long downtime advances the network tip past + where the prior run left off, the retention floor moves forward. Active-store + dirs and freeze flags from the prior run can end up below the new floor. + The pruning lifecycle handles COMPLETE tx indexes (`prunable_tx_index_ids` + requires `index:N:txhash` set), but an INCOMPLETE tx index from a prior run + (its `index:N:txhash` was never written, since RecSplit never ran) is + invisible to that path. Pass 1 catches those plus a few related cases. + + Determining the floor: + - retention_ledgers == 0 (archive) — no floor, nothing past retention. + - BSB present — Phase 1 just advanced streaming:last_committed_ledger to + BSB-tip's last ledger; that's the authoritative tip reference. + - No BSB — streaming:last_committed_ledger reflects only the prior run + (potentially weeks stale); sample current tip from history archive. + If unreachable, skip Pass 1 (the prune lifecycle will catch up later + as boundaries fire). + """ + retention_ledgers = config.service.retention_ledgers + if retention_ledgers == 0: + return # archive profile — no floor + + current_tip = estimate_current_tip(config, meta_store) + if current_tip is None: + log.warn("phase3 pass1: no tip reference available; skipping past-retention cleanup") + return + + floor_ledger = max(current_tip - retention_ledgers, GENESIS_LEDGER) + floor_chunk = chunk_id_of_ledger(floor_ledger) - resume_chunk_id = chunk_id_of_ledger(last_committed_ledger + 1) + # Discard hot:chunk:* below floor (active LFS / events store dirs). + for hot_key in meta_store.scan_prefix("hot:chunk:"): + store_kind, chunk_id = parse_hot_key(hot_key) + if chunk_id < floor_chunk: + delete_dir_if_exists(active_store_path_for(store_kind, chunk_id)) + meta_store.delete(hot_key) + log.info(f"phase3 pass1: discarded past-retention {hot_key}") + + # Discard hot:index:* below floor (active txhash store dirs). + for hot_key in meta_store.scan_prefix("hot:index:"): + _, tx_index_id = parse_hot_key(hot_key) + if last_ledger_in_tx_index(tx_index_id) < floor_ledger: + delete_dir_if_exists(active_store_path_for("index:txhash", tx_index_id)) + meta_store.delete(hot_key) + log.info(f"phase3 pass1: discarded past-retention {hot_key}") + + # Discard chunk:*:lfs / :events / :txhash freeze flags + their files for + # chunks below floor. Covers chunks of an INCOMPLETE prior-run tx index + # that pruning lifecycle can't reach (because index:N:txhash was never + # written). Per-chunk delete is idempotent (delete_if_exists). + for chunk_key in meta_store.scan_prefix("chunk:"): + chunk_id, kind = parse_chunk_key(chunk_key) + if chunk_id < floor_chunk: + delete_immutable_artifact(kind, chunk_id) # .pack / events cold segment / .bin + meta_store.delete(chunk_key) + + # Discard index:*:txhash freeze flags + RecSplit files for past-retention + # complete indexes. (Covered by prune lifecycle in steady state, but at + # startup we run this for completeness so the first prune sweep has no + # backlog.) + for index_key in meta_store.scan_prefix("index:"): + _, tx_index_id = parse_index_key(index_key) + if last_ledger_in_tx_index(tx_index_id) < floor_ledger: + delete_recsplit_idx_files(tx_index_id) + meta_store.delete(index_key) + + # Discard pruning:index:* markers for past-retention indexes that the + # prior run was already mid-prune on. The above loops have already + # taken care of files + index keys; this loop just clears the marker. + # (Markers above the floor are left alone — the lifecycle loop's initial + # sweep will pick them up and finish the prune.) + for pruning_key in meta_store.scan_prefix("pruning:index:"): + tx_index_id = parse_pruning_index_id(pruning_key) + if last_ledger_in_tx_index(tx_index_id) < floor_ledger: + meta_store.delete(pruning_key) + + +def pass2_recover_in_flight_transitions(config, meta_store): + """ + For every hot:* key still set after Pass 1, classify against the resume + position and dispatch the right recovery action. + + Why "still set after Pass 1": Pass 1 already removed past-retention hot + keys, so every entry seen here is at or above the retention floor — i.e., + legitimately a candidate for either Phase 4 reopen (SCENARIO A), freeze + cleanup (B), freeze re-run (C), or defensive cleanup (D). + """ + last_committed = meta_store.get("streaming:last_committed_ledger") + if last_committed is None: + return # no prior live commits → no in-flight work + + resume_chunk_id = chunk_id_of_ledger(last_committed + 1) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) - # Iterate hot:* keys; each branch acts on the parsed (store_kind, id). for hot_key in meta_store.scan_prefix("hot:"): - store_kind, chunk_or_tx_index_id = parse_hot_key(hot_key) - resume_chunk_or_tx_index_id = ( - resume_chunk_id if store_kind.startswith("chunk:") else resume_tx_index_id - ) - store_path = active_store_path_for(store_kind, chunk_or_tx_index_id) - freeze_flag_key = freeze_flag_key_for(store_kind, chunk_or_tx_index_id) + store_kind, scope_id = parse_hot_key(hot_key) + resume_id = resume_chunk_id if store_kind.startswith("chunk:") else resume_tx_index_id + store_path = active_store_path_for(store_kind, scope_id) + freeze_flag_key = freeze_flag_key_for(store_kind, scope_id) - if chunk_or_tx_index_id == resume_chunk_or_tx_index_id: + if scope_id == resume_id: continue # A: resume target — Phase 4 reopens. - elif meta_store.has(freeze_flag_key): # B: flag-is-truth. Frozen, but cleanup didn't finish. delete_dir_if_exists(store_path) meta_store.delete(hot_key) - - elif chunk_or_tx_index_id < resume_chunk_or_tx_index_id: - # C: freeze interrupted; restart it to completion. - finish_interrupted_freeze(store_kind, chunk_or_tx_index_id, meta_store) - + elif scope_id < resume_id: + # C: freeze interrupted; restart to completion. + finish_interrupted_freeze(store_kind, scope_id, meta_store) else: - # D: future-orphan — shouldn't occur. Log + defensive cleanup. - log.warn(f"phase3: future-orphan {store_kind}/{chunk_or_tx_index_id:08d} > resume {resume_chunk_or_tx_index_id:08d}") + # D: future-orphan — shouldn't occur in normal flow. Log + cleanup. + log.warn(f"phase3 pass2: future-orphan {store_kind}/{scope_id:08d} > resume {resume_id:08d}") delete_dir_if_exists(store_path) meta_store.delete(hot_key) -``` -`finish_interrupted_freeze` reopens the active store (idempotent on existing or partial dirs) and runs the corresponding live-path freeze ([LFS](#lfs-transition), [events](#events-transition), or [RecSplit](#recsplit-transition)) to produce the artifact. -### Compute Resume Ledger - -- `compute_resume_ledger` is a shared helper called once per service start, AFTER Phase 3 (reconcile) and BEFORE Phase 4 (live ingestion). Scans meta-store state end-to-end, validates on-disk consistency, and returns `resume_ledger` — the ledger sequence captive core is told to start emitting at via `PrepareRange(UnboundedRange(resume_ledger))`. -- **Runs AFTER Phase 3 (reconcile).** Phase 3 writes the `:lfs` flag for chunks whose freeze was in flight at a prior crash; running `compute_resume_ledger` before Phase 3 would see those mid-freeze chunks as internal `:lfs` gaps and false-positive-fatal at startup. -- **Scans every startup, even when `streaming:last_committed_ledger` is already set.** The scan's primary output in the mid-life-restart case is validation, not derivation; catching broken on-disk state before opening active stores is strictly safer than silently resuming on top. -- **Validation failures are fatal.** Any inconsistency aborts startup with "migration to streaming failed" + an operator-readable error naming what's wrong. The service exits non-zero; no active stores are opened. +def estimate_current_tip(config, meta_store) -> Optional[int]: + """ + Best estimate of the current network tip at startup. Used as the reference + for retention floor calculations. -**Derivation** — first match wins: - -| `streaming:last_committed_ledger` | Scan result | Situation | `resume_ledger` | -|---|---|---|---| -| present | (validated consistent) | Mid-life restart (possibly after Phase 3 (reconcile) just finished in-flight freezes) | `value + 1` | -| absent | contiguous `:lfs` chunks `[start..end]` | First-ever post-Phase-1 (catchup), or crash between Phase 1 (catchup) end and first live commit | `last_ledger_in_chunk(end) + 1` | -| absent | no `:lfs` chunks | Tip-tracker fresh start (no `[BSB]`) | `retention_aligned_resume_ledger(config)` | + BSB present: Phase 1 just advanced streaming:last_committed_ledger to + BSB-tip's last ledger, which is within minutes of network tip (BSB upload + lag). Use it directly. -**Validation rules** (any violation → fatal): + No BSB: streaming:last_committed_ledger is from the prior run (potentially + weeks stale). Sample the network tip from history archive (same helper + retention_aligned_resume_ledger uses). None if archive unreachable, in + which case Pass 1 skips and pruning lifecycle handles cleanup later. + """ + if config.bsb is not None: + return meta_store.get("streaming:last_committed_ledger") -- **No internal gap in `:lfs` coverage.** Example FAIL: chunks `[0..90] ∪ [92..N]` with `91` missing. A trailing "no chunks beyond N" is normal end-of-prefix, not a gap. -- **Start aligns to a tx-index boundary.** `start_chunk == 0` (archive) OR `start_chunk % cpi == 0` (pruning-history — first chunk of a tx index). Example FAIL at `cpi=100`: scan yields `[3456..N]`; `3456 % 100 ≠ 0`. Correct start would have been `3500`. -- **Chunk flags consistent.** Every chunk in the contiguous range has both `:lfs` AND `:events`. A chunk with one but not the other means `process_chunk` crashed mid-task and was never re-run. -- **Index flags consistent.** Every complete tx index fully inside `[start, end]` has `index:{tx_index_id:08d}:txhash`. Trailing partial indexes do NOT — those wait for Phase 2 (`.bin` hydration) on first start, or become Phase 3 (reconcile) build-respawn candidates on restart. -- **Live checkpoint consistent with scan.** When `streaming:last_committed_ledger = L` is present, chunks through `chunk_id_of_ledger(L) - 1` must all have `:lfs`. Example FAIL: `L = 56_345_672` (chunk 5_634 ingesting), but scan's highest contiguous chunk is 5_632 — chunk 5_633 must have been frozen before chunk 5_634 could be active; its absence means a recent immutable artifact went missing out of band. (Mid-freeze state at a prior crash does NOT false-positive this rule because Phase 3 (reconcile) has already finished any in-flight freeze before `compute_resume_ledger` runs.) + try: + return get_latest_network_tip(config.history_archives.urls) + except NetworkTipUnreachable: + return None +``` -```python -def compute_resume_ledger(config, meta_store): - cpi = config.service.chunks_per_tx_index - scan = scan_all_chunk_and_index_keys(meta_store) - validate_scan(scan, cpi) +`finish_interrupted_freeze` reopens the active store (idempotent on existing or partial dirs) and runs the corresponding live-path freeze ([LFS](#lfs-transition), [events](#events-transition), or [RecSplit](#recsplit-transition)) to produce the artifact. - last_committed_ledger = meta_store.get("streaming:last_committed_ledger") - if last_committed_ledger is not None: - validate_last_committed_consistency(scan, last_committed_ledger) - return last_committed_ledger + 1 - if scan.lfs_chunks: - return last_ledger_in_chunk(scan.lfs_chunks[-1]) + 1 # first-ever post-Phase-1 - return retention_aligned_resume_ledger(config) # tip-tracker fresh start (no BSB) +### Compute Resume Ledger +`compute_resume_ledger(config, meta_store) -> ledger_seq`. Runs once per service start, after Phase 3 (reconcile), before Phase 4 (live ingestion). Returns the ledger sequence Phase 4's captive core resumes from via `PrepareRange(UnboundedRange(resume_ledger))`. -def validate_scan(scan, cpi): - # Fatal on any violation — "migration to streaming failed". - if not scan.lfs_chunks: - return - start, end = scan.lfs_chunks[0], scan.lfs_chunks[-1] - - expected = set(range(start, end + 1)) - actual = set(scan.lfs_chunks) - if actual != expected: - fatal(f"internal :lfs gap: missing chunks {sorted(expected - actual)}") - if start != 0 and start % cpi != 0: - fatal(f"start chunk {start} not tx-index aligned (expected multiple of cpi={cpi})") - if actual != set(scan.events_chunks): - fatal(":lfs / :events mismatch — process_chunk crashed mid-run, unrecovered") - - first_complete_tx_index_id = first_fully_covered_tx_index_id(start) - last_complete_tx_index_id = last_fully_covered_tx_index_id(end) - complete = set(range(first_complete_tx_index_id, last_complete_tx_index_id + 1)) - missing = complete - set(scan.txhash_indexes) - if missing: - fatal(f"complete tx indexes {sorted(missing)} missing index:txhash flag") - - -def validate_last_committed_consistency(scan, last_committed_ledger): - # L = last_committed_ledger ⇒ all chunks up to chunk_id_of_ledger(L)-1 must have :lfs (L's own chunk is still ingesting). - active_chunk_id = chunk_id_of_ledger(last_committed_ledger) - required_last = active_chunk_id - 1 - if required_last < 0: - return - actual_last = scan.lfs_chunks[-1] if scan.lfs_chunks else -1 - if actual_last < required_last: - fatal(f"streaming:last_committed_ledger={last_committed_ledger} requires :lfs " - f"through chunk {required_last}; scan's highest is {actual_last}") +**Three cases, first match wins:** +| State | Situation | `resume_ledger` | +|---|---|---| +| `streaming:last_committed_ledger` present + within retention floor | Mid-life restart (Phase 1 advanced the key, or live loop committed ledgers in this run / a recent prior run) | `value + 1` | +| `streaming:last_committed_ledger` present but stale (no-BSB tip-tracker after long downtime, value below retention floor) | Re-ingesting via captive core from the stale value would replay days of past-retention chunks for nothing | Delete the stale key, return `retention_aligned_resume_ledger(config)` (skips forward to the new retention-aligned tx-index boundary) | +| `streaming:last_committed_ledger` absent + `:lfs` chunks present | Edge case — Phase 1 wrote `:lfs` flags but crashed before `advance_progress_marker` | `last_ledger_in_chunk(highest_lfs_chunk) + 1` | +| `streaming:last_committed_ledger` absent + no `:lfs` chunks | Tip-tracker fresh start (no BSB, never ingested) | `retention_aligned_resume_ledger(config)` | -def retention_aligned_resume_ledger(config): - # Tip-tracker fresh start (no BSB, no on-disk chunks). [BSB]-absent + retention=0 is rejected by validate_config. - network_tip_ledger = get_latest_network_tip(config.history_archives.urls) - retention_ledgers = config.service.retention_ledgers +```python +def compute_resume_ledger(config, meta_store) -> int: + """ + Decide the ledger sequence captive core's PrepareRange resumes at. + + Scenarios handled: + - Fresh first-ever start (BSB present) — Phase 1 already advanced the + progress marker; trivially resumes at progress_marker + 1. + - Fresh first-ever start (no BSB) — no progress marker, no :lfs; falls + through to retention_aligned_resume_ledger (samples tip from history + archive, aligns down to a tx-index boundary). + - Quick restart, BSB unchanged — progress marker is fresh; resume at + 1. + - Long-downtime restart, BSB present — Phase 1's catchup already advanced + the progress marker past any stale prior value; resume at the new + 1. + - Long-downtime restart, no BSB (tip-tracker) — Phase 1 was a no-op; + the progress marker is from the prior run (potentially weeks stale). + Detect by comparing against the current network tip; if the marker is + below the retention floor, delete it and skip forward to the + retention-aligned start. + - Phase 1 crashed after writing :lfs but before advance_progress_marker + — fall back to deriving resume from the highest :lfs chunk. + + No consistency validation is performed here. Phase 1 backfill self-heals + incomplete chunks within its range; Phase 3 recovers in-flight freezes; + pruning lifecycle handles past-retention state. + """ + progress_marker = meta_store.get("streaming:last_committed_ledger") + + if progress_marker is not None: + # Stale-marker check — only matters when there's no BSB (tip-tracker + # profile after long downtime). In BSB-present paths, Phase 1 has + # already advanced the marker past any stale value before we got here. + if config.bsb is None and config.service.retention_ledgers > 0: + current_tip = try_sample_network_tip(config) + if current_tip is not None: + floor = max(current_tip - config.service.retention_ledgers, GENESIS_LEDGER) + if progress_marker < floor: + log.info(f"compute_resume_ledger: marker {progress_marker} below retention floor {floor}; skipping forward") + meta_store.delete("streaming:last_committed_ledger") + return retention_aligned_resume_ledger_with_tip(config, current_tip) + return progress_marker + 1 + + # Marker absent. Phase 1 may have crashed after writing :lfs but before + # advance_progress_marker. Recover via the highest :lfs chunk. + highest_lfs = scan_max_lfs_chunk(meta_store) + if highest_lfs is not None: + return last_ledger_in_chunk(highest_lfs) + 1 + + # No marker, no :lfs — tip-tracker fresh start. + return retention_aligned_resume_ledger(config) + + +def scan_max_lfs_chunk(meta_store) -> Optional[int]: + """ + Returns the highest chunk_id with `:lfs` set, or None if no :lfs key exists. + + Reverse-iterates the chunk: prefix and stops at the first :lfs key found. + O(suffix variants per chunk) — typically 1–2 reads, regardless of total + chunks on disk. Sub-millisecond at any cpi / archive size. + """ + for key in meta_store.iter_prefix_reverse("chunk:"): + if key.endswith(":lfs"): + return parse_chunk_id_from_chunk_key(key) + return None + + +def retention_aligned_resume_ledger(config) -> int: + """ + No-BSB resume cursor: align to the first ledger of the tx index containing + the retention floor. Captive core archive-catches-up from that point. + + Two callers: + - compute_resume_ledger fresh-start branch (no BSB, no prior commits). + - compute_resume_ledger stale-marker branch (no BSB long downtime — + prior progress marker is below the new retention floor). + + Helper retention_aligned_resume_ledger_with_tip is the same with an + explicit tip parameter, for callers that already sampled. + """ + network_tip = get_latest_network_tip(config.history_archives.urls) + return retention_aligned_resume_ledger_with_tip(config, network_tip) + + +def retention_aligned_resume_ledger_with_tip(config, network_tip_ledger) -> int: + retention_ledgers = config.service.retention_ledgers target_ledger = max(network_tip_ledger - retention_ledgers, GENESIS_LEDGER) return first_ledger_of_tx_index_containing(target_ledger) + + +def try_sample_network_tip(config) -> Optional[int]: + """ + Wrapper around get_latest_network_tip that returns None on failure + instead of raising. Used where a missing tip is recoverable (e.g., + compute_resume_ledger's stale-marker check — if we can't confirm the + marker is stale, fall through to using it). + """ + try: + return get_latest_network_tip(config.history_archives.urls) + except NetworkTipUnreachable: + return None ``` ### Phase 4 — Live Ingestion -Opens active stores for the resume position, spawns the lifecycle task, starts captive core, and enters the ingestion loop. Query serving starts here (see [Query Contract](#query-contract)). +Opens active stores for the resume position, spawns the lifecycle task, starts captive core, and enters the ingestion loop. **Query serving has already been enabled by `run_rpc_service`** — Phase 4 is purely about ingestion. ```python def phase4_live_ingest(config, meta_store, resume_ledger): - # streaming:last_committed_ledger is first written by the live loop, not at bootstrap. + # service_ready was set by run_rpc_service before this call. Queries against + # frozen artifacts (chunks <= streaming:last_committed_ledger) are already + # being served. Phase 4 starts the live ingestion path so new ledgers begin + # to flow. + + # Open active stores at the resume position. open_active_*_store writes the + # hot:* key BEFORE mkdir (see Flag Semantics) — Phase 3 has already cleaned + # any stale hot keys for this chunk/index, so these mkdir calls land on an + # empty filesystem path (or, in the SCENARIO A "keep" case, on the prior-run + # active dir that's still on disk + idempotently re-opened). active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) + + # Background pruning loop. Runs an initial sweep on entry to handle any + # pruning:index:* markers left by a crash mid-prune in the prior run. run_in_background(run_prune_lifecycle_loop, config, meta_store) + # Captive core spinup. PrepareRange tells the SDK what range we want; actual + # spinup takes 4-5 minutes during which GetLedger blocks. /getHealth shows + # status="catching_up" during this window because streaming:last_committed_ledger + # hasn't moved yet, but historical queries still work against frozen artifacts. ledger_backend = make_ledger_backend(config.captive_core.config_path) ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) - set_service_ready() # in-memory; unblocks queries run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, resume_ledger) def open_active_stores_for_resume(config, meta_store, resume_ledger): + """ + Open the three active stores for the chunk/tx-index that ingestion will + resume into. Idempotent on existing dirs (mkdir no-ops; RocksDB recovers + from any partial state). + + Called once from phase4_live_ingest. Per-kind stores share no state and + can be opened in any order. + """ resume_chunk_id = chunk_id_of_ledger(resume_ledger) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) - # Each open_active_*_store sets its hot:* key before mkdir (see Flag Semantics). return ActiveStores( ledger = open_active_ledger_store(config, meta_store, resume_chunk_id), events = open_active_events_store(config, meta_store, resume_chunk_id), @@ -558,7 +831,7 @@ def open_active_stores_for_resume(config, meta_store, resume_ledger): ) ``` -Captive core takes 4–5 minutes to spin up and start emitting at `resume_ledger`. During that window `getHealth` remains in `catching_up` state (see [Query Contract](#query-contract)). +Captive core takes 4–5 minutes to spin up. During that window the service is already serving historical queries (everything up through `streaming:last_committed_ledger`). `/getHealth` reports `status = "catching_up"` until the live loop commits its first ledger; queries for ledgers above the marker return "not yet available" via the QueryRouter. --- @@ -607,7 +880,7 @@ Three independent background transitions per chunk/index boundary; each has its ### Concurrency Model - **`active_stores` is the ingestion loop's owned state.** Fields `ledger` / `events` / `txhash` (one handle per data type, no `*_next`) are mutated only inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze tasks receive a handle by value at spawn and never read back through `active_stores`. -- **Meta-store is single-writer.** Writers are the ingestion loop (per-ledger checkpoint), freeze tasks (`:lfs` / `:events` / `:txhash` flags), and the lifecycle loop (`"deleting"` marker + prune key delete). The meta-store wrapper serializes them on top of RocksDB's single-writer semantics. +- **Meta-store is single-writer.** Writers are the ingestion loop (per-ledger checkpoint), freeze tasks (`:lfs` / `:events` / `:txhash` flags), and the lifecycle loop (`pruning:index:*` marker + prune key deletes). The meta-store wrapper serializes them on top of RocksDB's single-writer semantics. - **Per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit); the next starts only after the previous releases. `wait_for_lfs_complete()` acquires the LFS gate; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases it (events / RecSplit follow the same shape). Not a global barrier — kinds remain independent. - **Query handlers read from a storage-manager layer.** Per-data-type managers (ledger / events / txhash) own their own state-transition synchronization; query handlers never touch `active_stores` directly. - **Read-view invariant:** a query sees either pre-transition data (routed to the transitioning store) or post-transition data (new active store + newly-flagged immutable artifact) — never a half-state mix. @@ -712,11 +985,20 @@ Retention is enforced by a single background task, woken at chunk boundaries. Pr ```python def run_prune_lifecycle_loop(config, meta_store): - # Initial sweep catches "deleting" state from a prior crashed prune; later sweeps fire on boundary. + """ + Background goroutine spawned at the top of phase4_live_ingest. Runs an + initial sweep on entry (catches in-progress prunes from a prior-run crash), + then loops on chunk-boundary notifications. + + This is the ONLY caller of prune_tx_index. Phase 1, 2, 3 don't call it. + Phase 3 Pass 1 cleans up past-retention artifacts at startup but does so + directly (no pruning marker needed — service_ready = false during startup, + no queries to gate). + """ retention_ledgers = config.service.retention_ledgers - _run_prune_sweep(meta_store, retention_ledgers, config) + _run_prune_sweep(meta_store, retention_ledgers, config) # initial: handles crash-recovered in-progress prunes while True: - wait_for_chunk_boundary_notification() + wait_for_chunk_boundary_notification() # woken by on_chunk_boundary's notify_lifecycle() _run_prune_sweep(meta_store, retention_ledgers, config) @@ -726,38 +1008,77 @@ def _run_prune_sweep(meta_store, retention_ledgers, config): def prunable_tx_index_ids(meta_store, retention_ledgers): - # Eligible: tx_index fully past retention AND `:txhash` is `"1"` or `"deleting"`. + """ + Two sources of work, unioned: + 1. Crash recovery — any pruning:index:N key set means a prior prune was + interrupted; re-attempt regardless of retention status. + 2. Steady-state — past-retention indexes whose index:N:txhash = "1". + """ if retention_ledgers == 0: return [] + + result = set() + + # Source 1: in-progress prunes from a prior crash. + # pruning:index:N is set BEFORE any file delete and cleared AFTER all files + # + the index key are gone, so its presence at startup unambiguously means + # "this prune was interrupted, finish it." + for key in meta_store.scan_prefix("pruning:index:"): + result.add(parse_pruning_index_id(key)) + + # Source 2: newly past-retention indexes that need a fresh prune. last_committed_ledger = meta_store.get("streaming:last_committed_ledger") max_eligible_tx_index_id = max_prunable_tx_index_id(last_committed_ledger, retention_ledgers) - if max_eligible_tx_index_id < 0: - return [] - result = [] for tx_index_id in range(0, max_eligible_tx_index_id + 1): - val = meta_store.get(f"index:{tx_index_id:08d}:txhash") - if val in ("1", "deleting"): - result.append(tx_index_id) - return result + if tx_index_id in result: + continue # already added from Source 1 + if meta_store.get(f"index:{tx_index_id:08d}:txhash") == "1": + result.add(tx_index_id) + + return sorted(result) def prune_tx_index(tx_index_id, meta_store, config): - # Two-phase marker: "deleting" set first, key cleared last. Idempotent on retry. - meta_store.put(f"index:{tx_index_id:08d}:txhash", "deleting") + """ + Tear down all artifacts for tx_index_id. Called only from _run_prune_sweep. + + Two ordering constraints bookend the work: + - pruning:index:N is set FIRST (before any file delete) — atomic gate + that flips queries to 4xx for any tx in index N. + - pruning:index:N is cleared LAST (after every other op) — survives + crashes so the next sweep's prunable_tx_index_ids picks N back up. + + Everything between is individually idempotent: file deletes use + delete_if_exists, key deletes are no-ops on absent keys. Crash anywhere + in between leaves a state the re-run cleans up. + """ + # OP 1: gate queries off + mark "prune in progress" for crash recovery. + meta_store.put(f"pruning:index:{tx_index_id:08d}", "1") + + # WORK: per-chunk file + key deletion. File-before-flag-delete preserved + # for each chunk's keys. for chunk_id in chunks_for_tx_index(tx_index_id): - # Files-first, flags-last; all deletes idempotent. delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) - delete_if_exists(raw_txhash_path(chunk_id)) # defence-in-depth; normally already gone + delete_if_exists(raw_txhash_path(chunk_id)) # defence-in-depth; normally already gone via cleanup_txhash meta_store.delete(f"chunk:{chunk_id:08d}:lfs") meta_store.delete(f"chunk:{chunk_id:08d}:events") meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_recsplit_idx_files(tx_index_id) + + # OP 2: index:N:txhash transitions from "1" to absent. With the pruning + # marker still set, queries continue to see 4xx via the marker check. meta_store.delete(f"index:{tx_index_id:08d}:txhash") + + # OP 3: clear the prune marker. Tx index N is now fully gone. + # Past this point, prunable_tx_index_ids no longer sees N. + meta_store.delete(f"pruning:index:{tx_index_id:08d}") ``` - **Why index-atomic.** Per-chunk pruning would open a window where `getTransaction` resolves to a ledger seq whose pack has been deleted; whole-index gating closes it. - **Extra data on disk.** Up to `LEDGERS_PER_TX_INDEX - 1` ledgers past strict retention. `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_TX_INDEX`, so the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. +- **Why a separate `pruning:index:*` key family** (instead of overloading `index:N:txhash` with a `"deleting"` value). Keeps every meta-store key binary (present-or-absent), so reader code never has to decode special values. The marker's "set first, cleared last" pattern is the same in either encoding; using a separate key family makes the prune-intent explicit and self-documenting in meta-store dumps. Cost: one extra meta-store op per prune (a sub-microsecond RocksDB point write) and a slightly longer `prunable_tx_index_ids` (union of the in-progress set with the past-retention-eligible set). Worth it for schema uniformity. +- **Pruning marker is steady-state-only.** Phase 3 Pass 1 also deletes past-retention `index:N:txhash` keys (and any `.idx` / `.pack` / events / `.bin` files for chunks below the retention floor), but does NOT set the pruning marker — at startup `service_ready = false`, so there are no queries to gate. Pass 1 also clears any `pruning:index:N` it finds during its sweep, so the lifecycle loop's initial sweep doesn't re-attempt work Pass 1 already finished. --- @@ -767,7 +1088,7 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ### Readiness Signal -- In-memory boolean `service_ready`, flipped to `true` by `set_service_ready()` at Phase 4 (live ingestion) entry, after Phases 1–3 complete and active stores are opened. Not persisted; every startup begins `false`. +- In-memory boolean `service_ready`, flipped to `true` by `set_service_ready()` once on-disk state is consistent — that means after Phases 1–3 complete and `compute_resume_ledger` returns, but BEFORE Phase 4's captive-core spinup. Reasoning: queries against frozen artifacts (chunks `<=` `streaming:last_committed_ledger`) don't depend on captive core having started, so 4xx-during-spinup adds no correctness; it only adds an unnecessary 4-5 minute query outage on every restart. Not persisted; every startup begins `false`. - HTTP server binds at service startup (before Phase 1 (catchup)), so `getHealth` is always servable. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `service_ready`: `false` → HTTP 4xx; `true` → route normally. - Clients see `HTTP 4xx` from the three read endpoints on every startup until Phase 4 is reached, regardless of prior runs. Intentional — catchup and recovery phases must complete before the service serves, every time. @@ -779,8 +1100,8 @@ Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, ` ### Behavior When an Index Is Being Pruned -- `prune_tx_index` sets `index:{tx_index_id:08d}:txhash = "deleting"` before touching any files, and deletes the key after all files are gone. Query routing treats `"deleting"` identically to `"absent"` (key-not-present). -- Queries for a ledger in a pruning index return HTTP 4xx (past retention) starting the instant the `"deleting"` marker is set, not when the files actually disappear. No window where queries route into a half-deleted index. +- `prune_tx_index` sets `pruning:index:{tx_index_id:08d} = "1"` before touching any files, deletes `index:{tx_index_id:08d}:txhash` after all artifacts are gone, and clears `pruning:index:{tx_index_id:08d}` last. QueryRouter checks the `pruning:index:*` key first; if set, returns HTTP 4xx as if the index were past retention. +- Queries for a ledger in a pruning index return HTTP 4xx the instant `pruning:index:N` is set, not when files actually disappear. No window where queries route into a half-deleted index. ### Rationale @@ -802,12 +1123,14 @@ No separate recovery phase. Every startup runs Phases 1–4 — already-complete In addition to the backfill subroutine's invariants in [01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery), streaming adds the following: -1. **Per-ledger checkpoint.** `streaming:last_committed_ledger` is written only after all three active stores durably commit. Resume is `last_committed_ledger + 1`. +1. **Monotonic progress marker.** `streaming:last_committed_ledger` advances only via `advance_progress_marker` (never regresses). Two writers: Phase 1 catchup (post-catchup, to `last_ledger_in_chunk(highest completed chunk)`) and live ingestion loop (per ledger, after all three active stores durably commit). Resume is `last_committed_ledger + 1` — except in the no-BSB long-downtime case where the marker is stale and below the new retention floor, in which case `compute_resume_ledger` deletes it and skips forward to the retention-aligned start. 2. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. 3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. -4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. Past-retention orphans can only arise from leapfrog — and leapfrog is deterministic, so Phase 1 (catchup) itself avoids producing them. -5. **Two-phase prune marker.** `index:{tx_index_id}:txhash = "deleting"` is set before any file delete; the key is cleared last. Queries treat `"deleting"` as absent. Idempotent on crash (`prunable_tx_index_ids` picks `"deleting"` back up). See [Pruning](#pruning). -6. **Hot-key tracking.** Every active store directory has a `hot:*` key, set BEFORE `mkdir` and cleared AFTER dir removal. Phase 3 (reconcile) iterates `hot:*` keys to find directories needing recovery — no filesystem scan anywhere. +4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. +5. **Pruning intent marker.** `pruning:index:{N} = "1"` is set BEFORE any file delete and cleared AFTER all artifacts and the `index:{N}:txhash` key are gone. QueryRouter treats its presence as "tx index N is unservable" → 4xx. Crashes mid-prune leave the marker set; the lifecycle loop's `prunable_tx_index_ids` picks it up via `scan_prefix("pruning:index:")` and re-runs the prune idempotently. See [Pruning](#pruning). +6. **Hot-key tracking + retention-aware cleanup.** Every active store directory has a `hot:*` key, set BEFORE `mkdir` and cleared AFTER dir removal. Phase 3's two passes are both meta-store-driven (no filesystem scan anywhere): pass 1 discards past-retention orphans (active dirs and stale freeze flags below the retention floor — common after long downtime where the floor moves forward); pass 2 reconciles in-flight transitions for entries above the floor. +7. **No permanently-partial tx index.** Every persisted tx index reaches a terminal state — either *complete* (`index:N:txhash = "1"`, RecSplit built) or *fully discarded* (all chunks + `index:N:txhash` deleted). The intermediate "trailing partial" state (chunks with `:lfs+:events` but no `index:N:txhash`) only persists transiently — it is completed by either: (a) a future Phase 1 invocation extending the backfill range past `last_chunk_in_tx_index(N)`, (b) Phase 4 ingestion reaching `last_chunk_in_tx_index(N)` (which fires `on_tx_index_boundary` → `build_tx_index_recsplit_files`), or (c) Phase 3 Pass 1 discarding it as past-retention. No fourth path; no tx index ever stays partial forever. +8. **No permanent orphans.** Every meta-store flag has a corresponding artifact (or is mid-cleanup, recoverable via the file-before-flag-delete invariant). Every active-store directory has a `hot:*` key (set before mkdir). Every immutable file has a freeze flag (set after fsync). The only states without a recovery path are operator-introduced (manually deleted a meta-store key while files survive, or restored a filesystem snapshot inconsistently) — explicitly out of scope per the meta-store-driven-recovery principle. In all other cases, every artifact on disk traces back to a meta-store record that some recovery path (backfill self-heal, Phase 3 Pass 1, Phase 3 Pass 2, or pruning lifecycle) will eventually act on. #### Compound Recovery Scenarios From 474958361b59ce9c396e10073df8769a8bb4438b Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sun, 26 Apr 2026 22:42:09 -0700 Subject: [PATCH 33/34] Streaming doc: Phase 3 Pass 1 categorization + tip-from-archive refactor MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Trim 1158 → ~930 lines (~27%): drop Terminology section, "Key Lifecycle in Streaming" code-style block, redundant Resilience invariants + Compound Recovery Scenarios subsection, pre/post-pseudocode prose, and cross-section service_ready / Concurrency Model duplication. - Phase 3 Pass 1 renamed pass1_drop_past_retention_state and restructured to spell out three categories explicitly: (1) lifecycle-unreachable orphans (hot:* keys + chunks of incomplete prior-run tx_indexes that the lifecycle's past-retention scan can't find), (2) lifecycle-reachable past-retention complete state (drained eagerly so the first lifecycle sweep starts with no backlog), (3) mid-prune markers below floor. - estimate_current_tip removed. Pass 1 now calls try_sample_network_tip directly to read the live tip from the history archive, with the marker as a degraded fallback only when the archive is unreachable. The BSB-marker-as-tip shortcut was unreliable under operator scenarios where BSB stops advancing while the network keeps producing. - retention_floor_ledger extracted as a shared helper, used by both Phase 1's retention_aligned_start_chunk and Phase 3 Pass 1's floor calc. - Inline pseudocode added for tx_index_ids_with_txhash_flag, max_prunable_tx_index_id, and open_active_txhash_store (with the idempotency contract — hot-key-before-mkdir, mkdir-exist-ok, RocksDB WAL recovery — spelled out). Sister stubs for the ledger and events variants flagged via "..." with one-line shape note. - Per-if-condition comment audit across Phases 1, 2, 3: callous comments rewritten to describe the exact scenario they check, no clubbing (e.g., phase1_catchup's "no chunks completed" expanded to "BSB returned -1; the other loop break is unreachable under current geometry"). - Service Entry Point promoted to its own heading; Backfill vs Phase 1 (catchup) collapsed to a bold-prefix paragraph with an inline HTML anchor preserving the cross-doc link from 01-backfill-workflow.md. --- .../design-docs/02-streaming-workflow.md | 821 +++++++----------- 1 file changed, 297 insertions(+), 524 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 7d40ec6b3..888ef8ed9 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -10,58 +10,13 @@ stellar-rpc is the **unified full-history RPC service** — historical backfill - Backfill (specified in [01-backfill-workflow.md](./01-backfill-workflow.md)) is used as an internal subroutine by Phase 1 (catchup). Operators never invoke backfill directly. **What the service does end-to-end:** -- Validates config against immutable meta-store state: `CHUNKS_PER_TX_INDEX` (chunks-per-tx-index constant; defines on-disk layout) and `RETENTION_LEDGERS` (history window in ledgers, or `0` for full history). Both detailed in [Configuration](#configuration). -- Catches up to the current **network tip** (most recent ledger the Stellar network has produced, sampled from the history archive — defined in [Terminology](#terminology)) using **BSB** (Buffered Storage Backend — remote object-store reader for `LedgerCloseMeta`; see [Ledger Source](#ledger-source)) or captive core (embedded `stellar-core` subprocess; see [Ledger Source](#ledger-source)), whichever is configured. +- Validates config against immutable meta-store state (`CHUNKS_PER_TX_INDEX`, `RETENTION_LEDGERS`). +- Catches up to the current network tip using **BSB** (Buffered Storage Backend — remote object-store reader for `LedgerCloseMeta`) or captive core (embedded `stellar-core` subprocess), whichever is configured. See [Ledger Source](#ledger-source). - Hydrates any in-flight state left by a prior run. -- Ingests live ledgers from captive core. -- Writes each live ledger to three **active Rocksdb stores** — mutable per-chunk or per-index RocksDB instances for ledger, txhash, events — detailed in [Active Store Architecture](#active-store-architecture). +- Ingests live ledgers from captive core into three **active RocksDB stores** (per-chunk ledger + events, per-index txhash). See [Active Store Architecture](#active-store-architecture). - Freezes active stores to immutable files at chunk and index boundaries in background. - Prunes past-retention indexes atomically when retention is configured. -- Serves `getLedger`, `getTransaction`, `getEvents` only after startup phases complete. Returns HTTP 4xx during startup. - ---- - -## Terminology - -Vocabulary used throughout this doc. Skim on first read; refer back as terms come up. - -- **Service** — the stellar-rpc binary running as one long-lived process. The only thing an operator starts. - -- **Startup phases 1–4** — the four steps the service runs at every start before it begins serving queries. Phase 1 catches up history, Phase 2 hydrates leftover state, Phase 3 reconciles anything left mid-flight by a prior crash, Phase 4 takes over for live streaming. Once Phase 4 is reached, the service stays there until it exits — there is no Phase 5. - -- **Phase 1 (catchup)** — the startup step that closes the gap between what's already on disk and what the Stellar network has produced so far. Uses backfill as its mechanism. - -- **Backfill** — the process of pulling historical ledgers from a remote object store and writing them to disk as immutable artifacts. Backfill is internal to the service — operators never invoke it directly. Specified in [01-backfill-workflow.md](./01-backfill-workflow.md). - -- **Retention-aligned start** — how the service picks the starting chunk when retention is configured: the start always lands on a tx-index boundary, never mid-index, so the first tx index ingested is complete. Without this rounding, the chunks before the start would fall below the retention floor and never be ingested, leaving the tx index broken and the ingest-work on its later chunks wasted. Used in two places: Phase 1 (catchup) range-start computation when BSB is configured, and `compute_resume_ledger`'s no-BSB path (fresh start or stale-marker recovery). - -- **Network tip** — the most recent ledger the Stellar network has produced. The service learns this from a public Stellar history archive over HTTP, not from its own state. - -- **Resume ledger** — at every start, the service decides which ledger it should resume live ingestion at, based on what's already on disk plus anything a prior crash left mid-flight. The first ledger ingested in the new run is the resume ledger. - -- **`streaming:last_committed_ledger`** — the local state-store key that records the last ledger the service successfully wrote during live streaming. Updated once per live ledger; never written during the startup phases. - -- **Active store** — a writable store that holds in-flight data for whatever chunk or txhash index is currently being ingested. Three kinds, one per data type: - - **Ledger active store** — one instance per chunk. - - **TxHash active store** — one instance per txhash index. - - **Events active store** — one instance per chunk. - -- **Immutable store** — on-disk files produced when an active store is frozen. Three kinds, paired with the active stores above: - - **Ledger pack file** — one per chunk. - - **TxHash lookup files** — multiple per txhash index, for fast `txhash → ledger` lookup. - - **Events cold segment** — three files per chunk. - -- **Freeze transition** — the background work of converting an active store into its immutable counterpart, then deleting the active store. Three kinds: **ledger freeze (LFS)** and **events freeze** happen at every chunk boundary; **txhash freeze** happens at every index boundary. - -- **Chunk** — a block of 10_000 consecutive ledgers. Atomic unit of ingestion and freeze: every chunk on disk is a complete 10_000-ledger chunk, never partial. - -- **Txhash index** (a.k.a. "tx index" or just "index") — a group of consecutive chunks (default: 1_000 chunks = 10_000_000 ledgers). Atomic unit of retention pruning: a tx index is pruned as a whole, never per chunk. Formulas in [Geometry](#geometry). - -- **Chunk boundary** — the moment ingestion finishes a chunk. Triggers the chunk's ledger and events freezes in the background. - -- **Index boundary** — the moment ingestion finishes a tx index. Triggers the tx index's txhash freeze in the background. Every index boundary is also a chunk boundary. - -- **`.bin` file** — a transient on-disk file produced by backfill while a tx index is still being filled in. Holds the raw txhashes for one chunk. Deleted once the tx index is complete (or once its contents are loaded into the active txhash store at startup). +- Serves `getLedger` / `getTransaction` / `getEvents` only after startup phases complete; returns HTTP 4xx during startup. --- @@ -85,7 +40,7 @@ These sections come from backfill — see [01-backfill-workflow.md — Configura - `[META_STORE]` — meta-store RocksDB path. - `[LOGGING]` — log level + format. -Streaming extends `[SERVICE]` with extra keys and introduces `[CAPTIVE_CORE]` (embedded `stellar-core` subprocess settings), `[ACTIVE_STORAGE]` (active RocksDB paths), and `[HISTORY_ARCHIVES]` (Stellar history-archive URLs for tip sampling) — all defined in [TOML Sections Documented Here](#toml-sections-documented-here) below. +Streaming extends `[SERVICE]` with extra keys and introduces `[CAPTIVE_CORE]` (embedded `stellar-core` subprocess settings), `[ACTIVE_STORAGE]` (active RocksDB paths), and `[HISTORY_ARCHIVES]` (Stellar history-archive URLs for tip sampling). ### Immutable Keys (stored in meta store, fatal if changed) @@ -96,10 +51,9 @@ Stored on first start; fatal on any subsequent start where the config value diff | `CHUNKS_PER_TX_INDEX` | `config:chunks_per_tx_index` | first run | Fatal if changed. | | `RETENTION_LEDGERS` | `config:retention_ledgers` | first run | Fatal if changed. | -- Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; not stored as immutable. -- Operators may add or remove `[BSB]` between runs; `compute_resume_ledger` derives resume from on-disk chunks regardless of which source produced them. `RETENTION_LEDGERS` already pins the retained ledger window — locking the source choice would add nothing. +Source selection (BSB vs captive core) is determined per-startup by `[BSB]` presence; operators _may add or remove_ `[BSB]` between runs. `RETENTION_LEDGERS` already pins the retained window, so locking the source choice would add nothing. -### TOML Sections Documented Here +### Streaming TOML Config **[SERVICE] — streaming additions** @@ -125,45 +79,30 @@ Extends the `[SERVICE]` table in [01-backfill-workflow.md — Configuration](./0 **[HISTORY_ARCHIVES]** -| Key | Type | Default | Description | -|---|---|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample network tip for the no-BSB resume-cursor calculation (when `[BSB]` is absent and the service needs a tip reference for retention floor / fresh-start alignment). | +| Key | Type | Default | Description | +|---|---|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `URLS` | []string | **required** | List of Stellar history archive URLs. Used to sample network tip for the no-BSB resume-from-ledger calculation | -**[BSB]** (optional) - -- Same schema as in the backfill doc. Presence in the config file determines Phase 1 (catchup) behavior: - - Present: Phase 1 (catchup) invokes backfill over the BSB (fast, parallel per-chunk catchup). - - Absent: Phase 1 (catchup) is a no-op; Phase 4 (live ingestion)'s captive core archive-catches-up from a `resume_ledger` aligned to the retention-aligned tx-index boundary (slower, but no object-store dependency). -- See [Ledger Source](#ledger-source) for the BSB-source details and [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup) for the full split. +**[BSB]** (optional) — same schema as in the backfill doc; presence determines Phase 1 (catchup) behavior. See [Operator Profiles](#operator-profiles). ### CLI Flags | Flag | Type | Default | Description | |---|---|---|---| | `--config` | string | **required** | Path to TOML config file. | -| `--log-level` | string | from `[LOGGING].LEVEL` | Override log level. | -| `--log-format` | string | from `[LOGGING].FORMAT` | Override log format. | -**No other flags.** - No `--mode`; no `--start-ledger`, `--end-ledger`; no separate subcommand for backfill or streaming. Any per-run behavior is either driven by config or derived at runtime from meta store + tip. +**No other flags.** No `--mode`; no `--start-ledger`, `--end-ledger`; no subcommand for backfill or streaming. +Per-run behavior is driven by config or derived at runtime from meta store + tip. ### Validation Rules -- `CHUNKS_PER_TX_INDEX` - immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). -- [`RETENTION_LEDGERS` - immutable across runs. Must be `0` OR a positive integer multiple of `LEDGERS_PER_TX_INDEX` (defined in [01-backfill-workflow.md — Geometry](./01-backfill-workflow.md#geometry)). - - Valid values of `RETENTION_LEDGERS` at `cpi=1_000`: `0`, `10_000_000`, `20_000_000`, `30_000_000` etc. - - Invalid: `15_000_000` (not a multiple), `5_000_000` (below minimum/not a multiple). - - Rationale: pruning runs at whole-index granularity; retention windows that don't align to index boundaries would leave partial indexes perpetually on disk. -- `[BSB]` optional. When present → Phase 1 (catchup) invokes backfill over the BSB; when absent → Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup. May be added or removed between runs. -- **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. Not a supported operating mode. -- `[HISTORY_ARCHIVES].URLS` required in all profiles. -- `[CAPTIVE_CORE].CONFIG_PATH` required in all profiles. -- `[CAPTIVE_CORE].STELLAR_CORE_BINARY_PATH` required in all profiles. -- `[SERVICE].NETWORK_PASSPHRASE` required in all profiles. +- `CHUNKS_PER_TX_INDEX` and `RETENTION_LEDGERS` are immutable across runs (see [Immutable Keys](#immutable-keys-stored-in-meta-store-fatal-if-changed)). +- `RETENTION_LEDGERS` must be `0` OR a positive integer multiple of `LEDGERS_PER_TX_INDEX`. Valid at cpi=1_000: `0`, `10_000_000`, `20_000_000`, ...; invalid: `5_000_000`, `15_000_000`. Pruning is whole-index — non-aligned windows would leave partial indexes perpetually on disk. +- **`[BSB]` absent AND `RETENTION_LEDGERS = 0` is fatal.** Full history requires BSB — captive-core archive-catchup from genesis would take weeks-to-months. +- `[HISTORY_ARCHIVES].URLS`, `[CAPTIVE_CORE].CONFIG_PATH`, `[CAPTIVE_CORE].STELLAR_CORE_BINARY_PATH`, `[SERVICE].NETWORK_PASSPHRASE` are required in all profiles. ### Validation Pseudocode -`validate_config` applies the rules above and then enforces immutability for the two immutable keys. The non-obvious mechanism is the immutable-key check itself — store on first run, compare on every subsequent run: - ```python def validate_config(config, meta_store): apply_static_rules(config) # see "Validation Rules" above @@ -181,33 +120,31 @@ def _enforce_immutable(meta_store, key, current_value): ### Operator Profiles -Three profiles emerge from config combinations. No profile flag. +Three profiles emerge from config combinations. No separate profile flag. -| Profile | `RETENTION_LEDGERS` | `[BSB]` | Phase 1 behavior | Use case | +| Profile | `RETENTION_LEDGERS` | `[BSB]` | Use case | Backfill behavior | |---|---|---|---|---| -| Archive | `0` | present | Backfill over full history (chunks `[0, current_chunk − 1]`) | Public archive node; full history. | -| Pruning-history | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | present | Backfill over retention window (start aligned to first chunk of the tx index containing the retention floor) | Windowed history with bulk initial catchup. | -| Tip-tracker | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | absent | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a `resume_ledger` aligned to the retention-aligned tx-index boundary | App developer; short retention; no object-store dep. | -| (invalid) | `0` | absent | — | Rejected by `validate_config`: full history requires BSB. | +| Archive | `0` | present | Public archive node; full history. | Backfill over full history (chunks `[0, bsb_tip_chunk]`) | +| Pruning-history | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | present | Windowed history with bulk initial catchup. | Backfill over retention window (start aligned to first chunk of the tx index containing the retention floor) | +| Tip-tracker | `N × LEDGERS_PER_TX_INDEX`, N ≥ 1 | absent | App developer; short retention; no object-store dep. | **No-op.** Phase 4 (live ingestion)'s captive core archive-catches-up from a `resume_ledger` aligned to the retention-aligned tx-index boundary | +| (invalid) | `0` | absent | Rejected by `validate_config`: full history requires BSB. | — | --- ## Meta Store Keys -*This section is a reference for the key schema and lifecycle. It reads more naturally after [Startup Sequence](#startup-sequence) below, which defines the phases that write and consume these keys.* - -Single RocksDB instance, WAL (Write-Ahead Log) always enabled. Authoritative source for every startup decision. +Single RocksDB instance, WAL always enabled. Authoritative source for every startup decision. Reference for the schema and lifecycle below; reads more naturally after [Startup Sequence](#startup-sequence) defines the phases that write and consume these keys. ### Keys Introduced by Streaming | Key | Value | Written when | -|---|---|---| -| `streaming:last_committed_ledger` | uint32 (big-endian) | Monotonic progress marker — highest ledger that has been persisted. Two writers (both via `advance_progress_marker`, never regressing): (1) Phase 1 (catchup) at end of catchup, advancing to `last_ledger_in_chunk(highest completed chunk)`; (2) live ingestion loop, per ledger, after all three active stores durably commit. When absent, [`compute_resume_ledger`](#compute-resume-ledger) falls back to the highest `:lfs` chunk (Phase 1 crashed before the post-catchup write) or to `retention_aligned_resume_ledger` (tip-tracker fresh start, no BSB, never ingested). | -| `config:retention_ledgers` | decimal string | First run (stored); enforced on subsequent starts. | -| `hot:chunk:{chunk_id:08d}:lfs` | `"1"` | Written **before** the active ledger store directory is created; deleted **after** that directory is removed by the freeze task. Presence indicates the directory exists or its lifecycle is incomplete (creation in flight, or freeze cleanup not yet finished). | -| `hot:chunk:{chunk_id:08d}:events` | `"1"` | Same pattern as `hot:chunk:lfs`, scoped to the active events store directory. | -| `hot:index:{tx_index_id:08d}:txhash` | `"1"` | Same pattern, scoped to the active txhash store directory. Per-index cadence (one per tx index, not per chunk). | -| `pruning:index:{tx_index_id:08d}` | `"1"` | Set by `prune_tx_index` BEFORE any file delete; cleared AFTER all of tx_index_id's artifacts and the `index:{tx_index_id}:txhash` key are gone. Presence means a prune is in progress (or was interrupted by a crash and needs to be re-attempted). QueryRouter treats presence as "this tx index is unservable" and returns 4xx. Lifecycle loop's `prunable_tx_index_ids` includes any index with this key set, so a crashed prune resumes idempotently on restart. | +|--|-|--| +| `streaming:last_committed_ledger` | uint32 | Monotonic progress marker; written via `advance_progress_marker`. Two writers: Phase 1 (post-catchup) and the live ingestion loop (per ledger). | +| `config:retention_ledgers` | uint32 | First run (stored); enforced on subsequent starts. | +| `hot:chunk:{chunk_id:08d}:lfs` | `"1"` | Set BEFORE active ledger store dir is created; cleared AFTER dir is removed by freeze. Presence ⇒ dir exists or its lifecycle is incomplete. | +| `hot:chunk:{chunk_id:08d}:events` | `"1"` | Same pattern, active events store dir. | +| `hot:index:{tx_index_id:08d}:txhash` | `"1"` | Same pattern, active txhash store dir. Per-index cadence (one per tx index). | +| `pruning:index:{tx_index_id:08d}` | `"1"` | Set by `prune_tx_index` BEFORE any file delete; cleared AFTER everything else (artifacts + `index:{N}:txhash`). QueryRouter returns 4xx while present; `prunable_tx_index_ids` re-enqueues N if the marker survives a crash. | ### Keys Shared with Backfill @@ -219,71 +156,31 @@ Defined in [01-backfill-workflow.md — Meta Store Keys](./01-backfill-workflow. - `chunk:{chunk_id:08d}:txhash` - `index:{tx_index_id:08d}:txhash` -Streaming-specific use of these keys (which paths write them when) is shown in [Key Lifecycle in Streaming](#key-lifecycle-in-streaming) below. All values are binary (`"1"` or absent); prune-in-progress is tracked via the separate `pruning:index:*` key family rather than overloading the value space. - -### Key Lifecycle in Streaming - -``` -Phase 1 (catchup) — every freeze flag set AFTER its artifact's fsync: - chunk:{chunk_id}:lfs = "1" - chunk:{chunk_id}:txhash = "1" # only while .bin is still on disk - chunk:{chunk_id}:events = "1" - index:{tx_index_id}:txhash = "1" # set when all chunks of tx_index_id are done - -Phase 2 (.bin hydration — see Startup Sequence) — file-before-flag-delete: - for each chunk with :txhash flag: - if .bin exists: load into active txhash RocksDB - delete .bin file - delete chunk:{chunk_id}:txhash flag - After Phase 2, no :txhash chunk flags and no .bin files remain. - -Active store open (Phase 2 / Phase 4 entry / boundary handlers) — -hot:* keys set BEFORE mkdir, one per active store kind: - hot:chunk:{chunk_id}:lfs = "1" - hot:chunk:{chunk_id}:events = "1" - hot:index:{tx_index_id}:txhash = "1" - -Live path (per ledger, after all 3 active stores commit): - streaming:last_committed_ledger = ledger_seq - -Live path (per chunk, background freeze) — flag AFTER fsync, hot key AFTER dir delete: - chunk:{chunk_id}:lfs = "1" → delete ledger store dir → clear hot:chunk:{chunk_id}:lfs - chunk:{chunk_id}:events = "1" → delete events store dir → clear hot:chunk:{chunk_id}:events - -Live path (per index, background freeze) — same pattern: - index:{tx_index_id}:txhash = "1" → delete txhash store dir → clear hot:index:{tx_index_id}:txhash - -Pruning (background, when tx_index_id is past retention) — separate pruning marker: - pruning:index:{tx_index_id} = "1" (queries return 4xx from here on) - delete all files + per-chunk :lfs + :events keys for tx_index_id - index:{tx_index_id}:txhash → deleted - pruning:index:{tx_index_id} → deleted (cleared LAST; survives crashes for recovery) -``` +All values are binary (`"1"` or absent); prune-in-progress is tracked via the separate `pruning:index:*` key family rather than overloading the value space. ### Flag Semantics -- **Flag-after-fsync (creation order).** A flag is set only AFTER the artifact it represents has been fsynced. Flag absent ⇒ artifact missing or incomplete; flag present ⇒ artifact is durable. -- **File-before-flag-delete (cleanup order).** The file is removed FIRST; the flag is cleared LAST. Flag present ⇒ cleanup may be incomplete; flag absent ⇒ cleanup done, no file exists. Reverse order would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. -- **Flag-driven recovery.** Every startup decision — hydration, transition replay, RecSplit spawn, prune eligibility, active-store directory reconciliation — derives from meta-store key presence. No filesystem-scan-and-infer anywhere. +- **Flag-after-fsync** — a freeze flag is set only AFTER its artifact is fsynced. Present ⇒ artifact durable. +- **File-before-flag-delete** — cleanup paths delete the file/dir FIRST, clear the key LAST. Reverse order would orphan a file with no meta-store record on a crash mid-pair, recoverable only by filesystem scan. +- **Hot keys before mkdir, cleared after dir-delete** — every active store dir has a `hot:*` key set BEFORE `mkdir` and cleared AFTER `delete_dir_if_exists`. -_**The meta-store flag is the always-correct signal of artifact state on disk, both for immutable files and for active-store directories. A crash anywhere leaves a state the next start recovers from by flag presence alone.**_ +_**The meta-store flag is the always-correct signal of artifact state on disk, both for immutable files and for active-store directories. A crash anywhere leaves a state the next start recovers from by flag presence alone — no filesystem scan anywhere.**_ --- ## Active Store Architecture -Three RocksDB-backed active stores; WAL always enabled. Each directory has a `hot:*` key (set before mkdir, cleared after dir removal); Phase 3 (reconcile) finds directories needing recovery via meta-store scan, never filesystem scan. Lifecycle driven by [freeze transitions](#freeze-transitions). +Three RocksDB-backed active stores; WAL always enabled. - **Ledger** — one per chunk at `{ACTIVE_STORAGE.PATH}/ledger-store-chunk-{chunk_id:08d}/`. Key `uint32BE(ledgerSeq)`, value `zstd(LCM bytes)`. -- **TxHash** — one per tx index at `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/`. Key `txhash[32]`, value `uint32BE(ledgerSeq)`. 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`; each CF pairs 1:1 with one of the 16 RecSplit `.idx` files at the index boundary. -- **Events** — one per chunk at `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/`. Schema per [getEvents full-history design](../../design-docs/getevents-full-history-design.md). Per-ledger writes are idempotent — re-write of the same ledger overwrites cleanly, so crash-replay is corruption-free. +- **TxHash** — one per tx index at `{ACTIVE_STORAGE.PATH}/txhash-store-index-{tx_index_id:08d}/`. Key `txhash[32]`, value `uint32BE(ledgerSeq)`. 16 column families (`cf-0`..`cf-f`) routed by `txhash[0] >> 4`; each CF pairs 1:1 with one of the 16 RecSplit `.idx` files. +- **Events** — one per chunk at `{ACTIVE_STORAGE.PATH}/events-store-chunk-{chunk_id:08d}/`. Schema per [getEvents full-history design](../../design-docs/getevents-full-history-design.md). Per-ledger writes are idempotent. ### Store Lifecycle -- **Boundary swap.** At every chunk boundary the next chunk's ledger + events stores open synchronously while the just-finished ones are handed to background freeze tasks; tx-index boundaries do the same for txhash. Each kind therefore holds at most one active + one transitioning at a time. Ingestion never blocks on the freeze. -- **Synchronous open cost.** ~100 ms maximum — small enough to ignore. -- **Deletion.** The freeze task deletes the active store's directory only after the immutable artifact is fsynced and its freeze flag is set. -- **Crash recovery.** Active-store directories surviving a crash are reconciled on the next start — see [Phase 3 — Reconcile](#phase-3--reconcile). +- At every chunk boundary, the next chunk's ledger + events stores open synchronously (~100 ms) while the just-finished ones are handed to background freeze tasks; tx-index boundaries do the same for txhash. Each kind holds at most one active + one transitioning. Ingestion never blocks on the freeze. +- The freeze task deletes the active dir only after the immutable artifact is fsynced and its freeze flag is set. +- Active-store dirs surviving a crash are reconciled by [Phase 3](#phase-3--reconcile). --- @@ -291,33 +188,30 @@ Three RocksDB-backed active stores; WAL always enabled. Each directory has a `ho Two ledger sources, scoped to different phases: -- **Backfill (Phase 1 (catchup)) uses `BSBSource`** — backfill-only reader (`PrepareRange` + `GetLedger`). Each `process_chunk` constructs its own scoped to its chunk's 10_000 ledgers. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). +- **Backfill (Phase 1) uses `BSBSource`** — backfill-only reader (`PrepareRange` + `GetLedger`). Each `process_chunk` constructs its own instance scoped to its chunk's 10_000 ledgers. Captive core cannot be a backfill source — see [Backfill vs Phase 1 (catchup)](#backfill-vs-phase-1-catchup). - **Live streaming (Phase 4 (live ingestion)) uses captive core directly** — `PrepareRange(UnboundedRange(resume_ledger))` + per-ledger `GetLedger(seq)` against the captive-core subprocess. --- ## Startup Sequence -Four sequential phases, same code path for first start and every restart. The first three are bounded bootstrap work; Phase 4 (live ingestion) is the long-running state the service stays in until process exit. +Four sequential phases on every start. The first three are bounded bootstrap; Phase 4 is the long-running ingestion state. -- **Phase 1 — catchup.** Closes the gap between on-disk `:lfs` flags and current network tip **when `[BSB]` is configured**, by invoking the backfill subroutine in a loop. Without `[BSB]`, Phase 1 (catchup) is a no-op and Phase 4 (live ingestion)'s captive core handles initial catchup naturally via its own `PrepareRange(UnboundedRange(resume_ledger))`. -- **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 (catchup) left (for the trailing partial index) into the active txhash store, then deletes them. -- **Phase 3 — reconcile orphans.** Completes any in-flight freeze transitions left by a prior crash. -- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, enters the ingestion loop. Runs until process exit. Note: `service_ready` is flipped by `run_rpc_service` BEFORE Phase 4 entry — historical queries are served during captive core's 4-5 minute spinup. +- **Phase 1 — catchup.** When `[BSB]` is configured, invokes the backfill subroutine in a loop to close the gap between on-disk artifacts and current network tip. Without `[BSB]`, Phase 1 is a no-op and Phase 4's captive core handles initial catchup via `PrepareRange(UnboundedRange(resume_ledger))`. +- **Phase 2 — hydrate txhash.** Loads any `.bin` files Phase 1 left (trailing partial index) into the active txhash store, then deletes them. +- **Phase 3 — reconcile.** Two passes: drop past-retention state, then recover in-flight freezes left by a prior crash. +- **Phase 4 — live ingestion.** Opens active stores, starts captive core, spawns the lifecycle task, enters the ingestion loop. Runs until process exit. -### Backfill vs Phase 1 (catchup) +**Backfill vs Phase 1 (catchup):** `run_backfill` (subroutine, [01-backfill-workflow.md](./01-backfill-workflow.md)) is BSB-only — captive core's serial subprocess can't be sharded per-chunk like BSB. `phase1_catchup` (startup phase) invokes backfill when `[BSB]` is configured, no-ops otherwise. -- **Backfill** is the subroutine (`run_backfill` in [01-backfill-workflow.md](./01-backfill-workflow.md)). BSB-only; parallel per-chunk BSB instances. Captive core cannot be a backfill source — its subprocess is serial and expensive to spin up per instantiation. -- **Phase 1 (catchup)** is the startup phase that runs on every service start. Its job: close the gap between on-disk state and current network tip before Phase 4 (live ingestion) takes over. Invokes backfill as its mechanism when `[BSB]` is configured; otherwise no-op and Phase 4 (live ingestion)'s captive core handles catchup via `PrepareRange(UnboundedRange(resume_ledger))`. +### Service Entry Point ```python def main(): - args = parse_cli_flags() # --config, --log-level, --log-format + args = parse_cli_flags() config = load_config_toml(args.config) - init_logging(config.logging, cli_overrides=args) run_rpc_service(config) - def run_rpc_service(config): meta_store = open_meta_store(config) validate_config(config, meta_store) @@ -330,54 +224,33 @@ def run_rpc_service(config): phase3_reconcile(config, meta_store) resume_ledger = compute_resume_ledger(config, meta_store) - # On-disk state is now consistent. Queries against frozen artifacts can be - # served immediately — they don't depend on captive core having started. - # Flipping service_ready here (rather than after captive-core spinup) cuts - # the 4xx window by the captive-core startup time (~4-5 min). + # Frozen-artifact queries don't need captive core; flip service_ready here + # (not after spinup) to avoid an unnecessary 4-5 min outage per restart. set_service_ready() - - # Phase 4 opens active stores and starts captive core. Live ingestion begins - # asynchronously. Queries for ledgers > streaming:last_committed_ledger return - # "not yet available" until ingestion catches up; that's the same client-visible - # behavior whether captive core has spun up or not. phase4_live_ingest(config, meta_store, resume_ledger) ``` -Query serving is gated on Phase 4 (live ingestion) being reached — see [Query Contract](#query-contract). +See [Query Contract](#query-contract) for the query-gating contract. ### Phase 1 — Catchup -- **No-op path:** if `config.bsb is None` (tip-tracker profile), Phase 1 returns `None` immediately. Phase 4's captive core does archive-catchup from `retention_aligned_resume_ledger`. -- **BSB path:** runs `run_backfill` from [01-backfill-workflow.md](./01-backfill-workflow.md) in a loop. Each iteration samples BSB's latest complete chunk and backfills `[retention_aligned_start_chunk, end_chunk]` inclusive. Loop exits when BSB has no new complete chunks. Phase 1 reads from BSB, so its horizon is BSB's chunk-aligned tip; the residual gap to network tip is closed by Phase 4's captive core. -- **Side effects on the meta store:** - 1. Backfill writes `:lfs` / `:events` / `chunk:*:txhash` / `index:*:txhash` flags as it materializes artifacts (per backfill design). - 2. After the loop, **Phase 1 advances `streaming:last_committed_ledger`** to `last_ledger_in_chunk(highest completed chunk)`. This is the durable record that Phase 1 catchup actually progressed past the prior value — used by Phase 3 to compute the retention floor and by `compute_resume_ledger` to derive the resume cursor. -- **Return value:** the highest chunk_id Phase 1 completed, or `None` for no-op. Phase 2 consumes this to find the trailing partial tx index without re-scanning meta-store. - ```python def phase1_catchup(config, meta_store) -> Optional[int]: """ - Catch up history via BSB (no-op when [BSB] absent). - - Scenarios handled: - - First-ever start (archive / pruning-history) — backfill from - retention_aligned_start_chunk to BSB tip. - - First-ever start (tip-tracker, no BSB) — early return None. - - Quick restart, BSB unchanged — loop runs once, run_backfill is a - full no-op (every chunk's flags already set), loop exits on iter 2. - - Mid-life restart, BSB advanced — backfill the new chunks; idempotent - skip on already-frozen chunks. - - Long-downtime restart with retention — backfill range starts at the new - retention-aligned position (computed from current BSB tip), skipping - past chunks now below the retention floor. Those stale chunks are left - for Phase 3 to clean up. - - Crash during Phase 1 — backfill is per-chunk-idempotent. On restart, - completed chunks skip; in-flight ones re-run. last_scheduled_end_chunk - is local-only and resets on every start; Phase 1 always re-runs from - scratch (cheap because of idempotent skips). - - Returns: highest chunk_id Phase 1 completed (Phase 2's input), - or None for no-op. + Catch up history via BSB; no-op when [BSB] absent. + + Loop samples BSB tip, computes retention-aligned start, runs backfill, and + repeats until BSB stops advancing. The loop exists because the remote + object store lags the live network tip — each iteration may surface new + chunks that landed while the previous backfill was running. After the + loop, advances the progress marker so Phase 3 / compute_resume_ledger + see the post-catchup position rather than a stale prior-run value + (long-downtime correctness). + + Idempotent across restarts — backfill skips already-flagged chunks. + last_scheduled_end_chunk is local-only and resets every start. + + Returns: highest chunk_id completed (input to Phase 2), or None on no-op. """ if config.bsb is None: # Tip-tracker profile. Nothing to backfill. @@ -392,46 +265,49 @@ def phase1_catchup(config, meta_store) -> Optional[int]: break # BSB has no new complete chunks since last iter start_chunk = retention_aligned_start_chunk(last_ledger_in_chunk(end_chunk), retention_ledgers) if end_chunk < start_chunk: - break # retention-aligned start landed past BSB's tip; Phase 4 picks up + # Defensive only — unreachable under current geometry. retention is 0 + # or N×LEDGERS_PER_TX_INDEX and floor is clamped to GENESIS, so + # first_chunk_of_tx_index_containing(floor) <= end_chunk always. + break log.info(f"phase1_catchup bsb_tip_chunk={end_chunk} range=[{start_chunk}, {end_chunk}]") run_backfill(config, start_chunk, end_chunk) last_scheduled_end_chunk = end_chunk if last_scheduled_end_chunk < 0: - return None # no chunks completed (e.g., BSB hasn't published any yet) - - # Bump the streaming:last_committed_ledger key to reflect Phase 1's catchup. - # This pushes the key past any stale value left by a prior run that's now - # below the retention floor. Without this advance, Phase 3 would compute the - # retention floor from the stale prior value, and compute_resume_ledger would - # tell captive core to resume at a stale ledger — re-ingesting chunks pruning - # is about to delete. + # Reached only when iter 1 broke on `end_chunk <= last_scheduled_end_chunk` + # with last_scheduled_end_chunk still at -1, i.e., bsb_latest_complete_chunk_id + # returned -1 (BSB has zero complete chunks — fresh bucket or transient + # empty state). No end_chunk to advance the marker to; Phase 4's captive + # core handles resume from whatever state exists. + # The other loop break (end_chunk < start_chunk) cannot land here under + # the current retention geometry — start_chunk <= end_chunk always. + return None + + # Advance marker past any stale prior-run value so Phase 3's floor calc and + # compute_resume_ledger don't replay chunks that are about to be pruned. advance_progress_marker(meta_store, last_ledger_in_chunk(last_scheduled_end_chunk)) return last_scheduled_end_chunk +def retention_floor_ledger(tip_ledger, retention_ledgers) -> int: + # Bottom edge of the retention window — oldest ledger to keep. Clamped to + # GENESIS for the early-bootstrap case where tip < retention_ledgers + # (tip - retention would otherwise be a non-existent negative ledger). + # Shared by Phase 1 (start-chunk computation) and Phase 3 Pass 1 (past-retention floor). + return max(tip_ledger - retention_ledgers, GENESIS_LEDGER) + def retention_aligned_start_chunk(tip_ledger, retention_ledgers): # Aligns DOWN to a tx-index boundary (no-gaps); up to LEDGERS_PER_TX_INDEX - 1 ledgers below strict retention. if retention_ledgers == 0: return 0 - target_ledger = max(tip_ledger - retention_ledgers, GENESIS_LEDGER) - return first_chunk_id_of_tx_index_containing(target_ledger) - + return first_chunk_id_of_tx_index_containing(retention_floor_ledger(tip_ledger, retention_ledgers)) def advance_progress_marker(meta_store, candidate_ledger): """ - Move streaming:last_committed_ledger forward to candidate_ledger, but only - if that's an advance — never regress. - - Two callers: - - phase1_catchup, once at end of catchup, with last_ledger_in_chunk(highest - completed chunk). - - run_live_ingestion_loop, once per ledger, after all three active stores - durably commit that ledger. - - Monotonicity matters: a regression would cause compute_resume_ledger to - point captive core at already-durable ledgers (re-ingest waste), and Phase 3 - to compute a stale retention floor. + Monotonic write to streaming:last_committed_ledger. Two callers — Phase 1 + (post-catchup) and the live ingestion loop (per ledger, after all three + active stores commit). Regression would cause re-ingest waste and a stale + retention floor. """ prior = meta_store.get("streaming:last_committed_ledger") if prior is None or candidate_ledger > prior: @@ -442,9 +318,10 @@ def advance_progress_marker(meta_store, candidate_ledger): ### Phase 2 — Hydrate TxHash Data from `.bin` -- Phase 1 (catchup) may leave `.bin` files for chunks in the last (incomplete) tx index. -- Phase 2 (`.bin` hydration) loads each into the active txhash store, then deletes the `.bin` + `chunk:{chunk_id:08d}:txhash` flag. -- After Phase 2 (`.bin` hydration): no `.bin` files and no `:txhash` chunk flags remain. +Phase 1's backfill range almost always ends mid-tx-index — BSB tip lands wherever the live network is, rarely on an index boundary. +For that trailing partial tx index, per-chunk `.bin` files are on disk but `index:N:txhash` is not yet written (RecSplit waits until every chunk of N is complete). +Phase 2 loads each surviving `.bin` into the active txhash store, then deletes the `.bin` and `chunk:{chunk_id:08d}:txhash` flag. +After Phase 2: no `.bin` files and no `:txhash` chunk flags remain. ```python def phase2_hydrate_txhash(config, meta_store, last_phase1_chunk_id): @@ -470,91 +347,123 @@ def phase2_hydrate_txhash(config, meta_store, last_phase1_chunk_id): try: for chunk_id in chunks_for_tx_index(tx_index_id): if not meta_store.has(f"chunk:{chunk_id:08d}:txhash"): + # Phase 1 didn't reach this chunk in the trailing tx_index + # (chunk_id > last_phase1_chunk_id) — no .bin file to hydrate. continue bin_path = raw_txhash_path(chunk_id) + # .bin absent + :txhash flag set ⇒ a prior Phase 2 deleted the file + # but crashed before clearing the flag (file-before-flag-delete). + # The data is already durable in the active txhash store from that + # prior load; skip the re-load and just finish flag cleanup below. if os.path.exists(bin_path): load_bin_into_rocksdb(bin_path, txhash_store) delete_if_exists(bin_path) meta_store.delete(f"chunk:{chunk_id:08d}:txhash") finally: txhash_store.close() -``` -- **Why "load then delete".** Without it, every restart during the incomplete-index lifetime would re-load the same `.bin` files. Load-then-delete makes Phase 2 a no-op on every subsequent restart until Phase 1 (catchup) deposits new `.bin` files. -- **Pure-streaming restarts** (no recent Phase 1 output) never see `.bin` files; the live path writes txhash directly to the active store. Phase 2 is a no-op. +def tx_index_ids_with_txhash_flag(meta_store) -> Set[int]: + # Scans chunk:*:txhash and returns the unique tx_index_ids those chunks + # belong to. Used by Sweep 1 to find indexes whose cleanup_txhash crashed + # mid-pair (some chunks of N still carry :txhash flags + .bin files). + result = set() + for key in meta_store.scan_prefix("chunk:"): + if not key.endswith(":txhash"): + continue + chunk_id = parse_chunk_id_from_chunk_key(key) + result.add(tx_index_id_of_chunk(chunk_id)) + return result + +def open_active_txhash_store(config, meta_store, tx_index_id) -> RocksDBStore: + """ + Idempotent. Safe to call repeatedly on the same tx_index_id: + + Hot key is written BEFORE mkdir (See Flag Semantics). + A crash in between leaves hot:* set + dir absent, + which Phase 3 Pass 2 reconciles. The reverse order would orphan the + dir with no meta-store record — only recoverable via filesystem scan, which + violates the meta-store-driven-recovery invariant. + """ + meta_store.put(f"hot:index:{tx_index_id:08d}:txhash", "1") + path = active_store_path_for("index:txhash", tx_index_id) + os.makedirs(path, exist_ok=True) + column_families = [f"cf-{nibble:x}" for nibble in range(16)] + # open_rocksdb_store recovers from any partial WAL on an existing DB. + return open_rocksdb_store(path, txhash_rocksdb_settings(column_families)) -### Phase 3 — Reconcile -Two passes, both strictly meta-store-driven (no filesystem scan): +def open_active_ledger_store(config, meta_store, chunk_id) -> RocksDBStore: + ... # same shape: put hot:chunk:{C}:lfs (idempotent), mkdir, open RocksDB. + +def open_active_events_store(config, meta_store, chunk_id) -> RocksDBStore: + ... # same shape: put hot:chunk:{C}:events (idempotent), mkdir, open RocksDB. +``` + +Pure-streaming restarts (no recent Phase 1 output) never see `.bin` files; the live path writes txhash directly to the active store. Phase 2 is a no-op. -- **Pass 1 — discard past-retention orphans.** After a long downtime, some active-store dirs and immutable artifacts from prior runs may now be below the new retention floor (because Phase 1's retention-aligned start chunk has moved forward to track the network tip). They have no meaningful next transition — can't be frozen if their hot DB is partial, shouldn't be kept because they're past retention. Discarded outright. -- **Pass 2 — recover in-flight transitions.** For active stores ABOVE the retention floor whose freeze was interrupted by the prior crash, complete the freeze (or clean up if the freeze flag was already set but cleanup didn't finish). +### Phase 3 — Reconcile -Order matters: Pass 1 first so Pass 2's resume-relative classification (A/B/C/D) only sees in-range entries. +Two passes, both meta-store-driven. Pass 1 runs first so Pass 2's resume-relative classification only sees in-range entries. ```python def phase3_reconcile(config, meta_store): - """ - Reconciles state left by the prior process exit. - - Scenarios handled: - - Fresh first-ever start — both passes early-return (streaming:last_committed_ledger - absent, no hot:* keys yet). - - Quick restart, no crash mid-freeze — Pass 1 finds nothing past retention; - Pass 2 keeps the resume position's hot:* keys via SCENARIO A. - - Long-downtime restart with retention (BSB present) — Phase 1's catchup - already advanced streaming:last_committed_ledger past stale chunks; Pass 1 - discards prior-run hot:* keys + flags now below the floor; Pass 2 sees - nothing left to do. - - Long-downtime restart with retention (no BSB) — Phase 1 was a no-op; - the streaming:last_committed_ledger key is stale; Pass 1 samples network - tip from history archive and uses that as the floor reference. - - Crash mid-LFS / mid-events / mid-RecSplit freeze — Pass 1 unaffected - (the in-flight chunk is at the resume position, well above retention); - Pass 2 SCENARIO C re-runs the freeze. - - Crash between freeze flag set and active-dir delete — Pass 2 SCENARIO B - cleans up the orphan dir. - - Future-orphan (defensive) — Pass 2 SCENARIO D logs + cleans up. - """ - pass1_discard_past_retention_orphans(config, meta_store) + pass1_drop_past_retention_state(config, meta_store) pass2_recover_in_flight_transitions(config, meta_store) -def pass1_discard_past_retention_orphans(config, meta_store): +def pass1_drop_past_retention_state(config, meta_store): """ - Find every artifact (hot DB dir, immutable file, freeze flag) below the - retention floor and discard it. - - Why this is needed: when a long downtime advances the network tip past - where the prior run left off, the retention floor moves forward. Active-store - dirs and freeze flags from the prior run can end up below the new floor. - The pruning lifecycle handles COMPLETE tx indexes (`prunable_tx_index_ids` - requires `index:N:txhash` set), but an INCOMPLETE tx index from a prior run - (its `index:N:txhash` was never written, since RecSplit never ran) is - invisible to that path. Pass 1 catches those plus a few related cases. - - Determining the floor: - - retention_ledgers == 0 (archive) — no floor, nothing past retention. - - BSB present — Phase 1 just advanced streaming:last_committed_ledger to - BSB-tip's last ledger; that's the authoritative tip reference. - - No BSB — streaming:last_committed_ledger reflects only the prior run - (potentially weeks stale); sample current tip from history archive. - If unreachable, skip Pass 1 (the prune lifecycle will catch up later - as boundaries fire). + Drop every meta-store key and on-disk artifact whose ledger range falls + below the retention floor. Three distinct categories of state get cleaned: + + 1. Lifecycle-unreachable orphans. + - hot:* keys + active-store dirs (LFS / events / txhash) for chunks + and indexes below the floor. The freeze can't run on past-retention + data, so these dirs have no path forward. + - chunk:{C}:lfs / :events / :txhash flags + their immutable files for + chunks of an INCOMPLETE prior-run tx index. Because index:N:txhash + was never written for that tx index, the pruning lifecycle's + past-retention scan (which keys off index:N:txhash = "1") cannot + find them — Pass 1 is their only cleanup path. + + 2. Lifecycle-reachable complete state, drained eagerly. + - index:N:txhash flags + the 16 RecSplit .idx files for fully-built + indexes below the floor. The pruning lifecycle would pick these up + on its next sweep; Pass 1 drops them at startup so the first sweep + starts with no backlog. + + 3. Mid-prune markers from a prior-run crash, below floor. + - pruning:index:N markers below floor. The chunk/index loops above + have already deleted N's files + flags, so this loop just clears + the marker. Above-floor markers are left alone — the pruning + lifecycle's crash-recovery scan picks them up on its next sweep. + + Floor reference: sample the live network tip from the history archive. + The progress marker is NOT a reliable tip proxy — `[BSB]` may be configured + but no longer advancing (operator stopped updating the remote store; BSB + outage; etc.), in which case the marker reflects BSB's last seen ledger, + not the live network tip. Falling back to the marker on archive-unreachable + is degraded best-effort cleanup; the pruning lifecycle catches up later as + captive core advances the marker through the gap. """ retention_ledgers = config.service.retention_ledgers if retention_ledgers == 0: return # archive profile — no floor - current_tip = estimate_current_tip(config, meta_store) + current_tip = try_sample_network_tip(config) if current_tip is None: - log.warn("phase3 pass1: no tip reference available; skipping past-retention cleanup") + # Archive unreachable. Degraded fallback to the local marker — Pass 1 + # under-cleans for now; pruning lifecycle catches up after captive core + # advances the marker. + current_tip = meta_store.get("streaming:last_committed_ledger") + if current_tip is None: + log.warn("phase3 pass1: no tip reference available (archive unreachable + marker absent); skipping past-retention cleanup") return - floor_ledger = max(current_tip - retention_ledgers, GENESIS_LEDGER) + floor_ledger = retention_floor_ledger(current_tip, retention_ledgers) floor_chunk = chunk_id_of_ledger(floor_ledger) - # Discard hot:chunk:* below floor (active LFS / events store dirs). + # Category 1 — lifecycle-unreachable orphans (hot keys + chunks of incomplete prior-run tx indexes). for hot_key in meta_store.scan_prefix("hot:chunk:"): store_kind, chunk_id = parse_hot_key(hot_key) if chunk_id < floor_chunk: @@ -562,7 +471,6 @@ def pass1_discard_past_retention_orphans(config, meta_store): meta_store.delete(hot_key) log.info(f"phase3 pass1: discarded past-retention {hot_key}") - # Discard hot:index:* below floor (active txhash store dirs). for hot_key in meta_store.scan_prefix("hot:index:"): _, tx_index_id = parse_hot_key(hot_key) if last_ledger_in_tx_index(tx_index_id) < floor_ledger: @@ -570,31 +478,20 @@ def pass1_discard_past_retention_orphans(config, meta_store): meta_store.delete(hot_key) log.info(f"phase3 pass1: discarded past-retention {hot_key}") - # Discard chunk:*:lfs / :events / :txhash freeze flags + their files for - # chunks below floor. Covers chunks of an INCOMPLETE prior-run tx index - # that pruning lifecycle can't reach (because index:N:txhash was never - # written). Per-chunk delete is idempotent (delete_if_exists). for chunk_key in meta_store.scan_prefix("chunk:"): chunk_id, kind = parse_chunk_key(chunk_key) if chunk_id < floor_chunk: delete_immutable_artifact(kind, chunk_id) # .pack / events cold segment / .bin meta_store.delete(chunk_key) - # Discard index:*:txhash freeze flags + RecSplit files for past-retention - # complete indexes. (Covered by prune lifecycle in steady state, but at - # startup we run this for completeness so the first prune sweep has no - # backlog.) + # Category 2 — past-retention complete indexes; lifecycle could reach but Pass 1 drains eagerly. for index_key in meta_store.scan_prefix("index:"): _, tx_index_id = parse_index_key(index_key) if last_ledger_in_tx_index(tx_index_id) < floor_ledger: delete_recsplit_idx_files(tx_index_id) meta_store.delete(index_key) - # Discard pruning:index:* markers for past-retention indexes that the - # prior run was already mid-prune on. The above loops have already - # taken care of files + index keys; this loop just clears the marker. - # (Markers above the floor are left alone — the lifecycle loop's initial - # sweep will pick them up and finish the prune.) + # Category 3 — clear below-floor pruning:index:* markers; files + flags already removed above. for pruning_key in meta_store.scan_prefix("pruning:index:"): tx_index_id = parse_pruning_index_id(pruning_key) if last_ledger_in_tx_index(tx_index_id) < floor_ledger: @@ -602,18 +499,15 @@ def pass1_discard_past_retention_orphans(config, meta_store): def pass2_recover_in_flight_transitions(config, meta_store): - """ - For every hot:* key still set after Pass 1, classify against the resume - position and dispatch the right recovery action. - - Why "still set after Pass 1": Pass 1 already removed past-retention hot - keys, so every entry seen here is at or above the retention floor — i.e., - legitimately a candidate for either Phase 4 reopen (SCENARIO A), freeze - cleanup (B), freeze re-run (C), or defensive cleanup (D). - """ + # Pass 1 has already removed past-retention hot keys; every entry here is + # at or above the retention floor. last_committed = meta_store.get("streaming:last_committed_ledger") if last_committed is None: - return # no prior live commits → no in-flight work + # No progress marker means neither Phase 1 nor the live loop has ever + # advanced it. Without a resume position, hot:* keys can't be classified + # into scenarios A/B/C/D described below — bail. Any stranded hot:* keys from a prior abnormal + # exit get classified on a later restart once the marker is set. + return resume_chunk_id = chunk_id_of_ledger(last_committed + 1) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) @@ -640,69 +534,40 @@ def pass2_recover_in_flight_transitions(config, meta_store): meta_store.delete(hot_key) -def estimate_current_tip(config, meta_store) -> Optional[int]: - """ - Best estimate of the current network tip at startup. Used as the reference - for retention floor calculations. - - BSB present: Phase 1 just advanced streaming:last_committed_ledger to - BSB-tip's last ledger, which is within minutes of network tip (BSB upload - lag). Use it directly. - - No BSB: streaming:last_committed_ledger is from the prior run (potentially - weeks stale). Sample the network tip from history archive (same helper - retention_aligned_resume_ledger uses). None if archive unreachable, in - which case Pass 1 skips and pruning lifecycle handles cleanup later. - """ - if config.bsb is not None: - return meta_store.get("streaming:last_committed_ledger") - - try: - return get_latest_network_tip(config.history_archives.urls) - except NetworkTipUnreachable: - return None ``` `finish_interrupted_freeze` reopens the active store (idempotent on existing or partial dirs) and runs the corresponding live-path freeze ([LFS](#lfs-transition), [events](#events-transition), or [RecSplit](#recsplit-transition)) to produce the artifact. ### Compute Resume Ledger -`compute_resume_ledger(config, meta_store) -> ledger_seq`. Runs once per service start, after Phase 3 (reconcile), before Phase 4 (live ingestion). Returns the ledger sequence Phase 4's captive core resumes from via `PrepareRange(UnboundedRange(resume_ledger))`. - -**Three cases, first match wins:** - -| State | Situation | `resume_ledger` | -|---|---|---| -| `streaming:last_committed_ledger` present + within retention floor | Mid-life restart (Phase 1 advanced the key, or live loop committed ledgers in this run / a recent prior run) | `value + 1` | -| `streaming:last_committed_ledger` present but stale (no-BSB tip-tracker after long downtime, value below retention floor) | Re-ingesting via captive core from the stale value would replay days of past-retention chunks for nothing | Delete the stale key, return `retention_aligned_resume_ledger(config)` (skips forward to the new retention-aligned tx-index boundary) | -| `streaming:last_committed_ledger` absent + `:lfs` chunks present | Edge case — Phase 1 wrote `:lfs` flags but crashed before `advance_progress_marker` | `last_ledger_in_chunk(highest_lfs_chunk) + 1` | -| `streaming:last_committed_ledger` absent + no `:lfs` chunks | Tip-tracker fresh start (no BSB, never ingested) | `retention_aligned_resume_ledger(config)` | +Runs once per service start, after Phase 3, before Phase 4. Returns the ledger sequence Phase 4's captive core resumes from via `PrepareRange(UnboundedRange(resume_ledger))`. ```python def compute_resume_ledger(config, meta_store) -> int: """ - Decide the ledger sequence captive core's PrepareRange resumes at. - - Scenarios handled: - - Fresh first-ever start (BSB present) — Phase 1 already advanced the - progress marker; trivially resumes at progress_marker + 1. - - Fresh first-ever start (no BSB) — no progress marker, no :lfs; falls - through to retention_aligned_resume_ledger (samples tip from history - archive, aligns down to a tx-index boundary). - - Quick restart, BSB unchanged — progress marker is fresh; resume at + 1. - - Long-downtime restart, BSB present — Phase 1's catchup already advanced - the progress marker past any stale prior value; resume at the new + 1. - - Long-downtime restart, no BSB (tip-tracker) — Phase 1 was a no-op; - the progress marker is from the prior run (potentially weeks stale). - Detect by comparing against the current network tip; if the marker is - below the retention floor, delete it and skip forward to the - retention-aligned start. - - Phase 1 crashed after writing :lfs but before advance_progress_marker - — fall back to deriving resume from the highest :lfs chunk. - - No consistency validation is performed here. Phase 1 backfill self-heals - incomplete chunks within its range; Phase 3 recovers in-flight freezes; - pruning lifecycle handles past-retention state. + Decide where captive core resumes ingestion. Three branches: + + 1. Marker present + (BSB or retention=0 or marker not stale) → marker + 1. + Covers fresh starts (Phase 1 just advanced), quick restarts, and + BSB-present long-downtime (Phase 1 advanced past the stale value). + + 2. Marker present + no BSB + below retention floor → stale (no-BSB + long downtime). Delete marker and return retention_aligned_resume_ledger. + Without this, captive core would re-ingest weeks of about-to-be-pruned + ledgers. + + 3. Marker absent → defensive fallback. Highest :lfs chunk + 1 if any + exist; otherwise retention_aligned_resume_ledger. + + Degraded-path note: in branch 2, if the history archive is unreachable + (try_sample_network_tip returns None), we cannot confirm staleness and + fall through to marker + 1. The live loop then re-ingests potentially + weeks of past-retention ledgers via captive core; the pruning lifecycle + eventually catches up. Documented as expected behavior under archive + unavailability — correctness holds, throughput is degraded. + + No consistency validation. Phase 1 self-heals incomplete chunks; Phase 3 + recovers in-flight freezes; pruning lifecycle handles past-retention state. """ progress_marker = meta_store.get("streaming:last_committed_ledger") @@ -720,24 +585,18 @@ def compute_resume_ledger(config, meta_store) -> int: return retention_aligned_resume_ledger_with_tip(config, current_tip) return progress_marker + 1 - # Marker absent. Phase 1 may have crashed after writing :lfs but before - # advance_progress_marker. Recover via the highest :lfs chunk. + # Marker absent — defensive fallback. Recover via the highest :lfs chunk + # if any exist; otherwise tip-tracker fresh start. highest_lfs = scan_max_lfs_chunk(meta_store) if highest_lfs is not None: return last_ledger_in_chunk(highest_lfs) + 1 - # No marker, no :lfs — tip-tracker fresh start. return retention_aligned_resume_ledger(config) def scan_max_lfs_chunk(meta_store) -> Optional[int]: - """ - Returns the highest chunk_id with `:lfs` set, or None if no :lfs key exists. - - Reverse-iterates the chunk: prefix and stops at the first :lfs key found. - O(suffix variants per chunk) — typically 1–2 reads, regardless of total - chunks on disk. Sub-millisecond at any cpi / archive size. - """ + # Reverse-iterates chunk: prefix; stops at first :lfs key. ~1-2 reads + # regardless of archive size. for key in meta_store.iter_prefix_reverse("chunk:"): if key.endswith(":lfs"): return parse_chunk_id_from_chunk_key(key) @@ -745,18 +604,8 @@ def scan_max_lfs_chunk(meta_store) -> Optional[int]: def retention_aligned_resume_ledger(config) -> int: - """ - No-BSB resume cursor: align to the first ledger of the tx index containing - the retention floor. Captive core archive-catches-up from that point. - - Two callers: - - compute_resume_ledger fresh-start branch (no BSB, no prior commits). - - compute_resume_ledger stale-marker branch (no BSB long downtime — - prior progress marker is below the new retention floor). - - Helper retention_aligned_resume_ledger_with_tip is the same with an - explicit tip parameter, for callers that already sampled. - """ + # No-BSB resume cursor: align to the first ledger of the tx index containing + # the retention floor; captive core archive-catches-up from that point. network_tip = get_latest_network_tip(config.history_archives.urls) return retention_aligned_resume_ledger_with_tip(config, network_tip) @@ -768,12 +617,8 @@ def retention_aligned_resume_ledger_with_tip(config, network_tip_ledger) -> int: def try_sample_network_tip(config) -> Optional[int]: - """ - Wrapper around get_latest_network_tip that returns None on failure - instead of raising. Used where a missing tip is recoverable (e.g., - compute_resume_ledger's stale-marker check — if we can't confirm the - marker is stale, fall through to using it). - """ + # Returns None on archive failure instead of raising. Used where a missing + # tip is recoverable (compute_resume_ledger's stale-marker check). try: return get_latest_network_tip(config.history_archives.urls) except NetworkTipUnreachable: @@ -782,30 +627,19 @@ def try_sample_network_tip(config) -> Optional[int]: ### Phase 4 — Live Ingestion -Opens active stores for the resume position, spawns the lifecycle task, starts captive core, and enters the ingestion loop. **Query serving has already been enabled by `run_rpc_service`** — Phase 4 is purely about ingestion. +Opens active stores for the resume position, spawns the lifecycle task, starts captive core, enters the ingestion loop. Query serving was already enabled by `run_rpc_service`; Phase 4 is purely about ingestion. Captive core takes 4–5 minutes to spin up — historical queries continue to be served against frozen artifacts during that window. See [Query Contract](#query-contract). ```python def phase4_live_ingest(config, meta_store, resume_ledger): - # service_ready was set by run_rpc_service before this call. Queries against - # frozen artifacts (chunks <= streaming:last_committed_ledger) are already - # being served. Phase 4 starts the live ingestion path so new ledgers begin - # to flow. - - # Open active stores at the resume position. open_active_*_store writes the - # hot:* key BEFORE mkdir (see Flag Semantics) — Phase 3 has already cleaned - # any stale hot keys for this chunk/index, so these mkdir calls land on an - # empty filesystem path (or, in the SCENARIO A "keep" case, on the prior-run - # active dir that's still on disk + idempotently re-opened). + # open_active_*_store writes the hot:* key BEFORE mkdir; Phase 3 has cleaned + # stale hot keys, so mkdir lands on an empty path or (SCENARIO A) on the + # prior-run active dir, idempotently re-opened. active_stores = open_active_stores_for_resume(config, meta_store, resume_ledger) - # Background pruning loop. Runs an initial sweep on entry to handle any - # pruning:index:* markers left by a crash mid-prune in the prior run. + # Initial sweep handles any pruning:index:* markers left by a prior crash. run_in_background(run_prune_lifecycle_loop, config, meta_store) - # Captive core spinup. PrepareRange tells the SDK what range we want; actual - # spinup takes 4-5 minutes during which GetLedger blocks. /getHealth shows - # status="catching_up" during this window because streaming:last_committed_ledger - # hasn't moved yet, but historical queries still work against frozen artifacts. + # PrepareRange blocks ~4-5 min during captive-core spinup. ledger_backend = make_ledger_backend(config.captive_core.config_path) ledger_backend.PrepareRange(UnboundedRange(resume_ledger)) @@ -813,14 +647,8 @@ def phase4_live_ingest(config, meta_store, resume_ledger): def open_active_stores_for_resume(config, meta_store, resume_ledger): - """ - Open the three active stores for the chunk/tx-index that ingestion will - resume into. Idempotent on existing dirs (mkdir no-ops; RocksDB recovers - from any partial state). - - Called once from phase4_live_ingest. Per-kind stores share no state and - can be opened in any order. - """ + # Per-kind stores share no state; opening order is free. Idempotent on + # existing dirs (mkdir no-ops; RocksDB recovers from partial WAL). resume_chunk_id = chunk_id_of_ledger(resume_ledger) resume_tx_index_id = tx_index_id_of_chunk(resume_chunk_id) @@ -831,8 +659,6 @@ def open_active_stores_for_resume(config, meta_store, resume_ledger): ) ``` -Captive core takes 4–5 minutes to spin up. During that window the service is already serving historical queries (everything up through `streaming:last_committed_ledger`). `/getHealth` reports `status = "catching_up"` until the live loop commits its first ledger; queries for ledgers above the marker return "not yet available" via the QueryRouter. - --- ## Ingestion Loop @@ -863,30 +689,26 @@ def run_live_ingestion_loop(config, ledger_backend, active_stores, meta_store, r ledger_seq += 1 ``` -- Each per-store write is atomic: RocksDB WriteBatch + WAL across all three active stores (ledger / txhash / events). -- Key/value schemas are in [Active Store Architecture](#active-store-architecture). +Per-store writes are atomic via RocksDB WriteBatch + WAL. --- ## Freeze Transitions -Three independent background transitions per chunk/index boundary; each has its own task, flag, and cleanup. Live ingestion never blocks on them. +Three independent background transitions per boundary; each has its own task, flag, and cleanup. Live ingestion never blocks on them. + +- **LFS transition** (per chunk) — retired ledger RocksDB → `.pack` file. +- **Events transition** (per chunk) — retired events RocksDB → cold segment (3 files). +- **RecSplit transition** (per index) — retired txhash RocksDB → 16 `.idx` files. -- **LFS transition** — per chunk. Retired ledger RocksDB → `.pack` file. -- **Events transition** — per chunk. Retired events RocksDB store → cold segment (3 files). -- **RecSplit transition** — per index. Retired txhash RocksDB → 16 `.idx` files. -- Streaming's freeze transitions never produce `.bin` files; those are transient backfill output only, produced during Phase 1 (catchup). +Streaming's freezes never produce `.bin` files; those are transient backfill output (Phase 1 only). ### Concurrency Model -- **`active_stores` is the ingestion loop's owned state.** Fields `ledger` / `events` / `txhash` (one handle per data type, no `*_next`) are mutated only inside `on_chunk_boundary` and `on_tx_index_boundary`. Freeze tasks receive a handle by value at spawn and never read back through `active_stores`. -- **Meta-store is single-writer.** Writers are the ingestion loop (per-ledger checkpoint), freeze tasks (`:lfs` / `:events` / `:txhash` flags), and the lifecycle loop (`pruning:index:*` marker + prune key deletes). The meta-store wrapper serializes them on top of RocksDB's single-writer semantics. -- **Per-kind single-flight gates.** One outstanding transition per kind (LFS / events / RecSplit); the next starts only after the previous releases. `wait_for_lfs_complete()` acquires the LFS gate; `signal_lfs_complete()` at the end of `freeze_ledger_chunk_to_pack_file` releases it (events / RecSplit follow the same shape). Not a global barrier — kinds remain independent. -- **Query handlers read from a storage-manager layer.** Per-data-type managers (ledger / events / txhash) own their own state-transition synchronization; query handlers never touch `active_stores` directly. - - **Read-view invariant:** a query sees either pre-transition data (routed to the transitioning store) or post-transition data (new active store + newly-flagged immutable artifact) — never a half-state mix. - - **Flag-is-truth applies to reads:** a query never routes to an immutable artifact whose `:lfs` / `:events` / `:txhash` flag is unset. - - Concrete lock primitives + routing logic belong in a separate query-routing design doc. -- **Stores are opened on-demand at boundary** — see [Store Lifecycle](#store-lifecycle) for the open + transition sequence and synchronous-open cost. +- **`active_stores` is owned by the ingestion loop.** Fields `ledger` / `events` / `txhash` are mutated only inside the boundary handlers. Freeze tasks receive a handle by value at spawn and never read back through `active_stores`. +- **Meta-store is single-writer.** Serialized across the ingestion loop (per-ledger checkpoint), freeze tasks (flags), and lifecycle loop (prune marker + key deletes). +- **Per-kind single-flight gates.** One outstanding LFS / events / RecSplit transition each; the next starts only after the previous releases. Not a global barrier — kinds remain independent. +- **Query routing.** Per-data-type storage managers own state-transition synchronization; query handlers never touch `active_stores` directly. A query sees either pre-transition or post-transition data, never a mix; never routes to an immutable artifact whose freeze flag is unset. Concrete lock primitives + routing logic are deferred to a separate query-routing doc. ### Chunk Boundary (every 10_000 ledgers) @@ -985,20 +807,13 @@ Retention is enforced by a single background task, woken at chunk boundaries. Pr ```python def run_prune_lifecycle_loop(config, meta_store): - """ - Background goroutine spawned at the top of phase4_live_ingest. Runs an - initial sweep on entry (catches in-progress prunes from a prior-run crash), - then loops on chunk-boundary notifications. - - This is the ONLY caller of prune_tx_index. Phase 1, 2, 3 don't call it. - Phase 3 Pass 1 cleans up past-retention artifacts at startup but does so - directly (no pruning marker needed — service_ready = false during startup, - no queries to gate). - """ + # Sole caller of prune_tx_index. Initial sweep on entry catches any + # in-progress prunes from a prior-run crash; subsequent sweeps fire on + # chunk-boundary notifications. retention_ledgers = config.service.retention_ledgers - _run_prune_sweep(meta_store, retention_ledgers, config) # initial: handles crash-recovered in-progress prunes + _run_prune_sweep(meta_store, retention_ledgers, config) while True: - wait_for_chunk_boundary_notification() # woken by on_chunk_boundary's notify_lifecycle() + wait_for_chunk_boundary_notification() # set by on_chunk_boundary's notify_lifecycle() _run_prune_sweep(meta_store, retention_ledgers, config) @@ -1009,54 +824,57 @@ def _run_prune_sweep(meta_store, retention_ledgers, config): def prunable_tx_index_ids(meta_store, retention_ledgers): """ - Two sources of work, unioned: - 1. Crash recovery — any pruning:index:N key set means a prior prune was - interrupted; re-attempt regardless of retention status. - 2. Steady-state — past-retention indexes whose index:N:txhash = "1". + Returns tx_index_ids that need pruning, from two sources unioned together: + + - Crash-recovery scan — any pruning:index:N key set means a prior prune + was interrupted (marker survives crashes by design); re-run regardless + of retention status. + - Past-retention scan — indexes with index:N:txhash = "1" whose last + ledger is below the retention floor. """ if retention_ledgers == 0: return [] result = set() - # Source 1: in-progress prunes from a prior crash. - # pruning:index:N is set BEFORE any file delete and cleared AFTER all files - # + the index key are gone, so its presence at startup unambiguously means - # "this prune was interrupted, finish it." + # Crash-recovery scan: in-progress prunes from a prior crash. for key in meta_store.scan_prefix("pruning:index:"): result.add(parse_pruning_index_id(key)) - # Source 2: newly past-retention indexes that need a fresh prune. + # Past-retention scan: newly-eligible indexes since the last sweep. last_committed_ledger = meta_store.get("streaming:last_committed_ledger") max_eligible_tx_index_id = max_prunable_tx_index_id(last_committed_ledger, retention_ledgers) for tx_index_id in range(0, max_eligible_tx_index_id + 1): if tx_index_id in result: - continue # already added from Source 1 + continue if meta_store.get(f"index:{tx_index_id:08d}:txhash") == "1": result.add(tx_index_id) return sorted(result) -def prune_tx_index(tx_index_id, meta_store, config): - """ - Tear down all artifacts for tx_index_id. Called only from _run_prune_sweep. +def max_prunable_tx_index_id(last_committed_ledger, retention_ledgers) -> int: + # Highest tx_index_id whose last_ledger is strictly below the retention floor. + # Returns -1 when no index is past-retention (range(0, 0) below is empty). + if last_committed_ledger is None: + return -1 + floor_ledger = last_committed_ledger - retention_ledgers + if floor_ledger <= GENESIS_LEDGER: + return -1 + # last_ledger_in_tx_index(N) < floor_ledger ⇔ N < tx_index_id_of_ledger(floor_ledger) + return tx_index_id_of_ledger(floor_ledger) - 1 - Two ordering constraints bookend the work: - - pruning:index:N is set FIRST (before any file delete) — atomic gate - that flips queries to 4xx for any tx in index N. - - pruning:index:N is cleared LAST (after every other op) — survives - crashes so the next sweep's prunable_tx_index_ids picks N back up. - Everything between is individually idempotent: file deletes use - delete_if_exists, key deletes are no-ops on absent keys. Crash anywhere - in between leaves a state the re-run cleans up. +def prune_tx_index(tx_index_id, meta_store, config): """ - # OP 1: gate queries off + mark "prune in progress" for crash recovery. + Tear down all artifacts for tx_index_id. Marker set FIRST (atomic 4xx gate + for any tx in N) and cleared LAST (survives crashes so the next sweep + picks N back up). Everything between is individually idempotent. + """ + # OP 1: gate queries off; mark "prune in progress" for crash recovery. meta_store.put(f"pruning:index:{tx_index_id:08d}", "1") - # WORK: per-chunk file + key deletion. File-before-flag-delete preserved - # for each chunk's keys. + # WORK: per-chunk file + key deletion (file-before-flag-delete). for chunk_id in chunks_for_tx_index(tx_index_id): delete_if_exists(ledger_pack_path(chunk_id)) delete_events_segment(chunk_id) @@ -1066,93 +884,48 @@ def prune_tx_index(tx_index_id, meta_store, config): meta_store.delete(f"chunk:{chunk_id:08d}:txhash") delete_recsplit_idx_files(tx_index_id) - # OP 2: index:N:txhash transitions from "1" to absent. With the pruning - # marker still set, queries continue to see 4xx via the marker check. + # OP 2: queries still 4xx via the marker check. meta_store.delete(f"index:{tx_index_id:08d}:txhash") - # OP 3: clear the prune marker. Tx index N is now fully gone. - # Past this point, prunable_tx_index_ids no longer sees N. + # OP 3: tx index N is now fully gone. meta_store.delete(f"pruning:index:{tx_index_id:08d}") ``` -- **Why index-atomic.** Per-chunk pruning would open a window where `getTransaction` resolves to a ledger seq whose pack has been deleted; whole-index gating closes it. -- **Extra data on disk.** Up to `LEDGERS_PER_TX_INDEX - 1` ledgers past strict retention. `RETENTION_LEDGERS` is a multiple of `LEDGERS_PER_TX_INDEX`, so the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. -- **Why a separate `pruning:index:*` key family** (instead of overloading `index:N:txhash` with a `"deleting"` value). Keeps every meta-store key binary (present-or-absent), so reader code never has to decode special values. The marker's "set first, cleared last" pattern is the same in either encoding; using a separate key family makes the prune-intent explicit and self-documenting in meta-store dumps. Cost: one extra meta-store op per prune (a sub-microsecond RocksDB point write) and a slightly longer `prunable_tx_index_ids` (union of the in-progress set with the past-retention-eligible set). Worth it for schema uniformity. -- **Pruning marker is steady-state-only.** Phase 3 Pass 1 also deletes past-retention `index:N:txhash` keys (and any `.idx` / `.pack` / events / `.bin` files for chunks below the retention floor), but does NOT set the pruning marker — at startup `service_ready = false`, so there are no queries to gate. Pass 1 also clears any `pruning:index:N` it finds during its sweep, so the lifecycle loop's initial sweep doesn't re-attempt work Pass 1 already finished. +- **Why index-atomic.** Per-chunk pruning would open a window where `getTransaction` resolves but its `getLedger` 4xxs (pack deleted); whole-index gating closes it. +- **Extra data on disk.** Up to `LEDGERS_PER_TX_INDEX - 1` ledgers past strict retention. `RETENTION_LEDGERS` is always a multiple of `LEDGERS_PER_TX_INDEX`, so the next-eligible index is exactly `LEDGERS_PER_TX_INDEX` further. +- **Separate marker family vs overloading `index:N:txhash` with a `"deleting"` value.** Keeps every meta-store key binary (present-or-absent) so reader code never decodes special values; cost is one extra sub-microsecond write per prune. +- **Marker is steady-state-only.** Phase 3 Pass 1 deletes past-retention state directly (no marker needed — `service_ready = false`, no queries to gate) and clears any below-floor `pruning:index:*` it finds. --- ## Query Contract -Query serving is gated on Phase 4 (live ingestion) being reached. `getLedger`, `getTransaction`, `getEvents` all return **HTTP 4xx** during Phases 1–3. +`getLedger` / `getTransaction` / `getEvents` are gated on `service_ready`; during startup phases they return **HTTP 4xx**. ### Readiness Signal -- In-memory boolean `service_ready`, flipped to `true` by `set_service_ready()` once on-disk state is consistent — that means after Phases 1–3 complete and `compute_resume_ledger` returns, but BEFORE Phase 4's captive-core spinup. Reasoning: queries against frozen artifacts (chunks `<=` `streaming:last_committed_ledger`) don't depend on captive core having started, so 4xx-during-spinup adds no correctness; it only adds an unnecessary 4-5 minute query outage on every restart. Not persisted; every startup begins `false`. -- HTTP server binds at service startup (before Phase 1 (catchup)), so `getHealth` is always servable. The QueryRouter routes `getHealth` unconditionally and gates `getLedger` / `getTransaction` / `getEvents` on `service_ready`: `false` → HTTP 4xx; `true` → route normally. -- Clients see `HTTP 4xx` from the three read endpoints on every startup until Phase 4 is reached, regardless of prior runs. Intentional — catchup and recovery phases must complete before the service serves, every time. +- In-memory boolean. Flipped `true` by `set_service_ready()` after Phases 1–3 complete and `compute_resume_ledger` returns, BEFORE Phase 4's captive-core spinup. Frozen-artifact queries don't need captive core; flipping after spinup would add an unnecessary 4–5 minute outage per restart. +- Not persisted; every startup begins `false`. Clients see 4xx on every startup until Phase 4 is reached, regardless of prior runs. +- HTTP server binds before Phase 1, so `/getHealth` is always servable. ### Behavior During Phases 1–3 -- `/getLedger`, `/getTransaction`, `/getEvents` → `HTTP 4xx` with no payload detail. -- `/getHealth` → always served. Response payload matches the existing stellar-rpc shape: `status` (`catching_up` during Phases 1–3, `healthy` during Phase 4 (live ingestion)), `latestLedger` (= `streaming:last_committed_ledger`, or `0` if absent), `oldestLedger` (first ingested ledger), `ledgerRetentionWindow`. No drift field, no network-tip field. -- No partial / incremental serving. The service does not serve "whatever is ingested so far" while Phases 1–3 are running. +- `/getLedger`, `/getTransaction`, `/getEvents` → HTTP 4xx, no payload. +- `/getHealth` → served. Response shape matches existing stellar-rpc: `status` (`catching_up` during Phases 1–3, `healthy` during Phase 4), `latestLedger` (= `streaming:last_committed_ledger`, or `0` if absent), `oldestLedger`, `ledgerRetentionWindow`. +- No partial / incremental serving while Phases 1–3 run. ### Behavior When an Index Is Being Pruned -- `prune_tx_index` sets `pruning:index:{tx_index_id:08d} = "1"` before touching any files, deletes `index:{tx_index_id:08d}:txhash` after all artifacts are gone, and clears `pruning:index:{tx_index_id:08d}` last. QueryRouter checks the `pruning:index:*` key first; if set, returns HTTP 4xx as if the index were past retention. -- Queries for a ledger in a pruning index return HTTP 4xx the instant `pruning:index:N` is set, not when files actually disappear. No window where queries route into a half-deleted index. - -### Rationale - -- Without an explicit gate, implementations drift toward "best-effort serve whatever is ingested" — inconsistent across operators, breaks client assumptions. -- Explicit `service_ready` + HTTP 4xx gives clients an unambiguous signal. -- `catching_up` health status gives operators visibility into progress. +- QueryRouter checks `pruning:index:N` first; if set, returns 4xx as if the index were past retention. Queries flip to 4xx the instant the marker is set, not when files actually disappear — no window where queries route into a half-deleted index. --- ## Resilience -Flag-after-fsync makes the meta store authoritative for every startup decision — never filesystem scanning. Streaming extends backfill's resilience model with per-ledger checkpoint discipline, per-kind single-flight freeze gates, and a two-phase prune marker. - -### Crash Recovery - -No separate recovery phase. Every startup runs Phases 1–4 — already-complete work is detected and skipped via meta-store flags. - -#### Invariants - -In addition to the backfill subroutine's invariants in [01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery), streaming adds the following: - -1. **Monotonic progress marker.** `streaming:last_committed_ledger` advances only via `advance_progress_marker` (never regresses). Two writers: Phase 1 catchup (post-catchup, to `last_ledger_in_chunk(highest completed chunk)`) and live ingestion loop (per ledger, after all three active stores durably commit). Resume is `last_committed_ledger + 1` — except in the no-BSB long-downtime case where the marker is stale and below the new retention floor, in which case `compute_resume_ledger` deletes it and skips forward to the retention-aligned start. -2. **No separate recovery phase.** Startup is Phases 1–4. Nothing else. -3. **Max-1-transitioning per freeze.** A freeze transition must complete before the next one starts, per kind (LFS, events, RecSplit). Applies in steady state and crash recovery. -4. **Retention immutable.** `config:retention_ledgers` is stored on first run and compared thereafter. No mid-run retention change. -5. **Pruning intent marker.** `pruning:index:{N} = "1"` is set BEFORE any file delete and cleared AFTER all artifacts and the `index:{N}:txhash` key are gone. QueryRouter treats its presence as "tx index N is unservable" → 4xx. Crashes mid-prune leave the marker set; the lifecycle loop's `prunable_tx_index_ids` picks it up via `scan_prefix("pruning:index:")` and re-runs the prune idempotently. See [Pruning](#pruning). -6. **Hot-key tracking + retention-aware cleanup.** Every active store directory has a `hot:*` key, set BEFORE `mkdir` and cleared AFTER dir removal. Phase 3's two passes are both meta-store-driven (no filesystem scan anywhere): pass 1 discards past-retention orphans (active dirs and stale freeze flags below the retention floor — common after long downtime where the floor moves forward); pass 2 reconciles in-flight transitions for entries above the floor. -7. **No permanently-partial tx index.** Every persisted tx index reaches a terminal state — either *complete* (`index:N:txhash = "1"`, RecSplit built) or *fully discarded* (all chunks + `index:N:txhash` deleted). The intermediate "trailing partial" state (chunks with `:lfs+:events` but no `index:N:txhash`) only persists transiently — it is completed by either: (a) a future Phase 1 invocation extending the backfill range past `last_chunk_in_tx_index(N)`, (b) Phase 4 ingestion reaching `last_chunk_in_tx_index(N)` (which fires `on_tx_index_boundary` → `build_tx_index_recsplit_files`), or (c) Phase 3 Pass 1 discarding it as past-retention. No fourth path; no tx index ever stays partial forever. -8. **No permanent orphans.** Every meta-store flag has a corresponding artifact (or is mid-cleanup, recoverable via the file-before-flag-delete invariant). Every active-store directory has a `hot:*` key (set before mkdir). Every immutable file has a freeze flag (set after fsync). The only states without a recovery path are operator-introduced (manually deleted a meta-store key while files survive, or restored a filesystem snapshot inconsistently) — explicitly out of scope per the meta-store-driven-recovery principle. In all other cases, every artifact on disk traces back to a meta-store record that some recovery path (backfill self-heal, Phase 3 Pass 1, Phase 3 Pass 2, or pruning lifecycle) will eventually act on. - -#### Compound Recovery Scenarios - -Backfill's crash-recovery model in [01-backfill-workflow.md](./01-backfill-workflow.md#crash-recovery) handles every Phase 1 (catchup) crash. Streaming adds: - -- **Crash during Phase 2 (`.bin` hydration).** All sub-cases are recoverable because every cleanup pair runs file-delete BEFORE flag-delete (see [Flag Semantics](#flag-semantics)). - - **Sweep 1, mid-loop.** Already-cleaned chunks: flag absent → skipped on retry. Pending chunks: flag + file still present → cleaned on retry. - - **Sweep 1, between file-delete and flag-delete.** Flag set, file already gone. Restart: flag triggers retry, `delete_if_exists` is a no-op on the missing file, flag deleted. - - **Sweep 2, between `load_bin_into_rocksdb` and file-delete.** Flag set, file present, data already durable in the active txhash RocksDB. Restart: re-loads (RocksDB put is idempotent on the same key/value), then deletes file, then flag. - - **Sweep 2, between file-delete and flag-delete.** Flag set, file gone, data durable. Restart: flag triggers retry, `os.path.exists(bin_path)` is False so load is skipped, file delete is a no-op, flag deleted. - - **No filesystem scan needed in any case** — the meta-store flag is the only signal the next start consults. - -- **Crash between per-ledger checkpoint and freeze completion (LFS / events).** - - State: `streaming:last_committed_ledger = last_ledger_in_chunk(chunk_id)`; `chunk:{chunk_id}:lfs` absent; `hot:chunk:{chunk_id}:lfs` set; active ledger store dir present. - - Phase 1 (catchup) on restart (assumes `[BSB]` configured): `:lfs` missing → re-runs `process_chunk(chunk_id)` with a fresh per-task BSB (idempotent per artifact). - - Phase 3 (reconcile) iterates `hot:*` keys. Hits SCENARIO B (freeze flag now set + chunk_id < resume): `delete_dir_if_exists` + clear hot key. Cleanup is idempotent. - - Cost: ~10_000 ledgers of redundant ingestion per affected chunk. Correctness preserved. +Streaming extends backfill's resilience model ([01-backfill-workflow.md — Crash Recovery](./01-backfill-workflow.md#crash-recovery)) with per-ledger checkpoint discipline, per-kind single-flight freeze gates, and the `pruning:index:*` marker family. No separate recovery phase — every startup runs Phases 1–4 and skips already-complete work via meta-store flags. -- **Crash mid-RecSplit.** - - State: `index:{tx_index_id}:txhash` absent; `hot:index:{tx_index_id}:txhash` set; all `:lfs` chunks present; partial `.idx` files possibly on disk. - - Phase 3 (reconcile) hits SCENARIO C → `finish_interrupted_freeze` re-runs the build (its preamble blanket-deletes partial `.idx` files). +### Streaming-Specific Invariants -- **Crash mid hot-store creation.** - - State: `hot:chunk:{chunk_id}:lfs` (or events / txhash) set, but `mkdir` / RocksDB open didn't complete. Dir absent or partially set up; freeze flag absent. - - Phase 3 (reconcile): if `chunk_id == resume`, SCENARIO A — keep; Phase 4 reopens via `open_active_*_store` (idempotent — mkdir no-ops on existing dir, RocksDB recovers from partial WAL). If `chunk_id < resume`, SCENARIO C — `finish_interrupted_freeze` reopens and re-runs the freeze. +1. **No permanently-partial tx index.** Every persisted tx index reaches a terminal state — either *complete* (`index:N:txhash = "1"`, RecSplit built) or *fully discarded* (all chunks + `index:N:txhash` deleted). The intermediate "trailing partial" state (chunks with `:lfs+:events` but no `index:N:txhash`) persists only transiently, and is completed by either (a) a future Phase 1 invocation extending the backfill range past `last_chunk_in_tx_index(N)`, (b) Phase 4 ingestion reaching `last_chunk_in_tx_index(N)` (which fires `on_tx_index_boundary` → `build_tx_index_recsplit_files`), or (c) Phase 3 Pass 1 discarding it as past-retention. +2. **No permanent orphans.** Every meta-store flag has a corresponding artifact (or is mid-cleanup, recoverable via file-before-flag-delete). Every active-store dir has a `hot:*` key. Every immutable file has a freeze flag. +3. **Pruning intent marker.** `pruning:index:{N}` is set BEFORE any file delete and cleared AFTER everything else; the lifecycle loop's `prunable_tx_index_ids` picks up surviving markers via `scan_prefix("pruning:index:")` and re-runs the prune idempotently. See [Pruning](#pruning). From 47990358cafa629097584cf772403bc925ff204c Mon Sep 17 00:00:00 2001 From: Karthik Iyer Date: Sun, 26 Apr 2026 23:40:05 -0700 Subject: [PATCH 34/34] Streaming doc: compute_resume_ledger 3-branch dispatch + Pass 1 sibling blanket-delete MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - compute_resume_ledger restructured into three explicit branches with the shorthand `marker = streaming:last_committed_ledger`. Branch 1 returns marker + 1 (current/unconfirmed). Branch 2 detects staleness via direct history-archive sampling and returns retention_aligned_resume_ledger_with_tip; the BSB-presence gate was removed so the BSB-stale-but-configured operator scenario gets the same skip-forward as no-BSB long-downtime. Branch 3 simplifies to a fresh-operator-start return; the prior highest-:lfs fallback and scan_max_lfs_chunk helper are dropped (corner-case optimization not worth the complexity, captive core re-ingests at most retention_window worth of ledgers). - Branch 2 no longer deletes streaming:last_committed_ledger. The live loop's first commit overwrites the stale value monotonically via advance_progress_marker (candidate > prior). - Pass 1's chunk: loop now applies a sibling blanket-delete: on first encounter of any chunk K below floor, delete_if_exists on all three artifact paths (.pack, .bin, events cold segment) — not just the one this key represents. Catches partial files left by a prior crashed process_chunk where one kind completed (flag set) but the next kind hadn't (no flag, partial file with no key tracking it). A blanket_cleaned set deduplicates per chunk_id so 3-flag chunks don't redo the work. - Per-if-condition comment audit across Phases 1, 2, 3: callous "clubs multiple scenarios into one comment" cases rewritten to state precisely the scenario each condition checks. Notable: phase1_catchup's terminal early-return comment now states "BSB returned -1" with a side-note that the other loop break (end_chunk < start_chunk) is unreachable under current geometry. Phase 2 Sweep 2's continue + os.path.exists checks now name their respective scenarios (chunks Phase 1 hadn't reached, and the prior-Phase-2-crash-mid-pair file-already-deleted case). - pass2_recover_in_flight_transitions's `if last_committed is None` comment now spells out why bailing is correct: any orphan hot:* keys from a prior run's abnormal exit get classified on a later restart once the marker is set. --- .../design-docs/02-streaming-workflow.md | 130 ++++++++++-------- 1 file changed, 72 insertions(+), 58 deletions(-) diff --git a/full-history/design-docs/02-streaming-workflow.md b/full-history/design-docs/02-streaming-workflow.md index 888ef8ed9..f62938607 100644 --- a/full-history/design-docs/02-streaming-workflow.md +++ b/full-history/design-docs/02-streaming-workflow.md @@ -403,7 +403,7 @@ Pure-streaming restarts (no recent Phase 1 output) never see `.bin` files; the l ### Phase 3 — Reconcile -Two passes, both meta-store-driven. Pass 1 runs first so Pass 2's resume-relative classification only sees in-range entries. +Two passes, both meta-store-driven. Pass 1 drops past-retention state first so Pass 2 only has to handle hot:* keys that are still within retention. ```python def phase3_reconcile(config, meta_store): @@ -424,7 +424,13 @@ def pass1_drop_past_retention_state(config, meta_store): chunks of an INCOMPLETE prior-run tx index. Because index:N:txhash was never written for that tx index, the pruning lifecycle's past-retention scan (which keys off index:N:txhash = "1") cannot - find them — Pass 1 is their only cleanup path. + find them — Pass 1 is their only cleanup path. The chunk loop also + applies a sibling blanket-delete: for every chunk K below floor that + carries any chunk:K:* key, delete_if_exists is called on ALL three + artifact paths (.pack, .bin, events cold segment), not just the one + this key represents. Catches partial files left by a prior crashed + process_chunk where one kind completed (flag set) but the next kind + hadn't — its partial file on disk has no key tracking it. 2. Lifecycle-reachable complete state, drained eagerly. - index:N:txhash flags + the 16 RecSplit .idx files for fully-built @@ -478,11 +484,24 @@ def pass1_drop_past_retention_state(config, meta_store): meta_store.delete(hot_key) log.info(f"phase3 pass1: discarded past-retention {hot_key}") + # Sibling blanket-delete: the first time we see any chunk:K:* key for K + # below floor, delete_if_exists ALL three artifact paths for K (.pack, .bin, + # events cold segment) — not just the one this key represents. Catches + # partial files from a prior crashed process_chunk where some kinds + # completed (flag set) but others didn't (flag absent + partial file on + # disk with no key tracking it). Idempotent — extra delete_if_exists calls + # on already-clean paths are no-ops. + blanket_cleaned = set() for chunk_key in meta_store.scan_prefix("chunk:"): - chunk_id, kind = parse_chunk_key(chunk_key) - if chunk_id < floor_chunk: - delete_immutable_artifact(kind, chunk_id) # .pack / events cold segment / .bin - meta_store.delete(chunk_key) + chunk_id, _ = parse_chunk_key(chunk_key) + if chunk_id >= floor_chunk: + continue + if chunk_id not in blanket_cleaned: + delete_if_exists(ledger_pack_path(chunk_id)) + delete_events_segment(chunk_id) + delete_if_exists(raw_txhash_path(chunk_id)) + blanket_cleaned.add(chunk_id) + meta_store.delete(chunk_key) # Category 2 — past-retention complete indexes; lifecycle could reach but Pass 1 drains eagerly. for index_key in meta_store.scan_prefix("index:"): @@ -545,62 +564,57 @@ Runs once per service start, after Phase 3, before Phase 4. Returns the ledger s ```python def compute_resume_ledger(config, meta_store) -> int: """ - Decide where captive core resumes ingestion. Three branches: - - 1. Marker present + (BSB or retention=0 or marker not stale) → marker + 1. - Covers fresh starts (Phase 1 just advanced), quick restarts, and - BSB-present long-downtime (Phase 1 advanced past the stale value). - - 2. Marker present + no BSB + below retention floor → stale (no-BSB - long downtime). Delete marker and return retention_aligned_resume_ledger. - Without this, captive core would re-ingest weeks of about-to-be-pruned - ledgers. - - 3. Marker absent → defensive fallback. Highest :lfs chunk + 1 if any - exist; otherwise retention_aligned_resume_ledger. - - Degraded-path note: in branch 2, if the history archive is unreachable - (try_sample_network_tip returns None), we cannot confirm staleness and - fall through to marker + 1. The live loop then re-ingests potentially - weeks of past-retention ledgers via captive core; the pruning lifecycle - eventually catches up. Documented as expected behavior under archive - unavailability — correctness holds, throughput is degraded. + Where should captive core resume ingestion? + Shorthand: `marker` = `streaming:last_committed_ledger`. + + Why not just "if marker exists, return marker + 1"? The marker is our + local progress checkpoint, not the live network's tip. In long-downtime + or BSB-stale scenarios it can be tens of millions of ledgers behind the + live tip; resuming at marker + 1 there would archive-catchup from the + stale point and re-ingest weeks of ledgers that the pruning lifecycle + would delete moments later. Branch 2 sidesteps that by sampling the + live tip and skipping forward when the marker has fallen behind. + + Three branches: + + Branch 1 — marker present and current (or staleness unconfirmed): + return marker + 1. Covers archive profile (no floor), normal + restarts, and the degraded path where the history archive is + unreachable. + + Branch 2 — marker present, retention > 0, marker below the retention + floor: stale. Return retention_aligned_resume_ledger_with_tip. Two + cases land here: (1) no-BSB long-downtime — prior live loop hasn't + committed in a while; (2) BSB-stale-but-configured — operator + stopped advancing BSB while the network kept producing, so Phase 1 + only advanced the marker to BSB's stale tip. The first live commit + on resume monotonically overwrites the stale marker via + advance_progress_marker. + + Branch 3 — marker absent (fresh operator start): + return retention_aligned_resume_ledger. No consistency validation. Phase 1 self-heals incomplete chunks; Phase 3 recovers in-flight freezes; pruning lifecycle handles past-retention state. """ - progress_marker = meta_store.get("streaming:last_committed_ledger") - - if progress_marker is not None: - # Stale-marker check — only matters when there's no BSB (tip-tracker - # profile after long downtime). In BSB-present paths, Phase 1 has - # already advanced the marker past any stale value before we got here. - if config.bsb is None and config.service.retention_ledgers > 0: - current_tip = try_sample_network_tip(config) - if current_tip is not None: - floor = max(current_tip - config.service.retention_ledgers, GENESIS_LEDGER) - if progress_marker < floor: - log.info(f"compute_resume_ledger: marker {progress_marker} below retention floor {floor}; skipping forward") - meta_store.delete("streaming:last_committed_ledger") - return retention_aligned_resume_ledger_with_tip(config, current_tip) - return progress_marker + 1 - - # Marker absent — defensive fallback. Recover via the highest :lfs chunk - # if any exist; otherwise tip-tracker fresh start. - highest_lfs = scan_max_lfs_chunk(meta_store) - if highest_lfs is not None: - return last_ledger_in_chunk(highest_lfs) + 1 - - return retention_aligned_resume_ledger(config) - - -def scan_max_lfs_chunk(meta_store) -> Optional[int]: - # Reverse-iterates chunk: prefix; stops at first :lfs key. ~1-2 reads - # regardless of archive size. - for key in meta_store.iter_prefix_reverse("chunk:"): - if key.endswith(":lfs"): - return parse_chunk_id_from_chunk_key(key) - return None + marker = meta_store.get("streaming:last_committed_ledger") + + # Branch 3 — fresh operator start. + if marker is None: + return retention_aligned_resume_ledger(config) + + # Branch 2 — staleness check. Skipped when retention=0 (no floor) or + # when the archive is unreachable (best-effort fall-through to Branch 1). + if config.service.retention_ledgers > 0: + current_tip = try_sample_network_tip(config) + if current_tip is not None: + floor = retention_floor_ledger(current_tip, config.service.retention_ledgers) + if marker < floor: + log.info(f"compute_resume_ledger: marker ({marker}) below retention floor ({floor}); skipping forward") + return retention_aligned_resume_ledger_with_tip(config, current_tip) + + # Branch 1 — marker is current (or staleness unconfirmed). + return marker + 1 def retention_aligned_resume_ledger(config) -> int: