Generalize cold DB: ColdStore trait + slot-keyed static archive#75
Generalize cold DB: ColdStore trait + slot-keyed static archive#75dapplion wants to merge 24 commits into
Conversation
Add a slot-keyed durable archive (`StaticBlockStore`) for finalized blinded blocks, integrated into `migrate_database` as a second pass that runs alongside the existing cold-state migration. File format and manifest persistence remain `todo!()` — this is the wiring scaffold. - New `DBColumn::BeaconBlockSlot` reverse index (root → slot). - `HotColdDB::get_block_with` and `block_exists` fall through to the archive after a hot-KV miss. - Archival driven inside `migrate_database`: cold ops (BeaconBlockRoots + BeaconBlockSlot) commit atomically, hot deletes after split commit. - Skip-slot dedup seeded from `BeaconBlockRoots[current_split.slot - 1]`, with `Hash256::ZERO` for the genesis case. - Spec at `specs/static-blocks.md`.
Companion document describing the static-file backend for `BlobSidecar` archival via `.erb` files. Initialization via genesis sync or imported era files; checkpoint sync and P2P blob backfill rejected at startup.
Replaces the byte-keyed Cold: ItemStore<E> bound on HotColdDB with a slot-typed
ColdStore<E> trait: get/put_batch/exists/iter_from for slot-keyed columns plus
get_index/put_index_batch over a tight DBColumnColdIndex enum (BlockSlot,
ColdStateSummary). KV backends (BeaconNodeBackend, MemoryStore) implement it
by translating slot/root keys into the existing KeyValueStore byte API.
StaticBlockStore generalised to StaticColdStore: one type, columns dispatched
on each call. Per-column subdirectory; per-column settings (record_type,
compression, max_decompressed) come from a build-time column_config table on
first creation and are persisted in each column's conf so future builds with
different defaults stay compatible. Conf magic bumped to LHSTBLK2.
Removes prune_historic_states + the lighthouse db prune-states CLI: the mode
they produce ("cold blocks present, cold states absent") isn't in the
startup-path table in specs/static-cold-backend.md and the spec doesn't
support runtime mode transitions. full_state_pruning_enabled goes with it.
Other: store_cold_state* helpers take separate slot-keyed and root-index
buffers; migration writes slot-keyed cold data first, root indices after, so
a crash leaves no dangling indices.
- Move beacon_node/store/src/static_blocks.rs to static_cold.rs (the type is no longer block-specific). - Add DBColumnCold (slot-keyed cold columns) alongside DBColumnColdIndex. StaticColdStore is keyed by DBColumnCold all the way through; no DBColumn conversion happens inside static_cold.rs. column_config returns a plain ColumnConfig (was Option) and UnsupportedColumn errors go away — the tighter enum makes them unrepresentable. - Eager-open every cold column at boot, freeze the columns map. No outer Mutex/RwLock; the per-column writer state mutex is the only sync point. - Rename ColumnConfig::max_decompressed -> max_value_bytes (it bounds the raw payload size on uncompressed reads too, defending against corrupt headers). - BeaconStateDiff: compression: false. HDiff is already compressed internally (zstd'd validator/balance chunks) so snappy on top is wasteful.
The slot-keyed methods on ColdStore (get/put_batch/contains/iter_from) now take the tight DBColumnCold enum instead of DBColumn, mirroring the existing DBColumnColdIndex shape on the index methods. This drops DBColumn from static_cold.rs entirely. KV backend impls (BeaconNodeBackend, MemoryStore) translate via column.db_column(). FrozenForwardsIterator::new still accepts DBColumn at the public boundary and converts at the call to cold_db.iter_from. Also: delete static_blobs.rs (was a stub returning Unsupported on every call, with no callers). Revert noise renames (io_batch, cold_db_block_ops, cold_db_state_ops, ops, .map_err(|e| e.into())) to keep the diff against unstable focused on real semantic changes.
`BeaconBlockSlot` (and the `DBColumnColdIndex::BlockSlot` variant that wrapped it) was added for a static-archive read-fallback path that was removed earlier in this branch. Nothing writes or reads it now, so drop the variant from the DBColumn enum, the matching DBColumnColdIndex variant, the `MissingFrozenBlockSlot` error, and the corresponding key_size match arm. Rewrite TODO-static-block-storage.md to reflect the current branch state: the static-cold generalization is in, the prune-states removal is in, and the remaining work is cold-backend selection (flag), review of block read/write paths now that BeaconBlockSlot is gone, an invariants review, and tests.
The two explicit impls (BeaconNodeBackend, MemoryStore) were identical boilerplate translating slot/root keys into the underlying byte-keyed KeyValueStore. Replace with a single blanket impl in lib.rs. Forecloses a future ColdStore impl that isn't a KeyValueStore (e.g. wiring StaticColdStore directly as the Cold parameter); reversible if/when that becomes wanted.
The blanket `ColdStore` impl writes `slot.as_ssz_bytes()` for
`BeaconColdStateSummary`, where older releases wrote SSZ-encoded
`ColdStateSummary { slot }`. The two encodings are byte-identical (an SSZ
container of one fixed-size field equals the field), but the equality is
load-bearing for read compatibility with existing databases. Add a
regression test that pins it.
The slot-walk rewrite of `check_cold_state_diff_consistency` was forced by not having an index iterator on the trait. Add `iter_index(col)` (yields `(Hash256, Slot)`) and restore the invariant to iterating `BeaconColdStateSummary` directly, matching unstable's structure modulo the slot-typed API.
Replace the two-buffer (slot-keyed data + state-root index) helper signatures with a single `&mut ColdBatch` and add `commit_cold_batch` that flushes data, syncs, then commits the index — encoding the data-before-index ordering at the API. `put_state` and `reconstruct.rs` collapse to "build batch, commit batch." The migration loop keeps a top-level summary index that accumulates across states and is flushed at end-of-migration; per-iteration data still goes through `commit_cold_data` (renamed from `commit_cold_items`).
Drops the `KeyValueStore -> ColdStore` blanket and replaces it with an explicit per-backend impl. `BeaconNodeBackend` no longer impls `ColdStore` directly — its byte-translation is inlined inside the `ColdBackend::Kv` arm where it's actually used. `MemoryStore` keeps an explicit impl (still used as the Cold parameter in tests via `EphemeralHarnessType`). `ColdBackend<E>` is a new enum with `Kv(BeaconNodeBackend)` / `Static(StaticColdStore)` variants, picked at startup from `StoreConfig::cold_backend` (default `Kv`). Production type signatures swap the second `BeaconNodeBackend<E>` slot to `ColdBackend<E>` (3 production sites, 6 test sites, 3 database_manager sites). `StaticColdBackend<E>` wrapper from the previous commit collapsed into a direct `impl<E> ColdStore<E> for StaticColdStore`. Index methods stub `Unsupported` for now — wiring the embedded KV is the next piece.
Genesis sync against the static cold backend was failing for two reasons: 1. `BeaconColdStateSummary` and friends are root-keyed indices; the static files are slot-keyed. The previous `Unsupported` stubs blocked the very first migration. Embed a `BeaconNodeBackend<E>` at `<root>/index/` and serve `get_index` / `put_index_batch` / `iter_index` from it. Forwards iteration over slot-keyed columns (`iter_from`) is now also implemented by walking the column's `.off` sidecar. 2. `BeaconChainBuilder::genesis` pre-writes the genesis block_root to cold `BlockRoots` at slot 0, then the first migration writes the same (slot, root) again. KV cold accepts the overwrite; the static backend's strict-ascending check rejected it. `Column::put` now treats a re-put of an identical value at the current highest slot as a no-op, and errors only on a value mismatch (a real bug). Threads `StoreConfig` into `StaticColdStore::open` so the embedded KV picks up the same backend (`leveldb` / `redb`) and tuning as the hot/blobs DBs. Adds `genesis_sync_static_cold` covering ~1000 finalized blocks with the static backend and a load of every cold state through the new index.
Drops the bespoke 1000-block static-cold test and instead has get_store read the cold backend from COLD_BACKEND=static|kv. CI / local can now run the existing store_tests suite against either backend without duplicating test bodies. Also trims ColdBackendKind to the derives actually exercised today. Display, EnumString, VariantNames, Copy were forward-looking for the not-yet-wired --cold-backend CLI flag - re-add when that lands.
The static cold backend is append-only in ascending slot order, so checkpoint/weak-subjectivity sync (which backfills slots below the anchor) is fundamentally incompatible. Refuse the combination explicitly in BeaconChainBuilder::weak_subjectivity_state instead of failing later with an opaque 'static cold put out of order' error. The 6 weak_subjectivity_sync_* tests early-return under COLD_BACKEND=static so the test suite passes against either backend. Adds the --cold-backend CLI flag (kv|static, default kv) so operators can opt into the static backend at startup. Re-adds EnumString and VariantNames on ColdBackendKind for clap parsing.
Idempotent put at any committed slot makes `migrate_database` retries safe after a mid-loop crash. The previous put accepted re-puts only at exactly `highest_written_slot`; on retry, slot 0 < highest fired out-of-order. Now any committed slot accepts an identical-value re-put; mismatched values and skipped-slot fills still error. New `COLD_BACKEND_KEY` in `BeaconMeta` pins the backend kind on first open and refuses mismatched re-opens (Static and Kv on-disk layouts are incompatible). `reconstruct_historic_states` refuses to run under static cold — the slots it would write are below every column's high-water mark. `max_value_bytes` ratchets upward on open if the build default exceeds disk, so a newer build can write larger records than an older one persisted, and re-persists immediately for stable re-opens. Per-column files renamed `static_blocks_*` -> `data_*`, `static_blocks.conf` -> `column.conf` — the literal prefix was misleading after the per-column generalisation. `kv_cold_store` helper module dropped; `MemoryStore`'s `ColdStore` impl inlined to match `ColdBackend::Kv`. Two impls, no shared helper. `decompress_record` returns `Result<Vec<u8>>` (was `Result<Option<Vec<u8>>>` with `Some` on every success path). `TODO(static)` markers added for `iter_from` perf, the migrate-vs-index transient invariant 11 window, invariants 10/11/12 re-review under static cold, and the missing test set. Spec cleanup: delete `specs/static-blocks.md` (stale, ~60% contradicted the code) and `TODO-static-block-storage.md`. Rewrite the `static_cold.rs` module header as the canonical byte-level format reference (layout, data file, `column.conf`, put contract, recovery).
Adds a sibling job to `beacon-chain-tests` that runs `beacon_chain::store_tests::*` with `COLD_BACKEND=static` (and `FORK_NAME=fulu`) to exercise the static slot-keyed cold-DB backend on every CI run. Mirrors the existing job's runner, toolchain, cache, and feature flags (`fork_from_env,slasher/lmdb,portable`). Added to `test-suite-success` so the merge queue blocks on it.
Adds the missing pieces so the static cold archive can serve block-by-root reads without keeping a duplicate in hot indefinitely. Schema (re-adds what f671da1 dropped): - `DBColumn::BeaconBlockSlot` (tag `bbs`, 32-byte key, 8-byte SSZ Slot) - `DBColumnColdIndex::BlockSlot` variant Migrate (`migrate_database`): - alongside the existing block-bulk push to `cold.Block`, push the matching `(block_root, slot)` to `cold_block_slot_index` and the `block_root` to `hot_block_delete_roots` - end-of-loop: `put_index_batch(BlockSlot, ...)` after `ColdStateSummary`, before split commit - post split commit: `hot_db.do_atomically(deletes)` reclaims hot space for the just-migrated blocks. Hot delete only runs after cold bytes + cold index are durable, so a crash here leaves cold canonical and reads fall through. KV mode keeps `move_blocks_to_static_cold` false → all the new buffers stay empty → status quo. Read fallback (`get_block_with`, `block_exists`): - hot first; on miss, `cold.get_index(BlockSlot, root)` then `cold.get(Block, slot)`. Missing bulk for an indexed slot raises `MissingFrozenBlock` (corruption). KV mode's empty BlockSlot index makes the fallback always return None on hot miss — identical to before. Invariant 10 (`check_cold_block_root_indices`): - now uses `self.block_exists(&block_root)` (the public read with cold fallback) instead of the bare `hot_db.key_exists(...)`. Required because hot-delete makes the bare hot check fire spuriously for every migrated slot under Static cold. Init-path coverage: - Genesis + KV: cold writes gated off, BlockSlot empty, fallback always None on hot miss. Status quo. - Genesis + Static: migrate writes block + index to cold, deletes from hot. Reads ≥ split.slot hit hot; < split.slot hit cold via fallback. - Era + Static: hot has only post-anchor blocks. cold has 0..S from era (future era-import path) + post-S from migrate. Fallback is the read path for slot < S. - Ckpt + KV: BlockSlot empty as in Genesis + KV. Backfill fills hot. - Ckpt + Static (no era): rejected by the existing WSS guard.
`make cli-local` after `e259a5157b` introduced `--cold-backend` without touching `book/src/help_bn.md`, so `cli-check` failed on every push.
Re-added in `bbc3badfd2` (`BeaconBlockSlot`); the hardcoded snapshot in `check_db_columns` wasn't updated, so the test asserted on a stale list.
|
The current Replaced Microbench (
Tests + |
|
End-to-end mainnet ERA import: 51.27 h → 1.22 h (42× speedup) Combined this PR (#75) with the ERA-import lcli (sigp#9273 / #69) plus the Compared against the tuned KV-cold-backend run (
The custom blinder is the second large win on top of the
Pre-Bellatrix: trivial passthrough since Per-phase tracing breakdown across all 1260 eras:
Per-era CSV in same shape as the KV reference: |
|
Architectural follow-up: phase-2 reconstruction is incompatible with monotonic-forward writes After phase 1 of the ERA importer finishes (era-boundary states written to Sequentialising the era loop (this branch tries that) doesn't help — the conflict is between phase-1 boundary writes and phase-2 intermediate writes, not between parallel workers. Two viable directions, both real design changes:
(1) is the smaller change and preserves the existing API. Just a flag in the This blocker doesn't affect the headline 42× phase-1 speedup (block + state-boundary writes work fine), but it's the gating issue for the static backend reaching feature parity with the KV cold backend. |
|
Custom transactions tree-hasher: Follow-up to the end-to-end ERA-import experiment above. Inside the custom blinder, the dominant remaining cost is hashing the transactions list. Replaced with a direct-byte hasher that walks the SSZ Microbench (
Withdrawals were also benchmarked — the typed path is already in the noise (~5 ns difference) so a hand-rolled withdrawals hasher isn't worth the maintenance burden. Only the transactions one ships. End-to-end blinder bench after integrating the new hasher into
The PR branch PR-creation URL: https://github.com/dapplion/lighthouse/pull/new/transactions-tree-hash-from-ssz-bytes |
Replace the per-slot fsync loop in `put_batch` with one fsync per file: items are grouped by file_id, all records appended through a BufWriter, then a single sync_all for the data file, all offsets written, single sync_all for the offset file, and a single atomic config commit per batch. Same caller-visible "batch durable on return" contract. For an 8192-item batch (one ERA's worth of slot-keyed writes) this drops fsync count from ~32k (4 per slot) to ~3, with measured speedups between 155x and 775x per column on /mnt/ssd NVMe. Spec updated to reflect the batched semantics.
I've added unit tests for StaticColdStore covering the open/get/put/put_batch paths, crash recovery via heal_current_file, and the idempotent re-put invariant (PR #78 ). I'm looking at the monotonic-write blocker for ERA reconstruction next. Your Option A (allow random-slot writes within existing file_ids, with a per-slot "is this offset already populated?" check and a conf flag for the new mode) makes sense to me as the smaller change. But I want to make sure I understand the crash window before implementing. Currently heal_current_file truncates the data file to current_data_len from the conf, then clears offsets beyond highest_written_slot. Under backfill mode, backfilled data is appended past current_data_len for slots below highest_written_slot. If a crash happens after the data append and offset write but before the conf update, heal will truncate the data file — but the backfilled slot's offset entry is below highest_written_slot so it Is the intended design to track the backfill extent separately in the conf (e.g. a backfill_data_len field), or to change heal_current_file to scan offsets and determine the true data extent in backfill mode? Or is there another approach I'm not seeing? |
|
I've implemented the backfill mode discussed in the architectural follow-up (Option A) as PR #79 against this branch. The crash window I asked about earlier is handled by tracking the backfill file separately in the conf (backfill_file_id + backfill_data_len), plus a scan_and_zero_dangling_offsets pass on heal that catches any offset pointing past the committed data length. This covers the case where data + offset are written but the conf isn't updated before a crash. Conf format bumped from LHSTBLK2 (36 bytes) to LHSTBLK3 (52 bytes) with backward-compatible open of old-format stores. 19 tests covering backfill behavior and crash recovery. This is storage-layer only — the reconstruct.rs integration depends on sigp#9273 being rebased onto this. |
Summary
EDIT: Notes from Geth on how they achieve crash safety for static files. I believe my solution already impls something similar, but would be good to double check https://gist.github.com/rjl493456442/39ca1b37c4fdaa4e71e6e9c5eba34050
Generalises the cold DB so the same
HotColdDBcan sit on either the existingKV cold backend or a new file-based static archive. Lays out the static-cold
spec and the file format; ships the file backend (
StaticColdStore) but doesnot wire it into
HotColdDByet — that integration lands in a follow-up.ColdStore<E>trait (replaces the byte-keyedCold: ItemStore<E>bound)get,put_batch,contains,iter_from,sync.DBColumnColdIndexenum (BlockSlot,ColdStateSummary):get_index,put_index_batch. Owned by the coldbackend, not spilled into the hot DB.
BeaconNodeBackend,MemoryStore) implementColdStorebytranslating
Slot/Hash256keys into the underlyingKeyValueStorebyteAPI.
store_cold_state*) take separate slot-keyed androot-index buffers; cold bulk lands first, indices after — so a crash leaves
no dangling indices.
StaticColdStoreDBColumnColdenum (
Block,BlockRoots,StateRoots,StateSnapshot,StateDiff).No
DBColumnreferences insidestatic_cold.rs.<root>/{blk,bbr,bsr,bss,bsd}/), opened eagerlyat boot. Frozen
HashMap<DBColumnCold, Column>after construction; onlyper-column writer-state mutex on the hot path.
record_type,compression,max_value_bytes) comefrom a build-time
column_configtable on first creation, then arepersisted in each column's conf and the on-disk values win on re-open. Conf
magic bumped to
LHSTBLK2. Future builds with different defaults staybackward-compatible with existing data.
BeaconStateDiffusescompression: false—HDiffis already compressedinternally (zstd'd validator and balance chunks; xdelta3 state diff), so
snappy on top is wasteful.
Removes
lighthouse db prune-states/prune_historic_statesPer
specs/static-cold-backend.md, the mode they produce ("cold blockspresent, cold states absent") isn't in the startup-path table and the spec
doesn't support runtime mode transitions in either direction.
full_state_pruning_enabledgoes with it.Specs
specs/static-cold-backend.md— pluggable cold backend design (modes,ownership, writers, availability rules, removed APIs, backend API surface).
specs/static-blocks.md— slot-keyed file format for blocks (kept as thebaseline format; generalised by
column_configfor the other cold columns).specs/era-storage.md— slot-keyed blob archive design (no implementationin this PR).
Test plan
cargo check --workspace --testspasses (verified locally).cargo nextest run -p store— exercisesMemoryStoreas aColdStorevia the existing tests.
cargo nextest run -p beacon_chain --test store_tests— migration pathsstill pass without the
prune_historic_statestest.StaticColdStore, write blocks/state diffs/snapshots/roots at ascending slots in each column, re-open, read them
back. Crash-recovery exercise: kill mid-
put, verifyheal_current_filetruncates uncommitted data.