Skip to content

Generalize cold DB: ColdStore trait + slot-keyed static archive#75

Open
dapplion wants to merge 24 commits into
unstablefrom
static-files-generalization-spec
Open

Generalize cold DB: ColdStore trait + slot-keyed static archive#75
dapplion wants to merge 24 commits into
unstablefrom
static-files-generalization-spec

Conversation

@dapplion
Copy link
Copy Markdown
Owner

@dapplion dapplion commented May 8, 2026

Summary

EDIT: Notes from Geth on how they achieve crash safety for static files. I believe my solution already impls something similar, but would be good to double check https://gist.github.com/rjl493456442/39ca1b37c4fdaa4e71e6e9c5eba34050

Generalises the cold DB so the same HotColdDB can sit on either the existing
KV cold backend or a new file-based static archive. Lays out the static-cold
spec and the file format; ships the file backend (StaticColdStore) but does
not wire it into HotColdDB yet — that integration lands in a follow-up.

ColdStore<E> trait (replaces the byte-keyed Cold: ItemStore<E> bound)

  • Slot-typed bulk: get, put_batch, contains, iter_from, sync.
  • Root-keyed indices via a tight DBColumnColdIndex enum (BlockSlot,
    ColdStateSummary): get_index, put_index_batch. Owned by the cold
    backend, not spilled into the hot DB.
  • KV backends (BeaconNodeBackend, MemoryStore) implement ColdStore by
    translating Slot/Hash256 keys into the underlying KeyValueStore byte
    API.
  • Migration helpers (store_cold_state*) take separate slot-keyed and
    root-index buffers; cold bulk lands first, indices after — so a crash leaves
    no dangling indices.

StaticColdStore

  • One type, all slot-keyed cold columns dispatched by a tight DBColumnCold
    enum (Block, BlockRoots, StateRoots, StateSnapshot, StateDiff).
    No DBColumn references inside static_cold.rs.
  • Per-column subdirectory (<root>/{blk,bbr,bsr,bss,bsd}/), opened eagerly
    at boot. Frozen HashMap<DBColumnCold, Column> after construction; only
    per-column writer-state mutex on the hot path.
  • Per-column settings (record_type, compression, max_value_bytes) come
    from a build-time column_config table on first creation, then are
    persisted in each column's conf and the on-disk values win on re-open. Conf
    magic bumped to LHSTBLK2. Future builds with different defaults stay
    backward-compatible with existing data.
  • BeaconStateDiff uses compression: falseHDiff is already compressed
    internally (zstd'd validator and balance chunks; xdelta3 state diff), so
    snappy on top is wasteful.

Removes lighthouse db prune-states / prune_historic_states

Per specs/static-cold-backend.md, the mode they produce ("cold blocks
present, cold states absent") isn't in the startup-path table and the spec
doesn't support runtime mode transitions in either direction.
full_state_pruning_enabled goes with it.

Specs

  • specs/static-cold-backend.md — pluggable cold backend design (modes,
    ownership, writers, availability rules, removed APIs, backend API surface).
  • specs/static-blocks.md — slot-keyed file format for blocks (kept as the
    baseline format; generalised by column_config for the other cold columns).
  • specs/era-storage.md — slot-keyed blob archive design (no implementation
    in this PR).

Test plan

  • cargo check --workspace --tests passes (verified locally).
  • cargo nextest run -p store — exercises MemoryStore as a ColdStore
    via the existing tests.
  • cargo nextest run -p beacon_chain --test store_tests — migration paths
    still pass without the prune_historic_states test.
  • Manual: open a fresh StaticColdStore, write blocks/state diffs/
    snapshots/roots at ascending slots in each column, re-open, read them
    back. Crash-recovery exercise: kill mid-put, verify heal_current_file
    truncates uncommitted data.

dapplion added 8 commits May 8, 2026 19:36
Add a slot-keyed durable archive (`StaticBlockStore`) for finalized blinded
blocks, integrated into `migrate_database` as a second pass that runs
alongside the existing cold-state migration. File format and manifest
persistence remain `todo!()` — this is the wiring scaffold.

- New `DBColumn::BeaconBlockSlot` reverse index (root → slot).
- `HotColdDB::get_block_with` and `block_exists` fall through to the
  archive after a hot-KV miss.
- Archival driven inside `migrate_database`: cold ops (BeaconBlockRoots +
  BeaconBlockSlot) commit atomically, hot deletes after split commit.
- Skip-slot dedup seeded from `BeaconBlockRoots[current_split.slot - 1]`,
  with `Hash256::ZERO` for the genesis case.
- Spec at `specs/static-blocks.md`.
Companion document describing the static-file backend for `BlobSidecar`
archival via `.erb` files. Initialization via genesis sync or imported
era files; checkpoint sync and P2P blob backfill rejected at startup.
Replaces the byte-keyed Cold: ItemStore<E> bound on HotColdDB with a slot-typed
ColdStore<E> trait: get/put_batch/exists/iter_from for slot-keyed columns plus
get_index/put_index_batch over a tight DBColumnColdIndex enum (BlockSlot,
ColdStateSummary). KV backends (BeaconNodeBackend, MemoryStore) implement it
by translating slot/root keys into the existing KeyValueStore byte API.

StaticBlockStore generalised to StaticColdStore: one type, columns dispatched
on each call. Per-column subdirectory; per-column settings (record_type,
compression, max_decompressed) come from a build-time column_config table on
first creation and are persisted in each column's conf so future builds with
different defaults stay compatible. Conf magic bumped to LHSTBLK2.

Removes prune_historic_states + the lighthouse db prune-states CLI: the mode
they produce ("cold blocks present, cold states absent") isn't in the
startup-path table in specs/static-cold-backend.md and the spec doesn't
support runtime mode transitions. full_state_pruning_enabled goes with it.

Other: store_cold_state* helpers take separate slot-keyed and root-index
buffers; migration writes slot-keyed cold data first, root indices after, so
a crash leaves no dangling indices.
- Move beacon_node/store/src/static_blocks.rs to static_cold.rs (the type
  is no longer block-specific).
- Add DBColumnCold (slot-keyed cold columns) alongside DBColumnColdIndex.
  StaticColdStore is keyed by DBColumnCold all the way through; no DBColumn
  conversion happens inside static_cold.rs. column_config returns a plain
  ColumnConfig (was Option) and UnsupportedColumn errors go away — the
  tighter enum makes them unrepresentable.
- Eager-open every cold column at boot, freeze the columns map. No outer
  Mutex/RwLock; the per-column writer state mutex is the only sync point.
- Rename ColumnConfig::max_decompressed -> max_value_bytes (it bounds the
  raw payload size on uncompressed reads too, defending against corrupt
  headers).
- BeaconStateDiff: compression: false. HDiff is already compressed
  internally (zstd'd validator/balance chunks) so snappy on top is wasteful.
@dapplion dapplion requested a review from michaelsproul as a code owner May 8, 2026 17:42
dapplion added 15 commits May 8, 2026 20:04
The slot-keyed methods on ColdStore (get/put_batch/contains/iter_from) now
take the tight DBColumnCold enum instead of DBColumn, mirroring the existing
DBColumnColdIndex shape on the index methods. This drops DBColumn from
static_cold.rs entirely.

KV backend impls (BeaconNodeBackend, MemoryStore) translate via
column.db_column(). FrozenForwardsIterator::new still accepts DBColumn at
the public boundary and converts at the call to cold_db.iter_from.

Also: delete static_blobs.rs (was a stub returning Unsupported on every
call, with no callers). Revert noise renames (io_batch, cold_db_block_ops,
cold_db_state_ops, ops, .map_err(|e| e.into())) to keep the diff against
unstable focused on real semantic changes.
`BeaconBlockSlot` (and the `DBColumnColdIndex::BlockSlot` variant that wrapped
it) was added for a static-archive read-fallback path that was removed earlier
in this branch. Nothing writes or reads it now, so drop the variant from the
DBColumn enum, the matching DBColumnColdIndex variant, the
`MissingFrozenBlockSlot` error, and the corresponding key_size match arm.

Rewrite TODO-static-block-storage.md to reflect the current branch state:
the static-cold generalization is in, the prune-states removal is in, and the
remaining work is cold-backend selection (flag), review of block read/write
paths now that BeaconBlockSlot is gone, an invariants review, and tests.
The two explicit impls (BeaconNodeBackend, MemoryStore) were identical
boilerplate translating slot/root keys into the underlying byte-keyed
KeyValueStore. Replace with a single blanket impl in lib.rs.

Forecloses a future ColdStore impl that isn't a KeyValueStore (e.g. wiring
StaticColdStore directly as the Cold parameter); reversible if/when that
becomes wanted.
The blanket `ColdStore` impl writes `slot.as_ssz_bytes()` for
`BeaconColdStateSummary`, where older releases wrote SSZ-encoded
`ColdStateSummary { slot }`. The two encodings are byte-identical (an SSZ
container of one fixed-size field equals the field), but the equality is
load-bearing for read compatibility with existing databases. Add a
regression test that pins it.
The slot-walk rewrite of `check_cold_state_diff_consistency` was forced by
not having an index iterator on the trait. Add `iter_index(col)` (yields
`(Hash256, Slot)`) and restore the invariant to iterating
`BeaconColdStateSummary` directly, matching unstable's structure modulo
the slot-typed API.
Replace the two-buffer (slot-keyed data + state-root index) helper signatures
with a single `&mut ColdBatch` and add `commit_cold_batch` that flushes data,
syncs, then commits the index — encoding the data-before-index ordering at
the API.

`put_state` and `reconstruct.rs` collapse to "build batch, commit batch."
The migration loop keeps a top-level summary index that accumulates across
states and is flushed at end-of-migration; per-iteration data still goes
through `commit_cold_data` (renamed from `commit_cold_items`).
Drops the `KeyValueStore -> ColdStore` blanket and replaces it with an
explicit per-backend impl. `BeaconNodeBackend` no longer impls `ColdStore`
directly — its byte-translation is inlined inside the `ColdBackend::Kv` arm
where it's actually used. `MemoryStore` keeps an explicit impl (still used
as the Cold parameter in tests via `EphemeralHarnessType`).

`ColdBackend<E>` is a new enum with `Kv(BeaconNodeBackend)` /
`Static(StaticColdStore)` variants, picked at startup from
`StoreConfig::cold_backend` (default `Kv`). Production type signatures swap
the second `BeaconNodeBackend<E>` slot to `ColdBackend<E>` (3 production
sites, 6 test sites, 3 database_manager sites).

`StaticColdBackend<E>` wrapper from the previous commit collapsed into a
direct `impl<E> ColdStore<E> for StaticColdStore`. Index methods stub
`Unsupported` for now — wiring the embedded KV is the next piece.
Genesis sync against the static cold backend was failing for two reasons:

1. `BeaconColdStateSummary` and friends are root-keyed indices; the static
   files are slot-keyed. The previous `Unsupported` stubs blocked the very
   first migration. Embed a `BeaconNodeBackend<E>` at `<root>/index/` and
   serve `get_index` / `put_index_batch` / `iter_index` from it. Forwards
   iteration over slot-keyed columns (`iter_from`) is now also implemented
   by walking the column's `.off` sidecar.

2. `BeaconChainBuilder::genesis` pre-writes the genesis block_root to cold
   `BlockRoots` at slot 0, then the first migration writes the same
   (slot, root) again. KV cold accepts the overwrite; the static backend's
   strict-ascending check rejected it. `Column::put` now treats a re-put of
   an identical value at the current highest slot as a no-op, and errors
   only on a value mismatch (a real bug).

Threads `StoreConfig` into `StaticColdStore::open` so the embedded KV picks
up the same backend (`leveldb` / `redb`) and tuning as the hot/blobs DBs.

Adds `genesis_sync_static_cold` covering ~1000 finalized blocks with the
static backend and a load of every cold state through the new index.
Drops the bespoke 1000-block static-cold test and instead has get_store
read the cold backend from COLD_BACKEND=static|kv. CI / local can now run
the existing store_tests suite against either backend without duplicating
test bodies.

Also trims ColdBackendKind to the derives actually exercised today.
Display, EnumString, VariantNames, Copy were forward-looking for the
not-yet-wired --cold-backend CLI flag - re-add when that lands.
The static cold backend is append-only in ascending slot order, so
checkpoint/weak-subjectivity sync (which backfills slots below the anchor)
is fundamentally incompatible. Refuse the combination explicitly in
BeaconChainBuilder::weak_subjectivity_state instead of failing later
with an opaque 'static cold put out of order' error.

The 6 weak_subjectivity_sync_* tests early-return under
COLD_BACKEND=static so the test suite passes against either backend.

Adds the --cold-backend CLI flag (kv|static, default kv) so operators
can opt into the static backend at startup. Re-adds EnumString and
VariantNames on ColdBackendKind for clap parsing.
Idempotent put at any committed slot makes `migrate_database` retries
safe after a mid-loop crash. The previous put accepted re-puts only at
exactly `highest_written_slot`; on retry, slot 0 < highest fired
out-of-order. Now any committed slot accepts an identical-value re-put;
mismatched values and skipped-slot fills still error.

New `COLD_BACKEND_KEY` in `BeaconMeta` pins the backend kind on first
open and refuses mismatched re-opens (Static and Kv on-disk layouts are
incompatible). `reconstruct_historic_states` refuses to run under
static cold — the slots it would write are below every column's
high-water mark.

`max_value_bytes` ratchets upward on open if the build default exceeds
disk, so a newer build can write larger records than an older one
persisted, and re-persists immediately for stable re-opens.

Per-column files renamed `static_blocks_*` -> `data_*`,
`static_blocks.conf` -> `column.conf` — the literal prefix was
misleading after the per-column generalisation.

`kv_cold_store` helper module dropped; `MemoryStore`'s `ColdStore` impl
inlined to match `ColdBackend::Kv`. Two impls, no shared helper.
`decompress_record` returns `Result<Vec<u8>>` (was `Result<Option<Vec<u8>>>`
with `Some` on every success path).

`TODO(static)` markers added for `iter_from` perf, the migrate-vs-index
transient invariant 11 window, invariants 10/11/12 re-review under
static cold, and the missing test set.

Spec cleanup: delete `specs/static-blocks.md` (stale, ~60% contradicted
the code) and `TODO-static-block-storage.md`. Rewrite the
`static_cold.rs` module header as the canonical byte-level format
reference (layout, data file, `column.conf`, put contract, recovery).
Adds a sibling job to `beacon-chain-tests` that runs
`beacon_chain::store_tests::*` with `COLD_BACKEND=static` (and `FORK_NAME=fulu`)
to exercise the static slot-keyed cold-DB backend on every CI run. Mirrors the
existing job's runner, toolchain, cache, and feature flags
(`fork_from_env,slasher/lmdb,portable`). Added to `test-suite-success` so the
merge queue blocks on it.
Adds the missing pieces so the static cold archive can serve block-by-root
reads without keeping a duplicate in hot indefinitely.

Schema (re-adds what f671da1 dropped):
- `DBColumn::BeaconBlockSlot` (tag `bbs`, 32-byte key, 8-byte SSZ Slot)
- `DBColumnColdIndex::BlockSlot` variant

Migrate (`migrate_database`):
- alongside the existing block-bulk push to `cold.Block`, push the matching
  `(block_root, slot)` to `cold_block_slot_index` and the `block_root` to
  `hot_block_delete_roots`
- end-of-loop: `put_index_batch(BlockSlot, ...)` after `ColdStateSummary`,
  before split commit
- post split commit: `hot_db.do_atomically(deletes)` reclaims hot space for
  the just-migrated blocks. Hot delete only runs after cold bytes + cold
  index are durable, so a crash here leaves cold canonical and reads fall
  through. KV mode keeps `move_blocks_to_static_cold` false → all the new
  buffers stay empty → status quo.

Read fallback (`get_block_with`, `block_exists`):
- hot first; on miss, `cold.get_index(BlockSlot, root)` then
  `cold.get(Block, slot)`. Missing bulk for an indexed slot raises
  `MissingFrozenBlock` (corruption). KV mode's empty BlockSlot index makes
  the fallback always return None on hot miss — identical to before.

Invariant 10 (`check_cold_block_root_indices`):
- now uses `self.block_exists(&block_root)` (the public read with cold
  fallback) instead of the bare `hot_db.key_exists(...)`. Required because
  hot-delete makes the bare hot check fire spuriously for every migrated
  slot under Static cold.

Init-path coverage:
- Genesis + KV: cold writes gated off, BlockSlot empty, fallback always
  None on hot miss. Status quo.
- Genesis + Static: migrate writes block + index to cold, deletes from
  hot. Reads ≥ split.slot hit hot; < split.slot hit cold via fallback.
- Era + Static: hot has only post-anchor blocks. cold has 0..S from era
  (future era-import path) + post-S from migrate. Fallback is the read
  path for slot < S.
- Ckpt + KV: BlockSlot empty as in Genesis + KV. Backfill fills hot.
- Ckpt + Static (no era): rejected by the existing WSS guard.
`make cli-local` after `e259a5157b` introduced `--cold-backend` without
touching `book/src/help_bn.md`, so `cli-check` failed on every push.
Re-added in `bbc3badfd2` (`BeaconBlockSlot`); the hardcoded snapshot in
`check_db_columns` wasn't updated, so the test asserted on a stale list.
@dapplion
Copy link
Copy Markdown
Owner Author

Column::put_batch: one fsync per file instead of per slot — branch static-cold-batched-fsync

The current ColdStore::put_batch is a one-line for item in items: self.put(item). Each put does 4 fsyncs (data file sync_all, offset file sync_all, then write_config which is tmp.sync_all + rename + dir.sync_all). For an 8192-item batch that's ~32k fsyncs, ~150 s on /mnt/ssd NVMe.

Replaced put_batch with a real batched implementation: group items by file_id, append all records through a 1 MiB BufWriter, then one data_file.sync_all, write all offsets, one off_file.sync_all, one atomic write_config for the whole batch. Same caller-visible "batch durable on return" contract; spec doc updated.

Microbench (beacon_node/store/examples/static_cold_bench.rs, 8192 items, /mnt/ssd NVMe):

column old put loop new put_batch speedup
Block 36.1 s 0.23 s 155×
BlockRoots 88.3 s 0.17 s 519×
StateRoots 123.9 s 0.16 s 775×

Tests + make lint-full clean. PR-ready: https://github.com/dapplion/lighthouse/pull/new/static-cold-batched-fsync

@dapplion
Copy link
Copy Markdown
Owner Author

End-to-end mainnet ERA import: 51.27 h → 1.22 h (42× speedup)

Combined this PR (#75) with the ERA-import lcli (sigp#9273 / #69) plus the put_batch fix plus a custom direct-byte SSZ blinder, and ran a full mainnet ERA import (1260 era files = ~10.3 M slots, eras 0..1260). Branch: experiment-era-static-cold-load.

Compared against the tuned KV-cold-backend run (/mnt/ssd/era-test-logs/era-import-timing.csv, 1258 eras in 51.27 h, same hardware, same ERA file source /mnt/ssd/era-mainnet-nimbus/):

backend eras total mean s/era
KV (tuned) 1260 51.27 h 146.5
Static (this branch) 1260 1.22 h 3.49

The custom blinder is the second large win on top of the put_batch fix. ERA files store full SignedBeaconBlocks; the importer wants blinded SSZ for the slot-keyed Block column. Doing the typed parse + clone_as_blinded + as_ssz_bytes round-trip allocates ~hundreds of small heap objects per block (every Attestation's AggregationBits, deposits, sync committee bits, …) which then immediately get discarded. The custom blinder walks BeaconBlockBody SSZ container offsets directly and only typed-decodes Transactions + Withdrawals slices for tree_hash_root. Verified byte-identical against clone_as_blinded().as_ssz_bytes() in beacon_node/beacon_chain/examples/blinder_bench.rs:

sample parse + blind + encode custom blinder speedup
Capella block (128 KB) 8.4 ms 2.1 ms 4.03×
Deneb block (86 KB) 7.7 ms 1.15 ms 6.70×

Pre-Bellatrix: trivial passthrough since FullPayloadBlindedPayload SSZ-encoding. Bellatrix and Electra+ fall back to the typed path for now.

Per-phase tracing breakdown across all 1260 eras:

phase mean / era % of parent
import_era_file (parent) 3.48 s 100%
era_import_decompress_blocks + era_import_blind_blocks ~1.49 s combined ~43%
era_import_write_blocks 0.65 s 19%
era_import_write_state 0.49 s 14%
era_import_read 0.42 s 12%
era_import_decode_state 0.17 s 5%
era_import_write_state_root_index 0.078 s 2%
era_import_write_block_index 0.057 s 2%

Per-era CSV in same shape as the KV reference: /mnt/ssd/lh-bench/claude-lh-era-files-static/logs/static-import-timing.csv. Disk footprint ends at ~135 GB vs KV ~681 GB (5× smaller — blinded blocks, no LevelDB compaction overhead, append-only files).

@dapplion
Copy link
Copy Markdown
Owner Author

Architectural follow-up: phase-2 reconstruction is incompatible with monotonic-forward writes

After phase 1 of the ERA importer finishes (era-boundary states written to StateSnapshot / StateDiff columns at slots 8192, 16384, …, 1260·8192), phase 2 (reconstruct_states_parallel) tries to backfill intermediate states by replaying blocks slot-by-slot. Those writes target slot ranges behind highest_written_slot for the column, which the static archive rejects:

StaticColdStoreError(Invalid("static cold put_batch out of order vs highest_written_slot"))

Sequentialising the era loop (this branch tries that) doesn't help — the conflict is between phase-1 boundary writes and phase-2 intermediate writes, not between parallel workers.

Two viable directions, both real design changes:

  1. Allow random-slot writes within an existing file_id. The .off sidecar is already a fixed-size table indexed by slot % SLOTS_PER_FILE. Replace the strict-monotonic invariant with a per-slot "is this offset already populated?" check, and append data to the data file regardless of arrival order. highest_written_slot becomes "highest committed slot" but doesn't gate writes to lower slots within already-existing files.

  2. Reorder the import: do reconstruction interleaved with phase 1, so each era's intermediate states get written immediately after that era's boundary — fully ascending. Avoids the conflict by construction but requires restructuring import_era_file and would lose the "phase-1 finishes fast and you can keep using the chain while phase 2 grinds" property the legacy KV path has.

(1) is the smaller change and preserves the existing API. Just a flag in the .conf to record the new mode. Happy to draft if you want to take that direction.

This blocker doesn't affect the headline 42× phase-1 speedup (block + state-boundary writes work fine), but it's the gating issue for the static backend reaching feature parity with the KV cold backend.

@dapplion
Copy link
Copy Markdown
Owner Author

Custom transactions tree-hasher: transactions_tree_hash_root_from_ssz_bytes — branch transactions-tree-hash-from-ssz-bytes

Follow-up to the end-to-end ERA-import experiment above. Inside the custom blinder, the dominant remaining cost is hashing the transactions list. Transactions::from_ssz_bytes(bytes)?.tree_hash_root() allocates one Vec<u8> per transaction (hundreds per mainnet block) just to throw it away after the hash.

Replaced with a direct-byte hasher that walks the SSZ List<Transaction, MAX_TX> offset table, hashes each transaction's bytes in place via tree_hash::merkle_root, and list-merkleizes the per-tx roots. Output is byte-identical to the typed path.

Microbench (beacon_chain/examples/hash_bench.rs in experiment-era-static-cold-load, real mainnet Capella + Deneb blocks):

op typed custom speedup
transactions (Capella, 92.5 KB / ~250 txs) 1498 µs 607 µs 2.47×
transactions (Deneb, 46.4 KB / ~150 txs) 991 µs 519 µs 1.91×
withdrawals (≤16 × 44 B = 704 B) 8.2–9.4 µs 7.9–8.9 µs 1.04–1.06×

Withdrawals were also benchmarked — the typed path is already in the noise (~5 ns difference) so a hand-rolled withdrawals hasher isn't worth the maintenance burden. Only the transactions one ships.

End-to-end blinder bench after integrating the new hasher into era::custom_blinder:

sample typed pipeline with custom blinder + tx hasher speedup
Capella block (128 KB → 35 KB blinded) 4.83 ms 1.58 ms 3.05×
Deneb block (86 KB → 39 KB blinded) 4.45 ms 0.93 ms 4.78×

The PR branch transactions-tree-hash-from-ssz-bytes is based on unstable, independent of this PR — it's a general-purpose types crate helper for any code that has Transactions SSZ bytes and only needs the root. Returns a real Result<Hash256, ssz::DecodeError> (no panicking slicing or unchecked arithmetic), 6 unit tests covering empty / single / many / mixed-size / single-large / chunk-boundary edges. make lint-full clean.

PR-creation URL: https://github.com/dapplion/lighthouse/pull/new/transactions-tree-hash-from-ssz-bytes

Replace the per-slot fsync loop in `put_batch` with one fsync per file:
items are grouped by file_id, all records appended through a BufWriter,
then a single sync_all for the data file, all offsets written, single
sync_all for the offset file, and a single atomic config commit per
batch.

Same caller-visible "batch durable on return" contract. For an
8192-item batch (one ERA's worth of slot-keyed writes) this drops
fsync count from ~32k (4 per slot) to ~3, with measured speedups
between 155x and 775x per column on /mnt/ssd NVMe.

Spec updated to reflect the batched semantics.
@galadd
Copy link
Copy Markdown

galadd commented May 20, 2026

Architectural follow-up: phase-2 reconstruction is incompatible with monotonic-forward writes

After phase 1 of the ERA importer finishes (era-boundary states written to StateSnapshot / StateDiff columns at slots 8192, 16384, …, 1260·8192), phase 2 (reconstruct_states_parallel) tries to backfill intermediate states by replaying blocks slot-by-slot. Those writes target slot ranges behind highest_written_slot for the column, which the static archive rejects:

StaticColdStoreError(Invalid("static cold put_batch out of order vs highest_written_slot"))

Sequentialising the era loop (this branch tries that) doesn't help — the conflict is between phase-1 boundary writes and phase-2 intermediate writes, not between parallel workers.

Two viable directions, both real design changes:

1. **Allow random-slot writes within an existing file_id.** The `.off` sidecar is already a fixed-size table indexed by `slot % SLOTS_PER_FILE`. Replace the strict-monotonic invariant with a per-slot "is this offset already populated?" check, and append data to the data file regardless of arrival order. `highest_written_slot` becomes "highest committed slot" but doesn't gate writes to lower slots within already-existing files.

2. **Reorder the import: do reconstruction interleaved with phase 1**, so each era's intermediate states get written immediately after that era's boundary — fully ascending. Avoids the conflict by construction but requires restructuring `import_era_file` and would lose the "phase-1 finishes fast and you can keep using the chain while phase 2 grinds" property the legacy KV path has.

(1) is the smaller change and preserves the existing API. Just a flag in the .conf to record the new mode. Happy to draft if you want to take that direction.

This blocker doesn't affect the headline 42× phase-1 speedup (block + state-boundary writes work fine), but it's the gating issue for the static backend reaching feature parity with the KV cold backend.

I've added unit tests for StaticColdStore covering the open/get/put/put_batch paths, crash recovery via heal_current_file, and the idempotent re-put invariant (PR #78 ).

I'm looking at the monotonic-write blocker for ERA reconstruction next. Your Option A (allow random-slot writes within existing file_ids, with a per-slot "is this offset already populated?" check and a conf flag for the new mode) makes sense to me as the smaller change. But I want to make sure I understand the crash window before implementing.

Currently heal_current_file truncates the data file to current_data_len from the conf, then clears offsets beyond highest_written_slot. Under backfill mode, backfilled data is appended past current_data_len for slots below highest_written_slot. If a crash happens after the data append and offset write but before the conf update, heal will truncate the data file — but the backfilled slot's offset entry is below highest_written_slot so it
won't be cleared, creating a dangling pointer.

Is the intended design to track the backfill extent separately in the conf (e.g. a backfill_data_len field), or to change heal_current_file to scan offsets and determine the true data extent in backfill mode? Or is there another approach I'm not seeing?

@galadd
Copy link
Copy Markdown

galadd commented Jun 2, 2026

I've implemented the backfill mode discussed in the architectural follow-up (Option A) as PR #79 against this branch.

The crash window I asked about earlier is handled by tracking the backfill file separately in the conf (backfill_file_id + backfill_data_len), plus a scan_and_zero_dangling_offsets pass on heal that catches any offset pointing past the committed data length. This covers the case where data + offset are written but the conf isn't updated before a crash.

Conf format bumped from LHSTBLK2 (36 bytes) to LHSTBLK3 (52 bytes) with backward-compatible open of old-format stores. 19 tests covering backfill behavior and crash recovery.

This is storage-layer only — the reconstruct.rs integration depends on sigp#9273 being rebased onto this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants