Skip to content

Static block storage prototype#74

Draft
dapplion wants to merge 4 commits into
static-block-storagefrom
lh-static-storage
Draft

Static block storage prototype#74
dapplion wants to merge 4 commits into
static-block-storagefrom
lh-static-storage

Conversation

@dapplion
Copy link
Copy Markdown
Owner

@dapplion dapplion commented May 8, 2026

Description

Adds a working draft for static block storage:

  • spec for static block files
  • file-backed StaticBlockStore with snappy records and .off offsets
  • HotColdDB static block fallback wiring
  • minimal StaticBlobStore API stub for checking integration shape
  • TODO tracker for remaining work

Additional Info

Draft PR for tracking the design/prototype work. Not ready for review.

Testing

  • cargo check -p store
  • pre-push hook: TEST_FEATURES="beacon-node-leveldb,beacon-node-redb," RUSTFLAGS="-C debug-assertions=no " make lint

@dapplion
Copy link
Copy Markdown
Owner Author

dapplion commented May 8, 2026

Empirical evidence for moving blocks out of LevelDB

Hit a wall during full-mainnet ERA import (sigp#9273 testing) at era ~960/1741 with the leveldb backend. Sharing the numbers in case they're useful motivation for this PR — they corroborate exactly why static block storage is the right direction.

The numbers

After ~23h of import (~600 GB of chain_db on disk):

  • 303,824 SST files in chain_db/
  • 5.53 TB written, 3.73 TB read by the lighthouse process (/proc/<pid>/io)
  • 399 GB cancelled writes (compaction abandoning work mid-flight)
  • ~10× write amplification on the actual on-disk size
  • Stalls grew progressively: ~30 min → ~50 min → 60+ min between successive era imports
  • During stalls: CPU pinned ~100% (single core), io_in_progress=1 on the SSD — classic single-threaded compaction signature

Why LevelDB is past its design point at this scale

LevelDB was designed by Google around 2011 as a small/medium key-value store (Chrome bookmarks, IndexedDB). Three concrete reasons it falls apart at our scale:

1. Single-threaded background compaction. LevelDB has exactly one background thread doing all compaction work. RocksDB (Facebook's fork) added subcompactions specifically because LevelDB couldn't handle write-heavy workloads. At 600 GB+ the queue grows faster than one thread can drain it.

2. Level sizing assumes ≤ ~100 GB. LevelDB's level pyramid: L0→L1=10 MB, L2=100 MB, L3=1 GB, L4=10 GB, L5=100 GB, L6=1 TB. With max_levels=7 the design implicitly caps near 1 TB. L6 compactions are massive: merging an L5 SST into L6 may touch hundreds of overlapping files, all serialised on the one bg thread.

3. MANIFEST/VersionEdit O(N) overhead. Every SST is tracked in a single MANIFEST file. With 303k SSTs, every version edit (after each compaction) writes to MANIFEST. Open-time recovery scans the whole edit log. The TableCache (bounded by max_open_files) thrashes — most reads miss cache and require open() + index parse + bloom filter rebuild.

Why static block storage helps directly

The bytes-by-column breakdown of chain_db/ showed ~98.65% of bytes are BeaconBlock. So pulling blocks out into the immutable, append-only static format proposed here:

  • Removes the dominant write-amp source (no SST compaction churn for blocks)
  • Drops the MANIFEST entry count by ~roughly the same factor (fewer SSTs to track)
  • The remaining columns fit comfortably within LevelDB's design envelope (<10 GB)
  • Era-file ingestion becomes a streaming append + offset-index write, no LSM-tree at all

The leveldb numbers above are basically the strongest argument for this PR. Happy to share the full raw logs / instrumentation data if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant