Skip to content

State cache: spec-derived byte-size estimation and budget-based eviction#71

Open
dapplion wants to merge 18 commits into
unstablefrom
dapplion/state-cache-byte-budget
Open

State cache: spec-derived byte-size estimation and budget-based eviction#71
dapplion wants to merge 18 commits into
unstablefrom
dapplion/state-cache-byte-budget

Conversation

@dapplion
Copy link
Copy Markdown
Owner

@dapplion dapplion commented Apr 4, 2026

Memory-aware state cache eviction using milhouse's cow_bytes pairwise tree walk.

Each BeaconState carries ApproxOwnedBytesList — byte counts for tree memory produced by each transition. States sharing ancestry share the same Arc entries. Total cache memory = unique entries across all states (deduplicated by Arc pointer).

cow_bytes walks two trees in parallel, skipping shared subtrees via Arc::ptr_eq. ~541ns per slot transition at 1M validators (vs 458ms for MemoryTracker). Finalized state base size measured once via total_tree_bytes (~25ms).

Eviction triggers when total_approx_owned_bytes() exceeds --state-cache-max-mb. Caches (committee, sync committee, epoch) included in measurement.

Depends on sigp/milhouse#100.

@dapplion dapplion requested a review from michaelsproul as a code owner April 4, 2026 02:17
@dapplion dapplion force-pushed the dapplion/state-cache-byte-budget branch 2 times, most recently from 314c195 to 9d2fee8 Compare April 6, 2026 06:32
dapplion and others added 15 commits April 6, 2026 17:18
…viction

Add estimated_marginal_bytes() that uses consensus spec knowledge to
approximate memory cost per cached state — epoch boundary states (~32MB)
vs mid-epoch (~1KB). Track per-state costs and a running cached_bytes sum.

New --state-cache-max-mb flag enables byte-budget eviction alongside the
existing count-based limit. Exposes estimated bytes via Prometheus metric.
…toricalSummary

Required by milhouse's MemoryTracker to measure COW bytes for these
types when stored in tree-backed List/Vector fields.
Formula fixes:
- Cap sparse estimate at full-tree cost (fixes 15x overestimate at 50% dirty)
- Account for Zero-node siblings along spine (fixes epoch boundary underestimate)
- Account for Arc<T> overhead in Leaf<T> nodes (fixes Hash256 underestimate)
- Fix internal node count: num_leaves-1 (correct for complete binary tree)

Add missing fields to estimated_marginal_bytes:
- slashings (1 dirty per epoch boundary)
- eth1_data_votes (current list length)
- historical_summaries (1 per epoch, Capella+)

Tests (25):
- estimate_tree_bytes: sparse, scattered, adjacent, full for u64/u8/Hash256
- Per-field: balances at 1/10/50/100%, participation, roots, randao, inactivity
- Integrated: epoch boundary (1.04x), mid-epoch (3.1x), real epoch transition
- Clone chain: shared COW, pruning, same-slot independence
- All tests assert both lower bound (estimate >= actual) and max ratio
Each BeaconState carries a Vec<Arc<ApproxOwnedBytes>> recording the tree
memory allocated at each transition. States that share ancestry (via
clone) share the same Arc entries. Total cache memory is computed by
deduplicating entries across all cached states by Arc pointer identity.

- ApproxOwnedBytesList field on BeaconState (skipped from serde/ssz/tree_hash)
- TreeSnapshot stub in per_slot_processing and per_block_processing
  (captures pre-state, measures delta — returns 0 until milhouse support)
- All fork upgrades preserve approx_owned_bytes via mem::take
- rebase_on_finalized resets to finalized's list + unique cost
- StateCache::total_approx_owned_bytes() iterates and deduplicates
…marks

Implement MemorySize for BeaconState (tree fields via macros + caches +
sync committees), CommitteeCache, EpochCache, SyncCommittee, and all
remaining leaf types (PendingAttestation, PendingDeposit, PendingPartialWithdrawal,
PendingConsolidation, Builder, BuilderPendingPayment, BuilderPendingWithdrawal,
Withdrawal). Add PtcWindowEntry newtype for FixedVector MemorySize support.

Add state_memory benchmark measuring MemoryTracker::track_item cost:
- Single state walk: ~316µs at 1024 validators (linear scaling)
- Pre+post delta (slot transition): ~350µs at 1024 validators
- Pre+post delta (epoch transition): ~343µs at 1024 validators

Co-authored-by: PoulavBhowmick03 <bpoulav@gmail.com>
Merge TODO-state-cache-size.md and DESIGN-cow-tracking.md into a single
plan at .claude/state-cache-memory-tracking.md. Update with current status,
the persistent MemoryTracker approach, and the three measurement cases.

Replace MinimalEthSpec benchmarks with mainnet-scale synthetic states
(1M and 2M validators). Results at 1M validators:
- Full walk: 459ms
- Pre+post slot transition: 451ms (dominated by pre-state walk)
- Pre+post epoch transition: 566ms
Replace TreeSnapshot stub with real cow_bytes implementation using
milhouse's pairwise tree walk (dapplion/milhouse cow-bytes branch).

TreeSnapshot::cow_bytes now calls cow_bytes_between which iterates all
tree-backed fields calling List/Vector::cow_bytes. Also adds
total_state_tree_bytes for measuring a full state's tree size.

Benchmarks at 1M validators (mainnet scale):
- cow_bytes slot transition:  541 ns (was 450ms with MemoryTracker)
- cow_bytes epoch transition: 12.8 ms
- total_tree_bytes:           25.1 ms (initial finalized state, once)
- MemoryTracker comparison:   458 ms (850,000x slower for slot)
Replace per-state estimated_marginal_bytes cost tracking with
total_approx_owned_bytes() which deduplicates shared ApproxOwnedBytes
segments across all cached states via Arc pointer identity.

- put_state eviction loop now uses total_approx_owned_bytes() instead
  of incrementally tracked cached_bytes
- Remove per-state cost from LRU tuple (no longer needed)
- Add store_beacon_state_cache_cow_byte_size gauge metric
- Add store_beacon_state_cache_evictions_total counter metric
- Add debug tracing for finalized base size measurement, rebase
  cow_bytes, and byte budget eviction events
Remove stale plans/state-cache-byte-size.md (original spec-derived
estimation design) and update .claude/state-cache-memory-tracking.md
to reflect the implemented cow_bytes + ApproxOwnedBytes design.
Delete estimated_marginal_bytes, estimate_tree_bytes, and all 25 tests
that validated the old spec-derived estimation formula. These are dead
code — eviction now uses total_approx_owned_bytes() via cow_bytes.

Replace with 9 tests covering the actual production code path:
- cow_bytes_between: clone=0, single mutation>0, epoch boundary large
- total_state_tree_bytes: nonzero, scales with validator count
- ApproxOwnedBytesList: deduplication across cloned states
- StateCache: finalized base size populated, put_state increases total,
  byte budget eviction fires and removes states
Add committee_caches and sync_committees to the COW measurement:
- cow_bytes_between: count cache heap bytes when Arc pointers differ
- total_state_tree_bytes: include cache heap bytes in the total
- Add CommitteeCache::approx_heap_bytes (shuffling + positions vecs)
- Add EpochCache::approx_heap_bytes (effective_balances + base_rewards)

Note: cow_bytes_between manually lists tree fields (must stay in sync
with rebase_on which uses bimap macros). The bimap macros require &mut
and Result return type, which cow_bytes (a read-only usize fn) can't
satisfy. A future milhouse change could add an immutable bimap variant.
- Add store_beacon_state_cache_segment_count histogram metric tracking
  the number of ApproxOwnedBytes segments per cached state
- Compact finalized state's segments to a single entry in
  update_finalized_state (prevents accumulation across finalizations)
- Record segment counts each time total_approx_owned_bytes is computed
@dapplion dapplion force-pushed the dapplion/state-cache-byte-budget branch from 9185ce2 to b00b4f4 Compare April 6, 2026 15:19
dapplion added 3 commits April 6, 2026 17:49
PtcWindowEntry was a newtype around FixedVector<u64, N> to satisfy
MemorySize bounds that no longer exist. Revert to upstream's plain
FixedVector.

TreeSnapshot was a struct wrapping state.clone() + cow_bytes_between.
Replace with direct clone + cow_bytes_between calls in per_slot_processing
and per_block_processing — simpler, no indirection.
Fast path (every put_state): use ApproxOwnedBytesList segments for
approximate total. Overcounts from repeated mutations to same paths,
but safe direction — triggers eviction early, never late.

Slow path (on finalization): recompute_exact_costs runs cow_bytes_between
for each cached state, replacing accumulated segments with a single exact
entry. Corrects overcount. ~2ms for slot-only caches, ~225ms worst case
with epoch boundary states.

The slow path runs in update_finalized_state which already does expensive
work (pruning, hdiff management). Adding 225ms there is acceptable.
@dapplion dapplion force-pushed the dapplion/state-cache-byte-budget branch from 15ee7a6 to 421b60d Compare April 7, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant