Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
7b922d8
state cache: add spec-derived byte-size estimation and budget-based e…
dapplion Apr 4, 2026
39d9a3d
fix clippy: remove unnecessary u32 cast
dapplion Apr 4, 2026
7092e19
add MemorySize impls for ParticipationFlags, Validator, Eth1Data, His…
dapplion Apr 6, 2026
dac059c
fix estimate_tree_bytes formula, add missing fields, add tests
dapplion Apr 6, 2026
1a124fe
add ApproxOwnedBytes tracking on BeaconState
dapplion Apr 6, 2026
149b3b0
add MemorySize for BeaconState, caches, and all leaf types; add bench…
dapplion Apr 6, 2026
61f9a89
consolidate tracking docs, add mainnet-scale benchmarks
dapplion Apr 6, 2026
18cf524
wire cow_bytes from milhouse into BeaconState measurement
dapplion Apr 6, 2026
5dd989c
wire byte budget eviction to ApproxOwnedBytes, add metrics and tracing
dapplion Apr 6, 2026
593a5f3
consolidate plan docs into single current-state document
dapplion Apr 6, 2026
e86af3b
remove dead MemoryTracker profiling example
dapplion Apr 6, 2026
27e0e08
remove dead estimation code, replace with production code tests
dapplion Apr 6, 2026
485c5d4
remove dead MemorySize impls and tracker comparison benchmark
dapplion Apr 6, 2026
3c03779
include caches in cow_bytes_between and total_state_tree_bytes
dapplion Apr 6, 2026
b00b4f4
add segment count histogram, compact finalized state segments
dapplion Apr 6, 2026
58fbf97
remove PtcWindowEntry newtype and TreeSnapshot struct
dapplion Apr 6, 2026
eb5d5cf
add note about size_of limitation for leaf types with heap allocations
dapplion Apr 6, 2026
421b60d
two-layer memory tracking: fast approximate + slow exact recomputation
dapplion Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions .claude/state-cache-memory-tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# State Cache Memory Tracking

## Problem

The state cache needs to know how much memory cached states consume to enforce
a byte budget (`--state-cache-max-mb`) and avoid OOM. States share tree nodes
via milhouse COW — the marginal cost depends on which nodes are shared.

### Prior art

- **sigp/lighthouse#7803** — Full `MemoryTracker` walk over all cached states.
Rejected: 450ms+ per measurement at mainnet scale, holds cache mutex.
- **sigp/lighthouse#7449, #7450** — Tracking issues for cache size measurement.
- **Spec-derived estimation** (`estimated_marginal_bytes`) — O(1) heuristic from
spec knowledge. Implemented in this branch as a fallback, with 25 tests. Tight
at epoch boundary (1.04x) but loose mid-epoch (3x). No milhouse dependency.

## Current design: ApproxOwnedBytes + cow_bytes

### How it works

Each `BeaconState` carries a `Vec<Arc<ApproxOwnedBytes>>` — byte counts for
chunks of tree memory it owns. States that share ancestry (via clone) share the
same `Arc` entries. Total cache memory = sum of unique entries (deduplicated by
Arc pointer) across all cached states.

Measurement uses milhouse's `cow_bytes` (PR sigp/milhouse#100): a pairwise tree
walk that compares two trees by `Arc::ptr_eq` at each node, skipping shared
subtrees. O(dirty_nodes) with zero allocations.

### Two-layer approach

**Fast path (every `put_state`):** Sum `ApproxOwnedBytesList` segments across all
states. Overcounts due to repeated mutations to the same tree path, but overcounting
is safe — it triggers eviction earlier, never too late. Cost: microseconds.

**Slow path (on finalization):** Run `cow_bytes_between(finalized, state)` for every
cached state, replacing segments with exact measurements. Corrects accumulated
overcount. Cost: ~2ms for slot-only caches, ~225ms with epoch boundary states.

### Three measurement points

1. **Initial finalized state** — `total_state_tree_bytes()` walks all tree nodes
once. ~25ms at 1M validators. Happens once per finalization (~every 6 min).

2. **State loaded from disk after rebase** — `cow_bytes_between(finalized, state)`
measures unique bytes vs finalized. O(dirty_nodes).

3. **After block/slot processing** — `TreeSnapshot` clones pre-state (cheap Arc
bumps), then `cow_bytes_between(pre, post)` after transition. Pushed as a new
`ApproxOwnedBytes` entry.

### Performance (benchmarked at 1M validators, MainnetEthSpec)

| Operation | Time |
|-----------|------|
| cow_bytes slot transition | **541 ns** |
| cow_bytes epoch transition | **12.8 ms** |
| total_tree_bytes (initial) | **25.1 ms** |
| MemoryTracker (for comparison) | **458 ms** |

### Eviction

`put_state` checks `total_approx_owned_bytes()` against `max_bytes`. If over
budget, culls states by priority (advanced → old boundary → mid-epoch → good
boundary) until under budget. The total is recomputed each check by iterating
all cached states and deduplicating `ApproxOwnedBytes` entries — ~6400 pointer
comparisons, trivial.

### Data flow

```
per_slot_processing / per_block_processing:
TreeSnapshot::new(state) ← cheap clone (Arc bumps)
... process ...
snapshot.cow_bytes(state) ← O(dirty_nodes), ~541ns slot / ~12.8ms epoch
state.approx_owned_bytes.push(delta)

rebase_on_finalized:
state.rebase_on(finalized)
cow_bytes_between(finalized, state) ← O(dirty_nodes)
state.approx_owned_bytes = finalized.approx_owned_bytes + unique

update_finalized_state:
total_state_tree_bytes(state) ← O(all_nodes), ~25ms, once
state.approx_owned_bytes.push(base_size)

put_state:
total = total_approx_owned_bytes() ← deduplicate Arc pointers
if total > max_bytes: cull(...)
```

## What's implemented

- `ApproxOwnedBytes` / `ApproxOwnedBytesList` on `BeaconState` (all variants)
- `cow_bytes_between()`, `total_state_tree_bytes()` in `consensus/types`
- `TreeSnapshot` in `per_slot_processing` and `per_block_processing`
- `rebase_on_finalized` resets segments to finalized's + unique cost
- `update_finalized_state` measures base size for new finalized states
- `total_approx_owned_bytes()` on `StateCache`
- Eviction wired to `total_approx_owned_bytes()` in `put_state`
- `--state-cache-max-mb` CLI flag (default: None = count-based only)
- Metrics: `store_beacon_state_cache_cow_byte_size` gauge,
`store_beacon_state_cache_evictions_total` counter
- Debug tracing on finalized base size, rebase cow_bytes, eviction events
- `MemorySize` for `BeaconState` and all subtypes (from #7803)
- `estimated_marginal_bytes` fallback with 25 tests (not used for eviction)
- milhouse `cow_bytes` PR: sigp/milhouse#100

## What's not tracked

- **Non-tree caches**: committee_caches (~30-60MB Arc-shared), pubkey_cache
(~100-150MB rpds), epoch_cache (~5MB Arc). Marginal cost ~0 when shared,
but the base finalized state's caches aren't measured.
- **Scalar fields**: fork, checkpoints, eth1_data. Small, fixed per state.

## References

- sigp/lighthouse#7449 — Measure state cache size
- sigp/lighthouse#7450 — Prune state cache based on size
- sigp/lighthouse#7803 — Memory Aware Caching (rejected)
- sigp/milhouse#100 — cow_bytes pairwise tree walk
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ malloc_utils = { path = "common/malloc_utils" }
maplit = "1"
merkle_proof = { path = "consensus/merkle_proof" }
metrics = { path = "common/metrics" }
milhouse = { version = "0.9", default-features = false, features = ["context_deserialize"] }
milhouse = { git = "https://github.com/dapplion/milhouse.git", branch = "cow-bytes", default-features = false, features = ["context_deserialize"] }
mockall = "0.13"
mockall_double = "0.3"
mockito = "1.5.0"
Expand Down
13 changes: 12 additions & 1 deletion beacon_node/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -802,11 +802,22 @@ pub fn cli_app() -> Command {
Arg::new("state-cache-size")
.long("state-cache-size")
.value_name("STATE_CACHE_SIZE")
.help("Specifies the size of the state cache")
.help("Specifies the maximum number of states in the state cache")
.default_value("128")
.action(ArgAction::Set)
.display_order(0)
)
.arg(
Arg::new("state-cache-max-mb")
.long("state-cache-max-mb")
.value_name("STATE_CACHE_MAX_MB")
.help("Maximum memory budget for the state cache in megabytes. When set, the \
cache evicts states to stay within this budget using estimated byte costs. \
Epoch boundary states (~32MB each on mainnet) are deprioritized for \
eviction. If unset, only count-based eviction is used.")
.action(ArgAction::Set)
.display_order(0)
)
/*
* Execution Layer Integration
*/
Expand Down
4 changes: 4 additions & 0 deletions beacon_node/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,10 @@ pub fn get_config<E: EthSpec>(
.map_err(|_| "state-cache-size is not a valid integer".to_string())?;
}

if let Some(max_mb) = clap_utils::parse_optional::<usize>(cli_args, "state-cache-max-mb")? {
client_config.store.state_cache_max_mb = Some(max_mb);
}

if let Some(historic_state_cache_size) =
clap_utils::parse_optional(cli_args, "historic-state-cache-size")?
{
Expand Down
5 changes: 5 additions & 0 deletions beacon_node/store/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,14 @@ zstd = { workspace = true }
[dev-dependencies]
beacon_chain = { workspace = true }
criterion = { workspace = true }
genesis = { workspace = true }
rand = { workspace = true, features = ["small_rng"] }
tempfile = { workspace = true }

[[bench]]
name = "hdiff"
harness = false

[[bench]]
name = "state_memory"
harness = false
133 changes: 133 additions & 0 deletions beacon_node/store/benches/state_memory.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
//! Benchmarks for state memory measurement using cow_bytes (pairwise tree walk).

use criterion::{Criterion, criterion_group, criterion_main};
use milhouse::{List, Vector};
use ssz_types::BitVector;
use std::hint::black_box;
use std::sync::Arc;
use types::state::*;
use types::*;

type E = MainnetEthSpec;

fn make_state(n: usize) -> BeaconState<E> {
let validator = Validator {
pubkey: bls::PublicKeyBytes::empty(),
withdrawal_credentials: Hash256::ZERO,
effective_balance: 32_000_000_000,
slashed: false,
activation_eligibility_epoch: Epoch::new(0),
activation_epoch: Epoch::new(0),
exit_epoch: Epoch::new(u64::MAX),
withdrawable_epoch: Epoch::new(u64::MAX),
};
let validators = List::new(vec![validator; n]).unwrap();
let balances = List::new(vec![32_000_000_000u64; n]).unwrap();
let inactivity_scores = List::new(vec![0u64; n]).unwrap();
let participation = List::new(vec![ParticipationFlags::default(); n]).unwrap();
let default_cc = Arc::new(CommitteeCache::default());
let sync = Arc::new(SyncCommittee::temporary());

BeaconState::Altair(BeaconStateAltair {
genesis_time: 0,
genesis_validators_root: Hash256::ZERO,
slot: Slot::new(0),
fork: Fork::default(),
latest_block_header: BeaconBlockHeader::empty(),
block_roots: Vector::default(),
state_roots: Vector::default(),
historical_roots: List::default(),
eth1_data: Eth1Data::default(),
eth1_data_votes: List::default(),
eth1_deposit_index: 0,
validators,
balances,
randao_mixes: Vector::default(),
slashings: Vector::default(),
previous_epoch_participation: participation.clone(),
current_epoch_participation: participation,
justification_bits: BitVector::new(),
previous_justified_checkpoint: Checkpoint::default(),
current_justified_checkpoint: Checkpoint::default(),
finalized_checkpoint: Checkpoint::default(),
inactivity_scores,
current_sync_committee: sync.clone(),
next_sync_committee: sync,
total_active_balance: None,
progressive_balances_cache: ProgressiveBalancesCache::default(),
committee_caches: [default_cc.clone(), default_cc.clone(), default_cc],
pubkey_cache: PubkeyCache::default(),
exit_cache: ExitCache::default(),
slashings_cache: SlashingsCache::default(),
epoch_cache: EpochCache::default(),
approx_owned_bytes: ApproxOwnedBytesList::default(),
})
}

fn make_slot_transition(base: &BeaconState<E>, n: usize) -> BeaconState<E> {
let mut post = base.clone();
// 1 proposer reward + 128 participation + roots + randao
*post.balances_mut().get_mut(0).unwrap() += 1;
*post.state_roots_mut().get_mut(0).unwrap() = Hash256::repeat_byte(0x01);
*post.block_roots_mut().get_mut(0).unwrap() = Hash256::repeat_byte(0x02);
*post.randao_mixes_mut().get_mut(0).unwrap() = Hash256::repeat_byte(0x03);
for i in 0..128.min(n) {
post.current_epoch_participation_mut()
.unwrap()
.get_mut(i)
.unwrap()
.add_flag(0)
.unwrap();
}
post.apply_pending_mutations().unwrap();
post
}

fn make_epoch_transition(base: &BeaconState<E>, n: usize) -> BeaconState<E> {
let mut post = base.clone();
// All balances + inactivity + participation replaced
for i in 0..n {
*post.balances_mut().get_mut(i).unwrap() += 1;
}
for i in 0..n {
*post.inactivity_scores_mut().unwrap().get_mut(i).unwrap() += 1;
}
*post.previous_epoch_participation_mut().unwrap() =
List::new(vec![ParticipationFlags::default(); n]).unwrap();
*post.current_epoch_participation_mut().unwrap() =
List::new(vec![ParticipationFlags::default(); n]).unwrap();
post.apply_pending_mutations().unwrap();
post
}

fn bench_cow_bytes(c: &mut Criterion) {
let mut group = c.benchmark_group("cow_bytes");
group.sample_size(10);

for n in [1_000_000, 2_000_000] {
eprintln!("Building states with {n} validators...");
let base = make_state(n);

// Slot transition: few dirty nodes.
let post_slot = make_slot_transition(&base, n);
group.bench_function(format!("slot_transition_{n}"), |b| {
b.iter(|| black_box(cow_bytes_between(&base, &post_slot)));
});

// Epoch transition: many dirty nodes.
let post_epoch = make_epoch_transition(&base, n);
group.bench_function(format!("epoch_transition_{n}"), |b| {
b.iter(|| black_box(cow_bytes_between(&base, &post_epoch)));
});

// Total tree bytes (for initial finalized state).
group.bench_function(format!("total_tree_bytes_{n}"), |b| {
b.iter(|| black_box(total_state_tree_bytes(&base)));
});
}

group.finish();
}

criterion_group!(benches, bench_cow_bytes);
criterion_main!(benches);
8 changes: 8 additions & 0 deletions beacon_node/store/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ pub const DEFAULT_HOT_HDIFF_BUFFER_CACHE_SIZE: NonZeroUsize = new_non_zero_usize
const EST_COMPRESSION_FACTOR: usize = 2;
pub const DEFAULT_EPOCHS_PER_BLOB_PRUNE: u64 = 1;
pub const DEFAULT_BLOB_PUNE_MARGIN_EPOCHS: u64 = 0;
/// Default maximum memory budget for the state cache in megabytes. `None` means no byte-budget
/// limit (count-based eviction only, the previous behaviour).
pub const DEFAULT_STATE_CACHE_MAX_MB: Option<usize> = None;

/// Database configuration parameters.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
Expand Down Expand Up @@ -64,6 +67,10 @@ pub struct StoreConfig {
/// The margin for blob pruning in epochs. The oldest blobs are pruned up until
/// data_availability_boundary - blob_prune_margin_epochs. Default: 0.
pub blob_prune_margin_epochs: u64,
/// Maximum memory budget for the state cache in megabytes. When set, the cache will evict
/// states to stay within this budget using spec-derived byte cost estimates. `None` disables
/// byte-budget eviction (count-based only).
pub state_cache_max_mb: Option<usize>,
}

/// Variant of `StoreConfig` that gets written to disk. Contains immutable configuration params.
Expand Down Expand Up @@ -120,6 +127,7 @@ impl Default for StoreConfig {
prune_blobs: true,
epochs_per_blob_prune: DEFAULT_EPOCHS_PER_BLOB_PRUNE,
blob_prune_margin_epochs: DEFAULT_BLOB_PUNE_MARGIN_EPOCHS,
state_cache_max_mb: DEFAULT_STATE_CACHE_MAX_MB,
}
}
}
Expand Down
6 changes: 6 additions & 0 deletions beacon_node/store/src/hot_cold_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,7 @@ impl<E: EthSpec> HotColdDB<E, MemoryStore<E>, MemoryStore<E>> {
config.state_cache_size,
config.state_cache_headroom,
config.hot_hdiff_buffer_cache_size,
config.state_cache_max_mb.map(|mb| mb * 1_048_576),
)),
historic_state_cache: Mutex::new(HistoricStateCache::new(
config.cold_hdiff_buffer_cache_size,
Expand Down Expand Up @@ -297,6 +298,7 @@ impl<E: EthSpec> HotColdDB<E, BeaconNodeBackend<E>, BeaconNodeBackend<E>> {
config.state_cache_size,
config.state_cache_headroom,
config.hot_hdiff_buffer_cache_size,
config.state_cache_max_mb.map(|mb| mb * 1_048_576),
)),
historic_state_cache: Mutex::new(HistoricStateCache::new(
config.cold_hdiff_buffer_cache_size,
Expand Down Expand Up @@ -515,6 +517,10 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
&metrics::STORE_BEACON_STATE_CACHE_SIZE,
state_cache.len() as i64,
);
metrics::set_gauge(
&metrics::STORE_BEACON_STATE_CACHE_COW_BYTE_SIZE,
state_cache.cached_bytes() as i64,
);
metrics::set_gauge_vec(
&metrics::STORE_BEACON_HDIFF_BUFFER_CACHE_SIZE,
HOT_METRIC,
Expand Down
Loading
Loading