perf(l1): optimize trie building in snap sync insertion#6410
perf(l1): optimize trie building in snap sync insertion#6410ElFantasma wants to merge 18 commits intomainfrom
Conversation
- Reuse nodehash_buffer across add_current_to_parent_and_write_queue calls instead of allocating a new 512-byte Vec on each of ~700M invocations - Avoid Nibbles path clone by computing prefix index directly - Merge code hash collection into the trie-building pass, eliminating a redundant full iteration + RLP decode of all accounts (~3 min on mainnet)
Split account insertion's trie build into 16 independent sub-tries (one per first nibble of account hash), built concurrently using a shared thread pool. Sub-trie results are assembled into the root BranchNode after all threads complete. Also partitions accounts in a single pass that collects code hashes and storage roots, eliminating the redundant second iteration.
Replace the in-memory partitioning (which loaded all 374M accounts into Vec buckets causing OOM on mainnet) with per-nibble RocksDB raw iterators that stream from disk. Each of the 16 sub-trie threads creates its own iterator, seeks to its nibble boundary, and streams accounts one at a time. The scope and thread pool are now managed directly in insert_accounts, matching the pattern used by insert_storages.
163c612 to
37d33dc
Compare
Lines of code reportTotal lines added: Detailed view |
🤖 Kimi Code ReviewHere's my review of the PR: Critical Issues1. Ignored Flush Errors (Potential Data Loss)File: The let _ = flush_nodes_to_write(nodes_to_write, db, buffer_sender);If this fails (DB write error or channel closed), the function returns flush_nodes_to_write(nodes_to_write, db, buffer_sender)?;2. Panic on Channel ReceiveFile: Multiple 3. Iterator Panic on Decode FailureFile: let account_state = AccountState::decode(&value_vec).expect("We should have accounts here");This panics on malformed data. During snap sync, invalid data from peers should be handled gracefully as a sync error, not a panic. Security & Correctness4. Trie Root ConsistencyThe parallel construction logic appears correct:
Recommendation: Ensure the test
5. Atomic Ordering for File IndexFile:
Performance & Resource Management6. Unnecessary Mutex ContentionFile:
let handles: Vec<_> = (0u8..16).map(|nibble| {
s.spawn(move || {
// ...
Ok((nibble, child_ref, nibble_iter.storage
---
*Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt* |
🤖 Codex Code ReviewFindings
I did not find a clear trie-root correctness bug in the parallel root assembly from static inspection. I couldn’t run the Rust tests in this sandbox because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewGood — Code Review:
|
Greptile SummaryThis PR optimizes snap sync trie building via three techniques: eliminating a redundant code-hash iteration pass, reusing a 512-byte hash buffer across ~700M calls, and parallelising the 16-nibble state trie build with streaming RocksDB iterators. The parallel correctness is well-validated (4 full mainnet runs with trie-root verification). Two minor robustness gaps remain: Confidence Score: 4/5Safe to merge — parallelism is correct and thoroughly validated; two non-blocking robustness improvements remain Parallel trie construction verified across 4 complete mainnet multisync runs with full state-root and storage-root validation. The P2 findings (silent I/O error discard and expect-panic instead of structured error) do not affect correctness under normal operation but could make failures harder to diagnose. Score is 4 rather than 5 because the silent code-hash loss path has real user impact if triggered. crates/networking/p2p/sync/snap_sync.rs (NibbleIterator::flush_code_hashes error handling), crates/common/trie/trie_sorted.rs (buffer channel expect calls)
|
| Filename | Overview |
|---|---|
| crates/common/trie/trie_sorted.rs | Adds parallel sub-trie building across 16 nibble ranges (trie_from_sorted_subtrie + assemble_root_from_subtries), reuses a 512-byte hash buffer, and eliminates a Nibbles clone per node; core algorithm is correct but buffer-channel error paths use panicking expect calls |
| crates/networking/p2p/sync/snap_sync.rs | Replaces sequential two-pass account insertion with 16-thread parallel NibbleIterator-based sub-trie build; code hash I/O errors are silently discarded in flush_code_hashes |
| crates/networking/p2p/sync/code_collector.rs | Adds next_file_index/set_file_index methods to CodeHashCollector to coordinate file-index ownership with external parallel writers |
Sequence Diagram
sequenceDiagram
participant S as insert_accounts
participant T as thread_scope
participant N as NibbleIter[0..15]
participant P as ThreadPool (DB writers)
participant DB as TrieDB
participant A as assemble_root_from_subtries
S->>T: thread_scope { spawn 16 threads }
par For each nibble 0..15
T->>N: trie_from_sorted_subtrie(nibble)
N->>DB: raw_iter.seek(nibble<<4)
loop yield accounts
N->>N: collect code_hashes + storage_accounts
N->>P: execute_priority(flush_nodes_to_write)
P->>DB: put_batch_no_alloc(nodes)
end
N->>N: flush_code_hashes() → dump_to_file
N-->>T: Ok(Some(NodeRef[nibble]))
end
T->>T: join all 16 handles → choices[16]
T->>T: drop pool → workers drain & exit
T-->>S: [NodeRef; 16]
S->>A: assemble_root_from_subtries(choices)
A->>DB: get([nibble]) per valid choice
A->>DB: put_batch(root node at [])
A-->>S: state_root H256
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 1050
Comment:
**Silent I/O error on code-hash flush**
Errors from `dump_to_file` are silently discarded with `let _ = ...`. If the disk write fails, the code hashes collected by this nibble's thread are permanently lost without any log or error surface — bytecodes for those accounts will never be downloaded.
```suggestion
if let Err(e) = dump_to_file(&file_name, hashes.encode_to_vec()) {
tracing::error!("Failed to write code hashes file (nibble {}): {:?}", self.nibble, e);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/common/trie/trie_sorted.rs
Line: 390-392
Comment:
**Panic instead of error on buffer-channel closure**
If a pool worker panics before returning a buffer via `sender.send(...)` in `flush_nodes_to_write`, all senders can drop and `recv().expect("This channel shouldn't close")` will panic the sub-trie thread rather than propagating a structured `Result`. The same pattern exists in `trie_from_sorted_accounts` around line 208. Using `recv().map_err(|_| TrieGenerationError::ThreadJoinError())` would keep error handling consistent with the rest of the function.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(l1): replace unwrap with expect for ..." | Re-trigger Greptile
| let file_idx = self.code_hash_file_index.fetch_add(1, Ordering::Relaxed); | ||
| let file_name = get_code_hashes_snapshot_file(self.code_hashes_dir, file_idx); | ||
| let hashes: Vec<H256> = std::mem::take(&mut self.code_hashes); | ||
| let _ = dump_to_file(&file_name, hashes.encode_to_vec()); |
There was a problem hiding this comment.
Silent I/O error on code-hash flush
Errors from dump_to_file are silently discarded with let _ = .... If the disk write fails, the code hashes collected by this nibble's thread are permanently lost without any log or error surface — bytecodes for those accounts will never be downloaded.
| let _ = dump_to_file(&file_name, hashes.encode_to_vec()); | |
| if let Err(e) = dump_to_file(&file_name, hashes.encode_to_vec()) { | |
| tracing::error!("Failed to write code hashes file (nibble {}): {:?}", self.nibble, e); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 1050
Comment:
**Silent I/O error on code-hash flush**
Errors from `dump_to_file` are silently discarded with `let _ = ...`. If the disk write fails, the code hashes collected by this nibble's thread are permanently lost without any log or error surface — bytecodes for those accounts will never be downloaded.
```suggestion
if let Err(e) = dump_to_file(&file_name, hashes.encode_to_vec()) {
tracing::error!("Failed to write code hashes file (nibble {}): {:?}", self.nibble, e);
}
```
How can I resolve this? If you propose a fix, please make it concise.
crates/common/trie/trie_sorted.rs
Outdated
| nodes_to_write = buffer_receiver | ||
| .recv() | ||
| .expect("This channel shouldn't close"); |
There was a problem hiding this comment.
Panic instead of error on buffer-channel closure
If a pool worker panics before returning a buffer via sender.send(...) in flush_nodes_to_write, all senders can drop and recv().expect("This channel shouldn't close") will panic the sub-trie thread rather than propagating a structured Result. The same pattern exists in trie_from_sorted_accounts around line 208. Using recv().map_err(|_| TrieGenerationError::ThreadJoinError()) would keep error handling consistent with the rest of the function.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/common/trie/trie_sorted.rs
Line: 390-392
Comment:
**Panic instead of error on buffer-channel closure**
If a pool worker panics before returning a buffer via `sender.send(...)` in `flush_nodes_to_write`, all senders can drop and `recv().expect("This channel shouldn't close")` will panic the sub-trie thread rather than propagating a structured `Result`. The same pattern exists in `trie_from_sorted_accounts` around line 208. Using `recv().map_err(|_| TrieGenerationError::ThreadJoinError())` would keep error handling consistent with the rest of the function.
How can I resolve this? If you propose a fix, please make it concise.
Review feedback addressedThanks for the thorough reviews. Here's what was addressed and what wasn't (with reasoning). Addressed
Not addressing (with reasoning)
|
Stale — all concerns addressed in the 'Review feedback addressed' comment and subsequent commits.
Motivation
Profiling snap sync's insertion phases on mainnet revealed that trie building is 75-91% of insertion time, and it's entirely CPU-bound (0% I/O wait). Account insertion took ~20 minutes and storage insertion ~43 minutes — together 61% of total snap sync time.
Description
Three optimizations targeting the CPU-bound trie construction:
1. Eliminate redundant code hash iteration pass
The original code iterated all accounts twice — once to collect code hashes (async flush), once to build the trie. Merged both into a single pass by collecting code hashes in the trie-building iterator and flushing them incrementally to disk.
2. Reuse
nodehash_bufferacross callsadd_current_to_parent_and_write_queueallocated a new 512-byteVecon every call (~700M calls on mainnet). Now receives the buffer as a parameter from the outer function. Also avoids aNibblespath clone per call by computing the prefix index directly.3. Parallel state trie building across 16 nibble ranges
Split the state trie build into 16 independent sub-tries (one per first nibble of account hash), built concurrently using streaming RocksDB iterators and a shared thread pool for DB flushes. Sub-trie results are assembled into the root BranchNode via
assemble_root_from_subtries. Each sub-trie thread also collects its own code hashes (flushed incrementally to disk) and storage accounts, merged after completion.Code structure
trie_from_sorted_subtrie— builds a single sub-trie, returns the childNodeRefassemble_root_from_subtries— assembles 16 sub-trie results into a root node (shared betweentrie_from_sorted_parallelandinsert_accounts)trie_from_sorted_parallel— clean API used by tests; production uses the lower-level functions directly to collect side effects viaNibbleIteratorNibbleIterator— RocksDB iterator filtered to a single nibble range, collects code hashes + storage accounts as side effectsBenchmarks (mainnet-8, release profile, no validation)
Account Insertion
Storage Insertion
Multisync validation (ethrex-mainnet-9, release-with-debug-assertions)
Ran 4 complete multisync cycles on
ethrex-mainnet-9(12 cores, 64GB RAM) withrelease-with-debug-assertionsprofile, which enables full state trie validation (state root, storage roots, bytecodes) after snap sync. All 4 runs passed on all 3 networks.Phase breakdown (run #2 — representative)
Hoodi (total: 45m 48s including validation)
Sepolia (total: 2h 14m including validation)
Mainnet (total: 3h 38m including validation)
Consistency across runs
All runs include full trie validation (state root + storage roots + bytecodes). Validation passed on every run.
Checklist
assemble_root_from_subtriesThreadJoinErrorinstead ofunwrap)