perf: EVM cache-build overhaul (dom-CHK + phase fusion + CSR)#514
Merged
Conversation
…nedy algorithm Replace the iterative bitset dataflow in computeDominators with Cooper-Harvey-Kennedy 2001 (CHK) producing an immediate-dominator array, augmented with Tarjan DFS pre/post times (DomInfo::Enter/Exit) so the two consumers (findBackEdgesUsingDominators, buildLoopsUsingDominance) answer dominance queries in O(1) via interval containment. Memory drops from O(N^2) bits to 3N uint32_t. Time drops from O(N^2/64) worst-case to O(N + E) typical for the reducible CFGs that EVM bytecode produces. evmCacheComplexityDemo speedups vs the post-DTVMStack#446 bitset path: N=10000: 10.38 ms -> 3.38 ms (3.1x) N=20000: 43.68 ms -> 5.90 ms (7.4x) N=50000: - -> 14.48 ms N=100000: 948 ms -> 38.95 ms (24.3x; user-provided pre-PR number) Class A/B/C self-root seeding moved to init time so descendants of a class-C node can intersect against a settled root in step 4 of the fixpoint, preserving the old bitset pass's Dom[descendant] semantics (verified by ClassCDescendant_SeedsAtInit). Gates (all pass): - format check on PR-changed files clean - dtvmapi build no new warnings in PR-touched files - evmone-unittests multipass 223/223 - evmone-unittests interpreter 215/215 - evmone-statetest -k fork_Cancun multipass 2723/2723 (zero new failures) - evmCacheTests 9/9 (4 implicit-dyn-pred + 5 new dominator) - evmCacheComplexityDemo gate thresholds met by >=2x margin Spec and reviews: docs/changes/2026-05-12-evm-dom-chk/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace `vector<vector<uint32_t>> Children(N)` with a CSR layout (ChildStart[N+1] + ChildIdx[]) so the dom-tree DFS no longer pays N inner-vector heap allocations at large N. Same algorithm; same output; lower constant factor. Also collapse the class-A/B/C init branches in computeDomInfo into a single `HasReachablePred` predicate, since Preds.empty() is just the zero-pred case of the same condition. Behavior unchanged. Gates: format clean, dtvmapi build clean, evmCacheTests 9/9, multipass unittests 223/223, scaling demo within ±10% noise of prior commit (still meets N=20k<15ms / N=100k<100ms gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Instrument buildGasChunksSPP with 13 named phase boundaries gated by ZEN_EVM_CACHE_PROFILE compile-time flag. OFF (default) macro-elides all chrono calls so release builds carry zero overhead. ON emits one stderr CSV row per phase: EVM_CACHE_PROFILE,<phase>,<microseconds>. Phases timed: - buildGasBlocks, collectJumpDests, buildCFGEdges, splitCriticalEdges - computeReachable (incl. dyn-target stitch), computeDomInfo - findBackEdges, computeReverseTopo, computeInCycle - buildLoopsUsingDominance, meteringInit, lemma614Schedule, writeback Enables per-phase profiling of real-corpus contracts to drive follow-up PR ordering (which phase dominates real workloads). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
evm_cache_complexity_demo: support --bytecode <hex-or-bin-file> [--label <tag>] to time cache build on real contract bytecode. CSV row <label>,<n_jumpdests>,<build_us> on stdout. Hex auto-detects 0x prefix and whitespace, falls back to raw binary. evm_cache_tests: add 5 structural dominator cases covering self-loops, irreducible multi-entry SCCs, nested loops with shared exits, post-split critical-edge diamonds, and dynamic-target JUMPDESTs inside static loops. Each case asserts IDom well-formedness via a shared helper plus behavioural invariants on cycle members. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bench_evm_cache.sh: spawn one fresh process per repetition (avoids intra-process cache reuse); emit long-form CSV (label,n_jumpdests,run_idx,phase,phase_us). `phase=total` comes from the demo's stdout; per-phase rows are picked up from the demo's stderr when the binary is built with -DZEN_EVM_CACHE_PROFILE=ON. analyze_evm_cache_bench.py: cluster bootstrap on contracts (per-contract unit, WITH replacement, N=1000 by default); BCa with jackknife `a` (leave-one-contract-out, Efron 1987) and median-bias `z_0`; gate inverts on the `total` phase as r_upper_CI <= 0.85 (= improvement_lo >= 15%%). Sanity: baseline=treatment gives r_median=1.0 with degenerate CI; a synthetic 50%% improvement on 10 contracts gives r=0.504, CI (0.490, 0.508), gate PASS. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fetch_topcontracts.py: curated list of ~90 high-traffic mainnet contracts
(stables, DEX routers, lending markets, NFT marketplaces, infra); pulls
runtime bytecode via public JSON-RPC (eth_getCode); dedupe by codehash;
writes hex + per-contract meta JSON. Used for the primary paired-ratio
bench corpus (production-grade workload, ~80 unique contracts).
fetch_sourcify_corpus.py: pulls verified contracts via Sourcify v2 REST
API (`/contracts/{chainId}` + `/contract/.../?fields=runtimeBytecode,
metadata`); supplies solc_version / optimizer_runs / viaIR metadata for
stratified sampling. Higher noise floor than top-RPC (most newly
verified contracts are 100-200 byte proxy stubs) but provides 7-strata
metadata when needed.
.gitignore: corpus output (raw/ + manifest_*.json) is bench artefact,
not source. Fetchers are reproducible; bench results live in spec
Results section.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
evm_cache.md: replace the iterative-bitset dataflow description with the Cooper-Harvey-Kennedy algorithm + Tarjan DFS Enter/Exit intervals (`O(N+E)` time, `O(N)` memory, `O(1)` `dominates` queries). Add a section on the optional `-DZEN_EVM_CACHE_PROFILE=ON` per-phase wall-clock CSV emission used to drive `tools/bench_evm_cache.sh`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move the spec drafted in ~/changes/2026-05-16-evm-spp-overhaul/ into the project-required location docs/changes/2026-05-16-evm-spp-overhaul/, with all Phase 0.5 motivation reviews + Phase 2 spec reviews retained. DTVM/CLAUDE.md mandates change docs live under docs/changes/ as PR artefacts, overriding the global ~/changes/ SSOT default. Spec status is now Implemented (v3) per the latest Results section in README.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address 4 of 6 Codex round-1 findings (1 NIT skipped, commit reword deferred until Opus returns): - Production gate: relabel "borderline" -> explicit FAIL on the recalibrated improvement_lo>0 clause, note user override on stratified evidence + algorithmic gate PASS. - Algorithmic table: refresh with 9-rep median (was 5-rep). 100k now 21.83x; add measurement-variance note acknowledging independent reviewer reruns in the 20-30x range. - Step 5 scope: narrow the spec claim to IDom-only structural tests; enumerate the loop / SPP / fuzz invariants explicitly deferred and point at evmone-statetest 2723 + existing implicit-dyn-pred GTests for end-to-end coverage. - analyze_evm_cache_bench.py docstring: cite Efron-Tibshirani 1993 (per Phase 2 R2 accepted nit) instead of Efron 1987. - fetch_topcontracts.py: split raw/ and meta/ into sibling dirs so the bench harness doesn't mis-interpret .meta.json files as bytecode files; remove sanctioned TornadoCash01 from TOP_CONTRACTS. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address 5 of 6 Opus round-1 findings (cosmetic NIT 6 left as-is):
- Replace IrreducibleSCC_TwoEntryLoop test with IrreducibleImproperRegion.
Old CFG was reducible under DTVM's dom-based loop detection (the two
cycle edges had no dominating back-edge target, so zero loops were
discovered and the fallback path was never exercised). New CFG is a
Hecht-Ullman improper region: 0 -> 1 -> 2 -> 3 -> {1, 4}; 4 -> {2, 5}.
Two overlapping back-edges (3->1 and 4->2) produce two loops with body
intersection {2, 3} but neither containing the other, forcing the
reducibility check to fail. The test asserts IDom correctness on the
irreducible region plus the dominator-chain-reaches-root invariant.
- Reorder DomInfo::dominates() bounds check before the A==B shortcut so
out-of-range equal arguments do not falsely report mutual dominance.
- evm_cache_for_testing.h: document that computeIDomForTesting is the
dominator pass in isolation, with no computeReachable /
splitCriticalEdges / reachability-stitch coverage.
- Spec Step 5 prose: add a downgrade note enumerating the per-fixture
behavioural claims (InCycle, UseLinearSPP, buildLoopsUsingDominance,
GasChunkCostSPP fallback, splitCriticalEdges write-back) and path-total
fuzz that were deferred to PR B / PR C.
- Spec Checklist: annotate Step 7 with "production gate FAIL, override
approved" so the failure flag is visible at scan time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 reviewers (Codex + Opus, parallel) reviewed commits b00efa1 + 8a95175 shipped for R1. Codex R2 raised 1 issue (variance band); Opus R2 raised 1 MINOR (test naming/comment drift) + 1 NIT (PR B note). Fixes: - Variance band (Codex R2 §5): 9-rep median rerun produced 19.26x at N=100k; spec said 20-30x. Refresh to "≈ 19-30x" with the four sampled medians explicitly listed (19.26 / 21.83 / 22.84 / 29.7); gate remains ≥10x. - IrreducibleImproperRegion mis-naming (Opus R2 MINOR): the new CFG 0->1->2->3->{1,4}; 4->{2,5} produces natural loops {1,2,3,4} and {2,3,4} where the second is properly nested in the first (reducible nest). My R1 fix-attempt comment claimed otherwise. Rename test to OverlappingBackEdgesIDom and rewrite the comment to describe it as a reducible nested case that exercises the CHK intersect finger-walk on a non-trivial back-edge set; soften the §"Step 5 Scope Reduction" wording from "genuinely forces ... irreducible loop nest" to the truer narrative. - Opus R2 NIT (PR B note): add a structural observation to the spec: dominator-based loop discovery only ever produces a properly-nested loop forest by construction, so exercising the SPP reducibility fallback at evm_cache.cpp:1019-1042 requires buildBytecodeCache-level plumb, not the computeIDomForTesting helper. Documented for PR B/C authors. Code change (test rename + comment) verified: 14/14 evmCacheTests pass, no other targets touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously buildGasBlocks ran two passes over the bytecode: pass 2a marked IsBlockStart[CodeSize] for JUMPDEST positions and after-terminator bytes, and pass 2b walked IsBlockStart to construct GasBlock entries. The auxiliary IsBlockStart vector cost CodeSize bytes of allocation + memset and one extra L1/L2-hostile traversal of the bytecode. Replace with a single walk: each iteration opens a new block at the current Pc, then advances opcode by opcode until either (a) a mid-block OP_JUMPDEST is encountered (which starts a new block), or (b) a gas-chunk terminator is processed (whose successor byte opens the next block). Semantically identical because every block start under the old scheme was either Pc=0, a JUMPDEST position, or the byte right after a terminator -- all three are produced naturally by the fused loop. Measured on evmCacheComplexityDemo N=100k synthetic (5 reps, median): phase buildGasBlocks: 10614 us -> 9250 us (-13%) total cache build: 54818 us -> 46260 us (-15%) The total wins more than the named phase because the eliminated IsBlockStart vector (300 KB for N=100k synthetic) sat in the outer buildBytecodeCache and is no longer allocated or zero-filled. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass - No behavioral change in callers; signature unchanged.
collectJumpDests previously re-scanned the entire bytecode after buildGasBlocks, allocated a SeenBlocks[Blocks.size()] dedup vector, and mapped each JUMPDEST byte through BlockAtPc to recover the unique set of JUMPDEST-leading blocks. Every JUMPDEST byte in valid EVM code already starts a new gas block, so the dedup is structurally unnecessary and the re-scan is pure duplication of the buildGasBlocks walk. Emit JumpDestBlocks inline: each iteration of buildGasBlocks reads the opening opcode of the new block; if it is OP_JUMPDEST, push the block id that is about to be assigned. Output is identical to the prior pass in both set membership and block-id ascending order; downstream buildCFGEdges and reachability seeding consume the list as an unordered set so any iteration order is acceptable. Measured on evmCacheComplexityDemo N=100k (5 reps, median): total: 46260 us -> 42813 us (-7%, this commit) total vs main: 54818 us -> 42813 us (-22%, cumulative w/ prior fusion) The phase formerly named EVM_PROFILE,collectJumpDests is now absent from profile output; its 0.4 ms instrumented cost plus an equivalent amount of un-instrumented bytecode-rescan + SeenBlocks zero-fill is reclaimed. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
…sses Previously every GasBlock owned a std::vector<uint32_t> Preds and Succs. With N gas blocks that materialises 2N small heap allocations, and every neighbour-iteration walks a pointer to a scattered heap chunk. The hot SPP passes (computeDomInfo CHK intersect over Preds, computeInCycle SCC DFS, findBackEdges over Succs, buildLoopsUsingDominance over both) all pay this pointer-chase tax per node. Flatten both directions into a contiguous CSR adjacency once after splitCriticalEdges finishes mutating the graph, then route every downstream reader through the new SuccsCSR / PredsCSR. The per-block vectors stay live (we copy out, not swap) -- N std::vector dealloc()s back-to-back cost more than the readers reclaim, so we trade short-lived peak memory for time. Reader-side measurement on evmCacheComplexityDemo N=100k (25 reps): pre-CSR (commit 3bba649) median = 44797 us post-CSR (this commit) median = 39475 us (-11.9%) Per-phase breakdown shifts the cost from many "Preds/Succs reader" rows into a single buildCSR row plus much faster readers: computeDomInfo 7233 -> 4169 us (-42%) computeInCycle 5694 -> 3842 us (-32%) computeReachable 1818 -> 970 us (-47%) findBackEdges 1169 -> 342 us (-71%) buildLoops 1309 -> 423 us (-68%) computeReverseTopo 1651 -> 1114 us (-32%) buildCSR 0 -> 3985 us (new, single up-front cost) Cumulative vs perf/evm-spp-foundation baseline (PR A binary at 54818 us): 54818 us -> 39475 us (-28.0%) Mutating helpers (buildCFGEdges, splitCriticalEdges, addEdge) still operate on the per-block vectors. CSR is built once after the mutations finish, so addEdge / erase semantics in those phases are unchanged. The testing helper computeIDomForTesting now builds its own CSR pair in-place from the input Succs[] adjacency, matching the production flow. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
Adds a single counter (gated on ZEN_EVM_CACHE_PROFILE) that prints how many RPO sweeps the CHK fixpoint took to settle. Used to answer "would SemiNCA help here?". Measurement on evmCacheComplexityDemo at N = 10k / 20k / 50k / 100k synthetic shows the fixpoint settles in exactly 2 rounds in every case -- one productive sweep plus one confirmation sweep -- so SemiNCA's single-pass advantage caps at roughly half of computeDomInfo's time, well under the cost of its eval/link forest bookkeeping. Zero runtime cost when ZEN_EVM_CACHE_PROFILE is off (macro elides).
computeInCycle previously ran an unconditional Tarjan SCC pass to mark every block that participates in a cycle. On a reducible CFG -- the common case for compiler-emitted EVM code -- this work is redundant: every cycle is the natural loop of some back-edge, and buildLoopsUsingDominance already enumerates those natural loops with their NodeMask bitmaps. The union of NodeMasks equals Tarjan's in-cycle set, so we can derive InCycle in one bitset OR sweep instead of running a second full DFS pair over Succs and Preds. Pipeline reorder: buildLoopsUsingDominance now runs before InCycle so its UseLinearSPP result and Loops vector are available to choose the cheap path. Reducible path (UseLinearSPP=true): OR all Loops[].NodeMask bitmaps into a CycleBits vector, then expand into the existing uint8_t InCycle vector. Empty Loops vector yields all-zero InCycle, which is correct -- an acyclic CFG has nothing in a cycle. Irreducible path (UseLinearSPP=false): keep the full Tarjan SCC. Dominator-based loop discovery can miss multi-entry cycles that have no single header, and lemma614Update relies on InCycle correctness to refuse gas shifts across cycles. The Tarjan backstop preserves soundness for these cases (rare in practice -- statetest 2723 shows no irreducible contracts trigger the fallback at scale). Measured on evmCacheComplexityDemo N=100k (50 reps, median): pre (commit 4d74033): 41247 us post (this commit): 39592 us (-4.0%) Phase delta: computeInCycle 3842 us -> 74 us; buildLoopsUsingDominance absorbs ~1.2 ms of cold-cache cost from running first instead of second. Net ~1.6 ms gain on synthetic, consistent across the IQR band. Verification: - evmCacheTests: 14/14 pass (covers IrreducibleImproperRegion fallback path indirectly through computeIDomForTesting; full Tarjan branch exercised below) - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
buildCFGEdges previously walked Blocks twice. The first pass called resolveConstantJumpTarget on every JUMP block solely to count the unresolved dynamic jumps and stamp ImplicitDynamicPredCount on every JUMPDEST. The second pass walked Blocks again to add fallthrough and jump-target edges, calling resolveConstantJumpTarget a second time on each JUMP block to recover the same answer. Collapse into one pass: count DynamicJumpCount inline while emitting edges, then stamp the JUMPDESTs at the end. addEdge does not depend on ImplicitDynamicPredCount being set, so deferring the stamp is safe. Measured on evmCacheComplexityDemo N=100k (50 reps): phase buildCFGEdges: 5315 us -> 4766 us (-10%) total cache build: 39592 us -> 38595 us (-2.5%) The phase win cancels half the per-call resolveConstantJumpTarget cost (the function is pure of Block + constants, so the second call returned the same answer with no side effect). Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
computeReverseTopo previously ran its own full DFS over Succs to produce a postorder list, explicitly skipping back-edges. That DFS is semantically identical to the DFS computeDomInfo already runs: visit each reachable node once, never follow back-edges (back-edge targets are visited ancestors, so the "visited" check rejects them anyway, making the explicit BackEdges filter redundant). Both produce the same forward-DAG postorder. Expose computeDomInfo's RPO as a DomInfo::RPO field. computeReverseTopo collapses to a reverse copy of Dom.RPO -- O(N) memory traversal instead of O(N+E) DFS. The defensive second pass in computeDomInfo (that visits unreachable components after the main reachable DFS) is preserved, so RPO covers every block id, matching computeReverseTopo's previous output set. Measured on evmCacheComplexityDemo N=100k (50 reps): phase computeReverseTopo: 1203 us -> 371 us (-69%) total cache build: 38595 us -> 38534 us (-0.2%, within noise) Total wall-clock barely moves because the freed cycles re-emerge as slight increases in adjacent phases via cache effects -- the work shifted, not actually disappeared in absolute terms. The win is structural: less code, one fewer DFS, RPO available for future passes that could subsume RevTopo entirely. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
Pure clang-format adjustments to function signatures and continuation line breaks introduced over the InCycle / RPO / buildCFGEdges fusion commits. No semantic changes.
…oc cost
buildGasBlocks previously default-constructed a stack-local GasBlock,
filled it across the inner opcode loop, then std::move'd it into
Blocks via push_back. Each push_back paid two costs:
1. Move construction -- 80 bytes copied from stack-local Block into
the vector slot.
2. Geometric capacity growth -- log2(N) reallocations during build,
each copying the entire prefix (~half of final size on average).
For N=100k blocks that is roughly 4 MB of memmove traffic that
contributes nothing to the result.
Replace with the two changes that drop both costs:
- Blocks.reserve(CodeSize) up front. Worst-case bound: opcodeLen >= 1
so block count is bounded by CodeSize. Real EVM averages 3-10
bytes/block so this over-reserves transiently by 3-10x, but the
saved realloc copies dominate. For EIP-170 production code
(24576 B max) the reserve costs ~1.9 MB; for the synthetic stress
demo at N=100k (CodeSize ~300 KB) it costs ~24 MB transient.
- emplace_back() the new block into Blocks directly; bind a
reference Blocks.back() (== emplace_back's return) and fill the
block in place. No stack-local intermediate, no move.
Measured on evmCacheComplexityDemo N=100k (100 reps):
phase buildGasBlocks: 10815 us -> 5108 us (-53%)
total cache build: 35170 us -> 31683 us (-9.9%)
This is the single biggest win in the PR after the initial fusion. The
reserve calculation is conservative on purpose: knowing the exact final
block count would need another bytecode pass, which would itself cost
~1 ms at this scale.
Verification:
- evmCacheTests: 14/14 pass
- evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
GasBlock previously embedded two std::vector<uint32_t> (Succs, Preds) inline. Each vector occupies 24 bytes of control fields, so the block header bloated to ~80 bytes per entry. Every pass that iterated Blocks to read scalar fields (computeDomInfo class B/C init, buildLoops body scans, meteringInit Cost copy, lemma614Update opcode/cost reads, writeback Start/End/Cost emit) paid this 2x cache stride. Replace with a parallel EdgeTables that holds two std::vector<vector<>> keyed by block id. The CFG-build phase (buildCFGEdges, addEdge, splitCriticalEdges) now operates on EdgeTables; the flatten step (buildAdjacencyCSR) reads from EdgeTables. Downstream readers were already on the CSR after the earlier CSR commit, so nothing else needed touching. GasBlock shrinks from ~80 -> ~40 bytes (4 uint32 PCs + 2 uint8 opcodes + uint64 cost + uint32 dyn-pred count = 32 bytes payload, 40 with padding). Iterating Blocks halves the cache traffic and the default constructor stops zero-filling two 24-byte vector control structs per emplace. Measured on evmCacheComplexityDemo N=100k (100 reps): phase buildGasBlocks: 5108 us -> 2515 us (-51%) phase buildCSR: 3929 us -> 2980 us (-24%) phase splitCriticalEdges: 751 us -> 395 us (-47%) phase writeback: 671 us -> 368 us (-45%) total cache build: 31683 us -> 28642 us (-9.6%) The buildGasBlocks win compounds with the prior reserve+emplace commit: now each emplaced GasBlock is half the size and has no vector ctor to invoke. The writeback win is pure stride compression on a tight loop over Blocks. Cumulative vs perf/evm-spp-foundation HEAD (47429 us at N=100k): 47429 -> 28642 us = -39.6%. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
After moving Succs/Preds out, GasBlock was 40 bytes -- the lone 8-byte Cost sat after a uint32 ImplicitDynamicPredCount, leaving a 4-byte trailing pad to satisfy the struct's 8-byte alignment. Reorder so all five 32-bit fields cluster first (Start, End, LastPc, PrevPc, ImplicitDynamicPredCount), followed by the two 1-byte opcodes + 2-byte tail pad to reach the 24-byte mark, then the 8-byte Cost. Total = 32 bytes exact, two blocks per cache line, no trailing pad. Locked in with a static_assert so future field additions get flagged. Measured on evmCacheComplexityDemo N=100k (100 reps): phase buildGasBlocks: 2515 us -> 2157 us (-14%, less zero-init/emplace) phase writeback: 368 us -> 331 us (-10%) phase splitCriticalEdges: 395 us -> 361 us (-9%) total cache build: 28642 us -> 28180 us (-1.6%) The buildGasBlocks win is the default-constructor doing less work per emplace_back (32 bytes of zeroed memory instead of 40). The writeback and split wins are from the tighter Block stride in their iteration loops. Verification: - evmCacheTests: 14/14 pass - evmone-statetest --vm external_vm -k fork_Cancun: 2723/2723 pass
Full-tier spec covering all 11 commits on perf/cache-build-fusion: phase fusion (buildGasBlocks 2-pass merge, collectJumpDests fold, buildCFGEdges single sweep), CSR adjacency + conditional Tarjan, DomInfo::RPO share with computeReverseTopo, and the GasBlock compaction trio (reserve + emplace_back, Succs/Preds split into EdgeTables, 32-byte field repack). Documents the data behind dropping PR B (Stack-SSA: 92.5/98.4% JUMPs already static; <1% expected runtime win) and SemiNCA (CHK fixpoint converges in 2 rounds at every measured N). Cross-N speedup table vs perf/evm-spp-foundation baseline (100-rep median): -21% at N=10k scaling to -41% at N=100k.
Both reviewers returned REVISE. Fixes applied: Major (Opus M-1/M-2/M-3, Codex C4/C5): - Remove fabricated "IrreducibleImproperRegion" test reference (the test is OverlappingBackEdgesIDom and its own comment disclaims fallback coverage). State that no unit test currently drives UseLinearSPP=false; end-to-end soundness comes from statetest. - Rewrite R2 soundness argument: InCycle=union(natural-loops) is a *performance* fast-path, not the safety mechanism. Soundness on irreducible CFGs is provided by lemma614Update's multi-pred guard via effectivePredCount, since every SCC-internal node has at least one in-cycle predecessor pushing its count >= 2. Added explicit warning not to remove the multi-pred guard on the assumption that InCycle covers it. - Soften "independently revertable" claim. Phase-internal commits (notably Phase 2's CSR/EdgeTables pair and Phase 5's reserve -> split -> repack chain) cannot be reverted in isolation without breaking the build; the per-commit-greenness claim remains. - Rewrite the perf tables. Replace stitched 9-rep + 25-rep data with a single same-session 50-rep per-phase + 100-rep interleaved-total measurement, both rebuilt from src/evm/evm_cache.cpp at 592fd35 vs HEAD. Document methodology so reviewers can reproduce. N=100k speedup re-derives to 1.69x / -41.0% under this methodology. Minor (Opus N-1..N-6, Codex C1/C6/C7/C8): - Per-phase sum vs total discrepancy now explained (chrono overhead at 13 phase boundaries). - Diff stat fixed: +312/-188 (was +236/-171). - Commit count clarified: 11 implementation + 1 docs = 12. - byte-identical EVMBytecodeCache claim softened to "behaviourally identical (statetest 2723/2723)" since no memcmp diff is run. - R1 (Blocks.reserve) scope note added: the no-realloc guarantee covers only buildGasBlocks initial construction, not the later splitCriticalEdges append. - R4 (chkFixpointRounds=2) caveat: synthetic stress + unit tests are the easy case for CHK; real-corpus measurement deferred. - N-6 meteringInit +110% attribution downgraded to "conjecture from access pattern, not measured." - Format gate description acknowledges Codex's exit-123 observation on pre-existing unrelated file violations; PR diff itself is clean. Code changes (Codex 3.3 suggestion): - Add Edges.size() == Blocks.size() invariant assert before buildAdjacencyCSR. Catches future drift if a new Blocks.push_back forgets to grow Edges in lockstep. - Fix GasBlock layout comment ("22 pad uint16" -> "22 pad[2]") per Opus N-4 since there is no actual uint16 field there. Verification: - evmCacheTests 14/14 pass - evmone-statetest -k fork_Cancun 2723/2723 pass - tools/format.sh check clean
Both Round 2 reviewers (Opus + Codex, independent) returned PASS verdicts after verifying the c5db655 round-1 fixes. Codex re-measured N=100k at 1.67x speedup (-40.2%), reproducing the documented 1.69x / -41.0% within +/-10% under the same interleaved methodology. Opus noted no new issues introduced by the R1 fixes. Polish item from Opus's R2 (non-blocking): the per-phase table notes were one-sided -- they explained why baseline's instrumented sum exceeds the total (chrono overhead at phase boundaries) but not why HEAD's sum is below the total (un-instrumented outer vector allocation in buildBytecodeCache; ~7 ms for synthetic N=100k due to 9.6 MB PushValueMap zero-init, ~0.2 ms for EIP-170 production code). Added a paragraph explaining the asymmetry. Review cadence: 2 rounds, target met within 1-2 cap.
- `docs/modules/evm/cache-build.md`: new module spec scoped to shipped
state. Covers pipeline phase order, GasBlock 32B layout, EdgeTables /
CSRGraph types, DomInfo (CHK + Tarjan E/E), conditional InCycle
branches, ZEN_EVM_CACHE_PROFILE counters, and the R2-verbatim
soundness invariant via lemma614Update's effectivePredCount multi-pred
guard (with explicit future-contributor warning).
- `perf-summary.md`: appends a directional B-lite Sourcify pilot
(n=10, paired wall-clock vs upstream/main `ef062ae` on mainnet
contracts pulled via `eth_getCode`, stratified by CodeSize). Overall
median 1.17x / +14.9% with 9/10 contracts faster; DAI flagged as
follow-up outlier. Adds an operationalized future-work C-rubric with
pre-committed GO/KILL/Partial thresholds covering production-size
cache-build, end-to-end evmone-bench, N-stratum spread, and
first-touch p95.
- `reviews/motivation-{1,2}-{opus,codex}.md`: dev-cycle motivation
red-team for the A -> B -> C follow-up plan. iter=1 both REFINE
(33x framing, C numeric trigger, C estimate provenance, R2 PASS
preservation, B methodology). iter=2 Opus PROCEED conditional on
three write-time fixes (C-rubric (iii) operationalize, evm_cache.md
scope, B-lite labeling); Codex REFINE on the same convergent list.
All review-cited write-time fixes are applied in the deliverables
themselves: cache-build.md scoped tight; perf-summary B-lite labeled
"directional, n=10, selection-biased"; C-rubric (iii) replaced with
"N=2000 paired >= 50% of N=100k paired"; (iv) first-touch p95 >= 5%
clause added.
The pipeline table lists phases 0-13 (14 entries) but the chrono-overhead prose said "13 phase pairs", which is the 13 phases inside `buildGasChunksSPP` excluding phase 0 `buildJumpDestMap` that runs in `buildBytecodeCache`'s outer scope. Reader cross-referencing the table would briefly think the number was wrong. Clarify in the prose without changing the table or the numeric overhead estimate.
The change doc README and reviews were already in English; only perf-summary.md was mixed Chinese + English. Translate verbatim, preserving all numeric tables, identifiers, file paths, commit SHAs, and markdown structure.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR overhauls EVM bytecode cache-build performance by replacing/optimizing CFG and dominator-related passes, adding profiling and benchmarking tools, and documenting the new cache-build pipeline and performance methodology.
Changes:
- Adds EVM cache-build profiling, benchmark analysis, corpus-fetching, and bytecode replay tooling.
- Refactors cache-build internals around CHK dominators, CSR adjacency, phase fusion, and compact block metadata.
- Adds extensive change documentation, performance summaries, and adversarial review records.
Reviewed changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
CMakeLists.txt |
Adds ZEN_EVM_CACHE_PROFILE option. |
src/evm/CMakeLists.txt |
Propagates cache profiling define to EVM object target. |
src/evm/evm_cache.cpp |
Implements cache-build pipeline changes, CSR graph, CHK dominators, and profiling hooks. |
src/evm/evm_cache.md |
Updates cache-build algorithm documentation. |
src/evm/evm_cache_for_testing.h |
Adds dominator testing helper API declaration. |
src/tests/evm_cache_complexity_demo.cpp |
Adds bytecode replay mode and microsecond CSV output. |
tests/corpus/evm-cache/.gitignore |
Ignores generated corpus/benchmark artifacts. |
tests/corpus/evm-cache/fetch_sourcify_corpus.py |
Adds Sourcify corpus acquisition and metadata extraction. |
tools/bench_evm_cache.sh |
Adds repeated fresh-process cache-build benchmark runner. |
tools/analyze_evm_cache_bench.py |
Adds paired-ratio BCa bootstrap analyzer. |
docs/modules/evm/cache-build.md |
Adds module-level cache-build specification and invariants. |
docs/changes/2026-05-17-evm-cache-build-fusion/perf-summary.md |
Adds performance summary and follow-up gating rubric. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/round-1-opus.md |
Adds round-1 review record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/round-1-codex.md |
Adds round-1 review record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/round-2-opus.md |
Adds round-2 verification record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/round-2-codex.md |
Adds round-2 verification record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/motivation-1-opus.md |
Adds motivation review record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/motivation-1-codex.md |
Adds motivation review record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/motivation-2-opus.md |
Adds follow-up motivation review record. |
docs/changes/2026-05-17-evm-cache-build-fusion/reviews/motivation-2-codex.md |
Adds follow-up motivation review record. |
docs/changes/2026-05-16-evm-spp-overhaul/problem-statement.md |
Adds scoped problem statement for foundation work. |
docs/changes/2026-05-16-evm-spp-overhaul/reviews/* |
Adds foundation-layer motivation, spec, and implementation review records. |
docs/changes/2026-05-12-evm-dom-chk/reviews/* |
Adds prior dominator-change review records. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
⚡ Performance Regression Check Results✅ Performance Check Passed (interpreter)Performance Benchmark Results (threshold: 25%)
Summary: 194 benchmarks, 0 regressions ✅ Performance Check Passed (multipass)Performance Benchmark Results (threshold: 25%)
Summary: 194 benchmarks, 0 regressions |
Two related changes responding to PR DTVMStack#514 review: 1. `CSRGraph::operator[]`: guard against null `Data.data()` pointer arithmetic. A single-block contract with no edges has empty CSR `Data`, and `Data.data()` is permitted to return `nullptr`. Forming `nullptr + Off[Node]` is undefined per [expr.add]/4 even when the offset is zero, and UBSan flags it. Return an empty `{nullptr, nullptr}` Range early when `Data.empty()`. 2. `computeInCycle` invariant comment: the pre-existing comment claimed that natural-loop union "captures every cycle" and that Tarjan SCC was the soundness backstop on the fallback path. R2 review of this PR established the actual invariant: InCycle is a performance fast path; soundness on irreducible CFGs rests on lemma614Update's `effectivePredCount(Succ) != 1` multi-pred guard. Align the inline comment with the module spec in `docs/modules/evm/cache-build.md` §Invariants, including the future-contributor warning not to remove the multi-pred guard on the assumption that InCycle covers it.
The doc previously stated time complexity as `O((N + E) · α(N))`. CHK is not a union-find algorithm and does not provide an inverse-Ackermann bound; the near-linear behaviour is workload-dependent, with worst-case bounded by dominator-tree depth and empirical `chkFixpointRounds = 2` on every measured workload. Reword as `O((N + E) · R)` with `R` defined as the number of fixpoint sweeps and the measured / worst-case bounds spelled out.
…tats `static_jump_stats` previously marked every PUSH-then-JUMP/JUMPI pair as a static target without decoding the pushed value or checking whether it lands on a valid `JUMPDEST` PC. This diverged from the cache builder's `resolveConstantJumpTarget` semantics in `src/evm/evm_cache.cpp`, which both decodes the constant and requires the target byte to be a `JUMPDEST` outside any PUSH-data region. The divergence undercounts dynamic JUMPs whenever a PUSH constant happens to point at a non-JUMPDEST byte, biasing the `dyn_jump_ratio` used for corpus stratification toward "static". Rewrite as a two-pass scan: pass 1 collects valid JUMPDEST PCs (skipping PUSH-data regions); pass 2 decodes each PUSH value and counts the following JUMP/JUMPI as static iff the decoded value is in the JUMPDEST set. End-of-code PUSH truncation is zero-padded on the right to match EVM stack semantics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EVM cache-build pipeline overhaul. Two layers of work bundled in one PR
because the second layer depends on instrumentation + algorithmic
foundations introduced in the first:
(
O(N²/64)) with Cooper-Harvey-Kennedy + Tarjan DFS Enter/Exit forO(1)dominates(A, B)queries; inline the dominator-tree childrenadjacency; add the opt-in
ZEN_EVM_CACHE_PROFILEper-phase chronoinstrumentation; add
evmCacheComplexityDemobytecode-replay mode +structural dominator GTests; add the
bench_evm_cache.shpairedharness +
analyze_evm_cache_bench.pypaired-ratio BCa cluster-bootstrap analyzer; add the Sourcify + top-RPC corpus fetchers.
(
buildGasBlocks2-pass → 1-pass,collectJumpDestsfolded in,buildCFGEdgessingle-sweep); flattenBlocks[].Succs/Predsinto aread-only
CSRGraphaftersplitCriticalEdgesfreezes the graph androute every downstream reader through it; share
DomInfo::RPOwithcomputeReverseTopo; packGasBlock80 B → 32 Bby extracting
Succs/Predsinto a parallelEdgeTablesandreordering fields (
static_assert(sizeof(GasBlock) == 32)-locked).Behaviour-level semantics unchanged.
evmone-statetest --vm external_vm -k fork_Cancun2723/2723 pass andbuild/evmCacheTests14/14 pass —both re-ran after every implementation commit, not just at the end.
The fusion-layer N=100k headline (-41% / 1.69×) was independently
re-measured at 1.67× / -40.2%, reproducing within ±10%. Spec docs:
docs/changes/2026-05-16-evm-spp-overhaul/README.md(foundation)docs/changes/2026-05-17-evm-cache-build-fusion/README.md(fusion)docs/changes/2026-05-17-evm-cache-build-fusion/perf-summary.md(3-tier cross-N comparison + per-phase deltas + production-scale
pilot + pre-committed gating criteria for the deferred micro-opts)
Production-scale pilot (n=10, directional)
Paired wall-clock on 10 mainnet contracts pulled via
eth_getCodefromhttps://ethereum.publicnode.com, stratified byCodeSize. 15 reps perbinary, point estimates only — the full paired-ratio BCa
cluster-bootstrap (the foundation layer's harness applied to a wider
Sourcify corpus) remains a post-merge follow-up.
The speedup column below is the median of per-contract speedups,
not the ratio of the displayed
Median baseline/Median HEADcolumns (each of which is itself a median across the stratum's
contracts).
9 / 10 contracts faster on HEAD. DAI (-21.5%, 7.9 KB) is the one
outlier and is logged for follow-up — not a ship blocker, but warrants
investigation under a wider corpus. Per-contract rows live in
perf-summary.md.Caveats: selection-biased toward high-traffic mainnet contracts
(USDT/USDC/Uniswap/WETH cluster); n=10 is too thin to support any
confidence-interval claim; this is a directional sanity check, not
production-grade methodology.
Synthetic stress (algorithmic-DoS regime, not production scale)
EIP-170 caps real contract bytecode at 24 576 bytes, so the deployable
upper bound is N ≲ 8000 blocks (at worst-case ~3 B / block packing).
evmCacheComplexityDemoat N=100 000 is outside what a deployedcontract can produce and ships only as an algorithmic-DoS regression
guard. With that caveat front-loaded:
Of the 33× at N=100 000, the foundation layer contributes 18.6× (the
iterative-bitset dominator was the dominant cost at upstream/main) and
the fusion layer contributes the remaining 1.78× on top. Real deployed
contracts fall in the N=100-2000 band where the proportional gain
compresses substantially — see the production-scale pilot above for
an empirical anchor.
The pipeline goes from super-linear
2× N → 4× time(upstream/main,matching the
O(N²/64)bitset dataflow) to fully linear2× N → 2.0× timeon HEAD.What's in this PR
28 commits = 17 implementation/test/tooling + 11 docs/review. Diff:
45 files, +6347 / -244. Source footprint: 11 files / +1896 / -244
under
src/,tests/corpus/evm-cache/, andtools/; the rest isdocs.
Foundation layer (commits
48fada6..592fd35, 11 commits):48fada6replace iterative-bitset dominator with CHK1be3f39inline dom-tree children adjacency62ef503opt-inZEN_EVM_CACHE_PROFILEper-phase instrumentation3c659f6bytecode-replay demo mode + structural dominator GTests9df8ee8bench_evm_cache.sh+analyze_evm_cache_bench.py(paired-ratio BCa cluster-bootstrap; Efron-Tibshirani §14.3)
a75ab11Sourcify + top-RPC corpus fetchers92c6c04,04d0a55,b00efa1,8a95175,592fd35change doc +review fixes
Fusion layer (commits
e06d291..911f8c1, 17 commits):e06d291buildGasBlocks2-pass → 1-pass;3bba649collectJumpDestsfold.0dd5bb9CSRGraphflattenafter
splitCriticalEdges;4d74033chkFixpointRoundsdiagnosticcounter;
6e1bc6bconditionalInCycleon reducible CFGs (skipsTarjan SCC when reducible).
de934a8buildCFGEdgessingle-sweep;118c993computeReverseToporeadsDomInfo::RPO.GasBlockcompaction:55a250bBlocks.reserve(CodeSize)+emplace_back;689e5d5Succs/Predsextracted into parallelEdgeTables(GasBlock 80 → 40 B);f7630d8field reorder packs toexact 32 B (
static_assert-locked).77e0454clang-formatsweep (no semantic change).4f9f5be,c5db655,de507df,ab74da5,99a666c,911f8c1change doc + module spec + production-scale pilot + review fixes.
Safety invariant on irreducible CFGs is preserved by
lemma614Update'seffectivePredCountmulti-pred guard (evm_cache.cpp:1224), not bythe conditional
InCyclefast-path. A future-contributor warning indocs/modules/evm/cache-build.md§Invariants explicitly states themulti-pred guard must NOT be removed on the assumption that
InCyclecovers it. Counterexample (irreducible 2-entry cycle
A ↔ B) includedin that warning.
Test plan
tools/format.sh checkcleancmake --build build --target dtvmapi -j$(nproc)succeeds withno new warnings (use
CCACHE_DISABLE=1if ccache mount is read-only)build/evmCacheTests— 14 / 14 pass (10 dominator + 4implicit-dyn-pred)
evmone-statetest --vm external_vm -k fork_Cancun— 2723 / 2723pass (~80 s)
evmCacheComplexityDemoat N=10k/20k/50k/100k — monotoneimprovement vs
upstream/mainthan
upstream/main; DAI flagged for follow-upre-measured at 1.67× / -40.2% (within ±10%)
Out of scope / future work
(evmone-bench) of JUMPs already statically resolved by the existing
PUSH→JUMP heuristic; expected runtime gain < 1% against 500+ LoC
of SSA construction.
2 rounds on every measured workload (logged via
chkFixpointRoundsdiagnostic); SemiNCA's second-sweep saving (~1.5 ms) is comparable
to its own DSU bookkeeping cost.
computeReachablefold /buildCFGEdgesdedup-skip /buildCSRprefetch /GasBlockhot/cold split) — gated on the production-scalevalidation follow-up. Pre-committed thresholds in
perf-summary.md§Future-work: GO requires (i) production N ≲ 8000 paired median
≥ +5% AND p95 reduction ≥ 0.2 ms, (ii) end-to-end evmone-bench
median ≥ +1% / p95 ≥ +3%, (iii) N=2000 paired ≥ 50% of N=100k
paired, (iv) first-touch p95 reduction ≥ +5%. KILL if any clause
fails → pivot to runtime / JIT / host-call hotspots.
UseLinearSPP=falsededicated GTest — deferred to a follow-upPR. Current irreducible-fallback path correctness rests on the
multi-pred guard argument +
evmone-statetestend-to-end. Seedocs/changes/2026-05-17-evm-cache-build-fusion/README.md.🤖 Generated with Claude Code