Skip to content

blocks & script & serialization perf#171

Draft
l0rinc wants to merge 23 commits into
masterfrom
detached529
Draft

blocks & script & serialization perf#171
l0rinc wants to merge 23 commits into
masterfrom
detached529

Conversation

@l0rinc
Copy link
Copy Markdown
Owner

@l0rinc l0rinc commented May 13, 2026

High-Level Impact

Affected Area Where It Shows Up Practical Effect
Block / transaction deserialization Receiving blocks/txs from peers, ReadBlockBench, DeserializeBlockTest, reindex/import paths Most meaningful stack impact. Faster CompactSize, VarInt, prevector, byte-vector, and DataStream reads reduce CPU in the path that turns wire/disk bytes into block/tx objects. Several block-level benches show small but real wins, with focused script/prevector benches much larger.
Script and witness byte containers scriptSig, scriptPubKey, witnesses, byte vectors inside serialized tx/block data Avoids unnecessary one-iteration chunk loops and redundant size queries. This matters because block deserialization touches many small byte containers.
Block / transaction serialization Sending blocks/txs, writing blocks, WriteBlockBench, serialization to DataStream Multibyte CompactSize/VarInt writes are batched, and common small encodings are hinted/inlined. This mainly reduces per-field overhead when constructing serialized messages or block data.
Size-only serialization GetSerializeSize, SizeComputer, fee/package/relay logic that only needs encoded size Bigger focused wins: CompactSize sizing around -16.6%, VarInt sizing around -24.1% in synthetic payloads. Actual app impact depends on how often the caller computes serialized size instead of serializing bytes.
Generic map/set deserialization Any serialized std::map / std::set payload Moves freshly deserialized entries into containers instead of copying. Focused bench showed roughly -21% map and -19% set deserialize time for large payload entries. This is clean and low-risk, but its high-level impact depends on workload because blocks/txs mostly stress vectors/prevectors more than maps/sets.
Mempool / package validation failures Transaction/package rejection paths, MempoolAcceptResult, PackageMempoolAcceptResult Avoids copying validation state objects into result wrappers. This helps failure-heavy tx/package validation workloads and is mostly a cleanup win; normal accepted transaction flow will barely notice it.
Chain activation ActivateBestChain, block connection/activation calls Avoids a shared_ptr refcount bump when passing the optional already-owned block. Tiny per-call improvement in a hot validation function.
P2P connection management CNode, CConnman, listen sockets, manual/outbound connection thread Removes small ownership/callback/vector copies during peer setup and lookup. Useful for connection churn and tests, but not expected to move steady-state block/tx relay benchmarks much.

The biggest real node-level effect is in byte deserialization: blocks and transactions are dense with compact sizes, varints, scripts, witnesses, vectors, and prevectors. That means IBD, reindex/import, block relay ingestion, and block-file reads are the places most likely to benefit.

The validation and p2p commits are intentionally smaller. They clean up avoidable copies in important areas, but they are not expected to produce large end-to-end benchmark deltas unless the workload is failure-heavy validation or connection-churn heavy.


Commit Optimization Measured result
0a44f00 Move deserialized map/set temporaries into hinted inserts Map -21.30%, set -18.65% in focused assoc deserialize bench
403c01b Simplify DataStream::read/ignore bounds checks Read -10.61%, ignore -5.80% focused stream bench
c6a43df Return early for single-byte CompactSize reads DeserializeBlockTest -0.10%; kept mainly as common-path cleanup
80f52b9 Skip prevector chunk loop for single reads Script prevector benches -12% to -18%; production DeserializeBlockTest -1.84%
15ffbdd Skip byte std::vector chunk loop for single reads DeserializeBlockTest -0.47%; instruction counts down
47ef2b0 Reuse decoded size for prevector read span Focused rows up to -7.65%
2003a8d Reuse decoded size for byte vector read span Larger focused rows -1.01% to -2.07%
ab03801 Write multibyte CompactSize in one chunk WriteBlockBench -2.88%, ReadBlockBench -0.97%, DeserializeBlockTest -0.69%
bbe8ee5 Batch multibyte VarInt writes Focused -0.61%; WriteBlockBench -4.16%
c4e35ca Hint common one-byte compact/varint encodings DeserializeBlockTest -0.57%, ReadBlockBench -0.55%, WriteBlockBench -0.97%
d576178 Hint byte-container read fast paths Timings ~flat; instruction counts down materially
1df1075 Hint non-empty byte-container writes Prevector serialize -1.51%, vector serialize -0.89%
5996b44 Simplify GetSizeOfVarInt loop SizeComputer VarInt payload -0.20%
2d41ffc Shortcut CompactSize sizing through SizeComputer wrappers SizeComputer Compact payload -16.57%
1cd25da Shortcut VarInt sizing through SizeComputer wrappers SizeComputer VarInt payload -24.09%
81a904b Hint single-byte VarInt reads ReadVarIntMixed -7.12%
c30142e Inline primitive byte reads ReadCompactSizeMixed -6.69%; VarInt flat
60eccb8 Inline ReadVarInt Focused ReadVarIntMixed -55.81%; later rerun still roughly 2x vs no-inline
8206fb0 Inline ReadCompactSize Focused ReadCompactSizeMixed -53.90%; later rerun still roughly 2x vs no-inline
d8395be Inline WriteCompactSize WriteCompactSizeMixed -0.87%; code-size tradeoff

l0rinc added 23 commits May 13, 2026 09:07
Move the freshly deserialized std::map pair and std::set key into the hinted insert calls instead of copying them. The values are temporary and unique at that point, so this is a two-line ownership cleanup.

Measured with the focused associative deserialize benchmark using 1024 entries carrying 256-byte vector payloads: map deserialize 211246 -> 166255 ns/op (-21.30%); set deserialize 213492 -> 173678 ns/op (-18.65%).

To measure, build the temporary serialize-associative benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries, or sanity-check the production path with build-gcc-base/bin/test_bitcoin --run_test=serialize_tests,streams_tests,transaction_tests.
Use DataStream::size() as the available byte count for read() and ignore(), then advance m_read_pos after the bounds check. This removes CheckedAdd from the hot path and drops the util/overflow.h include from streams.h.

Focused stream benchmark: DataStream::read 8x32 bytes 94.152 -> 84.161 ns/op (-10.61%); DataStream::ignore 8x32 bytes 89.316 -> 84.135 ns/op (-5.80%).

To measure, build the temporary DataStream read/ignore benchmark from doc/compiler-optimization-investigation.md and compare alternating pinned clean/patched runs. Production sanity: build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench)$' -sanity-check.
Return CompactSize values below 253 immediately after the first byte. Those values are canonical and cannot exceed MAX_SIZE, so the common path no longer falls through to the multibyte range-check branch.

Alternating pinned block benchmarks against the retained serialization baseline measured DeserializeBlockTest 2672.046 -> 2669.416 us/block (-0.10%); ReadBlockBench was noise at 3912.891 -> 3929.036 us/op (+0.41%). Kept because the common path is simpler and removes an unnecessary branch.

To measure, run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench)$' -min-time=500 for alternating clean/patched binaries and compare nanobench medians.
Deserialize byte prevectors with a direct resize-and-read path when the encoded size fits within the existing 5 MiB per-read allocation cap. Block 413567 script prevectors all fall under that cap, so the old loop usually executed one unnecessary iteration.

Focused script benchmarks measured inline scriptPubKey prevectors 15.630 -> 12.756 ns/script (-18.38%), heap scriptSig 76..252 bytes 39.861 -> 32.746 ns/script (-17.85%), heap scriptSig 253..254 bytes 42.139 -> 36.930 ns/script (-12.36%), and mutable block deserialization 832.899 -> 814.122 us/block (-2.25%).

Production check: taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench)$' -min-time=500 on alternating clean/patched binaries measured DeserializeBlockTest 2757.548 -> 2706.907 us/block (-1.84%).
Mirror the retained prevector fast path for byte std::vector deserialization: empty vectors return immediately, vectors below the allocation cap resize once and read once, and the existing chunk loop remains for oversized encoded sizes.

Alternating pinned block benchmarks measured DeserializeBlockTest 2681.178 -> 2668.535 us/block (-0.47%). ReadBlockBench was effectively noise at 3873.898 -> 3880.407 us/op (+0.17%), while median instructions still dropped slightly in both workloads.

To measure, run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench)$' -min-time=500 on alternating clean/patched binaries and compare nanobench medians plus instruction counters where available.
After resize_uninitialized(nSize), the byte prevector read path already knows the exact element count. Reuse nSize for the read span instead of asking the container for the size that was just assigned.

Focused prevector<36, uint8_t> benchmark with a fresh SpanReader each iteration measured representative rows: 25 bytes 10.5345 -> 10.0606 ns/op (-4.50%), 35 bytes 11.0295 -> 10.1861 ns/op (-7.65%), 107 bytes 28.6133 -> 27.7249 ns/op (-3.11%).

To measure, build the temporary prevector-size benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries.
After resize(nSize), the byte std::vector read path can use the decoded nSize directly for the span length. This is a one-line cleanup matching the prevector path.

Focused std::vector<uint8_t> benchmark with a fresh SpanReader each iteration had mixed tiny-size noise but improved larger rows: 107 bytes 38.6092 -> 37.8101 ns/op (-2.07%) and 254 bytes 40.9207 -> 40.5055 ns/op (-1.01%).

To measure, build the temporary vector-size benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries.
Stage the CompactSize marker and little-endian payload in a small stack buffer and write them with one stream call. The one-byte CompactSize path is unchanged.

Alternating pinned block benchmarks against the prevector baseline measured WriteBlockBench 1370.763 -> 1331.327 us/op (-2.88%), ReadBlockBench 3993.744 -> 3954.876 us/op (-0.97%), and DeserializeBlockTest 2694.041 -> 2675.437 us/block (-0.69%).

To measure, run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench|WriteBlockBench|ReadRawBlockBench)$' -min-time=500 on alternating clean/patched binaries and compare nanobench medians and instruction counts.
Keep the direct one-byte VarInt path, but generate multibyte encodings in output order and write the resulting span once instead of emitting each byte separately.

Focused mixed VarInt write benchmark measured 6918.08 -> 6875.66 ns/op (-0.61%). Alternating pinned block benchmarks measured WriteBlockBench 1396.129 -> 1338.126 us/op (-4.16%); DeserializeBlockTest and linearization workloads were neutral/noisy.

To measure, build the temporary mixed VarInt write benchmark from doc/compiler-optimization-investigation.md, and run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(WriteBlockBench|DeserializeBlockTest|LinearizeOptimallyTotal|LinearizeOptimallyPerCost)$' -min-time=500 on alternating clean/patched binaries.
Mark the dominant one-byte CompactSize and VarInt write/read cases as likely. These are code-layout hints only; the serialization format and branch structure stay unchanged.

Alternating pinned block benchmarks measured DeserializeBlockTest 2678.273 -> 2662.998 us/block (-0.57%), ReadBlockBench 3910.368 -> 3888.997 us/op (-0.55%), and WriteBlockBench 1384.836 -> 1371.392 us/op (-0.97%). WriteBlockBench instructions dropped from 3,868,269 to 3,852,752.

To measure, run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench|WriteBlockBench)$' -min-time=500 on alternating clean/patched binaries; use perf counters where available to confirm instruction movement.
Mark the retained byte prevector/std::vector single-read path as likely and the empty-container return as unlikely. This documents the expected shape after the explicit fast path split.

A longer alternating run measured effectively flat timings with lower instruction counts: DeserializeBlockTest 2674.190 -> 2670.351 us/block (-0.14%) and ReadBlockBench 3886.841 -> 3883.063 us/op (-0.10%). Instructions dropped from 14,193,155 to 14,129,904 for DeserializeBlockTest and from 15,663,869 to 15,404,443 for ReadBlockBench.

To measure, run taskset -c 3 build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench)$' -min-time=500 on alternating clean/patched binaries, and capture perf instruction counters when available.
Mirror the read-side byte container hints by marking non-empty byte prevector and std::vector serialization as likely. Empty containers still only write their CompactSize length.

Focused 107-byte DataStream serialization benchmark measured PrevectorSerialize107 139.386 -> 137.280 ns/op (-1.51%) and VectorSerialize107 137.110 -> 135.889 ns/op (-0.89%).

To measure, build the temporary byte-container serialization benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries.
Initialize the VarInt byte count to one and loop while another encoded byte is needed. This removes the while(true) plus internal break shape from GetSizeOfVarInt without changing the encoding calculation.

Focused SizeComputer VarInt payload benchmark measured 6486.74 -> 6473.86 ns/op (-0.20%). This is intentionally a tiny cleanup retained because it keeps the now-hot sizing helper simpler.

To measure, build the temporary SizeComputer VarInt payload benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries.
Let streams that wrap SizeComputer expose the underlying SizeComputer, so generic WriteCompactSize can seek by encoded size instead of staging bytes that will never be written.

Focused GetSerializeSize(TX_WITH_WITNESS(payload)) benchmark on 1024 byte vectors of 254 bytes measured SizeComputerCompactPayload 2147.71 -> 1791.77 ns/op (-16.57%).

To measure, build the temporary SizeComputer CompactSize payload benchmark from doc/compiler-optimization-investigation.md. Validate with build-gcc-base/bin/test_bitcoin --run_test=serialize_tests,transaction_tests,blockmanager_tests and build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench|WriteBlockBench)$' -sanity-check.
When a ParamsStream wraps SizeComputer, have generic WriteVarInt seek by GetSizeOfVarInt instead of generating temporary encoded bytes for a size-only stream.

Focused GetSerializeSize(TX_WITH_WITNESS(payload)) benchmark on 2048 VARINT(uint64_t) values measured SizeComputerVarIntPayload 8543.45 -> 6485.23 ns/op (-24.09%).

To measure, build the temporary SizeComputer VarInt payload benchmark from doc/compiler-optimization-investigation.md. Validate with build-gcc-base/bin/test_bitcoin --run_test=serialize_tests,transaction_tests,blockmanager_tests and build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench|WriteBlockBench)$' -sanity-check.
ReadVarInt returns after one byte for the common encoding. Mark the continuation branch as unlikely, matching the one-byte write-side hint without changing the loop logic.

Focused mixed VarInt read benchmark with one multibyte value every eight entries measured ReadVarIntMixed 20107.3 -> 18676.1 ns/op (-7.12%).

To measure, build the temporary ReadVarIntMixed benchmark from doc/compiler-optimization-investigation.md and run alternating pinned clean/patched binaries.
Force ser_readdata8 inline so CompactSize and VarInt decoders keep their byte read in the hot caller. This is a narrow annotation on the primitive one-byte read helper.

Focused retained-tree benchmarks measured ReadCompactSizeMixed 17513.5 -> 16342.5 ns/op (-6.69%) and ReadVarIntMixed 17159.8 -> 17161.9 ns/op (flat).

To measure, build the temporary ReadCompactSizeMixed and ReadVarIntMixed benchmarks from doc/compiler-optimization-investigation.md. Validate with build-gcc-base/bin/test_bitcoin --run_test=serialize_tests,streams_tests,transaction_tests and build-gcc-base/bin/bench_bitcoin -filter='^(DeserializeBlockTest|ReadBlockBench|WriteBlockBench)$' -sanity-check.
Force ReadVarInt inline so formatter-heavy deserialization keeps the common one-byte decode and continuation check in the caller. This is a code-size tradeoff, so it sits late in the review order.

Focused mixed VarInt read benchmark with one multibyte value every eight entries measured ReadVarIntMixed 19397.5 -> 8572.54 ns/op (-55.81%). A later retained-tree no-inline rerun still showed roughly a 2x focused win, so the earlier absolute number is treated as benchmark-layout sensitive but directionally valid.

To measure, build the temporary ReadVarIntMixed benchmark from doc/compiler-optimization-investigation.md and compare alternating pinned binaries with and without ALWAYS_INLINE on ReadVarInt.
Force ReadCompactSize inline so the common one-byte size decode stays in its caller. This is deliberately late in the series because it buys speed with extra generated code.

Focused mixed CompactSize read benchmark with one 254-byte value every eight entries measured ReadCompactSizeMixed 17752.8 -> 8184.2 ns/op (-53.90%). A later retained-tree no-inline rerun still showed about a 2x focused win, so the exact absolute numbers are layout-sensitive.

To measure, build the temporary ReadCompactSizeMixed benchmark from doc/compiler-optimization-investigation.md and compare alternating pinned binaries with and without ALWAYS_INLINE on ReadCompactSize. Track binary size with size build-gcc-base/bin/bench_bitcoin.
Force WriteCompactSize inline so the common one-byte encode stays in the caller. The measured speedup is small and the binary grows, so this is ordered after the clearer source cleanups.

Focused mixed CompactSize write benchmark with one 254-byte value every eight entries measured WriteCompactSizeMixed 9504.43 -> 9421.75 ns/op (-0.87%). The retained build showed roughly 35 KiB additional bench_bitcoin .text for this change.

To measure, build the temporary WriteCompactSizeMixed benchmark from doc/compiler-optimization-investigation.md and compare alternating pinned binaries with and without ALWAYS_INLINE on WriteCompactSize. Track binary size with size build-gcc-base/bin/bench_bitcoin.
Record the retained optimization results, rejected source experiments, benchmark caveats, and measurement environment for the reordered serialization optimization series.

The source commits carry their own focused speedups and measurement notes; this document keeps the broader investigation details, including biased/noisy benchmark cases, distribution data for block 413567, binary-size observations, and validation commands.
Move TxValidationState and PackageValidationState into mempool acceptance result objects after their final local use, and pass the optional ActivateBestChain block by const reference instead of copying its shared_ptr.

The result constructors are used throughout transaction and package validation failure paths. This removes avoidable state copies on those paths, while the ActivateBestChain signature avoids a shared_ptr reference-count update each time block activation is driven with an already-owned block.

Found with a temporary clang-tidy-20 run enabling performance-unnecessary-value-param, clang-analyzer-cplusplus.Move, cppcoreguidelines-rvalue-reference-param-not-moved, and the related performance move/copy checks over validation/serialization hot files.

To measure, compare perf counters for validation-heavy runs such as bench_bitcoin -filter='MemPoolAddTransactions|ConnectBlock' and package/tx validation tests. The expected effect is small and path-dependent: fewer TxValidationState/PackageValidationState copies in failure construction and one less shared_ptr copy per ActivateBestChain call.

Verification: clang-tidy-20 -p build-clang-o3 with the expanded checks on src/validation.cpp cleared the patched validation result and ActivateBestChain diagnostics; remaining reports are cold fs::path copies in snapshot cleanup plus an analyzer warning around repeated package-result moves. Also ran cmake --build build-gcc-base --target test_bitcoin bench_bitcoin -j4 and build-gcc-base/bin/test_bitcoin --run_test=txvalidation_tests,txpackage_tests,validation_tests,validation_block_tests,validation_chainstate_tests,validation_chainstatemanager_tests --catch_system_errors=no.
Move shared pointers into ListenSocket, CConnman, and CNode members at ownership handoff points, avoid copying the specified outgoing connection list while the connection thread is running, take CNodeOptions by value because the constructor consumes its session field, and avoid copying std::function in ForNode.

These are small p2p cleanup wins rather than broad algorithmic changes: they remove shared_ptr reference-count updates, an outgoing-address vector copy, and a callback wrapper copy in connection-management paths.

Found with a temporary clang-tidy-20 run enabling performance-unnecessary-value-param, cppcoreguidelines-rvalue-reference-param-not-moved, bugprone move checks, and the related performance move/copy checks over p2p and serialization-related hot files.

To measure, compare perf stat or callgrind/instruction counts on connection-churn focused runs such as the net test suites or a temporary CConnman/CNode construction microbenchmark. Existing benchmark coverage does not directly isolate these ownership handoffs, so a stable end-to-end bench delta is expected to be tiny/noisy.

Verification: clang-tidy-20 -p build-clang-o3 with the expanded checks on src/net.cpp produced no remaining user-visible diagnostics for this patch. Also ran cmake --build build-gcc-base --target test_bitcoin bench_bitcoin -j4 and build-gcc-base/bin/test_bitcoin --run_test=net_tests,net_peer_connection_tests,net_peer_eviction_tests,netbase_tests,sock_tests --catch_system_errors=no.
@l0rinc l0rinc changed the title Detached529 blocks & script & serialization perf May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant