Optimize hot-path performance across allocator core by emeryberger · Pull Request #84 · emeryberger/Hoard

emeryberger · 2026-03-04T18:59:43Z

Summary

Seven independent optimizations targeting remaining hot-path bottlenecks, validated with A/B benchmarking:

Multiplicative inverse for normalize() modulo — replaces expensive % operator (~30 cycles) with multiply+shift (~4 cycles) for non-power-of-two object sizes via precomputed __uint128_t inverse
O(1) size class lookup for GeometricSizeClass — replaces O(log N) binary search with CLZ-based bit-position lookup + short linear scan (~5 iterations max)
Spinlock exponential backoff with PAUSE/YIELD — re-enables PAUSE on x86 and YIELD on ARM64 with exponential backoff (1→2→4→...→64 then yield) to reduce cache-line bouncing
EmptyClass bitmap for O(1) fullest-bin lookup — _binmap bitmap + __builtin_clz finds fullest non-empty class in O(1) instead of linear scan
ThresholdSegHeap selective clearing — bitmap tracks active bins; only clears bins that were used instead of iterating all entries
Cache-line aligned Statistics — pads to 64 bytes to prevent false sharing between adjacent bins; adds platform-aware DESTRUCTIVE_INTERFERENCE_SIZE (128 on Apple Silicon, 64 on x86) for cross-thread TLAB padding
Superblock header hot/cold field separation — reorders fields so read-only data (cache line 1), allocation-path mutables (line 2), and ownership/linking (line 3) don't share cache lines

Benchmark Results (Apple Silicon, 5 runs, min/avg/max)

Benchmark	Baseline avg	Optimized avg	Change
larson 4t (ops/s, higher=better)	159M	169M	+6%
threadtest 4t/256B	2.40s	2.23s	+7%
linux-scalability 4t	21.3ms	20.3ms	+5%
threadtest 4t/8B	27.8ms	28.7ms	neutral
threadtest 16t/8B	14.9ms	16.0ms	neutral
cache-scratch 4t	61μs	60μs	neutral
cache-scratch 16t	182μs	184μs	neutral

All benchmarks remain ~3-5x faster than the system allocator.

Test plan

Build succeeds on macOS (ARM64) with cmake .. && make
All benchmarks run correctly with DYLD_INSERT_LIBRARIES
No regressions on any benchmark (A/B tested, 5 runs each)
Build on Linux (x86-64)
Build on Windows (MSVC) — __uint128_t paths have MSVC fallbacks
Run under AddressSanitizer (-fsanitize=address)

🤖 Generated with Claude Code

Performance optimizations (7 independent changes): 1. Multiplicative inverse for normalize() modulo — replace expensive % operator (~30 cycles) with multiply+shift (~4 cycles) for non-power-of-two object sizes via precomputed __uint128_t inverse. 2. O(1) GeometricSizeClass lookup — replace O(log N) binary search with CLZ-based bit-position lookup + short linear scan. 3. Spinlock exponential backoff with PAUSE/YIELD — re-enable PAUSE on x86 and YIELD on ARM64, exponential backoff (1→64 then yield). 4. EmptyClass bitmap for O(1) fullest-bin lookup — _binmap + __builtin_clz finds fullest non-empty class in O(1) instead of linear scan. 5. ThresholdSegHeap selective clearing — bitmap tracks active bins; only clears bins that were used instead of iterating all entries. 6. Cache-line aligned Statistics — pad to 64 bytes (L1 cache line) to prevent false sharing; platform-aware DESTRUCTIVE_INTERFERENCE_SIZE (128 on Apple Silicon, 64 on x86) for cross-thread TLAB padding. 7. Superblock header hot/cold field separation — read-only fields on cache line 1, allocation-path mutables on line 2, ownership/linking on line 3. Benchmarked on Apple Silicon (5 runs per config): larson (server workload): +6% (159M → 169M ops/s) threadtest 256B/4t: +7% (2.40s → 2.23s) linux-scalability 4t: +5% (21.3ms → 20.3ms) threadtest/cache-scratch: neutral CI (GitHub Actions on Linux, macOS, Windows): - Benchmarks use mimalloc-bench parameters (cache-scratch/cache-thrash: $N 1000 1 2000000, larson: 5 8 1000 5000 100 4141 $N). - Linux: benchmarks linked against static libhoard (avoids gnuwrapper symbol issues with LD_PRELOAD). - macOS: static link with -force_load for interposition sections; continue-on-error due to GitHub runner malloc zone restrictions. - Windows: withdll.exe Detours injection for threadtest/test_malloc/ bench_malloc. - Adds HOARD_BUILD_BENCHMARKS and HOARD_LINK_BENCHMARKS CMake options. - Adds MSVC intrinsic wrappers for __builtin_ctzll/__builtin_clz. - Adds xxfree_sized/xxfree_aligned_sized stubs for latest Heap-Layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'master' of https://github.com/emeryberger/Hoard

baf86cc

emeryberger force-pushed the perf/hot-path-optimizations branch 7 times, most recently from cfb1ef5 to 51a9faa Compare March 5, 2026 22:43

emeryberger force-pushed the perf/hot-path-optimizations branch from 51a9faa to decc1ce Compare March 5, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize hot-path performance across allocator core#84

Optimize hot-path performance across allocator core#84
emeryberger wants to merge 2 commits intomasterfrom
perf/hot-path-optimizations

emeryberger commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

emeryberger commented Mar 4, 2026

Summary

Benchmark Results (Apple Silicon, 5 runs, min/avg/max)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant