Optimize hot-path performance across allocator core#84
Open
emeryberger wants to merge 2 commits intomasterfrom
Open
Optimize hot-path performance across allocator core#84emeryberger wants to merge 2 commits intomasterfrom
emeryberger wants to merge 2 commits intomasterfrom
Conversation
cfb1ef5 to
51a9faa
Compare
Performance optimizations (7 independent changes): 1. Multiplicative inverse for normalize() modulo — replace expensive % operator (~30 cycles) with multiply+shift (~4 cycles) for non-power-of-two object sizes via precomputed __uint128_t inverse. 2. O(1) GeometricSizeClass lookup — replace O(log N) binary search with CLZ-based bit-position lookup + short linear scan. 3. Spinlock exponential backoff with PAUSE/YIELD — re-enable PAUSE on x86 and YIELD on ARM64, exponential backoff (1→64 then yield). 4. EmptyClass bitmap for O(1) fullest-bin lookup — _binmap + __builtin_clz finds fullest non-empty class in O(1) instead of linear scan. 5. ThresholdSegHeap selective clearing — bitmap tracks active bins; only clears bins that were used instead of iterating all entries. 6. Cache-line aligned Statistics — pad to 64 bytes (L1 cache line) to prevent false sharing; platform-aware DESTRUCTIVE_INTERFERENCE_SIZE (128 on Apple Silicon, 64 on x86) for cross-thread TLAB padding. 7. Superblock header hot/cold field separation — read-only fields on cache line 1, allocation-path mutables on line 2, ownership/linking on line 3. Benchmarked on Apple Silicon (5 runs per config): larson (server workload): +6% (159M → 169M ops/s) threadtest 256B/4t: +7% (2.40s → 2.23s) linux-scalability 4t: +5% (21.3ms → 20.3ms) threadtest/cache-scratch: neutral CI (GitHub Actions on Linux, macOS, Windows): - Benchmarks use mimalloc-bench parameters (cache-scratch/cache-thrash: $N 1000 1 2000000, larson: 5 8 1000 5000 100 4141 $N). - Linux: benchmarks linked against static libhoard (avoids gnuwrapper symbol issues with LD_PRELOAD). - macOS: static link with -force_load for interposition sections; continue-on-error due to GitHub runner malloc zone restrictions. - Windows: withdll.exe Detours injection for threadtest/test_malloc/ bench_malloc. - Adds HOARD_BUILD_BENCHMARKS and HOARD_LINK_BENCHMARKS CMake options. - Adds MSVC intrinsic wrappers for __builtin_ctzll/__builtin_clz. - Adds xxfree_sized/xxfree_aligned_sized stubs for latest Heap-Layers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
51a9faa to
decc1ce
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seven independent optimizations targeting remaining hot-path bottlenecks, validated with A/B benchmarking:
normalize()modulo — replaces expensive%operator (~30 cycles) with multiply+shift (~4 cycles) for non-power-of-two object sizes via precomputed__uint128_tinverseGeometricSizeClass— replaces O(log N) binary search with CLZ-based bit-position lookup + short linear scan (~5 iterations max)_binmapbitmap +__builtin_clzfinds fullest non-empty class in O(1) instead of linear scanDESTRUCTIVE_INTERFERENCE_SIZE(128 on Apple Silicon, 64 on x86) for cross-thread TLAB paddingBenchmark Results (Apple Silicon, 5 runs, min/avg/max)
All benchmarks remain ~3-5x faster than the system allocator.
Test plan
cmake .. && makeDYLD_INSERT_LIBRARIES__uint128_tpaths have MSVC fallbacks-fsanitize=address)🤖 Generated with Claude Code