Skip to content

Optimize hot-path performance across allocator core#84

Open
emeryberger wants to merge 2 commits intomasterfrom
perf/hot-path-optimizations
Open

Optimize hot-path performance across allocator core#84
emeryberger wants to merge 2 commits intomasterfrom
perf/hot-path-optimizations

Conversation

@emeryberger
Copy link
Copy Markdown
Owner

Summary

Seven independent optimizations targeting remaining hot-path bottlenecks, validated with A/B benchmarking:

  • Multiplicative inverse for normalize() modulo — replaces expensive % operator (~30 cycles) with multiply+shift (~4 cycles) for non-power-of-two object sizes via precomputed __uint128_t inverse
  • O(1) size class lookup for GeometricSizeClass — replaces O(log N) binary search with CLZ-based bit-position lookup + short linear scan (~5 iterations max)
  • Spinlock exponential backoff with PAUSE/YIELD — re-enables PAUSE on x86 and YIELD on ARM64 with exponential backoff (1→2→4→...→64 then yield) to reduce cache-line bouncing
  • EmptyClass bitmap for O(1) fullest-bin lookup_binmap bitmap + __builtin_clz finds fullest non-empty class in O(1) instead of linear scan
  • ThresholdSegHeap selective clearing — bitmap tracks active bins; only clears bins that were used instead of iterating all entries
  • Cache-line aligned Statistics — pads to 64 bytes to prevent false sharing between adjacent bins; adds platform-aware DESTRUCTIVE_INTERFERENCE_SIZE (128 on Apple Silicon, 64 on x86) for cross-thread TLAB padding
  • Superblock header hot/cold field separation — reorders fields so read-only data (cache line 1), allocation-path mutables (line 2), and ownership/linking (line 3) don't share cache lines

Benchmark Results (Apple Silicon, 5 runs, min/avg/max)

Benchmark Baseline avg Optimized avg Change
larson 4t (ops/s, higher=better) 159M 169M +6%
threadtest 4t/256B 2.40s 2.23s +7%
linux-scalability 4t 21.3ms 20.3ms +5%
threadtest 4t/8B 27.8ms 28.7ms neutral
threadtest 16t/8B 14.9ms 16.0ms neutral
cache-scratch 4t 61μs 60μs neutral
cache-scratch 16t 182μs 184μs neutral

All benchmarks remain ~3-5x faster than the system allocator.

Test plan

  • Build succeeds on macOS (ARM64) with cmake .. && make
  • All benchmarks run correctly with DYLD_INSERT_LIBRARIES
  • No regressions on any benchmark (A/B tested, 5 runs each)
  • Build on Linux (x86-64)
  • Build on Windows (MSVC) — __uint128_t paths have MSVC fallbacks
  • Run under AddressSanitizer (-fsanitize=address)

🤖 Generated with Claude Code

@emeryberger emeryberger force-pushed the perf/hot-path-optimizations branch 7 times, most recently from cfb1ef5 to 51a9faa Compare March 5, 2026 22:43
Performance optimizations (7 independent changes):

1. Multiplicative inverse for normalize() modulo — replace expensive %
   operator (~30 cycles) with multiply+shift (~4 cycles) for
   non-power-of-two object sizes via precomputed __uint128_t inverse.

2. O(1) GeometricSizeClass lookup — replace O(log N) binary search with
   CLZ-based bit-position lookup + short linear scan.

3. Spinlock exponential backoff with PAUSE/YIELD — re-enable PAUSE on
   x86 and YIELD on ARM64, exponential backoff (1→64 then yield).

4. EmptyClass bitmap for O(1) fullest-bin lookup — _binmap + __builtin_clz
   finds fullest non-empty class in O(1) instead of linear scan.

5. ThresholdSegHeap selective clearing — bitmap tracks active bins; only
   clears bins that were used instead of iterating all entries.

6. Cache-line aligned Statistics — pad to 64 bytes (L1 cache line) to
   prevent false sharing; platform-aware DESTRUCTIVE_INTERFERENCE_SIZE
   (128 on Apple Silicon, 64 on x86) for cross-thread TLAB padding.

7. Superblock header hot/cold field separation — read-only fields on
   cache line 1, allocation-path mutables on line 2, ownership/linking
   on line 3.

Benchmarked on Apple Silicon (5 runs per config):
  larson (server workload):     +6%  (159M → 169M ops/s)
  threadtest 256B/4t:           +7%  (2.40s → 2.23s)
  linux-scalability 4t:         +5%  (21.3ms → 20.3ms)
  threadtest/cache-scratch:     neutral

CI (GitHub Actions on Linux, macOS, Windows):

- Benchmarks use mimalloc-bench parameters (cache-scratch/cache-thrash:
  $N 1000 1 2000000, larson: 5 8 1000 5000 100 4141 $N).
- Linux: benchmarks linked against static libhoard (avoids gnuwrapper
  symbol issues with LD_PRELOAD).
- macOS: static link with -force_load for interposition sections;
  continue-on-error due to GitHub runner malloc zone restrictions.
- Windows: withdll.exe Detours injection for threadtest/test_malloc/
  bench_malloc.
- Adds HOARD_BUILD_BENCHMARKS and HOARD_LINK_BENCHMARKS CMake options.
- Adds MSVC intrinsic wrappers for __builtin_ctzll/__builtin_clz.
- Adds xxfree_sized/xxfree_aligned_sized stubs for latest Heap-Layers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@emeryberger emeryberger force-pushed the perf/hot-path-optimizations branch from 51a9faa to decc1ce Compare March 5, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant