Commit 32cd98a
peng.li24
perf: inline f32 poly loops + AVX-512 sqrt/abs + CMake tests build
Performance improvements (N=524288, 0 ULP maintained):
exp f32: 0.085x numpy → 0.70x (+8x vs old scalar-per-call approach)
log f32: 0.095x numpy → 0.87x (+9x)
sin f32: 0.054x numpy → 0.74x (+14x)
cos f32: 0.053x numpy → 0.72x (+14x)
sqrt f32: 0.910x numpy → 1.07x (now vectorized, AVX-512 immune to throttle)
sqrt f64: parity maintained
Root causes fixed:
1. noinline helper functions (npy_expf_vec16 etc.) caused 32768 function
calls per 524288-element array; now the polynomial is inlined directly
into each template specialization with all 14-15 constants defined as
non-static locals before the loop — GCC keeps them in zmm8-zmm31.
2. -ffloat-store in Makefile caused GCC to spill every __m512 intermediate
to the stack and reload it, doubling the instruction count for every
operation. Removed (redundant on x86-64 with SSE/AVX default float ABI).
3. sqrt/abs had no AVX-512 specialization; added 16-wide float and 8-wide
double loops using _mm512_sqrt_ps/pd and _mm512_abs_ps/pd (IEEE 754
exact, 0 ULP, immune to CPU frequency throttling caused by other AVX-512
loops running in the same process).
Build system:
- Replace tests/Makefile with tests/CMakeLists.txt
cmake -S tests -B tests/build && cmake --build tests/build
cmake --build tests/build --target test
- Update root CMakeLists.txt help messages accordingly1 parent d81c887 commit 32cd98a
5 files changed
Lines changed: 537 additions & 34 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | | - | |
80 | | - | |
| 79 | + | |
| 80 | + | |
0 commit comments