Skip to content

Commit 06d30d3

Browse files
author
peng.li24
committed
docs: reconcile bitexact compiler flags between README and tests/CMakeLists.txt
- Fix -mavx512f status: required → recommended (code has #ifdef guards, compiles and runs correctly without it via scalar npy_* path) - Add missing -msse4.1 (required: _mm_insert_epi32 in linalg.h) - Add -O2 to README flag list and table - Separate -mfma from -mavx512f with proper dependency explanation - Fix -mprefer-vector-width=256: only required with -mavx512f - List all 19 -fno-builtin-* flags explicitly in README - Update test count: 900 → 981 - Remove misleading claim in manual section - Empirically verified: -fno-builtin-* removal = 0 new failures
1 parent d2bb2b6 commit 06d30d3

2 files changed

Lines changed: 37 additions & 14 deletions

File tree

README.md

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ target_compile_features(mymodule PRIVATE cxx_std_17)
110110
**Manual (header-only)**
111111

112112
Add `-Ipath/to/numpycpp` to your compiler flags and include the headers directly. No build step, no copy required.
113-
- Bitexact backend: add `-ldl` at link time (no other flags needed at `-O2`; see compiler flags table below)
113+
- Bitexact backend: add `-ldl` at link time. See compiler flags table below for required flags (`-ffp-contract=off`, `-msse4.1`, etc.).
114114
- Std backend: add `-DNUMPYCPP_STD_ONLY` (no `-ldl` needed)
115115

116116
### Testing
@@ -158,22 +158,42 @@ The minimum set was determined empirically: each flag was removed in isolation
158158
and the full 981-test suite was re-run. Only flags whose removal caused at
159159
least one test failure are marked **required**.
160160

161+
The SVML bridge compiles cleanly with or without `-mavx512f` — all AVX-512 code
162+
is guarded by `#ifdef __AVX512F__`. Without `-mavx512f`, the scalar `npy_*`
163+
path is used (resolved via `dlsym` from numpy's `.so`), which is still bit-exact.
164+
We recommend using cmake's `check_cxx_source_runs` to probe AVX-512 at configure
165+
time — see [`tests/CMakeLists.txt`](tests/CMakeLists.txt) for a complete example.
166+
161167
```cmake
162168
target_compile_options(<target> PRIVATE
169+
-O2
163170
-ffp-contract=off # REQUIRED — see below
164-
-mavx512f -mfma # REQUIRED — see below
165-
-mprefer-vector-width=256 # REQUIRED — see below
171+
-msse4.1 # REQUIRED — see below
172+
-mfma # REQUIRED with -mavx512f — see below
173+
-mavx512f # recommended — see below (use cmake probe)
174+
-mprefer-vector-width=256 # REQUIRED with -mavx512f — see below
175+
# Defensive: prevent GCC from substituting npy_* call sites with builtins.
176+
# No test currently depends on these — kept as future-proofing.
177+
-fno-builtin-exp -fno-builtin-log -fno-builtin-sin
178+
-fno-builtin-cos -fno-builtin-tan -fno-builtin-pow
179+
-fno-builtin-sqrt -fno-builtin-atan2 -fno-builtin-log2
180+
-fno-builtin-log10 -fno-builtin-asin -fno-builtin-acos
181+
-fno-builtin-atan -fno-builtin-exp2
182+
-fno-builtin-cbrt -fno-builtin-expm1 -fno-builtin-log1p
166183
)
167184
target_link_libraries(<target> PRIVATE dl) # REQUIRED — dlsym
168185
```
169186

170187
| Flag | Status | Why required | Consequence of removal |
171188
|------|:------:|-------------|------------------------|
172-
| `-ffp-contract=off` | **required** | Prevents silent FMA fusion of `a*b+c`. einsum loops must match numpy's BLAS multiply-then-add order. | 36 einsum tests fail with ±1 ULP. |
173-
| `-mavx512f -mfma` | **required** | SVML bridge declares `exp_svml_f64` etc. inside `#ifdef __AVX512F__`. AVX-512 intrinsics are runtime-guarded — binary safe on non-AVX-512 CPUs. | Hard compile error: `'exp_svml_f64' was not declared`. |
174-
| `-mprefer-vector-width=256` | **required** | Prevents GCC from emitting ZMM instructions globally. Some cloud VMs expose `avx512f` in CPUID but trap ZMM via hypervisor XSAVE. The SVML bridge is safe (runtime guard), but unguarded auto-vectorized ZMM causes SIGILL. | SIGILL at startup on some cloud VMs (GitHub Actions azure runners). |
189+
| `-O2` | recommended | Standard optimization level. Without optimization, bit-exact results are preserved but performance degrades significantly. | Slow execution (correctness unaffected). |
190+
| `-ffp-contract=off` | **required** | Prevents silent FMA fusion of `a*b+c`. einsum loops must match numpy's BLAS multiply-then-add order. | 36 einsum tests fail with ±1 ULP (verified). |
191+
| `-msse4.1` | **required** | `linalg.h` uses `_mm_insert_epi32` (SSE4.1 instruction) unconditionally. | Hard compile error: `'__builtin_ia32_pinsrd' requires SSE4.1`. |
192+
| `-mfma` | **required** | `avx512_loops.h` uses `_mm512_fmadd_ps/pd` inside `#ifdef __AVX512F__`. Only needed together with `-mavx512f`. | Hard compile error if `-mavx512f` is enabled. |
193+
| `-mavx512f` | recommended | Enables the AVX-512 SVML vector path (`__svml_exp8`, etc.) and wide-loop specializations in `avx512_loops.h`. Without it, the scalar `npy_*` path is used — still bit-exact, but 4–8× slower on large arrays. **Safe on non-AVX-512 CPUs:** all AVX-512 code is isolated behind `__attribute__((target("avx512f")))` + runtime `cpu_has_avx512f()` guard. | Fallback to scalar `npy_*` path (still bit-exact, slower). |
194+
| `-mprefer-vector-width=256` | **required** | Prevents GCC from emitting ZMM (512-bit) instructions in auto-vectorized code when `-mavx512f` is enabled. Some cloud VMs expose `avx512f` in CPUID but trap ZMM via hypervisor XSAVE. Explicit AVX-512 intrinsics are safe (runtime-guarded), but unguarded auto-vectorized ZMM causes SIGILL. No effect without `-mavx512f`. | SIGILL at startup on some cloud VMs (GitHub Actions azure runners). |
175195
| `-ldl` | **required** | `dlsym`/`dlopen` locate numpy's `_multiarray_umath.so` at runtime. | Link error: `undefined reference to 'dlsym'`. |
176-
| `-fno-builtin-exp`| recommended | Prevents GCC from substituting npy_* call sites with builtins. numpycpp never calls `exp()` from `<cmath>` directly, so no current effect — kept as defensive guard. | No test failure when removed today. |
196+
| `-fno-builtin-*` (full list) | recommended | Prevents GCC from substituting npy_* call sites with builtin implementations. numpycpp resolves math functions via dlsym at runtime, never calling `exp()` from `<cmath>` directlyso no current effect. Kept as defensive guard against future GCC versions. | No test failure when removed today (verified on GCC 9–14). |
177197

178198
#### Compiler flags — std backend (`NUMPYCPP_STD_ONLY=ON`)
179199

tests/CMakeLists.txt

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
# Two build modes (select with -DNUMPYCPP_STD_ONLY):
44
#
55
# OFF (default) — bit-exact mode:
6-
# All 900 tests verify IEEE 754 bit-identical results vs numpy.
7-
# Requires: dlsym, numpy .so loaded, AVX-512 capable machine (or fallback).
6+
# All 981 tests verify IEEE 754 bit-identical results vs numpy.
7+
# Requires: dlsym, numpy .so loaded.
88
# cmake -S tests -B tests/build
99
#
1010
# ON — std / performance-first mode:
@@ -108,12 +108,15 @@ if(NUMPYCPP_STD_ONLY)
108108
else()
109109
# ── bit-exact build (default) ─────────────────────────────────────────────
110110
# Flags determined empirically: each flag's removal was tested against all
111-
# 900 tests. Only flags whose removal caused failures are marked REQUIRED.
111+
# 981 tests. Only flags whose removal caused failures are marked REQUIRED.
112112
target_compile_options(numpycpp PRIVATE
113113
-O2
114-
-ffp-contract=off # REQUIRED: no implicit FMA (keeps Cody-Waite exact)
115-
-msse4.1 -mfma # baseline SSE4.1 + FMA
116-
# disable builtin replacements so calls go through SVML/npy_math paths
114+
-ffp-contract=off # REQUIRED — verified: 36 einsum tests fail without
115+
-msse4.1 # REQUIRED — _mm_insert_epi32 in linalg.h
116+
-mfma # REQUIRED — _mm512_fmadd_* in AVX-512 loops
117+
# Disable GCC builtin replacements — defensive: no test currently fails
118+
# without these, but they prevent future GCC versions from silently
119+
# substituting npy_* call sites with builtins.
117120
-fno-builtin-exp -fno-builtin-log -fno-builtin-sin
118121
-fno-builtin-cos -fno-builtin-tan -fno-builtin-pow
119122
-fno-builtin-sqrt -fno-builtin-atan2 -fno-builtin-log2
@@ -132,7 +135,7 @@ else()
132135
target_link_libraries(numpycpp PRIVATE OpenMP::OpenMP_CXX)
133136
endif()
134137
target_link_libraries(numpycpp PRIVATE dl)
135-
message(STATUS "Test build: bit-exact mode (-O2 -ffp-contract=off, dlsym, SVML)")
138+
message(STATUS "Test build: bit-exact mode (-O2 -ffp-contract=off -msse4.1 -mfma, dlsym, SVML)")
136139
endif()
137140

138141
# Place .so next to the test scripts for easy import

0 commit comments

Comments
 (0)