You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: reconcile bitexact compiler flags between README and tests/CMakeLists.txt
- Fix -mavx512f status: required → recommended (code has #ifdef guards,
compiles and runs correctly without it via scalar npy_* path)
- Add missing -msse4.1 (required: _mm_insert_epi32 in linalg.h)
- Add -O2 to README flag list and table
- Separate -mfma from -mavx512f with proper dependency explanation
- Fix -mprefer-vector-width=256: only required with -mavx512f
- List all 19 -fno-builtin-* flags explicitly in README
- Update test count: 900 → 981
- Remove misleading claim in manual section
- Empirically verified: -fno-builtin-* removal = 0 new failures
|`-ffp-contract=off`|**required**| Prevents silent FMA fusion of `a*b+c`. einsum loops must match numpy's BLAS multiply-then-add order. | 36 einsum tests fail with ±1 ULP. |
173
-
|`-mavx512f -mfma`|**required**| SVML bridge declares `exp_svml_f64` etc. inside `#ifdef __AVX512F__`. AVX-512 intrinsics are runtime-guarded — binary safe on non-AVX-512 CPUs. | Hard compile error: `'exp_svml_f64' was not declared`. |
174
-
|`-mprefer-vector-width=256`|**required**| Prevents GCC from emitting ZMM instructions globally. Some cloud VMs expose `avx512f` in CPUID but trap ZMM via hypervisor XSAVE. The SVML bridge is safe (runtime guard), but unguarded auto-vectorized ZMM causes SIGILL. | SIGILL at startup on some cloud VMs (GitHub Actions azure runners). |
189
+
|`-O2`| recommended | Standard optimization level. Without optimization, bit-exact results are preserved but performance degrades significantly. | Slow execution (correctness unaffected). |
190
+
|`-ffp-contract=off`|**required**| Prevents silent FMA fusion of `a*b+c`. einsum loops must match numpy's BLAS multiply-then-add order. | 36 einsum tests fail with ±1 ULP (verified). |
|`-mfma`|**required**|`avx512_loops.h` uses `_mm512_fmadd_ps/pd` inside `#ifdef __AVX512F__`. Only needed together with `-mavx512f`. | Hard compile error if `-mavx512f` is enabled. |
193
+
|`-mavx512f`| recommended | Enables the AVX-512 SVML vector path (`__svml_exp8`, etc.) and wide-loop specializations in `avx512_loops.h`. Without it, the scalar `npy_*` path is used — still bit-exact, but 4–8× slower on large arrays. **Safe on non-AVX-512 CPUs:** all AVX-512 code is isolated behind `__attribute__((target("avx512f")))` + runtime `cpu_has_avx512f()` guard. | Fallback to scalar `npy_*` path (still bit-exact, slower). |
194
+
|`-mprefer-vector-width=256`|**required**| Prevents GCC from emitting ZMM (512-bit) instructions in auto-vectorized code when `-mavx512f` is enabled. Some cloud VMs expose `avx512f` in CPUID but trap ZMM via hypervisor XSAVE. Explicit AVX-512 intrinsics are safe (runtime-guarded), but unguarded auto-vectorized ZMM causes SIGILL. No effect without `-mavx512f`. | SIGILL at startup on some cloud VMs (GitHub Actions azure runners). |
175
195
|`-ldl`|**required**|`dlsym`/`dlopen` locate numpy's `_multiarray_umath.so` at runtime. | Link error: `undefined reference to 'dlsym'`. |
176
-
|`-fno-builtin-exp` … | recommended | Prevents GCC from substituting npy_* call sites with builtins. numpycpp never calls`exp()` from `<cmath>` directly, so no current effect — kept as defensive guard. | No test failure when removed today. |
196
+
|`-fno-builtin-*` (full list) | recommended | Prevents GCC from substituting npy_* call sites with builtin implementations. numpycpp resolves math functions via dlsym at runtime, never calling`exp()` from `<cmath>` directly — so no current effect. Kept as defensive guard against future GCC versions. | No test failure when removed today (verified on GCC 9–14). |
0 commit comments