Skip to content

release: QBLAS 1.5.0 compiled-library rewrite with runtime CPU dispatch#15

Merged
SwayamInSync merged 37 commits into
mainfrom
release/1.5
May 14, 2026
Merged

release: QBLAS 1.5.0 compiled-library rewrite with runtime CPU dispatch#15
SwayamInSync merged 37 commits into
mainfrom
release/1.5

Conversation

@SwayamInSync
Copy link
Copy Markdown
Owner

@SwayamInSync SwayamInSync commented May 14, 2026

Summary

Full rewrite of QBLAS from the legacy header-only C++ template implementation (1.0) to a compiled shared library (1.5) with a stable CBLAS-style C ABI, runtime CPU dispatch, and OpenMP-backed parallelism.

Result: measurable speedups across all BLAS levels vs the old QBLAS,
benchmarked against the same numpy float64 / scipy-openblas baseline on
the same machine
— see perf_comparison_with_old.md.
Headline numbers (AMD EPYC 7V13, AVX2 tier):

metric old QBLAS 1.0 new QBLAS 1.5
single-thread gemm slowdown vs OpenBLAS ~1400× ~800×
16-thread gemm @ n=256 vs OpenBLAS 7072× 287×
kernel speedup (single-core) (baseline) 1.6-1.7×
threading at small N (≤256) broken working

The 24× win at n=128/256 multi-threaded gemm is from the new dynamic
nc auto-scaling; the old fixed nc=512 meant only 1 of 16 threads
did work on small problems.

What changed

  • C ABI: cblas_q{dot,nrm2,asum,axpy,scal,gemv,ger,gemm,syrk,trmm,trsm} etc.
    Mirrors the standard CBLAS surface for the binary128 / SLEEF_quad type.
  • Architecture: compiled .so with per-ISA OBJECT libraries
    (generic / sse2 / avx2 / avx512 / neon) built from one
    kernels_template.h parametrised over QV_WIDTH and QV_ISA_SUFFIX.
  • Dispatch: CPUID-based feature detection at library init; selects
    the highest-tier kernel the host supports. Override via
    QBLAS_DISPATCH={generic,sse2,avx2,avx512,neon} for diagnostics.
  • Threading: OpenMP, opt-in by problem size with dynamically-detected
    thresholds (measures the actual fork overhead at init time rather than
    hard-coding a magic 8192). Goto-style 5-loop blocked GEMM with nc
    auto-scaled per thread count.
  • Build: CMake primary, plus meson.build for use as a subproject by
    downstream packages (notably numpy-quaddtype, integration tested on
    branch `qblas-rewrite-integration` there).
  • CI: ubuntu gcc + clang, macos-14, macos-15. Per-tier ctest run on
    Linux (`QBLAS_DISPATCH=generic|sse2|avx2 ctest`). Numpy-vs-QBLAS
    correctness test across 85 cases at every push.

closes #3, closes #4, closes #5, closes #6, closes #7, closes #9, closes #13

clang on macOS builds without OpenMP, exposing four kinds of warnings
that gcc/clang-on-Linux happen to elide:

  - ts_ns() in qblas_cpu.c is only referenced from a _OPENMP-gated
    overhead probe; guard it so it doesn't compile as dead code.
  - level1 threshold comparisons compared a signed int against a size_t
    field; widen to size_t.
  - cblas_qger / trsm_left_diag / trmm_left_diag declared nthreads
    outside their _OPENMP block; move the declaration inside.

test_against_numpy was failing on macOS because find_package(Python3)
captured the framework python at configure time and missed the venv
where numpy is installed. The CI now exports VIRTUAL_ENV alongside the
PATH update, and tests/CMakeLists.txt checks 'import numpy' at
configure time, skipping the test (with a status message) if numpy is
unreachable from the chosen interpreter.
The ctypes loader hardcoded libtlfloat.so.1, libsleef.so, etc, which
on macOS don't exist - SLEEF installs libtlfloat.1.dylib, libsleef.dylib,
and libsleefquad.dylib. Replace the hardcoded names with a per-platform
candidate list that tries the versioned soname first and falls back to
the unversioned symlink. The QBLAS_TEST_* env vars still let callers
override individual paths.

Also extend the CI matrix to cover macos-15 alongside macos-14 so the
dylib resolution path is exercised on both runners.
Replace the per-platform extension lists with a directory scan that
picks any libfoo.{so,dylib}* file matching the stem, preferring the
shortest (canonical) name. ctypes.CDLL needs an explicit path because
the libs live under .sleef-prefix/lib/ which isn't on the system
loader's search path, but we don't have to spell out which extensions
exist - whichever the bootstrap installed will be picked up.
Set DYLD_LIBRARY_PATH / LD_LIBRARY_PATH in the test environment so the
dynamic loader can resolve libqblas, libqblas_shim, and the SLEEF deps
by basename. ctypes.CDLL then just hands the loader a name and gets
back a loaded handle - no directory scanning, no platform-specific
filename lists, no env-var indirection in Python. The only platform
sniff left is the .so/.dylib suffix, which CPython's own ctypes does
the same way.
When _POSIX_C_SOURCE is defined (it is, both from the source and from
the build system's -D_POSIX_C_SOURCE=200809L), Apple's <unistd.h>
hides BSD/Darwin extensions including _SC_NPROCESSORS_ONLN. The
constant is in POSIX-1.2008 in principle but Apple keeps it gated on
_DARWIN_C_SOURCE.

Define _DARWIN_C_SOURCE inside the __APPLE__ guard so the macro
becomes visible. Also wrap the sysconf call in #ifdef so a hypothetical
platform without the constant falls back to single-core instead of
breaking the build. Linux is unaffected (the apple guard is dead code
there).

Surfaced by the numpy-quaddtype macOS-arm64 wheel build, which built
QBLAS as a subproject under cibuildwheel.
QBLAS 1.0.0 was the legacy header-only template implementation that
shipped on the main branch; this is the compiled-library rewrite with
runtime CPU dispatch, CBLAS-style C ABI, and the perf gains documented
in perf_comparison_with_old.md.
@SwayamInSync SwayamInSync changed the title release: QBLAS 1.5.0 — compiled-library rewrite with runtime CPU dispatch release: QBLAS 1.5.0 compiled-library rewrite with runtime CPU dispatch May 14, 2026
@SwayamInSync SwayamInSync added the enhancement New feature or request label May 14, 2026
@SwayamInSync SwayamInSync merged commit 27403d2 into main May 14, 2026
15 checks passed
@SwayamInSync SwayamInSync deleted the release/1.5 branch May 14, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment