Releases: n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA
v0.5.16 — x86_64 Desktop RTX 5090 Benchmarks
Documentation release adding measured benchmark results for a desktop Blackwell build.
New: Intel Core Ultra 9 285K + RTX 5090 benchmark section
Measured on a desktop NVIDIA GeForce RTX 5090 (Blackwell, SM 10.0, 32 GB; CUDA 13.0.88, driver 580.126.20, Vulkan 1.3.275), Release / C++20:
- Vulkan matrix-multiply scaling from 1.28 GFLOPS (64²) to 114 GFLOPS (1024²)
- Full Vulkan compute-kernel set — vec add/mul/dot, reduce-sum, conv1D (15.6 GFLOPS), conv2D, prefix sum, matrix transpose
- CPU reference (Eigen / scalar) — Eigen GEMM ~57 GFLOPS, scalar & complex cross-correlation, std::exp/sin
- Radar scalar paths — CAF ~26.5 GFLOPS, CFAR-1D/2D, NLMS, MTI, delay-sum beamform, steering vector
- 16/16 test suites pass in 14.2 s, including the 36-test CUDA suite on SM 10.0
Complements the existing 275HX / RTX 5070 Ti laptop section. The Vulkan microbenchmarks are end-to-end (per-iteration allocation + host↔device copy), so they are transfer-bound and not a measure of the RTX 5090's peak compute.
No code behavior changes — documentation only.
Full diff: v0.5.15...v0.5.16
v0.5.15 — Complete Source Documentation Coverage
Documentation-only release: every source file in src/ now carries a comprehensive header doc block.
Documentation
- Header doc blocks for the last two source files —
src/vulkan/vulkan_backend.cpp(12 functional blocks: shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) andsrc/platform/platform.cpp(7 blocks: sysfs helpers, CPU/feature/cache detection, cached accessor, core topology, thread affinity, feature/cache accessors, GEMM blocking selection). - Maximal inline documentation of Vulkan device selection — enumeration ordering, type scoring, compute-queue eligibility, and the
VK_DRIVER_FILESoverride recipe. - README release link — added release badge and "Latest release" callout.
No code behavior changes; documentation only. All 16/16 test suites continue to pass.
Full diff: v0.5.14...v0.5.15
v0.5.14 — Discrete GPU preference & x86_64 dual-GPU benchmarks
Vulkan Backend
- Discrete-GPU-preferred device selection — The backend previously picked
devices[0](first enumerated device), which on multi-GPU machines (e.g. a laptop with an integrated GPU alongside a discrete card) could default to the slower integrated GPU. It now ranks physical devices by type (discrete > integrated > virtual > CPU) and only considers devices exposing a compute queue, so the queue-family search cannot fail on the chosen device. The selected GPU is logged as[Vulkan] Selected GPU: ....
Documentation
- x86_64 workstation benchmark section — Added results for Intel Core Ultra 9 275HX + RTX 5070 Ti, showing Intel iGPU vs NVIDIA RTX 5070 Ti Vulkan numbers side-by-side, CPU/Eigen and radar reference figures, and the 16/16 test-suite (138 tests) breakdown including the CUDA suite. ARM / Orange Pi 6 Plus / Raspberry Pi 5 content is unchanged.
- Comprehensive Vulkan backend documentation —
src/vulkan/vulkan_backend.cppnow carries the same comprehensive header-doc standard introduced in v0.5.11 for the other kernel files: a file-header block covering all 12 functional blocks (shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) plus maximal inline documentation of the discrete-GPU device-selection logic (enumeration ordering, type scoring, compute-queue eligibility, and theVK_DRIVER_FILESoverride recipe).
Benchmarks (x86_64: Core Ultra 9 275HX + RTX 5070 Ti)
Built with -DENABLE_NEON=OFF -DENABLE_SVE2=OFF -DENABLE_CUDA=ON -DENABLE_VULKAN=ON, Release. Two Vulkan GPUs available: integrated Intel Graphics (Arrow Lake) and discrete NVIDIA RTX 5070 Ti (Blackwell, CUDA 13.1).
Vulkan — Matrix Multiply (Intel iGPU vs NVIDIA RTX 5070 Ti)
| Size | Intel iGPU (ARL) | NVIDIA RTX 5070 Ti |
|---|---|---|
| 64×64 | 3.53 GFLOPS | 1.08 GFLOPS |
| 128×128 | 12.8 GFLOPS | 5.56 GFLOPS |
| 256×256 | 30.9 GFLOPS | 20.3 GFLOPS |
| 512×512 | 63.1 GFLOPS | 50.0 GFLOPS |
| 1024×1024 | 124 GFLOPS | 121 GFLOPS |
These are end-to-end microbenchmarks — each iteration includes buffer allocation and host↔device copy. The discrete RTX pays PCIe transfer latency while the iGPU shares system memory, so for small/transfer-bound sizes the iGPU wins; only at 1024² does compute dominate and the two converge. These figures are not the RTX 5070 Ti's peak compute — the harness is transfer-bound by design.
Vulkan — Other compute kernels (largest size)
| Benchmark | Size | Intel iGPU | NVIDIA RTX 5070 Ti |
|---|---|---|---|
| Vec Dot | 4M | 971 MFLOPS | 734 MFLOPS |
| Reduce Sum | 4M | 860 MFLOPS | 646 MFLOPS |
| Conv1D | 262144 / 128 | 18.5 GFLOPS | 12.1 GFLOPS |
| Conv2D | 512 / 7 | 7.26 GFLOPS | 5.54 GFLOPS |
| Prefix Sum | 4096 | 2.47 GFLOPS | 2.47 GFLOPS |
CPU & Radar reference (NEON disabled on x86_64 → Eigen / scalar paths)
| Benchmark | Size | Throughput |
|---|---|---|
| Eigen GEMM | 512 | 52.7 GFLOPS |
| Eigen Dot | 4096 | 42.9 GFLOPS |
| Complex XCorr (scalar) | 1024 | 21.0 GFLOPS |
| Radar CAF | 65536 / 101 / 500 | 25.7 GFLOPS |
| Radar NLMS | 256K / 128 taps | 9.17 GFLOPS |
| Radar CFAR 2D | 512×1024 | 12.0 GFLOPS |
Test Results (x86_64, Core Ultra 9 275HX + RTX 5070 Ti — all 16/16 suites pass, 138 tests)
| Backend | Tests | Status |
|---|---|---|
| NEON (scalar fallback) | 68 | Pass |
| Vulkan | 5 | Pass |
| Radar | 19 | Pass |
| CUDA (RTX 5070 Ti) | 36 | Pass |
| Platform | 9 | Pass |
| Basic | 1 | Pass |
v0.5.13 - GEMM L3 Optimization, L2 Detection & SVE2 Prefetch (April 2026)
Bug Fixes:
- SVE2 GEMM buffer bounds clamping - Added
std::min()clamping of runtime MC/KC/NC parameters to static buffer maximums, preventing potential buffer overflow if platform detection returned values exceedingMAX_MC/MAX_KC/MAX_NC. NEON GEMM already had this safety check; SVE2 GEMM was missing it. - NEON GEMM docstring correction - Fixed inaccurate cache blocking parameter values in the file header comment (claimed MC=384-512 but code used MC=128-256).
Optimizations (CIX P1 / Orange Pi 6 Plus):
- GEMM NC doubled to 2048 for large L3 - B panel now occupies 4MB (33% of 12MB L3) vs prior 2MB (17%). Reduces main memory traffic by ~2x for large matrix multiplications on the CIX P1 CD8160.
- SVE2 GEMM microkernel prefetch - Added
svprfb(SV_PLDL1KEEP)prefetch hints to pipeline next iteration's A/B panels into L1 data cache, reducing stall cycles on Cortex-A720.
Platform Detection Enhancements:
- L2 cache size detection - New
get_l2_cache_size()API reads per-core L2 from sysfs (probes a performance core), with heuristic fallbacks: A720=512KB, A520=256KB, A76=512KB. CpuInfo::l2_cache_bytesfield - Cached L2 size available viadetect_cpu_info().- Updated test suite - Platform test validates L2 detection and new NC=2048 GEMM blocking.
Test Results (CIX P1 CD8160, aarch64, all 16/16 pass):
| Backend | Tests | Status |
|---|---|---|
| NEON | 7/7 | Pass |
| SVE2 | 1/1 | Pass |
| Vulkan | 4/4 | Pass |
| Radar | 2/2 | Pass |
| Platform | 1/1 | Pass |
| Basic | 1/1 | Pass |
v0.5.12 - CUDA 13 Full Upgrade & SM 103 Support (April 2026)
CUDA 13.0 Full Support:
- CUDA 13.0.88 verified - Complete build and test pass with CUDA 13 toolkit
- SM 103 architecture added - Support for RTX 5070/5060 (GB205/GB206 Blackwell variants)
- Updated architecture list - CUDA 13+ now builds: SM 75, 80, 86, 89, 90, 100, 103
- Improved Blackwell detection - CMake now detects both SM 100 and SM 103 for
OPTMATH_CUDA_BLACKWELL
Test Results (x86-64 with RTX 5090, CUDA 13.0.88):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| CUDA | 36/36 | Pass | Full GPU acceleration with TF32 Tensor Cores |
| Vulkan | 4/4 | Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | Pass | Eigen auto-vectorized for AVX2 |
Architecture Support (CUDA 13+):
| SM | Architecture | GPUs |
|---|---|---|
| 75 | Turing | RTX 2060/2070/2080 |
| 80/86 | Ampere | RTX 3060/3070/3080/3090, A100 |
| 89 | Ada Lovelace | RTX 4060/4070/4080/4090 |
| 90 | Hopper | H100 |
| 100 | Blackwell | RTX 5080/5090 (GB202/GB203) |
| 103 | Blackwell | RTX 5060/5070 (GB205/GB206) |
v0.5.11 - Comprehensive Kernel Documentation (April 2026)
Source-Level Documentation:
Every kernel source file now has a thorough header comment documenting all functional blocks, specific API/intrinsic usage, and algorithmic techniques. This makes the codebase self-documenting for contributors and users reading the source.
CUDA Kernels (3 files):
cuda_kernels.cu: 8 documented blocks — vector elementwise (float4 vectorized), transcendentals (CUDA__expf/__sinf/__cosf/__sincosf/__tanf/__powffast-math intrinsics), activation functions (sigmoid/tanh/ReLU/leaky-ReLU/GELU/softmax with shared-memory reduction), matrix ops (tiled transpose with +1 bank-conflict padding), cuBLAS wrappers (cublasSgemm/cublasSdot/cublasSnrm2/cublasSgemv), CUB parallel reductions (cub::DeviceReduce::Sum/Max/Min), Eigen host wrappers (error-checked cudaMalloc/cudaMemcpy with CPU fallback), cuSOLVER Cholesky (cusolverDnSpotrf/cusolverDnDpotrf/cusolverDnSpotrswith architecture-aware GPU/CPU thresholds)cuda_complex.cu: 7 documented blocks — split-format complex arithmetic, complex analysis kernels, warp-level complex dot product (__shfl_down_sync+atomicAdd), format conversion (interleave/deinterleave), convolution (1D naive/shared-memory template, 2D), cuFFT wrappers (cufftPlan1d/cufftPlanMany/cufftPlan2dwithCUFFT_C2C), Eigen complex wrapperscuda_radar.cu: 7 documented blocks — window function generators, GPU-resident CAF pipeline (Doppler shift via__sincosf+ FFT + conj-multiply + IFFT + magnitude extraction), CFAR detection (1D CA-CFAR + 2D with guard cells), Doppler windowing, Bartlett beamformer (ULA steering vectors + shared-memory spectrum reduction), NLMS adaptive filter (CPU, sequential weight updates), projection clutter cancellation
NEON Kernels (8 files):
neon_kernels.cpp: Vector ops with 4-accumulator FMA pipeline utilization, vectorized transcendentals (6th-order minimax polynomials with IEEE754 bit manipulation for exponent reconstruction)neon_complex.cpp: Split and interleaved (vld2q_f32/vst2q_f32) complex arithmetic, Newton-Raphson rsqrt magnitude (vrsqrteq_f32/vrsqrtsq_f32)neon_gemm_optimized.cpp: 3-level Goto-style cache-blocked GEMM, 8x8 microkernel withvmlaq_laneq_f32rank-1 updates, runtime A76/A720 cache parameter tuningneon_radar.cpp: Window functions (including Kaiser with Bessel I0), CAF, CFAR (CA/2D-SAT/OS), NLMS, projection clutter, DFT, MTI, delay-sum and phase-shift beamformingneon_conv2d.cpp: General, separable, fully-unrolled 3x3 and 5x5 convolutionneon_iir.cpp: Biquad Direct Form II Transposed, cascade, Bristow-Johnson filter designneon_linalg.cpp: TRSV/TRSM, Cholesky (A=L*L^T), LU (partial pivot), QR (Householder), solvers, matrix inverseneon_resample.cpp: Polyphase L:M rational rate conversion with streaming delay line
SVE2 Kernels (4 files):
sve2_kernels.cpp: Predicated vector ops (svwhilelt_b32loops), transcendentals (predicated Horner/Chebyshev polynomials), 8x8 GEMM microkernel (MC=256/KC=512/NC=2048 for A720 12MB L3), I8MM int8 GEMM (svmmla_s32)sve2_complex.cpp: FCMA-accelerated complex multiply (svcmla_f32_zrotations 0/90/270 for 2-instruction complex multiply), non-FCMA fallback withsvtbl_f32deinterleaving, nativesvsqrt_f32_zmagnitudesve2_radar.cpp: CAF with predicated complex MAC (svmla_f32_mmerging semantics), cross-correlation, phase-shift beamformingsve2_detect.cpp: Runtime SVE2 detection viagetauxval(AT_HWCAP2) & HWCAP2_SVE2
README Updated:
- File Structure section now includes per-file descriptions of all functional blocks and key APIs
- Each source file entry documents the specific intrinsics, algorithms, and CUDA/cuBLAS/cuFFT/cuSOLVER calls used
v0.5.10 - TF32 Tolerance Fix & Verified CUDA 13 + cuSolver Cholesky (April 2026)
CUDA 13 Verified:
- Full test pass on RTX 5090 (Blackwell) - All 36 CUDA tests pass including 7 Cholesky tests
- cuSolver Cholesky decomposition - Verified working with CUDA 13's cusolverDnSpotrf/Dpotrf
- SM 75 (Turing) fallback - Architecture-aware thresholds work correctly from SM 75-100
TF32 Precision Handling:
- MatrixGEMM test tolerance adjusted for TF32 (TensorFloat-32) precision on Ampere+ GPUs
- TF32 uses 19-bit mantissa vs FP32's 24-bit, allowing ~0.4% relative error
- Tolerance updated to 1% relative + 5e-3 absolute to account for accumulated rounding in GEMM
Test Results (x86-64 with RTX 5090, CUDA 13.0):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| CUDA | 36/36 | Pass | Full GPU acceleration with TF32 Tensor Cores |
| Vulkan | 4/4 | Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | Pass | Eigen auto-vectorized for AVX2 |
v0.5.9 - CUDA 13 Support & Multi-Architecture Compatibility (April 2026)
CUDA 13 Support:
- Full CUDA 13.0 compatibility - API changes for
cudaMemPrefetchAsyncandcudaDeviceProp - Architecture auto-detection - CMake automatically selects appropriate architectures based on CUDA toolkit version:
- CUDA 13+: SM 75, 80, 86, 89, 90, 100, 103 (Turing through Blackwell)
- CUDA 12.x: SM 50-89 (Maxwell through Ada)
- CUDA 11.x: SM 50-86 (Maxwell through Ampere)
- Native build option - Use
-DOPTMATH_CUDA_NATIVE=ONfor faster compilation targeting only the local GPU - Blackwell (SM 100) optimizations - Full native support with CUDA 13
Architecture Compile Definitions:
New compile definitions for architecture-specific code paths:
OPTMATH_CUDA_13_PLUS- CUDA 13+ detectedOPTMATH_CUDA_BLACKWELL- SM 100 enabledOPTMATH_CUDA_HOPPER- SM 90 enabledOPTMATH_CUDA_ADA- SM 89 enabledOPTMATH_CUDA_AMPERE- SM 80/86 enabledOPTMATH_CUDA_TURING- SM 75 enabled
cuSolver Cholesky Improvements:
- Enhanced fallback chain with architecture-aware thresholds
- Small matrix optimization (CPU path for matrices < 64-128 depending on architecture)
- Improved DeviceInfo API with architecture detection helpers
Note: CUDA 13 dropped support for SM < 75. For Maxwell/Pascal/Volta GPUs, use CUDA 12.x.
v0.5.8 - Blackwell (RTX 50xx) Support & Documentation (April 2026)
NVIDIA Blackwell Support:
- RTX 5090 Tested: Full build and test run on NVIDIA GeForce RTX 5090 (SM 10.0, Blackwell architecture)
- CUDA 12.8+ Requirement: Documented that Blackwell SM 100 requires CUDA 12.8+ for native support
- Vulkan Fallback: Vulkan 1.3 backend provides full GPU acceleration on Blackwell when CUDA toolkit is older
- Architecture Table Updated: Added RTX 5090 specs (21760 CUDA cores, 680 Tensor Cores, 32GB VRAM)
Build System:
- Multi-Architecture Support: Build instructions updated for RTX 20xx/30xx/40xx (SM 75/86/89) and RTX 50xx (SM 100)
- Native Optimization: Added
-march=native -mtune=nativeflags for x86-64 builds - Local Install: Support for
~/.localprefix installation without root access
Documentation:
- CUDA Toolkit Requirements Table: Clear mapping of GPU generations to minimum CUDA versions
- Troubleshooting Expanded: Added Blackwell-specific error messages and solutions
- GPU Architecture Detection: Added
nvidia-smicommands to identify compute capability
Test Results (x86-64 with RTX 5090, CUDA 12.0, Vulkan 1.3):
| Backend | Tests | Status | Notes |
|---|---|---|---|
| Vulkan | 4/4 | ✅ Pass | Full GPU acceleration |
| NEON/CPU | 11/11 | ✅ Pass | Eigen auto-vectorized for AVX2 |
| CUDA | 0/1 | Requires CUDA 12.8+ for Blackwell |
Known Limitation: CUDA tests fail on RTX 50xx with CUDA 12.0 due to PTX forward compatibility issues. Install CUDA 12.8+ from NVIDIA for native Blackwell support, or use the Vulkan backend.
v0.5.7 - SVE2 & Radar Pipeline Optimization for Orange Pi 6 Plus (March 2026)
SVE2 Transcendentals — Eliminate Heap Allocations (sve2_kernels.cpp):
sve2_fast_cos_f32: Inlined sin polynomial with pi/2 offset in a single SVE2 predicated pass. Previously allocated astd::vector<float>temp buffer and made 2 passes over the data.sve2_fast_sigmoid_f32: Fused single-pass with inline exp(-x) computation. Previously allocated 2std::vectorbuffers and made 3 passes (clamp+negate, exp, divide).sve2_fast_tanh_f32: Fused single-pass with inline exp(-2x) + sigmoid. Previously allocated 2std::vectorbuffers and made 3 passes.- Impact: Eliminates 1-2 heap allocations and 1-2 extra data passes per call in hot paths.
SVE2 GEMM Microkernel — Vectorize with SVE2 FMA (sve2_kernels.cpp):
micro_kernel_8x8_sve2: Replaced scalarfloat acc[8][8]loop with SVE2 column-oriented accumulators usingsvmla_n_f32_zfor rank-1 broadcast FMA, matching the NEON kernel's register-blocked design.- Uses lo/hi vector pairs for 8-row columns on 128-bit SVE2 (CIX P1
svcntw()=4). - Vectorized load-add-store for C matrix writeback.
CAF Doppler Shift — Vectorize Trig in Hot Loop (neon_radar.cpp, sve2_radar.cpp):
- Replaced per-sample
std::cos/std::sincalls (the CAF bottleneck) with batchneon_fast_cos/sin_f32andsve2_fast_cos/sin_f32. - Doppler phase rotation now uses vectorized
neon_complex_mul_f32/sve2_complex_mul_f32instead of scalar multiply. - Estimated ~10x speedup for the Doppler shift phase of CAF computation.
SVE2 Complex Exponential — Vectorize (sve2_complex.cpp):
sve2_complex_exp_f32: Replaced scalarstd::cos/std::sinloop withsve2_fast_cos/sin_f32for full SVE2 vectorization.
Architecture Safety: All changes guarded by #ifdef OPTMATH_USE_SVE2 / #ifdef OPTMATH_USE_NEON with scalar fallbacks intact. No changes to headers, APIs, or non-ARM code paths.
All 16 test suites pass (100%) on Orange Pi 6 Plus.