Releases · n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA

25 May 14:07

n4hy

v0.5.16

2401832

v0.5.16 — x86_64 Desktop RTX 5090 Benchmarks Latest

Latest

Documentation release adding measured benchmark results for a desktop Blackwell build.

New: Intel Core Ultra 9 285K + RTX 5090 benchmark section

Measured on a desktop NVIDIA GeForce RTX 5090 (Blackwell, SM 10.0, 32 GB; CUDA 13.0.88, driver 580.126.20, Vulkan 1.3.275), Release / C++20:

Vulkan matrix-multiply scaling from 1.28 GFLOPS (64²) to 114 GFLOPS (1024²)
Full Vulkan compute-kernel set — vec add/mul/dot, reduce-sum, conv1D (15.6 GFLOPS), conv2D, prefix sum, matrix transpose
CPU reference (Eigen / scalar) — Eigen GEMM ~57 GFLOPS, scalar & complex cross-correlation, std::exp/sin
Radar scalar paths — CAF ~26.5 GFLOPS, CFAR-1D/2D, NLMS, MTI, delay-sum beamform, steering vector
16/16 test suites pass in 14.2 s, including the 36-test CUDA suite on SM 10.0

Complements the existing 275HX / RTX 5070 Ti laptop section. The Vulkan microbenchmarks are end-to-end (per-iteration allocation + host↔device copy), so they are transfer-bound and not a measure of the RTX 5090's peak compute.

No code behavior changes — documentation only.

Full diff: v0.5.15...v0.5.16

Assets 2

25 May 14:09

n4hy

v0.5.15

12e0452

v0.5.15 — Complete Source Documentation Coverage

Documentation-only release: every source file in src/ now carries a comprehensive header doc block.

Documentation

Header doc blocks for the last two source files — src/vulkan/vulkan_backend.cpp (12 functional blocks: shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) and src/platform/platform.cpp (7 blocks: sysfs helpers, CPU/feature/cache detection, cached accessor, core topology, thread affinity, feature/cache accessors, GEMM blocking selection).
Maximal inline documentation of Vulkan device selection — enumeration ordering, type scoring, compute-queue eligibility, and the VK_DRIVER_FILES override recipe.
README release link — added release badge and "Latest release" callout.

No code behavior changes; documentation only. All 16/16 test suites continue to pass.

Full diff: v0.5.14...v0.5.15

Assets 2

25 May 09:00

n4hy

v0.5.14

5017593

v0.5.14 — Discrete GPU preference & x86_64 dual-GPU benchmarks

Vulkan Backend

Discrete-GPU-preferred device selection — The backend previously picked devices[0] (first enumerated device), which on multi-GPU machines (e.g. a laptop with an integrated GPU alongside a discrete card) could default to the slower integrated GPU. It now ranks physical devices by type (discrete > integrated > virtual > CPU) and only considers devices exposing a compute queue, so the queue-family search cannot fail on the chosen device. The selected GPU is logged as [Vulkan] Selected GPU: ....

Documentation

x86_64 workstation benchmark section — Added results for Intel Core Ultra 9 275HX + RTX 5070 Ti, showing Intel iGPU vs NVIDIA RTX 5070 Ti Vulkan numbers side-by-side, CPU/Eigen and radar reference figures, and the 16/16 test-suite (138 tests) breakdown including the CUDA suite. ARM / Orange Pi 6 Plus / Raspberry Pi 5 content is unchanged.
Comprehensive Vulkan backend documentation — src/vulkan/vulkan_backend.cpp now carries the same comprehensive header-doc standard introduced in v0.5.11 for the other kernel files: a file-header block covering all 12 functional blocks (shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) plus maximal inline documentation of the discrete-GPU device-selection logic (enumeration ordering, type scoring, compute-queue eligibility, and the VK_DRIVER_FILES override recipe).

Benchmarks (x86_64: Core Ultra 9 275HX + RTX 5070 Ti)

Built with -DENABLE_NEON=OFF -DENABLE_SVE2=OFF -DENABLE_CUDA=ON -DENABLE_VULKAN=ON, Release. Two Vulkan GPUs available: integrated Intel Graphics (Arrow Lake) and discrete NVIDIA RTX 5070 Ti (Blackwell, CUDA 13.1).

Vulkan — Matrix Multiply (Intel iGPU vs NVIDIA RTX 5070 Ti)

Size	Intel iGPU (ARL)	NVIDIA RTX 5070 Ti
64×64	3.53 GFLOPS	1.08 GFLOPS
128×128	12.8 GFLOPS	5.56 GFLOPS
256×256	30.9 GFLOPS	20.3 GFLOPS
512×512	63.1 GFLOPS	50.0 GFLOPS
1024×1024	124 GFLOPS	121 GFLOPS

These are end-to-end microbenchmarks — each iteration includes buffer allocation and host↔device copy. The discrete RTX pays PCIe transfer latency while the iGPU shares system memory, so for small/transfer-bound sizes the iGPU wins; only at 1024² does compute dominate and the two converge. These figures are not the RTX 5070 Ti's peak compute — the harness is transfer-bound by design.

Vulkan — Other compute kernels (largest size)

Benchmark	Size	Intel iGPU	NVIDIA RTX 5070 Ti
Vec Dot	4M	971 MFLOPS	734 MFLOPS
Reduce Sum	4M	860 MFLOPS	646 MFLOPS
Conv1D	262144 / 128	18.5 GFLOPS	12.1 GFLOPS
Conv2D	512 / 7	7.26 GFLOPS	5.54 GFLOPS
Prefix Sum	4096	2.47 GFLOPS	2.47 GFLOPS

CPU & Radar reference (NEON disabled on x86_64 → Eigen / scalar paths)

Benchmark	Size	Throughput
Eigen GEMM	512	52.7 GFLOPS
Eigen Dot	4096	42.9 GFLOPS
Complex XCorr (scalar)	1024	21.0 GFLOPS
Radar CAF	65536 / 101 / 500	25.7 GFLOPS
Radar NLMS	256K / 128 taps	9.17 GFLOPS
Radar CFAR 2D	512×1024	12.0 GFLOPS

Test Results (x86_64, Core Ultra 9 275HX + RTX 5070 Ti — all 16/16 suites pass, 138 tests)

Backend	Tests	Status
NEON (scalar fallback)	68	Pass
Vulkan	5	Pass
Radar	19	Pass
CUDA (RTX 5070 Ti)	36	Pass
Platform	9	Pass
Basic	1	Pass

Assets 2

25 May 09:02

n4hy

v0.5.13

3186815

v0.5.13 - GEMM L3 Optimization, L2 Detection & SVE2 Prefetch (April 2026)

Bug Fixes:

SVE2 GEMM buffer bounds clamping - Added std::min() clamping of runtime MC/KC/NC parameters to static buffer maximums, preventing potential buffer overflow if platform detection returned values exceeding MAX_MC/MAX_KC/MAX_NC. NEON GEMM already had this safety check; SVE2 GEMM was missing it.
NEON GEMM docstring correction - Fixed inaccurate cache blocking parameter values in the file header comment (claimed MC=384-512 but code used MC=128-256).

Optimizations (CIX P1 / Orange Pi 6 Plus):

GEMM NC doubled to 2048 for large L3 - B panel now occupies 4MB (33% of 12MB L3) vs prior 2MB (17%). Reduces main memory traffic by ~2x for large matrix multiplications on the CIX P1 CD8160.
SVE2 GEMM microkernel prefetch - Added svprfb(SV_PLDL1KEEP) prefetch hints to pipeline next iteration's A/B panels into L1 data cache, reducing stall cycles on Cortex-A720.

Platform Detection Enhancements:

L2 cache size detection - New get_l2_cache_size() API reads per-core L2 from sysfs (probes a performance core), with heuristic fallbacks: A720=512KB, A520=256KB, A76=512KB.
CpuInfo::l2_cache_bytes field - Cached L2 size available via detect_cpu_info().
Updated test suite - Platform test validates L2 detection and new NC=2048 GEMM blocking.

Test Results (CIX P1 CD8160, aarch64, all 16/16 pass):

Backend	Tests	Status
NEON	7/7	Pass
SVE2	1/1	Pass
Vulkan	4/4	Pass
Radar	2/2	Pass
Platform	1/1	Pass
Basic	1/1	Pass

Assets 2

25 May 09:02

n4hy

v0.5.12

541dbb9

v0.5.12 - CUDA 13 Full Upgrade & SM 103 Support (April 2026)

CUDA 13.0 Full Support:

CUDA 13.0.88 verified - Complete build and test pass with CUDA 13 toolkit
SM 103 architecture added - Support for RTX 5070/5060 (GB205/GB206 Blackwell variants)
Updated architecture list - CUDA 13+ now builds: SM 75, 80, 86, 89, 90, 100, 103
Improved Blackwell detection - CMake now detects both SM 100 and SM 103 for OPTMATH_CUDA_BLACKWELL

Test Results (x86-64 with RTX 5090, CUDA 13.0.88):

Backend	Tests	Status	Notes
CUDA	36/36	Pass	Full GPU acceleration with TF32 Tensor Cores
Vulkan	4/4	Pass	Full GPU acceleration
NEON/CPU	11/11	Pass	Eigen auto-vectorized for AVX2

Architecture Support (CUDA 13+):

SM	Architecture	GPUs
75	Turing	RTX 2060/2070/2080
80/86	Ampere	RTX 3060/3070/3080/3090, A100
89	Ada Lovelace	RTX 4060/4070/4080/4090
90	Hopper	H100
100	Blackwell	RTX 5080/5090 (GB202/GB203)
103	Blackwell	RTX 5060/5070 (GB205/GB206)

Assets 2

25 May 09:02

n4hy

v0.5.11

b819d91

v0.5.11 - Comprehensive Kernel Documentation (April 2026)

Source-Level Documentation:

Every kernel source file now has a thorough header comment documenting all functional blocks, specific API/intrinsic usage, and algorithmic techniques. This makes the codebase self-documenting for contributors and users reading the source.

CUDA Kernels (3 files):

cuda_kernels.cu: 8 documented blocks — vector elementwise (float4 vectorized), transcendentals (CUDA __expf/__sinf/__cosf/__sincosf/__tanf/__powf fast-math intrinsics), activation functions (sigmoid/tanh/ReLU/leaky-ReLU/GELU/softmax with shared-memory reduction), matrix ops (tiled transpose with +1 bank-conflict padding), cuBLAS wrappers (cublasSgemm/cublasSdot/cublasSnrm2/cublasSgemv), CUB parallel reductions (cub::DeviceReduce::Sum/Max/Min), Eigen host wrappers (error-checked cudaMalloc/cudaMemcpy with CPU fallback), cuSOLVER Cholesky (cusolverDnSpotrf/cusolverDnDpotrf/cusolverDnSpotrs with architecture-aware GPU/CPU thresholds)
cuda_complex.cu: 7 documented blocks — split-format complex arithmetic, complex analysis kernels, warp-level complex dot product (__shfl_down_sync + atomicAdd), format conversion (interleave/deinterleave), convolution (1D naive/shared-memory template, 2D), cuFFT wrappers (cufftPlan1d/cufftPlanMany/cufftPlan2d with CUFFT_C2C), Eigen complex wrappers
cuda_radar.cu: 7 documented blocks — window function generators, GPU-resident CAF pipeline (Doppler shift via __sincosf + FFT + conj-multiply + IFFT + magnitude extraction), CFAR detection (1D CA-CFAR + 2D with guard cells), Doppler windowing, Bartlett beamformer (ULA steering vectors + shared-memory spectrum reduction), NLMS adaptive filter (CPU, sequential weight updates), projection clutter cancellation

NEON Kernels (8 files):

neon_kernels.cpp: Vector ops with 4-accumulator FMA pipeline utilization, vectorized transcendentals (6th-order minimax polynomials with IEEE754 bit manipulation for exponent reconstruction)
neon_complex.cpp: Split and interleaved (vld2q_f32/vst2q_f32) complex arithmetic, Newton-Raphson rsqrt magnitude (vrsqrteq_f32/vrsqrtsq_f32)
neon_gemm_optimized.cpp: 3-level Goto-style cache-blocked GEMM, 8x8 microkernel with vmlaq_laneq_f32 rank-1 updates, runtime A76/A720 cache parameter tuning
neon_radar.cpp: Window functions (including Kaiser with Bessel I0), CAF, CFAR (CA/2D-SAT/OS), NLMS, projection clutter, DFT, MTI, delay-sum and phase-shift beamforming
neon_conv2d.cpp: General, separable, fully-unrolled 3x3 and 5x5 convolution
neon_iir.cpp: Biquad Direct Form II Transposed, cascade, Bristow-Johnson filter design
neon_linalg.cpp: TRSV/TRSM, Cholesky (A=L*L^T), LU (partial pivot), QR (Householder), solvers, matrix inverse
neon_resample.cpp: Polyphase L:M rational rate conversion with streaming delay line

SVE2 Kernels (4 files):

sve2_kernels.cpp: Predicated vector ops (svwhilelt_b32 loops), transcendentals (predicated Horner/Chebyshev polynomials), 8x8 GEMM microkernel (MC=256/KC=512/NC=2048 for A720 12MB L3), I8MM int8 GEMM (svmmla_s32)
sve2_complex.cpp: FCMA-accelerated complex multiply (svcmla_f32_z rotations 0/90/270 for 2-instruction complex multiply), non-FCMA fallback with svtbl_f32 deinterleaving, native svsqrt_f32_z magnitude
sve2_radar.cpp: CAF with predicated complex MAC (svmla_f32_m merging semantics), cross-correlation, phase-shift beamforming
sve2_detect.cpp: Runtime SVE2 detection via getauxval(AT_HWCAP2) & HWCAP2_SVE2

README Updated:

File Structure section now includes per-file descriptions of all functional blocks and key APIs
Each source file entry documents the specific intrinsics, algorithms, and CUDA/cuBLAS/cuFFT/cuSOLVER calls used

Assets 2

25 May 09:02

n4hy

v0.5.10

4214208

v0.5.10 - TF32 Tolerance Fix & Verified CUDA 13 + cuSolver Cholesky (April 2026)

CUDA 13 Verified:

Full test pass on RTX 5090 (Blackwell) - All 36 CUDA tests pass including 7 Cholesky tests
cuSolver Cholesky decomposition - Verified working with CUDA 13's cusolverDnSpotrf/Dpotrf
SM 75 (Turing) fallback - Architecture-aware thresholds work correctly from SM 75-100

TF32 Precision Handling:

MatrixGEMM test tolerance adjusted for TF32 (TensorFloat-32) precision on Ampere+ GPUs
TF32 uses 19-bit mantissa vs FP32's 24-bit, allowing ~0.4% relative error
Tolerance updated to 1% relative + 5e-3 absolute to account for accumulated rounding in GEMM

Test Results (x86-64 with RTX 5090, CUDA 13.0):

Backend	Tests	Status	Notes
CUDA	36/36	Pass	Full GPU acceleration with TF32 Tensor Cores
Vulkan	4/4	Pass	Full GPU acceleration
NEON/CPU	11/11	Pass	Eigen auto-vectorized for AVX2

Assets 2

25 May 09:02

n4hy

v0.5.9

cb90c28

v0.5.9 - CUDA 13 Support & Multi-Architecture Compatibility (April 2026)

CUDA 13 Support:

Full CUDA 13.0 compatibility - API changes for cudaMemPrefetchAsync and cudaDeviceProp
Architecture auto-detection - CMake automatically selects appropriate architectures based on CUDA toolkit version:
- CUDA 13+: SM 75, 80, 86, 89, 90, 100, 103 (Turing through Blackwell)
- CUDA 12.x: SM 50-89 (Maxwell through Ada)
- CUDA 11.x: SM 50-86 (Maxwell through Ampere)
Native build option - Use -DOPTMATH_CUDA_NATIVE=ON for faster compilation targeting only the local GPU
Blackwell (SM 100) optimizations - Full native support with CUDA 13

Architecture Compile Definitions:

New compile definitions for architecture-specific code paths:

OPTMATH_CUDA_13_PLUS - CUDA 13+ detected
OPTMATH_CUDA_BLACKWELL - SM 100 enabled
OPTMATH_CUDA_HOPPER - SM 90 enabled
OPTMATH_CUDA_ADA - SM 89 enabled
OPTMATH_CUDA_AMPERE - SM 80/86 enabled
OPTMATH_CUDA_TURING - SM 75 enabled

cuSolver Cholesky Improvements:

Enhanced fallback chain with architecture-aware thresholds
Small matrix optimization (CPU path for matrices < 64-128 depending on architecture)
Improved DeviceInfo API with architecture detection helpers

Note: CUDA 13 dropped support for SM < 75. For Maxwell/Pascal/Volta GPUs, use CUDA 12.x.

Assets 2

25 May 09:02

n4hy

v0.5.8

51fbe08

v0.5.8 - Blackwell (RTX 50xx) Support & Documentation (April 2026)

NVIDIA Blackwell Support:

RTX 5090 Tested: Full build and test run on NVIDIA GeForce RTX 5090 (SM 10.0, Blackwell architecture)
CUDA 12.8+ Requirement: Documented that Blackwell SM 100 requires CUDA 12.8+ for native support
Vulkan Fallback: Vulkan 1.3 backend provides full GPU acceleration on Blackwell when CUDA toolkit is older
Architecture Table Updated: Added RTX 5090 specs (21760 CUDA cores, 680 Tensor Cores, 32GB VRAM)

Build System:

Multi-Architecture Support: Build instructions updated for RTX 20xx/30xx/40xx (SM 75/86/89) and RTX 50xx (SM 100)
Native Optimization: Added -march=native -mtune=native flags for x86-64 builds
Local Install: Support for ~/.local prefix installation without root access

Documentation:

CUDA Toolkit Requirements Table: Clear mapping of GPU generations to minimum CUDA versions
Troubleshooting Expanded: Added Blackwell-specific error messages and solutions
GPU Architecture Detection: Added nvidia-smi commands to identify compute capability

Test Results (x86-64 with RTX 5090, CUDA 12.0, Vulkan 1.3):

Backend	Tests	Status	Notes
Vulkan	4/4	✅ Pass	Full GPU acceleration
NEON/CPU	11/11	✅ Pass	Eigen auto-vectorized for AVX2
CUDA	0/1	⚠️ Skip	Requires CUDA 12.8+ for Blackwell

Known Limitation: CUDA tests fail on RTX 50xx with CUDA 12.0 due to PTX forward compatibility issues. Install CUDA 12.8+ from NVIDIA for native Blackwell support, or use the Vulkan backend.

Assets 2

25 May 09:02

n4hy

v0.5.7

748415a

v0.5.7 - SVE2 & Radar Pipeline Optimization for Orange Pi 6 Plus (March 2026)

SVE2 Transcendentals — Eliminate Heap Allocations (sve2_kernels.cpp):

sve2_fast_cos_f32: Inlined sin polynomial with pi/2 offset in a single SVE2 predicated pass. Previously allocated a std::vector<float> temp buffer and made 2 passes over the data.
sve2_fast_sigmoid_f32: Fused single-pass with inline exp(-x) computation. Previously allocated 2 std::vector buffers and made 3 passes (clamp+negate, exp, divide).
sve2_fast_tanh_f32: Fused single-pass with inline exp(-2x) + sigmoid. Previously allocated 2 std::vector buffers and made 3 passes.
Impact: Eliminates 1-2 heap allocations and 1-2 extra data passes per call in hot paths.

SVE2 GEMM Microkernel — Vectorize with SVE2 FMA (sve2_kernels.cpp):

micro_kernel_8x8_sve2: Replaced scalar float acc[8][8] loop with SVE2 column-oriented accumulators using svmla_n_f32_z for rank-1 broadcast FMA, matching the NEON kernel's register-blocked design.
Uses lo/hi vector pairs for 8-row columns on 128-bit SVE2 (CIX P1 svcntw()=4).
Vectorized load-add-store for C matrix writeback.

CAF Doppler Shift — Vectorize Trig in Hot Loop (neon_radar.cpp, sve2_radar.cpp):

Replaced per-sample std::cos/std::sin calls (the CAF bottleneck) with batch neon_fast_cos/sin_f32 and sve2_fast_cos/sin_f32.
Doppler phase rotation now uses vectorized neon_complex_mul_f32 / sve2_complex_mul_f32 instead of scalar multiply.
Estimated ~10x speedup for the Doppler shift phase of CAF computation.

SVE2 Complex Exponential — Vectorize (sve2_complex.cpp):

sve2_complex_exp_f32: Replaced scalar std::cos/std::sin loop with sve2_fast_cos/sin_f32 for full SVE2 vectorization.

Architecture Safety: All changes guarded by #ifdef OPTMATH_USE_SVE2 / #ifdef OPTMATH_USE_NEON with scalar fallbacks intact. No changes to headers, APIs, or non-ARM code paths.

All 16 test suites pass (100%) on Orange Pi 6 Plus.

Assets 2

Releases: n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA

v0.5.16 — x86_64 Desktop RTX 5090 Benchmarks

New: Intel Core Ultra 9 285K + RTX 5090 benchmark section

Uh oh!

v0.5.15 — Complete Source Documentation Coverage

Documentation

Uh oh!

v0.5.14 — Discrete GPU preference & x86_64 dual-GPU benchmarks

Vulkan Backend

Documentation

Benchmarks (x86_64: Core Ultra 9 275HX + RTX 5070 Ti)

Vulkan — Matrix Multiply (Intel iGPU vs NVIDIA RTX 5070 Ti)

Vulkan — Other compute kernels (largest size)

CPU & Radar reference (NEON disabled on x86_64 → Eigen / scalar paths)

Test Results (x86_64, Core Ultra 9 275HX + RTX 5070 Ti — all 16/16 suites pass, 138 tests)

Uh oh!

v0.5.13 - GEMM L3 Optimization, L2 Detection & SVE2 Prefetch (April 2026)

Uh oh!

v0.5.12 - CUDA 13 Full Upgrade & SM 103 Support (April 2026)

Uh oh!

v0.5.11 - Comprehensive Kernel Documentation (April 2026)

Uh oh!

v0.5.10 - TF32 Tolerance Fix & Verified CUDA 13 + cuSolver Cholesky (April 2026)

Uh oh!

v0.5.9 - CUDA 13 Support & Multi-Architecture Compatibility (April 2026)

Uh oh!

v0.5.8 - Blackwell (RTX 50xx) Support & Documentation (April 2026)

Uh oh!

v0.5.7 - SVE2 & Radar Pipeline Optimization for Orange Pi 6 Plus (March 2026)

Uh oh!