Skip to content

Releases: n4hy/OptimizedKernelsForRaspberryPi5_NvidiaCUDA

v0.5.16 — x86_64 Desktop RTX 5090 Benchmarks

25 May 14:07

Choose a tag to compare

Documentation release adding measured benchmark results for a desktop Blackwell build.

New: Intel Core Ultra 9 285K + RTX 5090 benchmark section

Measured on a desktop NVIDIA GeForce RTX 5090 (Blackwell, SM 10.0, 32 GB; CUDA 13.0.88, driver 580.126.20, Vulkan 1.3.275), Release / C++20:

  • Vulkan matrix-multiply scaling from 1.28 GFLOPS (64²) to 114 GFLOPS (1024²)
  • Full Vulkan compute-kernel set — vec add/mul/dot, reduce-sum, conv1D (15.6 GFLOPS), conv2D, prefix sum, matrix transpose
  • CPU reference (Eigen / scalar) — Eigen GEMM ~57 GFLOPS, scalar & complex cross-correlation, std::exp/sin
  • Radar scalar paths — CAF ~26.5 GFLOPS, CFAR-1D/2D, NLMS, MTI, delay-sum beamform, steering vector
  • 16/16 test suites pass in 14.2 s, including the 36-test CUDA suite on SM 10.0

Complements the existing 275HX / RTX 5070 Ti laptop section. The Vulkan microbenchmarks are end-to-end (per-iteration allocation + host↔device copy), so they are transfer-bound and not a measure of the RTX 5090's peak compute.

No code behavior changes — documentation only.

Full diff: v0.5.15...v0.5.16

v0.5.15 — Complete Source Documentation Coverage

25 May 14:09

Choose a tag to compare

Documentation-only release: every source file in src/ now carries a comprehensive header doc block.

Documentation

  • Header doc blocks for the last two source filessrc/vulkan/vulkan_backend.cpp (12 functional blocks: shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) and src/platform/platform.cpp (7 blocks: sysfs helpers, CPU/feature/cache detection, cached accessor, core topology, thread affinity, feature/cache accessors, GEMM blocking selection).
  • Maximal inline documentation of Vulkan device selection — enumeration ordering, type scoring, compute-queue eligibility, and the VK_DRIVER_FILES override recipe.
  • README release link — added release badge and "Latest release" callout.

No code behavior changes; documentation only. All 16/16 test suites continue to pass.

Full diff: v0.5.14...v0.5.15

v0.5.14 — Discrete GPU preference & x86_64 dual-GPU benchmarks

25 May 09:00

Choose a tag to compare

Vulkan Backend

  • Discrete-GPU-preferred device selection — The backend previously picked devices[0] (first enumerated device), which on multi-GPU machines (e.g. a laptop with an integrated GPU alongside a discrete card) could default to the slower integrated GPU. It now ranks physical devices by type (discrete > integrated > virtual > CPU) and only considers devices exposing a compute queue, so the queue-family search cannot fail on the chosen device. The selected GPU is logged as [Vulkan] Selected GPU: ....

Documentation

  • x86_64 workstation benchmark section — Added results for Intel Core Ultra 9 275HX + RTX 5070 Ti, showing Intel iGPU vs NVIDIA RTX 5070 Ti Vulkan numbers side-by-side, CPU/Eigen and radar reference figures, and the 16/16 test-suite (138 tests) breakdown including the CUDA suite. ARM / Orange Pi 6 Plus / Raspberry Pi 5 content is unchanged.
  • Comprehensive Vulkan backend documentationsrc/vulkan/vulkan_backend.cpp now carries the same comprehensive header-doc standard introduced in v0.5.11 for the other kernel files: a file-header block covering all 12 functional blocks (shader loading, buffers, context lifecycle, pipeline cache, dispatch, vector/matrix/DSP/reduction/scan/FFT ops, Mali-G720 specialization, CPU fallback) plus maximal inline documentation of the discrete-GPU device-selection logic (enumeration ordering, type scoring, compute-queue eligibility, and the VK_DRIVER_FILES override recipe).

Benchmarks (x86_64: Core Ultra 9 275HX + RTX 5070 Ti)

Built with -DENABLE_NEON=OFF -DENABLE_SVE2=OFF -DENABLE_CUDA=ON -DENABLE_VULKAN=ON, Release. Two Vulkan GPUs available: integrated Intel Graphics (Arrow Lake) and discrete NVIDIA RTX 5070 Ti (Blackwell, CUDA 13.1).

Vulkan — Matrix Multiply (Intel iGPU vs NVIDIA RTX 5070 Ti)

Size Intel iGPU (ARL) NVIDIA RTX 5070 Ti
64×64 3.53 GFLOPS 1.08 GFLOPS
128×128 12.8 GFLOPS 5.56 GFLOPS
256×256 30.9 GFLOPS 20.3 GFLOPS
512×512 63.1 GFLOPS 50.0 GFLOPS
1024×1024 124 GFLOPS 121 GFLOPS

These are end-to-end microbenchmarks — each iteration includes buffer allocation and host↔device copy. The discrete RTX pays PCIe transfer latency while the iGPU shares system memory, so for small/transfer-bound sizes the iGPU wins; only at 1024² does compute dominate and the two converge. These figures are not the RTX 5070 Ti's peak compute — the harness is transfer-bound by design.

Vulkan — Other compute kernels (largest size)

Benchmark Size Intel iGPU NVIDIA RTX 5070 Ti
Vec Dot 4M 971 MFLOPS 734 MFLOPS
Reduce Sum 4M 860 MFLOPS 646 MFLOPS
Conv1D 262144 / 128 18.5 GFLOPS 12.1 GFLOPS
Conv2D 512 / 7 7.26 GFLOPS 5.54 GFLOPS
Prefix Sum 4096 2.47 GFLOPS 2.47 GFLOPS

CPU & Radar reference (NEON disabled on x86_64 → Eigen / scalar paths)

Benchmark Size Throughput
Eigen GEMM 512 52.7 GFLOPS
Eigen Dot 4096 42.9 GFLOPS
Complex XCorr (scalar) 1024 21.0 GFLOPS
Radar CAF 65536 / 101 / 500 25.7 GFLOPS
Radar NLMS 256K / 128 taps 9.17 GFLOPS
Radar CFAR 2D 512×1024 12.0 GFLOPS

Test Results (x86_64, Core Ultra 9 275HX + RTX 5070 Ti — all 16/16 suites pass, 138 tests)

Backend Tests Status
NEON (scalar fallback) 68 Pass
Vulkan 5 Pass
Radar 19 Pass
CUDA (RTX 5070 Ti) 36 Pass
Platform 9 Pass
Basic 1 Pass

v0.5.13 - GEMM L3 Optimization, L2 Detection & SVE2 Prefetch (April 2026)

25 May 09:02

Choose a tag to compare

Bug Fixes:

  • SVE2 GEMM buffer bounds clamping - Added std::min() clamping of runtime MC/KC/NC parameters to static buffer maximums, preventing potential buffer overflow if platform detection returned values exceeding MAX_MC/MAX_KC/MAX_NC. NEON GEMM already had this safety check; SVE2 GEMM was missing it.
  • NEON GEMM docstring correction - Fixed inaccurate cache blocking parameter values in the file header comment (claimed MC=384-512 but code used MC=128-256).

Optimizations (CIX P1 / Orange Pi 6 Plus):

  • GEMM NC doubled to 2048 for large L3 - B panel now occupies 4MB (33% of 12MB L3) vs prior 2MB (17%). Reduces main memory traffic by ~2x for large matrix multiplications on the CIX P1 CD8160.
  • SVE2 GEMM microkernel prefetch - Added svprfb(SV_PLDL1KEEP) prefetch hints to pipeline next iteration's A/B panels into L1 data cache, reducing stall cycles on Cortex-A720.

Platform Detection Enhancements:

  • L2 cache size detection - New get_l2_cache_size() API reads per-core L2 from sysfs (probes a performance core), with heuristic fallbacks: A720=512KB, A520=256KB, A76=512KB.
  • CpuInfo::l2_cache_bytes field - Cached L2 size available via detect_cpu_info().
  • Updated test suite - Platform test validates L2 detection and new NC=2048 GEMM blocking.

Test Results (CIX P1 CD8160, aarch64, all 16/16 pass):

Backend Tests Status
NEON 7/7 Pass
SVE2 1/1 Pass
Vulkan 4/4 Pass
Radar 2/2 Pass
Platform 1/1 Pass
Basic 1/1 Pass

v0.5.12 - CUDA 13 Full Upgrade & SM 103 Support (April 2026)

25 May 09:02

Choose a tag to compare

CUDA 13.0 Full Support:

  • CUDA 13.0.88 verified - Complete build and test pass with CUDA 13 toolkit
  • SM 103 architecture added - Support for RTX 5070/5060 (GB205/GB206 Blackwell variants)
  • Updated architecture list - CUDA 13+ now builds: SM 75, 80, 86, 89, 90, 100, 103
  • Improved Blackwell detection - CMake now detects both SM 100 and SM 103 for OPTMATH_CUDA_BLACKWELL

Test Results (x86-64 with RTX 5090, CUDA 13.0.88):

Backend Tests Status Notes
CUDA 36/36 Pass Full GPU acceleration with TF32 Tensor Cores
Vulkan 4/4 Pass Full GPU acceleration
NEON/CPU 11/11 Pass Eigen auto-vectorized for AVX2

Architecture Support (CUDA 13+):

SM Architecture GPUs
75 Turing RTX 2060/2070/2080
80/86 Ampere RTX 3060/3070/3080/3090, A100
89 Ada Lovelace RTX 4060/4070/4080/4090
90 Hopper H100
100 Blackwell RTX 5080/5090 (GB202/GB203)
103 Blackwell RTX 5060/5070 (GB205/GB206)

v0.5.11 - Comprehensive Kernel Documentation (April 2026)

25 May 09:02

Choose a tag to compare

Source-Level Documentation:

Every kernel source file now has a thorough header comment documenting all functional blocks, specific API/intrinsic usage, and algorithmic techniques. This makes the codebase self-documenting for contributors and users reading the source.

CUDA Kernels (3 files):

  • cuda_kernels.cu: 8 documented blocks — vector elementwise (float4 vectorized), transcendentals (CUDA __expf/__sinf/__cosf/__sincosf/__tanf/__powf fast-math intrinsics), activation functions (sigmoid/tanh/ReLU/leaky-ReLU/GELU/softmax with shared-memory reduction), matrix ops (tiled transpose with +1 bank-conflict padding), cuBLAS wrappers (cublasSgemm/cublasSdot/cublasSnrm2/cublasSgemv), CUB parallel reductions (cub::DeviceReduce::Sum/Max/Min), Eigen host wrappers (error-checked cudaMalloc/cudaMemcpy with CPU fallback), cuSOLVER Cholesky (cusolverDnSpotrf/cusolverDnDpotrf/cusolverDnSpotrs with architecture-aware GPU/CPU thresholds)
  • cuda_complex.cu: 7 documented blocks — split-format complex arithmetic, complex analysis kernels, warp-level complex dot product (__shfl_down_sync + atomicAdd), format conversion (interleave/deinterleave), convolution (1D naive/shared-memory template, 2D), cuFFT wrappers (cufftPlan1d/cufftPlanMany/cufftPlan2d with CUFFT_C2C), Eigen complex wrappers
  • cuda_radar.cu: 7 documented blocks — window function generators, GPU-resident CAF pipeline (Doppler shift via __sincosf + FFT + conj-multiply + IFFT + magnitude extraction), CFAR detection (1D CA-CFAR + 2D with guard cells), Doppler windowing, Bartlett beamformer (ULA steering vectors + shared-memory spectrum reduction), NLMS adaptive filter (CPU, sequential weight updates), projection clutter cancellation

NEON Kernels (8 files):

  • neon_kernels.cpp: Vector ops with 4-accumulator FMA pipeline utilization, vectorized transcendentals (6th-order minimax polynomials with IEEE754 bit manipulation for exponent reconstruction)
  • neon_complex.cpp: Split and interleaved (vld2q_f32/vst2q_f32) complex arithmetic, Newton-Raphson rsqrt magnitude (vrsqrteq_f32/vrsqrtsq_f32)
  • neon_gemm_optimized.cpp: 3-level Goto-style cache-blocked GEMM, 8x8 microkernel with vmlaq_laneq_f32 rank-1 updates, runtime A76/A720 cache parameter tuning
  • neon_radar.cpp: Window functions (including Kaiser with Bessel I0), CAF, CFAR (CA/2D-SAT/OS), NLMS, projection clutter, DFT, MTI, delay-sum and phase-shift beamforming
  • neon_conv2d.cpp: General, separable, fully-unrolled 3x3 and 5x5 convolution
  • neon_iir.cpp: Biquad Direct Form II Transposed, cascade, Bristow-Johnson filter design
  • neon_linalg.cpp: TRSV/TRSM, Cholesky (A=L*L^T), LU (partial pivot), QR (Householder), solvers, matrix inverse
  • neon_resample.cpp: Polyphase L:M rational rate conversion with streaming delay line

SVE2 Kernels (4 files):

  • sve2_kernels.cpp: Predicated vector ops (svwhilelt_b32 loops), transcendentals (predicated Horner/Chebyshev polynomials), 8x8 GEMM microkernel (MC=256/KC=512/NC=2048 for A720 12MB L3), I8MM int8 GEMM (svmmla_s32)
  • sve2_complex.cpp: FCMA-accelerated complex multiply (svcmla_f32_z rotations 0/90/270 for 2-instruction complex multiply), non-FCMA fallback with svtbl_f32 deinterleaving, native svsqrt_f32_z magnitude
  • sve2_radar.cpp: CAF with predicated complex MAC (svmla_f32_m merging semantics), cross-correlation, phase-shift beamforming
  • sve2_detect.cpp: Runtime SVE2 detection via getauxval(AT_HWCAP2) & HWCAP2_SVE2

README Updated:

  • File Structure section now includes per-file descriptions of all functional blocks and key APIs
  • Each source file entry documents the specific intrinsics, algorithms, and CUDA/cuBLAS/cuFFT/cuSOLVER calls used

v0.5.10 - TF32 Tolerance Fix & Verified CUDA 13 + cuSolver Cholesky (April 2026)

25 May 09:02

Choose a tag to compare

CUDA 13 Verified:

  • Full test pass on RTX 5090 (Blackwell) - All 36 CUDA tests pass including 7 Cholesky tests
  • cuSolver Cholesky decomposition - Verified working with CUDA 13's cusolverDnSpotrf/Dpotrf
  • SM 75 (Turing) fallback - Architecture-aware thresholds work correctly from SM 75-100

TF32 Precision Handling:

  • MatrixGEMM test tolerance adjusted for TF32 (TensorFloat-32) precision on Ampere+ GPUs
  • TF32 uses 19-bit mantissa vs FP32's 24-bit, allowing ~0.4% relative error
  • Tolerance updated to 1% relative + 5e-3 absolute to account for accumulated rounding in GEMM

Test Results (x86-64 with RTX 5090, CUDA 13.0):

Backend Tests Status Notes
CUDA 36/36 Pass Full GPU acceleration with TF32 Tensor Cores
Vulkan 4/4 Pass Full GPU acceleration
NEON/CPU 11/11 Pass Eigen auto-vectorized for AVX2

v0.5.9 - CUDA 13 Support & Multi-Architecture Compatibility (April 2026)

25 May 09:02

Choose a tag to compare

CUDA 13 Support:

  • Full CUDA 13.0 compatibility - API changes for cudaMemPrefetchAsync and cudaDeviceProp
  • Architecture auto-detection - CMake automatically selects appropriate architectures based on CUDA toolkit version:
    • CUDA 13+: SM 75, 80, 86, 89, 90, 100, 103 (Turing through Blackwell)
    • CUDA 12.x: SM 50-89 (Maxwell through Ada)
    • CUDA 11.x: SM 50-86 (Maxwell through Ampere)
  • Native build option - Use -DOPTMATH_CUDA_NATIVE=ON for faster compilation targeting only the local GPU
  • Blackwell (SM 100) optimizations - Full native support with CUDA 13

Architecture Compile Definitions:

New compile definitions for architecture-specific code paths:

  • OPTMATH_CUDA_13_PLUS - CUDA 13+ detected
  • OPTMATH_CUDA_BLACKWELL - SM 100 enabled
  • OPTMATH_CUDA_HOPPER - SM 90 enabled
  • OPTMATH_CUDA_ADA - SM 89 enabled
  • OPTMATH_CUDA_AMPERE - SM 80/86 enabled
  • OPTMATH_CUDA_TURING - SM 75 enabled

cuSolver Cholesky Improvements:

  • Enhanced fallback chain with architecture-aware thresholds
  • Small matrix optimization (CPU path for matrices < 64-128 depending on architecture)
  • Improved DeviceInfo API with architecture detection helpers

Note: CUDA 13 dropped support for SM < 75. For Maxwell/Pascal/Volta GPUs, use CUDA 12.x.

v0.5.8 - Blackwell (RTX 50xx) Support & Documentation (April 2026)

25 May 09:02

Choose a tag to compare

NVIDIA Blackwell Support:

  • RTX 5090 Tested: Full build and test run on NVIDIA GeForce RTX 5090 (SM 10.0, Blackwell architecture)
  • CUDA 12.8+ Requirement: Documented that Blackwell SM 100 requires CUDA 12.8+ for native support
  • Vulkan Fallback: Vulkan 1.3 backend provides full GPU acceleration on Blackwell when CUDA toolkit is older
  • Architecture Table Updated: Added RTX 5090 specs (21760 CUDA cores, 680 Tensor Cores, 32GB VRAM)

Build System:

  • Multi-Architecture Support: Build instructions updated for RTX 20xx/30xx/40xx (SM 75/86/89) and RTX 50xx (SM 100)
  • Native Optimization: Added -march=native -mtune=native flags for x86-64 builds
  • Local Install: Support for ~/.local prefix installation without root access

Documentation:

  • CUDA Toolkit Requirements Table: Clear mapping of GPU generations to minimum CUDA versions
  • Troubleshooting Expanded: Added Blackwell-specific error messages and solutions
  • GPU Architecture Detection: Added nvidia-smi commands to identify compute capability

Test Results (x86-64 with RTX 5090, CUDA 12.0, Vulkan 1.3):

Backend Tests Status Notes
Vulkan 4/4 ✅ Pass Full GPU acceleration
NEON/CPU 11/11 ✅ Pass Eigen auto-vectorized for AVX2
CUDA 0/1 ⚠️ Skip Requires CUDA 12.8+ for Blackwell

Known Limitation: CUDA tests fail on RTX 50xx with CUDA 12.0 due to PTX forward compatibility issues. Install CUDA 12.8+ from NVIDIA for native Blackwell support, or use the Vulkan backend.

v0.5.7 - SVE2 & Radar Pipeline Optimization for Orange Pi 6 Plus (March 2026)

25 May 09:02

Choose a tag to compare

SVE2 Transcendentals — Eliminate Heap Allocations (sve2_kernels.cpp):

  • sve2_fast_cos_f32: Inlined sin polynomial with pi/2 offset in a single SVE2 predicated pass. Previously allocated a std::vector<float> temp buffer and made 2 passes over the data.
  • sve2_fast_sigmoid_f32: Fused single-pass with inline exp(-x) computation. Previously allocated 2 std::vector buffers and made 3 passes (clamp+negate, exp, divide).
  • sve2_fast_tanh_f32: Fused single-pass with inline exp(-2x) + sigmoid. Previously allocated 2 std::vector buffers and made 3 passes.
  • Impact: Eliminates 1-2 heap allocations and 1-2 extra data passes per call in hot paths.

SVE2 GEMM Microkernel — Vectorize with SVE2 FMA (sve2_kernels.cpp):

  • micro_kernel_8x8_sve2: Replaced scalar float acc[8][8] loop with SVE2 column-oriented accumulators using svmla_n_f32_z for rank-1 broadcast FMA, matching the NEON kernel's register-blocked design.
  • Uses lo/hi vector pairs for 8-row columns on 128-bit SVE2 (CIX P1 svcntw()=4).
  • Vectorized load-add-store for C matrix writeback.

CAF Doppler Shift — Vectorize Trig in Hot Loop (neon_radar.cpp, sve2_radar.cpp):

  • Replaced per-sample std::cos/std::sin calls (the CAF bottleneck) with batch neon_fast_cos/sin_f32 and sve2_fast_cos/sin_f32.
  • Doppler phase rotation now uses vectorized neon_complex_mul_f32 / sve2_complex_mul_f32 instead of scalar multiply.
  • Estimated ~10x speedup for the Doppler shift phase of CAF computation.

SVE2 Complex Exponential — Vectorize (sve2_complex.cpp):

  • sve2_complex_exp_f32: Replaced scalar std::cos/std::sin loop with sve2_fast_cos/sin_f32 for full SVE2 vectorization.

Architecture Safety: All changes guarded by #ifdef OPTMATH_USE_SVE2 / #ifdef OPTMATH_USE_NEON with scalar fallbacks intact. No changes to headers, APIs, or non-ARM code paths.

All 16 test suites pass (100%) on Orange Pi 6 Plus.