OptMathKernels API Reference
Complete API documentation for OptMathKernels - High-Performance Numerical Library for Raspberry Pi 5 and NVIDIA GPUs.
Version : 0.5.7
Total Functions : 473+
Backends : NEON (ARM), SVE2 (ARMv9), CUDA (NVIDIA), Vulkan (Cross-platform), Radar (Signal Processing), Platform (Detection)
Header : #include <optmath/neon_kernels.hpp>
Namespace : optmath::neon
Target : ARM Cortex-A76 (Raspberry Pi 5), ARMv8-A with NEON
Returns true if NEON acceleration was compiled in and is available.
Core Vector Operations (Low-Level)
Function
Return Type
Parameters
Description
neon_dot_f32
float
const float* a, const float* b, std::size_t n
Dot product of two float32 arrays
neon_dot_f64
double
const double* a, const double* b, std::size_t n
Dot product of two float64 arrays
neon_add_f32
void
float* out, const float* a, const float* b, std::size_t n
Element-wise addition: out = a + b
neon_sub_f32
void
float* out, const float* a, const float* b, std::size_t n
Element-wise subtraction: out = a - b
neon_mul_f32
void
float* out, const float* a, const float* b, std::size_t n
Element-wise multiplication: out = a * b
neon_div_f32
void
float* out, const float* a, const float* b, std::size_t n
Element-wise division: out = a / b
Function
Return Type
Parameters
Description
neon_norm_f32
float
const float* a, std::size_t n
L2 norm: sqrt(sum(a[i]^2))
neon_reduce_sum_f32
float
const float* a, std::size_t n
Sum all elements
neon_reduce_max_f32
float
const float* a, std::size_t n
Maximum element
neon_reduce_min_f32
float
const float* a, std::size_t n
Minimum element
Matrix Operations (Low-Level)
Function
Return Type
Parameters
Description
neon_gemm_4x4_f32
void
float* C, const float* A, std::size_t lda, const float* B, std::size_t ldb, std::size_t ldc
4x4 GEMM microkernel: C += A * B
neon_gemm_blocked_f32
void
float* C, const float* A, const float* B, std::size_t M, std::size_t N, std::size_t K, std::size_t lda, std::size_t ldb, std::size_t ldc
Cache-blocked GEMM (runtime-tuned MC/KC/NC)
Cache Blocking Parameters (auto-tuned per detected L3 cache):
Cortex-A76 (Pi 5, 2MB L3): MC=128, KC=256, NC=512
Cortex-A720 (Orange Pi 6+, 12MB L3): MC=256, KC=512, NC=1024
8x8 microkernel with column-oriented NEON FMA accumulators
DSP / Filter Operations (Low-Level)
Function
Return Type
Parameters
Description
neon_fir_f32
void
const float* x, std::size_t n_x, const float* h, std::size_t n_h, float* y
FIR filter: y = x * h (convolution)
Polyphase Resampler (Low-Level)
Rational sample rate conversion by L/M using polyphase decomposition with NEON-optimized FIR per phase.
Structures:
struct PolyphaseResamplerState {
std::vector<std::vector<float >> phases; // Polyphase decomposition [L][n_taps]
std::size_t L; // Interpolation factor
std::size_t M; // Decimation factor
std::size_t n_taps; // Taps per phase
std::vector<float > delay; // Delay line for streaming
std::size_t delay_pos; // Write position in circular delay line
std::size_t phase_acc; // Phase accumulator
};
Function
Return Type
Parameters
Description
neon_resample_init
void
PolyphaseResamplerState& state, const float* filter, std::size_t filter_len, std::size_t L, std::size_t M
Initialize resampler with prototype lowpass filter
neon_resample_f32
std::size_t
float* out, const float* in, std::size_t input_len, PolyphaseResamplerState& state
Streaming resampler (returns output sample count)
neon_resample_oneshot_f32
void
float* out, std::size_t* output_len, const float* in, std::size_t input_len, const float* filter, std::size_t filter_len, std::size_t L, std::size_t M
One-shot resampler (non-streaming)
Biquad IIR Filter (Low-Level)
Direct Form II Transposed biquad filter with cascade support and design helpers.
Structures:
struct BiquadCoeffs {
float b0, b1, b2; // Numerator (feedforward)
float a1, a2; // Denominator (feedback), a0 normalized to 1
};
struct BiquadState {
float s1 = 0 .0f ; // DF2T state variable 1
float s2 = 0 .0f ; // DF2T state variable 2
};
Function
Return Type
Parameters
Description
neon_biquad_f32
void
float* out, const float* in, std::size_t n, const BiquadCoeffs& coeffs, BiquadState& state
Process single biquad section (in-place OK)
neon_biquad_cascade_f32
void
float* out, const float* in, std::size_t n, const BiquadCoeffs* coeffs, BiquadState* states, std::size_t n_sections
Process cascade of biquad sections
neon_biquad_lowpass
BiquadCoeffs
float fc, float fs, float Q = 0.707
Design 2nd-order Butterworth lowpass
neon_biquad_highpass
BiquadCoeffs
float fc, float fs, float Q = 0.707
Design 2nd-order Butterworth highpass
neon_biquad_bandpass
BiquadCoeffs
float fc, float fs, float Q = 1.0
Design 2nd-order bandpass (constant 0dB peak)
neon_biquad_notch
BiquadCoeffs
float fc, float fs, float Q = 1.0
Design 2nd-order notch (band-reject)
2D Convolution (Low-Level)
NEON-vectorized 2D convolution with row-major layout. Valid mode (no padding).
Output size: (in_rows - kernel_rows + 1) x (in_cols - kernel_cols + 1).
Function
Return Type
Parameters
Description
neon_conv2d_f32
void
float* out, const float* in, std::size_t in_rows, std::size_t in_cols, const float* kernel, std::size_t kernel_rows, std::size_t kernel_cols
General NxM 2D convolution
neon_conv2d_separable_f32
void
float* out, const float* in, std::size_t in_rows, std::size_t in_cols, const float* row_kernel, std::size_t row_kernel_len, const float* col_kernel, std::size_t col_kernel_len
Separable 2D convolution (row then column pass)
neon_conv2d_3x3_f32
void
float* out, const float* in, std::size_t in_rows, std::size_t in_cols, const float kernel[9]
Optimized 3x3 convolution (fully unrolled)
neon_conv2d_5x5_f32
void
float* out, const float* in, std::size_t in_rows, std::size_t in_cols, const float kernel[25]
Optimized 5x5 convolution (unrolled)
Activation Functions (In-Place)
Function
Return Type
Parameters
Description
neon_relu_f32
void
float* data, std::size_t n
ReLU: max(0, x)
neon_sigmoid_f32
void
float* data, std::size_t n
Sigmoid: 1/(1+exp(-x)) (scalar)
neon_tanh_f32
void
float* data, std::size_t n
Hyperbolic tangent (scalar)
Vectorized Transcendentals (Fast Approximations)
These functions use NEON SIMD for 4-8x speedup, trading accuracy for speed.
Typical accuracy: exp ~12%, sin/cos ~1e-5, sigmoid ~3%, tanh ~6%.
Function
Return Type
Parameters
Description
neon_fast_exp_f32
void
float* out, const float* in, std::size_t n
Vectorized exp (6th-order polynomial, ~12% error)
neon_fast_sin_f32
void
float* out, const float* in, std::size_t n
Vectorized sin (Chebyshev polynomial, ~1e-5 error)
neon_fast_cos_f32
void
float* out, const float* in, std::size_t n
Vectorized cos (~1e-5 error)
neon_fast_sigmoid_f32
void
float* out, const float* in, std::size_t n
Fast vectorized sigmoid (~3% error)
neon_fast_tanh_f32
void
float* out, const float* in, std::size_t n
Fast vectorized tanh (~6% error)
Performance (Raspberry Pi 5):
exp: ~13 GFLOPS (45x faster than scalar)
sin/cos: ~10 GFLOPS (30x faster than scalar)
sigmoid/tanh: ~8 GFLOPS (25x faster than scalar)
Complex Number Operations (Separate Real/Imag)
For C/ctypes interop where complex data is in separate arrays.
Function
Return Type
Parameters
Description
neon_complex_mul_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, std::size_t n
Complex multiply: out = a * b
neon_complex_conj_mul_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, std::size_t n
Complex conjugate multiply: out = a * conj(b)
neon_complex_dot_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, std::size_t n
Complex dot product: sum(a * conj(b))
neon_complex_magnitude_f32
void
float* out, const float* re, const float* im, std::size_t n
Magnitude: sqrt(re^2 + im^2)
neon_complex_magnitude_squared_f32
void
float* out, const float* re, const float* im, std::size_t n
Squared magnitude: re^2 + im^2
neon_complex_phase_f32
void
float* out, const float* re, const float* im, std::size_t n
Phase angle: atan2(im, re)
neon_complex_add_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, std::size_t n
Complex addition
neon_complex_scale_f32
void
float* out_re, float* out_im, const float* in_re, const float* in_im, float scale_re, float scale_im, std::size_t n
Complex scalar multiply
neon_complex_exp_f32
void
float* out_re, float* out_im, const float* phase, std::size_t n
Complex exponential: exp(j*phase)
Complex Number Operations (Interleaved)
For IQ data format: [re0, im0, re1, im1, ...]
Function
Return Type
Parameters
Description
neon_complex_mul_interleaved_f32
void
float* out, const float* a, const float* b, std::size_t n
Complex multiply (interleaved format)
neon_complex_conj_mul_interleaved_f32
void
float* out, const float* a, const float* b, std::size_t n
Complex conjugate multiply (interleaved)
High-level C++ interface using Eigen types.
Function
Return Type
Parameters
Description
neon_dot
float
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Dot product
neon_dot
double
const Eigen::VectorXd& a, const Eigen::VectorXd& b
Double-precision dot product
neon_add
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Vector addition
neon_sub
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Vector subtraction
neon_mul
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Element-wise multiplication
neon_div
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Element-wise division
neon_norm
float
const Eigen::VectorXf& a
L2 norm
neon_reduce_sum
float
const Eigen::VectorXf& a
Sum all elements
neon_reduce_max
float
const Eigen::VectorXf& a
Maximum element
neon_reduce_min
float
const Eigen::VectorXf& a
Minimum element
neon_fir
Eigen::VectorXf
const Eigen::VectorXf& x, const Eigen::VectorXf& h
FIR filter
neon_relu
void
Eigen::VectorXf& x
In-place ReLU
neon_sigmoid
void
Eigen::VectorXf& x
In-place sigmoid
neon_tanh
void
Eigen::VectorXf& x
In-place tanh
Function
Return Type
Parameters
Description
neon_gemm
Eigen::MatrixXf
const Eigen::MatrixXf& A, const Eigen::MatrixXf& B
Matrix multiply: A * B
neon_gemm_blocked
Eigen::MatrixXf
const Eigen::MatrixXf& A, const Eigen::MatrixXf& B
Optimized blocked GEMM
neon_mat_scale
Eigen::MatrixXf
const Eigen::MatrixXf& A, float s
Scalar multiply: A * s
neon_mat_transpose
Eigen::MatrixXf
const Eigen::MatrixXf& A
Matrix transpose
neon_mat_vec_mul
Eigen::VectorXf
const Eigen::MatrixXf& A, const Eigen::VectorXf& v
Matrix-vector multiply: A * v
Function
Return Type
Parameters
Description
neon_complex_mul
Eigen::VectorXcf
const Eigen::VectorXcf& a, const Eigen::VectorXcf& b
Complex multiply
neon_complex_conj_mul
Eigen::VectorXcf
const Eigen::VectorXcf& a, const Eigen::VectorXcf& b
Complex conjugate multiply
neon_complex_dot
std::complex<float>
const Eigen::VectorXcf& a, const Eigen::VectorXcf& b
Complex dot product
neon_complex_magnitude
Eigen::VectorXf
const Eigen::VectorXcf& a
Magnitude of complex vector
neon_complex_phase
Eigen::VectorXf
const Eigen::VectorXcf& a
Phase of complex vector
Function
Return Type
Parameters
Description
neon_resample
Eigen::VectorXf
const Eigen::VectorXf& in, const Eigen::VectorXf& filter, std::size_t L, std::size_t M
Polyphase resampler (one-shot)
neon_biquad
Eigen::VectorXf
const Eigen::VectorXf& in, const BiquadCoeffs& coeffs
Biquad IIR filter (single section)
neon_conv2d
Eigen::MatrixXf
const Eigen::MatrixXf& in, const Eigen::MatrixXf& kernel
2D convolution (handles col/row-major conversion)
Dense Linear Algebra (Low-Level)
Column-major layout. All operations are in-place unless noted. NEON-vectorized AXPY/dot/scale for contiguous column data.
Function
Return Type
Parameters
Description
neon_trsv_lower_f32
void
float* b, const float* L, std::size_t n, std::size_t ldl
Forward substitution: solve L*x = b
neon_trsv_upper_f32
void
float* b, const float* U, std::size_t n, std::size_t ldu
Backward substitution: solve U*x = b
neon_trsv_lower_unit_f32
void
float* b, const float* L, std::size_t n, std::size_t ldl
Unit-diagonal forward substitution
neon_trsv_lower_trans_f32
void
float* b, const float* L, std::size_t n, std::size_t ldl
Solve L^T*x = b using lower L
neon_trsm_lower_f32
void
float* B, const float* L, std::size_t n, std::size_t nrhs, std::size_t ldl, std::size_t ldb
Multi-RHS lower triangular solve
neon_trsm_upper_f32
void
float* B, const float* U, std::size_t n, std::size_t nrhs, std::size_t ldu, std::size_t ldb
Multi-RHS upper triangular solve
Function
Return Type
Parameters
Description
neon_cholesky_f32
int
float* A, std::size_t n, std::size_t lda
Cholesky A = L*L^T (returns 0 or failing pivot)
neon_lu_f32
int
float* A, int* piv, std::size_t m, std::size_t n, std::size_t lda
LU with partial pivoting (returns 0 or failing pivot)
neon_qr_f32
void
float* A, float* tau, std::size_t m, std::size_t n, std::size_t lda
QR via Householder reflections
neon_qr_extract_q_f32
void
float* Q, const float* A, const float* tau, std::size_t m, std::size_t n, std::size_t lda, std::size_t ldq
Extract explicit Q from Householder vectors
Function
Return Type
Parameters
Description
neon_solve_f32
int
float* A, float* b, std::size_t n, std::size_t lda
General solve via LU
neon_solve_spd_f32
int
float* A, float* b, std::size_t n, std::size_t lda
SPD solve via Cholesky
neon_inverse_f32
int
float* Ainv, const float* A, std::size_t n, std::size_t lda, std::size_t ldinv
Matrix inverse via LU
Eigen Dense Linear Algebra Wrappers
Function
Return Type
Parameters
Description
neon_cholesky
Eigen::MatrixXf
const Eigen::MatrixXf& A
Cholesky: returns L (empty on failure)
neon_lu
pair<MatrixXf, VectorXi>
const Eigen::MatrixXf& A
LU: returns (LU combined, pivot vector)
neon_qr
pair<MatrixXf, MatrixXf>
const Eigen::MatrixXf& A
QR: returns (Q, R)
neon_trsv_lower
Eigen::VectorXf
const Eigen::MatrixXf& L, const Eigen::VectorXf& b
Solve L*x = b
neon_trsv_upper
Eigen::VectorXf
const Eigen::MatrixXf& U, const Eigen::VectorXf& b
Solve U*x = b
neon_solve
Eigen::VectorXf
const Eigen::MatrixXf& A, const Eigen::VectorXf& b
General solve A*x = b
neon_solve_spd
Eigen::VectorXf
const Eigen::MatrixXf& A, const Eigen::VectorXf& b
SPD solve A*x = b
neon_inverse
Eigen::MatrixXf
const Eigen::MatrixXf& A
Matrix inverse (empty on failure)
CUDA Backend (NVIDIA GPU)
Header : #include <optmath/cuda_backend.hpp>
Namespace : optmath::cuda
Target : NVIDIA GPUs with Compute Capability 7.0+ (Turing, Ampere, Ada Lovelace, Hopper, Blackwell)
bool is_available ();
int get_device_count ();
DeviceInfo get_device_info (int device_id = 0 );
void print_device_info (int device_id = 0 );
DeviceInfo Structure :
struct DeviceInfo {
int device_id;
std::string name;
int compute_capability_major; // e.g., 8 for Ampere
int compute_capability_minor; // e.g., 6 for RTX 3090
size_t total_memory; // Total GPU memory in bytes
size_t free_memory; // Available GPU memory
int multiprocessor_count; // Number of SMs
int max_threads_per_block; // Max threads per block (1024)
int warp_size; // Warp size (32)
bool tensor_cores; // Volta+ (SM 7.0+)
bool tf32_support; // Ampere+ (SM 8.0+)
bool fp16_support; // Pascal+ (SM 6.0+)
bool fp8_support; // Blackwell (SM 10.0+)
bool blackwell; // Blackwell architecture
bool unified_memory; // Unified Memory support
float memory_bandwidth_gbps; // Memory bandwidth
size_t shared_memory_per_block; // Shared memory per block
// Convenience methods
int compute_major () const ;
int compute_minor () const ;
bool has_tensor_cores () const ;
bool is_ampere_or_newer () const ;
bool is_blackwell_or_newer () const ;
};
CudaContext (Singleton):
class CudaContext {
static CudaContext& get ();
bool init (int device_id = 0 );
void cleanup ();
bool is_initialized () const ;
int device_id () const ;
size_t get_free_memory () const ;
size_t get_total_memory () const ;
void synchronize ();
enum class PrecisionMode { FP32, TF32, FP16, FP64, MIXED_FP16_FP32 };
void set_precision_mode (PrecisionMode mode);
PrecisionMode get_precision_mode () const ;
};
CudaStream :
class CudaStream {
CudaStream ();
~CudaStream ();
void synchronize ();
bool query () const ; // Returns true if stream is idle
cudaStream_t get () const ;
};
Memory Management Templates
DeviceBuffer - GPU memory:
template <typename T>
class DeviceBuffer {
DeviceBuffer ();
explicit DeviceBuffer (size_t count);
void allocate (size_t count);
void free ();
void copy_from_host (const T* host_data, size_t count);
void copy_to_host (T* host_data, size_t count) const ;
void copy_from_host_async (const T* host_data, size_t count, CudaStream& stream);
void copy_to_host_async (T* host_data, size_t count, CudaStream& stream) const ;
T* data ();
const T* data () const ;
size_t size () const ;
size_t bytes () const ;
bool empty () const ;
};
PinnedBuffer - Page-locked host memory:
template <typename T>
class PinnedBuffer {
PinnedBuffer ();
explicit PinnedBuffer (size_t count);
void allocate (size_t count);
void free ();
T* data ();
size_t size () const ;
};
UnifiedBuffer - Unified Memory (CPU/GPU accessible):
template <typename T>
class UnifiedBuffer {
UnifiedBuffer ();
explicit UnifiedBuffer (size_t count);
void allocate (size_t count);
void free ();
void prefetch_to_device (int device_id = 0 );
void prefetch_to_host ();
T* data ();
size_t size () const ;
};
Vector Operations (Low-Level)
Function
Return Type
Parameters
Description
cuda_vec_add_f32
void
float* out, const float* a, const float* b, size_t n
Vector addition
cuda_vec_mul_f32
void
float* out, const float* a, const float* b, size_t n
Element-wise multiplication
cuda_vec_scale_f32
void
float* out, const float* a, float scalar, size_t n
Scalar multiplication
cuda_vec_dot_f32
float
const float* a, const float* b, size_t n
Dot product
cuda_vec_sum_f32
float
const float* a, size_t n
Sum all elements
cuda_vec_max_f32
float
const float* a, size_t n
Maximum element
cuda_vec_min_f32
float
const float* a, size_t n
Minimum element
cuda_vec_norm_f32
float
const float* a, size_t n
L2 norm
cuda_vec_abs_f32
void
float* out, const float* a, size_t n
Absolute value
cuda_vec_sqrt_f32
void
float* out, const float* a, size_t n
Square root
Transcendental Functions (CUDA Fast Math)
Function
Return Type
Parameters
Description
cuda_exp_f32
void
float* out, const float* in, size_t n
Exponential
cuda_log_f32
void
float* out, const float* in, size_t n
Natural logarithm
cuda_sin_f32
void
float* out, const float* in, size_t n
Sine
cuda_cos_f32
void
float* out, const float* in, size_t n
Cosine
cuda_sincos_f32
void
float* sin_out, float* cos_out, const float* in, size_t n
Simultaneous sin/cos
cuda_tan_f32
void
float* out, const float* in, size_t n
Tangent
cuda_atan2_f32
void
float* out, const float* y, const float* x, size_t n
atan2(y, x)
cuda_pow_f32
void
float* out, const float* base, const float* exp, size_t n
Power function
Activation Functions (ML)
Function
Return Type
Parameters
Description
cuda_sigmoid_f32
void
float* out, const float* in, size_t n
Sigmoid: 1/(1+exp(-x))
cuda_tanh_f32
void
float* out, const float* in, size_t n
Hyperbolic tangent
cuda_relu_f32
void
float* out, const float* in, size_t n
ReLU: max(0, x)
cuda_leaky_relu_f32
void
float* out, const float* in, float alpha, size_t n
Leaky ReLU
cuda_gelu_f32
void
float* out, const float* in, size_t n
GELU activation
cuda_softmax_f32
void
float* out, const float* in, size_t n
Softmax
Matrix Operations (cuBLAS)
Function
Return Type
Parameters
Description
cuda_mat_mul_f32
void
float* C, const float* A, const float* B, int M, int N, int K, bool transA, bool transB
GEMM: C = A * B
cuda_mat_add_f32
void
float* C, const float* A, const float* B, int M, int N
Matrix addition
cuda_mat_scale_f32
void
float* out, const float* A, float scalar, int M, int N
Scalar multiply
cuda_mat_transpose_f32
void
float* out, const float* A, int M, int N
Transpose
cuda_mat_vec_mul_f32
void
float* out, const float* A, const float* x, int M, int N
Matrix-vector multiply
cuda_mat_mul_tensorcore_f32
void
float* C, const float* A, const float* B, int M, int N, int K
Tensor Core GEMM (Ampere+)
cuda_mat_mul_tensorcore_fp16
void
void* C, const void* A, const void* B, int M, int N, int K
FP16 Tensor Core GEMM
cuda_batched_mat_mul_f32
void
float** C, float** A, float** B, int M, int N, int K, int batch
Batched GEMM
Linear Algebra (cuSOLVER)
Function
Return Type
Parameters
Description
cuda_cholesky
Eigen::MatrixXf
const Eigen::MatrixXf& A
Cholesky decomposition
cuda_lu
pair<Eigen::MatrixXf, Eigen::VectorXi>
const Eigen::MatrixXf& A
LU decomposition with pivots
cuda_qr
pair<Eigen::MatrixXf, Eigen::MatrixXf>
const Eigen::MatrixXf& A
QR decomposition (Q, R)
cuda_svd
SVDResult
const Eigen::MatrixXf& A
SVD: U, S, Vt
cuda_eig
pair<Eigen::VectorXf, Eigen::MatrixXf>
const Eigen::MatrixXf& A
Eigendecomposition
cuda_solve
Eigen::VectorXf
const Eigen::MatrixXf& A, const Eigen::VectorXf& b
Linear solve: Ax = b
cuda_inverse
Eigen::MatrixXf
const Eigen::MatrixXf& A
Matrix inverse
Complex Number Operations
Function
Return Type
Parameters
Description
cuda_complex_mul_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, size_t n
Complex multiply
cuda_complex_conj_mul_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, size_t n
Complex conjugate multiply
cuda_complex_dot_f32
void
float* out_re, float* out_im, const float* a_re, const float* a_im, const float* b_re, const float* b_im, size_t n
Complex dot product
cuda_complex_magnitude_f32
void
float* out, const float* re, const float* im, size_t n
Magnitude
cuda_complex_phase_f32
void
float* out, const float* re, const float* im, size_t n
Phase angle
cuda_complex_exp_f32
void
float* out_re, float* out_im, const float* phase, size_t n
Complex exponential
CudaFFTPlan Class :
class CudaFFTPlan {
bool create_1d (size_t n, bool inverse = false );
bool create_1d_batch (size_t n, size_t batch, bool inverse = false );
bool create_2d (size_t nx, size_t ny, bool inverse = false );
void execute (float * inout);
void execute (const float * in, float * out);
void destroy ();
};
One-shot FFT Functions :
Function
Return Type
Parameters
Description
cuda_fft_1d_f32
void
float* inout, size_t n, bool inverse
1D FFT
cuda_fft_1d_batch_f32
void
float* inout, size_t n, size_t batch, bool inverse
Batched 1D FFT
cuda_fft_2d_f32
void
float* inout, size_t nx, size_t ny, bool inverse
2D FFT
Eigen Wrappers :
Function
Return Type
Parameters
Description
cuda_fft
Eigen::VectorXcf
const Eigen::VectorXcf& x
Forward FFT
cuda_ifft
Eigen::VectorXcf
const Eigen::VectorXcf& x
Inverse FFT
cuda_fft2
Eigen::MatrixXcf
const Eigen::MatrixXcf& x
2D FFT
cuda_ifft2
Eigen::MatrixXcf
const Eigen::MatrixXcf& x
2D inverse FFT
cuda_rfft
Eigen::VectorXcf
const Eigen::VectorXf& x
Real-to-complex FFT
cuda_irfft
Eigen::VectorXf
const Eigen::VectorXcf& x, size_t n
Complex-to-real inverse FFT
Function
Return Type
Parameters
Description
cuda_conv1d_f32
void
float* out, const float* signal, const float* kernel, size_t signal_len, size_t kernel_len
1D convolution
cuda_conv2d_f32
void
float* out, const float* image, const float* kernel, int img_h, int img_w, int kern_h, int kern_w
2D convolution
cuda_fftconv1d_f32
void
float* out, const float* signal, const float* kernel, size_t signal_len, size_t kernel_len
FFT-based 1D convolution
cuda_conv1d
Eigen::VectorXf
const Eigen::VectorXf& signal, const Eigen::VectorXf& kernel
Eigen 1D convolution
cuda_conv2d
Eigen::MatrixXf
const Eigen::MatrixXf& image, const Eigen::MatrixXf& kernel
Eigen 2D convolution
Radar Signal Processing (GPU)
Function
Return Type
Parameters
Description
cuda_caf
Eigen::MatrixXf
const Eigen::VectorXcf& ref, const Eigen::VectorXcf& surv, size_t n_doppler, float doppler_start, float doppler_step, float sample_rate, size_t n_range
Cross-Ambiguity Function
cuda_cfar_2d
Eigen::MatrixXi
const Eigen::MatrixXf& power_map, int guard_range, int guard_doppler, int ref_range, int ref_doppler, float pfa_factor
2D CFAR detector
cuda_cfar_ca
Eigen::VectorXi
const Eigen::VectorXf& power, int guard_cells, int ref_cells, float pfa_factor
1D CA-CFAR
cuda_doppler_process
Eigen::MatrixXcf
const Eigen::MatrixXcf& pulse_data, size_t fft_size, int window_type
Doppler processing
cuda_bartlett_spectrum
Eigen::VectorXf
const Eigen::VectorXcf& array_data, float d_lambda, int n_angles
Bartlett beamformer
cuda_steering_vectors_ula
Eigen::MatrixXcf
int n_elements, float d_lambda, const Eigen::VectorXf& angles
ULA steering vectors
cuda_nlms_filter
Eigen::VectorXcf
const Eigen::VectorXcf& surv, const Eigen::VectorXcf& ref, int filter_len, float mu, float eps
NLMS adaptive filter
cuda_projection_clutter
Eigen::VectorXcf
const Eigen::VectorXcf& surv, const Eigen::MatrixXcf& clutter_subspace
Projection clutter cancellation
enum class WindowType {
RECTANGULAR, HAMMING, HANNING, BLACKMAN,
BLACKMAN_HARRIS, KAISER, GAUSSIAN, TUKEY
};
Function
Return Type
Parameters
Description
cuda_generate_window
Eigen::VectorXf
size_t n, WindowType type, float param
Generate window on GPU
cuda_apply_window
void
Eigen::VectorXf& data, const Eigen::VectorXf& window
Apply window (real)
cuda_apply_window
void
Eigen::VectorXcf& data, const Eigen::VectorXf& window
Apply window (complex)
Function
Return Type
Parameters
Description
set_device
void
int device_id
Set active CUDA device
get_device
int
-
Get current device
enable_peer_access
bool
int device_from, int device_to
Enable P2P access
parallel_for_devices
void
const std::vector<int>& devices, Func&& func
Distribute workload
CudaTimer :
class CudaTimer {
void start ();
void stop ();
float elapsed_ms () const ;
};
Bandwidth Measurement :
struct BandwidthStats {
float host_to_device_gbps;
float device_to_host_gbps;
float device_to_device_gbps;
};
BandwidthStats measure_bandwidth (size_t bytes = 256 * 1024 * 1024 );
Function
Return Type
Parameters
Description
get_last_error
std::string
-
Get last CUDA error message
check_cuda_error
bool
const char* operation
Check and report errors
For API compatibility with NEON backend:
Shorthand
Full Function
cuda_add(a, b)
cuda_vec_add(a, b)
cuda_mul(a, b)
cuda_vec_mul(a, b)
cuda_scale(a, s)
cuda_vec_scale(a, s)
cuda_dot(a, b)
cuda_vec_dot(a, b)
cuda_sum(a)
cuda_reduce_sum(a)
cuda_max(a)
cuda_reduce_max(a)
cuda_min(a)
cuda_reduce_min(a)
cuda_gemm(A, B)
cuda_mat_mul(A, B)
cuda_gemv(A, x)
cuda_mat_vec_mul(A, x)
cuda_transpose(A)
cuda_mat_transpose(A)
cuda_complex_abs(a)
cuda_complex_magnitude(a)
cuda_complex_arg(a)
cuda_complex_phase(a)
Vulkan Backend (Cross-platform GPU)
Header : #include <optmath/vulkan_backend.hpp>
Namespace : optmath::vulkan
Target : Any Vulkan 1.2+ GPU (NVIDIA, AMD, Intel, Raspberry Pi 5 VideoCore VII)
class VulkanContext {
static VulkanContext& get ();
bool init ();
void cleanup ();
#ifdef OPTMATH_USE_VULKAN
VkDevice device;
VkQueue computeQueue;
VkCommandPool commandPool;
uint32_t findMemoryType (uint32_t typeFilter, VkMemoryPropertyFlags properties);
#endif
};
Function
Return Type
Parameters
Description
vulkan_vec_add
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Vector addition
vulkan_vec_sub
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Vector subtraction
vulkan_vec_mul
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Element-wise multiply
vulkan_vec_div
Eigen::VectorXf
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Element-wise divide
vulkan_vec_dot
float
const Eigen::VectorXf& a, const Eigen::VectorXf& b
Dot product
vulkan_vec_norm
float
const Eigen::VectorXf& a
L2 norm
Function
Return Type
Parameters
Description
vulkan_mat_add
Eigen::MatrixXf
const Eigen::MatrixXf& a, const Eigen::MatrixXf& b
Matrix addition
vulkan_mat_sub
Eigen::MatrixXf
const Eigen::MatrixXf& a, const Eigen::MatrixXf& b
Matrix subtraction
vulkan_mat_mul
Eigen::MatrixXf
const Eigen::MatrixXf& a, const Eigen::MatrixXf& b
Matrix multiplication (16x16 tiled)
vulkan_mat_transpose
Eigen::MatrixXf
const Eigen::MatrixXf& a
Transpose
vulkan_mat_scale
Eigen::MatrixXf
const Eigen::MatrixXf& a, float scalar
Scalar multiply
vulkan_mat_vec_mul
Eigen::VectorXf
const Eigen::MatrixXf& a, const Eigen::VectorXf& v
Matrix-vector multiply
vulkan_mat_outer_product
Eigen::MatrixXf
const Eigen::VectorXf& u, const Eigen::VectorXf& v
Outer product
vulkan_mat_elementwise_mul
Eigen::MatrixXf
const Eigen::MatrixXf& a, const Eigen::MatrixXf& b
Element-wise multiply
Function
Return Type
Parameters
Description
vulkan_convolution_1d
Eigen::VectorXf
const Eigen::VectorXf& x, const Eigen::VectorXf& k
1D convolution
vulkan_convolution_2d
Eigen::MatrixXf
const Eigen::MatrixXf& x, const Eigen::MatrixXf& k
2D convolution
vulkan_correlation_1d
Eigen::VectorXf
const Eigen::VectorXf& x, const Eigen::VectorXf& k
1D correlation
vulkan_correlation_2d
Eigen::MatrixXf
const Eigen::MatrixXf& x, const Eigen::MatrixXf& k
2D correlation
Function
Return Type
Parameters
Description
vulkan_reduce_sum
float
const Eigen::VectorXf& a
Sum all elements
vulkan_reduce_max
float
const Eigen::VectorXf& a
Maximum element
vulkan_reduce_min
float
const Eigen::VectorXf& a
Minimum element
vulkan_scan_prefix_sum
Eigen::VectorXf
const Eigen::VectorXf& a
Parallel prefix sum
Function
Return Type
Parameters
Description
vulkan_fft_radix2
void
Eigen::VectorXf& data, bool inverse
Radix-2 FFT (in-place, interleaved)
vulkan_fft_radix4
void
Eigen::VectorXf& data, bool inverse
Radix-4 FFT (in-place, interleaved)
Note : Data format is interleaved complex: [re0, im0, re1, im1, ...]. Size must be 2 * N where N is a power of 2.
37 GLSL compute shaders compiled to SPIR-V:
Shader
Purpose
vec_add.comp.glsl
Vector addition
vec_sub.comp.glsl
Vector subtraction
vec_mul.comp.glsl
Vector multiplication
vec_div.comp.glsl
Vector division
vec_dot.comp.glsl
Dot product
vec_norm.comp.glsl
L2 norm
mat_add.comp.glsl
Matrix addition
mat_sub.comp.glsl
Matrix subtraction
mat_mul.comp.glsl
Matrix multiplication
mat_mul_tiled.comp.glsl
Tiled GEMM (16x16 shared memory)
mat_transpose.comp.glsl
Matrix transpose
mat_scale.comp.glsl
Scalar multiply
mat_vec_mul.comp.glsl
Matrix-vector multiply
mat_outer_product.comp.glsl
Outer product
mat_elementwise_mul.comp.glsl
Element-wise multiply
reduce_sum.comp.glsl
Sum reduction
reduce_max.comp.glsl
Max reduction
reduce_min.comp.glsl
Min reduction
reduce_complete.comp.glsl
Complete reduction
scan_local.comp.glsl
Local prefix scan
scan_block_sums.comp.glsl
Block sum scan
scan_add_offsets.comp.glsl
Add scan offsets
scan_prefix_sum.comp.glsl
Full prefix sum
convolution_1d.comp.glsl
1D convolution
convolution_1d_optimized.comp.glsl
Optimized 1D convolution
convolution_2d.comp.glsl
2D convolution
convolution_2d_optimized.comp.glsl
Optimized 2D convolution
correlation_1d.comp.glsl
1D correlation
correlation_2d.comp.glsl
2D correlation
fft_radix2.comp.glsl
Radix-2 FFT
fft_radix2_optimized.comp.glsl
Optimized radix-2 FFT
fft_radix4.comp.glsl
Radix-4 FFT
ifft_radix2.comp.glsl
Inverse radix-2 FFT
ifft_radix4.comp.glsl
Inverse radix-4 FFT
caf_doppler_shift.comp.glsl
CAF Doppler shift
caf_xcorr.comp.glsl
CAF cross-correlation
cfar_2d.comp.glsl
2D CFAR detection
Radar Kernels (Signal Processing)
Header : #include <optmath/radar_kernels.hpp>
Namespace : optmath::radar
Target : Passive radar, SDR signal processing
enum class WindowType {
RECTANGULAR, // No window (boxcar)
HAMMING, // Hamming window
HANNING, // Hann window
BLACKMAN, // Blackman window
BLACKMAN_HARRIS, // Blackman-Harris window
KAISER // Kaiser window (param = beta)
};
Function
Return Type
Parameters
Description
generate_window_f32
void
float* window, std::size_t n, WindowType type, float beta
Generate window coefficients
apply_window_f32
void
float* data, const float* window, std::size_t n
Apply window (real, in-place)
apply_window_complex_f32
void
float* data_re, float* data_im, const float* window, std::size_t n
Apply window (complex)
generate_window
Eigen::VectorXf
std::size_t n, WindowType type, float beta
Eigen window generator
apply_window
void
Eigen::VectorXf& data, const Eigen::VectorXf& window
Eigen window (real)
apply_window
void
Eigen::VectorXcf& data, const Eigen::VectorXf& window
Eigen window (complex)
Function
Return Type
Parameters
Description
xcorr_f32
void
float* out, const float* x, std::size_t nx, const float* y, std::size_t ny
Real cross-correlation (size: nx+ny-1)
xcorr_complex_f32
void
float* out_re, float* out_im, const float* x_re, const float* x_im, std::size_t nx, const float* y_re, const float* y_im, std::size_t ny
Complex cross-correlation
xcorr
Eigen::VectorXf
const Eigen::VectorXf& x, const Eigen::VectorXf& y
Real cross-correlation
xcorr
Eigen::VectorXcf
const Eigen::VectorXcf& x, const Eigen::VectorXcf& y
Complex cross-correlation
Cross-Ambiguity Function (CAF)
The CAF is the core passive radar processing operation, measuring correlation between reference and surveillance signals across multiple Doppler shifts and range delays.
Function
Return Type
Parameters
Description
caf_f32
void
float* out_mag, const float* ref_re, const float* ref_im, const float* surv_re, const float* surv_im, std::size_t n_samples, std::size_t n_doppler_bins, float doppler_start, float doppler_step, float sample_rate, std::size_t n_range_bins
Direct CAF computation
caf_fft_f32
void
float* out_mag, const float* ref_re, const float* ref_im, const float* surv_re, const float* surv_im, std::size_t n_samples, std::size_t n_doppler_bins, float doppler_start, float doppler_step, float sample_rate, std::size_t n_range_bins
FFT-based CAF (faster for large arrays)
caf
Eigen::MatrixXf
const Eigen::VectorXcf& ref, const Eigen::VectorXcf& surv, std::size_t n_doppler_bins, float doppler_start, float doppler_step, float sample_rate, std::size_t n_range_bins
Eigen CAF wrapper
Parameters :
ref: Reference signal from transmitter (FM broadcast, DVB-T, etc.)
surv: Surveillance signal from receiver
n_doppler_bins: Number of Doppler frequency bins to compute
doppler_start: Starting Doppler frequency (Hz)
doppler_step: Doppler bin spacing (Hz)
sample_rate: Sample rate of signals (Hz)
n_range_bins: Number of range (delay) bins
Output : Range-Doppler magnitude matrix [n_doppler_bins x n_range_bins]
Constant False Alarm Rate detection maintains constant Pfa in varying clutter environments.
Function
Return Type
Parameters
Description
cfar_ca_f32
void
std::uint8_t* detections, float* threshold, const float* input, std::size_t n, std::size_t guard_cells, std::size_t reference_cells, float pfa_factor
1D Cell-Averaging CFAR
cfar_2d_f32
void
std::uint8_t* detections, const float* input, std::size_t n_doppler, std::size_t n_range, std::size_t guard_range, std::size_t guard_doppler, std::size_t ref_range, std::size_t ref_doppler, float pfa_factor
2D CFAR for range-Doppler
cfar_os_f32
void
std::uint8_t* detections, float* threshold, const float* input, std::size_t n, std::size_t guard_cells, std::size_t reference_cells, std::size_t k_select, float pfa_factor
Ordered-Statistic CFAR (robust to clutter edges)
cfar_ca
Eigen::Matrix<uint8_t,Dynamic,1>
const Eigen::VectorXf& input, std::size_t guard_cells, std::size_t reference_cells, float pfa_factor
Eigen 1D CFAR
cfar_2d
Eigen::Matrix<uint8_t,Dynamic,Dynamic>
const Eigen::MatrixXf& input, std::size_t guard_range, std::size_t guard_doppler, std::size_t ref_range, std::size_t ref_doppler, float pfa_factor
Eigen 2D CFAR
CFAR Cell Structure :
[ref cells] [guard cells] [CUT] [guard cells] [ref cells]
Function
Return Type
Parameters
Description
nlms_filter_f32
void
float* output, float* weights, const float* input, const float* reference, std::size_t n, std::size_t filter_length, float mu, float eps
Normalized LMS adaptive filter
projection_clutter_f32
void
float* output, const float* input, const float* clutter_subspace, std::size_t n, std::size_t subspace_dim
Projection clutter cancellation
nlms_filter
Eigen::VectorXf
const Eigen::VectorXf& input, const Eigen::VectorXf& reference, std::size_t filter_length, float mu, float eps
Eigen NLMS filter
projection_clutter
Eigen::VectorXf
const Eigen::VectorXf& input, const Eigen::MatrixXf& clutter_subspace
Eigen projection cancellation
NLMS Parameters :
filter_length: Number of taps (typically 32-128)
mu: Adaptation step size (0 < mu < 2, typically 0.1)
eps: Regularization constant (1e-6)
Function
Return Type
Parameters
Description
doppler_fft_f32
void
float* output_re, float* output_im, const float* input_re, const float* input_im, std::size_t n_pulses, std::size_t n_range, std::size_t fft_size
Doppler FFT across pulses
mti_filter_f32
void
float* output, const float* input, std::size_t n_pulses, std::size_t n_range, const float* coeffs, std::size_t n_coeffs
Moving Target Indicator filter
doppler_fft
Eigen::MatrixXcf
const Eigen::MatrixXcf& input, std::size_t fft_size
Eigen Doppler FFT
mti_filter
Eigen::MatrixXf
const Eigen::MatrixXf& input, const Eigen::VectorXf& coeffs
Eigen MTI filter
Common MTI Coefficients :
2-pulse canceller: [1, -1]
3-pulse canceller: [1, -2, 1]
Function
Return Type
Parameters
Description
beamform_delay_sum_f32
void
float* output, const float* inputs, const int* delays, const float* weights, std::size_t n_channels, std::size_t n_samples
Delay-and-sum beamformer
beamform_phase_f32
void
float* output_re, float* output_im, const float* inputs_re, const float* inputs_im, const float* phases, const float* weights, std::size_t n_channels, std::size_t n_samples
Phase-shift beamformer
steering_vector_ula_f32
void
float* steering_re, float* steering_im, std::size_t n_elements, float d_lambda, float theta_rad
ULA steering vector
beamform_delay_sum
Eigen::VectorXf
const Eigen::MatrixXf& inputs, const Eigen::VectorXi& delays, const Eigen::VectorXf& weights
Eigen delay-sum beamformer
beamform_phase
Eigen::VectorXcf
const Eigen::MatrixXcf& inputs, const Eigen::VectorXf& phases, const Eigen::VectorXf& weights
Eigen phase beamformer
steering_vector_ula
Eigen::VectorXcf
std::size_t n_elements, float d_lambda, float theta_rad
Eigen ULA steering vector
Steering Vector Formula (ULA):
a(θ)[n] = exp(j * 2π * d/λ * n * sin(θ))
Function Count by Backend
Backend
Low-Level
Eigen Wrappers
Total
NEON
48
56
104
CUDA
98
144
242
Vulkan
0
23
23
Radar
24
24
48
Total
170
247
417
Operation
Size
NEON (Pi5)
CUDA (RTX 4090)
Vulkan (Pi5)
Dot Product
4096
0.8 μs
0.02 μs
0.5 μs
GEMM
256x256
1.2 ms
0.008 ms
2.5 ms
GEMM
1024x1024
45 ms
0.08 ms
N/A
FFT
4096
0.4 ms
0.01 ms
0.3 ms
FFT
65536
8 ms
0.12 ms
5 ms
CAF
4096x64
5.2 ms
0.3 ms
2.1 ms
Exp
1M elements
1.2 ms
0.03 ms
N/A
Architecture
Backend
Compute Version
Key Features
ARM Cortex-A76
NEON
ARMv8-A
128-bit SIMD, 2.4 GHz
VideoCore VII
Vulkan
1.2
12 QPU cores, 1 GFLOPS
NVIDIA Turing
CUDA
7.5
Tensor Cores Gen 1
NVIDIA Ampere
CUDA
8.0/8.6
Tensor Cores Gen 3, TF32
NVIDIA Ada
CUDA
8.9
Tensor Cores Gen 4, FP8
NVIDIA Hopper
CUDA
9.0
Transformer Engine
NVIDIA Blackwell
CUDA
10.0
Tensor Cores Gen 5, FP8
AMD RDNA2/3
Vulkan
1.3
Compute units
Intel Arc
Vulkan
1.3
Xe-HPG cores
Platform Detection (optmath::platform)
Function
Signature
Description
detect_cpu_info
const CpuInfo& detect_cpu_info()
Detect CPU topology, cache sizes, features (cached)
get_performance_cores
std::vector<int> get_performance_cores()
Get CPU IDs of big cores (A720)
get_efficiency_cores
std::vector<int> get_efficiency_cores()
Get CPU IDs of LITTLE cores (A520)
pin_thread_to_performance_cores
int pin_thread_to_performance_cores()
Pin calling thread to big cores
pin_thread_to_core
int pin_thread_to_core(int cpu_id)
Pin calling thread to specific core
get_sve_vector_length
int get_sve_vector_length()
SVE vector length in bytes (0 if unavailable)
get_l2_cache_size
std::size_t get_l2_cache_size()
Per-core L2 cache size (performance core)
get_l3_cache_size
std::size_t get_l3_cache_size()
Shared L3 cache size
Function
Signature
Description
get_gemm_mc
std::size_t get_gemm_mc()
MC parameter (256 for L3>=8MB, 128 otherwise)
get_gemm_kc
std::size_t get_gemm_kc()
KC parameter (512 for L3>=8MB, 256 otherwise)
get_gemm_nc
std::size_t get_gemm_nc()
NC parameter (2048 for L3>=8MB, 512 otherwise)
MIT License - Copyright (c) 2026 Dr Robert W McGwier, PhD