feat: optional GPU acceleration for SFAP computation via CuPy by veylantis · Pull Request #11 · NsquaredLab/MyoGen

veylantis · 2026-05-19T12:55:27Z

GPU Acceleration PR — Benchmark & Parity Data

PR Summary

Closes #8. Optional GPU acceleration for SFAP computation via CuPy. Adds xp=np (array backend) parameter to core bioelectric functions for direct CuPy dispatch, eliminating numpy __array_function__ protocol overhead. GPU path in calc_sfaps() uses CUDA streams for concurrent kernel execution with graceful CPU fallback.

Changes (2 files, +15 / +64 net lines)

`bioelectric.py` (+15 net)

Added xp=np parameter to get_tm_current_dz, get_current_density, get_elementary_current_response
Replaced np.xxx → xp.xxx inside function bodies (mechanical substitution)
Inlined shift_padding for xp-compatibility when xp != np
Default xp=np preserves full backward compatibility — zero API breakage

`motor_unit_sim.py` (+64 net)

Added CuPy detection (HAS_CUPY flag) with Windows CUDA DLL discovery
GPU branch in calc_sfaps(): CUDA streams + xp=cp dispatch
CPU branch: unchanged original code

Benchmark: CPU vs GPU (dz=0.1, default)

Config	Time	Speedup
CPU (NumPy)	10.43s	baseline
GPU (CuPy)	1.27s	8.2×

200 fibers × 25 electrodes × 2 rounds, RTX 3090, after JIT warmup

Benchmark: GPU Speedup Scaling with dz

dz	Spatial pts/fiber	CPU	GPU	Speedup
0.500	122	0.54s	0.27s	2.0×
0.100	614	2.71s	0.31s	8.7×
0.050	1,228	5.39s	0.40s	13.6×
0.020	3,070	13.64s	0.65s	20.8×
0.010	6,140	27.13s	0.99s	27.3×

50 fibers × 25 electrodes, RTX 3090

3-Way Numerical Parity

Three-way regression test comparing:

original_cpu — unmodified main branch, NumPy
modified_cpu — this PR, xp=np (NumPy path)
modified_gpu — this PR, xp=cp (CuPy path)

Comparison	xcorr	RMSE	max\|diff\|	Status
original_cpu vs modified_cpu	1.000000000000000	1.28e-19	2.17e-18	✅
original_cpu vs modified_gpu	1.000000000000000	1.92e-19	3.47e-18	✅
modified_cpu vs modified_gpu	1.000000000000000	2.01e-19	3.47e-18	✅

OVERALL: ALL PASSED — differences at floating-point rounding level (~1e-19)

Test configuration

Seed: 42 (deterministic)
20 fibers × 4 electrodes, dz=0.1
All three norms identical: 6.4294464436e-02

Design Decisions

Why `xp=np` parameter instead of numpy `__array_function__`?

NumPy's __array_function__ protocol dispatches CuPy calls through a Python-level indirection layer, adding ~0.1-0.5ms overhead per call. With ~500 dispatched calls per fiber (meshgrid, diff, exp, roll × electrodes), this overhead dominated GPU execution time:

Approach	GPU time (200 fibers)	Speedup
numpy dispatch	7.93s	1.3×
direct xp=cp	1.27s	8.2×

Why CUDA streams?

Streams enable the GPU to overlap memory transfers with kernel execution. With 4 non-blocking streams, fibers are round-robin assigned, allowing pipeline parallelism without Python threading overhead.

Model-agnostic design

The xp parameter does not constrain the physics model:

Caller (calc_sfaps) treats bioelectric functions as black boxes
Any upstream model revamp only needs to maintain xp.xxx convention
CPU path is completely untouched

Environment

Python 3.12
CuPy 13.x (cupy-cuda12x)
NVIDIA GeForce RTX 3090
Windows 11 / CUDA 12.x

Add xp (array backend) parameter to core bioelectric functions for direct CuPy dispatch, eliminating numpy __array_function__ overhead. GPU path in calc_sfaps() uses CUDA streams for concurrent execution. Changes: - bioelectric.py: add xp=np param to get_tm_current_dz, get_current_density, get_elementary_current_response - motor_unit_sim.py: GPU branch with CuPy detection, CUDA streams, graceful CPU fallback Performance (RTX 3090, 200 fibers x 25 electrodes): - dz=0.1: 8.2x speedup (10.4s CPU -> 1.3s GPU) - dz=0.01: 27.3x speedup (27.1s CPU -> 1.0s GPU) 3-way parity: RMSE < 2e-19, xcorr = 1.000000000000000 Fully backward-compatible: xp=np default, no API breakage.

…tate, hardened tests

veylantis · 2026-05-19T16:26:48Z

Pushed a follow-up commit (820ca5e) with quality improvements:

shift_padding → xp-aware: Restored as a proper function with xp=np parameter instead of inlining — single source of truth for both backends
Eliminated device-to-host sync: int(xp.round(…)) → round(float(…))

use_gpu tri-state API:

def calc_sfaps(..., use_gpu: bool | None = None):
    # None  → auto (GPU if available AND MYOGEN_DISABLE_GPU not set)
    # True  → require GPU, raises RuntimeError if unavailable
    # False → force CPU

Hardened tests: pytest.importorskip("cupy") + cupy.cuda.is_available() for graceful skip on CPU-only CI. Added TestShiftPadding contract tests (positive/negative/zero shift + GPU parity)
PEP 8: Split multi-statement ; line in __init__.py
Inline comments for .astype(xp.float64) and xp.ravel() rationale

No performance regression — benchmark confirmed 8.1× @ dz=0.1, 27.7× @ dz=0.01.

…mpy/cupy mixing When callers pass numpy arrays or numpy scalars (e.g. from pint .magnitude indexing) to get_current_density() or get_elementary_current_response() with xp=cupy, CuPy raises TypeError because it cannot mix numpy.ndarray with cupy.ndarray. Add xp.asarray() for array inputs and float() casts for scalar parameters at function entry. Each xp-aware function now owns its backend contract — callers need not pre-convert.

DRohlf

Well done, GPU/CPU parity is clean and the use_gpu and xp-aware kernels are a good pattern.

Two non-blocking notes, don't block the PR for these:

Per-fiber loop does cp.asarray(z) / cp.asarray(radial_distance) and builds z via np.arange/np.concatenate on the cpu. Fine at smaller scales, but at larger production motor unit sizes the transfer time will start to dominate.
Consider a cp.get_default_memory_pool().free_all_blocks() after cp.asnumpy(sfaps_dev) to release pool memory. This is important for memory allocation on smaller GPUs or with large simulated muscles.

veylantis force-pushed the feat/gpu-cupy-backend branch from d3c8cbb to 9122b07 Compare May 19, 2026 15:31

veylantis force-pushed the feat/gpu-cupy-backend branch from 9122b07 to c1f5adf Compare May 19, 2026 15:51

veylantis mentioned this pull request May 19, 2026

[Bug] min_radial_dist unit mismatch in MotorUnitSim.calc_sfaps() — µm value used in mm context #9

Closed

refactor: address code review - xp-aware shift_padding, use_gpu tri-s…

820ca5e

…tate, hardened tests

RaulSimpetru changed the title ~~feat: optional GPU acceleration for SFAP computation via CuPy~~ feat: optional GPU acceleration for SFAP computation via CuPy closes #8 May 19, 2026

RaulSimpetru changed the title ~~feat: optional GPU acceleration for SFAP computation via CuPy closes #8~~ feat: optional GPU acceleration for SFAP computation via CuPy May 19, 2026

RaulSimpetru requested a review from DRohlf May 19, 2026 19:17

DRohlf approved these changes May 24, 2026

View reviewed changes

DRohlf merged commit c061601 into NsquaredLab:main May 24, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optional GPU acceleration for SFAP computation via CuPy#11

feat: optional GPU acceleration for SFAP computation via CuPy#11
DRohlf merged 3 commits into
NsquaredLab:mainfrom
veylantis:feat/gpu-cupy-backend

veylantis commented May 19, 2026 •

edited by RaulSimpetru

Loading

Uh oh!

veylantis commented May 19, 2026

Uh oh!

DRohlf left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

veylantis commented May 19, 2026 • edited by RaulSimpetru Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Acceleration PR — Benchmark & Parity Data

PR Summary

Changes (2 files, +15 / +64 net lines)

bioelectric.py (+15 net)

motor_unit_sim.py (+64 net)

Benchmark: CPU vs GPU (dz=0.1, default)

Benchmark: GPU Speedup Scaling with dz

3-Way Numerical Parity

Test configuration

Design Decisions

Why xp=np parameter instead of numpy __array_function__?

Why CUDA streams?

Model-agnostic design

Environment

Uh oh!

veylantis commented May 19, 2026

Uh oh!

DRohlf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

veylantis commented May 19, 2026 •

edited by RaulSimpetru

Loading

`bioelectric.py` (+15 net)

`motor_unit_sim.py` (+64 net)

Why `xp=np` parameter instead of numpy `__array_function__`?