Skip to content

feat: optional GPU acceleration for SFAP computation via CuPy#11

Merged
DRohlf merged 3 commits into
NsquaredLab:mainfrom
veylantis:feat/gpu-cupy-backend
May 24, 2026
Merged

feat: optional GPU acceleration for SFAP computation via CuPy#11
DRohlf merged 3 commits into
NsquaredLab:mainfrom
veylantis:feat/gpu-cupy-backend

Conversation

@veylantis
Copy link
Copy Markdown
Contributor

@veylantis veylantis commented May 19, 2026

GPU Acceleration PR — Benchmark & Parity Data

PR Summary

Closes #8. Optional GPU acceleration for SFAP computation via CuPy. Adds xp=np (array backend) parameter to core bioelectric functions for direct CuPy dispatch, eliminating numpy __array_function__ protocol overhead. GPU path in calc_sfaps() uses CUDA streams for concurrent kernel execution with graceful CPU fallback.


Changes (2 files, +15 / +64 net lines)

bioelectric.py (+15 net)

  • Added xp=np parameter to get_tm_current_dz, get_current_density, get_elementary_current_response
  • Replaced np.xxxxp.xxx inside function bodies (mechanical substitution)
  • Inlined shift_padding for xp-compatibility when xp != np
  • Default xp=np preserves full backward compatibility — zero API breakage

motor_unit_sim.py (+64 net)

  • Added CuPy detection (HAS_CUPY flag) with Windows CUDA DLL discovery
  • GPU branch in calc_sfaps(): CUDA streams + xp=cp dispatch
  • CPU branch: unchanged original code

Benchmark: CPU vs GPU (dz=0.1, default)

Config Time Speedup
CPU (NumPy) 10.43s baseline
GPU (CuPy) 1.27s 8.2×

200 fibers × 25 electrodes × 2 rounds, RTX 3090, after JIT warmup

benchmark_cpu_vs_gpu

Benchmark: GPU Speedup Scaling with dz

dz Spatial pts/fiber CPU GPU Speedup
0.500 122 0.54s 0.27s 2.0×
0.100 614 2.71s 0.31s 8.7×
0.050 1,228 5.39s 0.40s 13.6×
0.020 3,070 13.64s 0.65s 20.8×
0.010 6,140 27.13s 0.99s 27.3×

50 fibers × 25 electrodes, RTX 3090

benchmark_dz_scaling

3-Way Numerical Parity

Three-way regression test comparing:

  1. original_cpu — unmodified main branch, NumPy
  2. modified_cpu — this PR, xp=np (NumPy path)
  3. modified_gpu — this PR, xp=cp (CuPy path)
Comparison xcorr RMSE max|diff| Status
original_cpu vs modified_cpu 1.000000000000000 1.28e-19 2.17e-18
original_cpu vs modified_gpu 1.000000000000000 1.92e-19 3.47e-18
modified_cpu vs modified_gpu 1.000000000000000 2.01e-19 3.47e-18

OVERALL: ALL PASSED — differences at floating-point rounding level (~1e-19)

Test configuration

  • Seed: 42 (deterministic)
  • 20 fibers × 4 electrodes, dz=0.1
  • All three norms identical: 6.4294464436e-02

Design Decisions

Why xp=np parameter instead of numpy __array_function__?

NumPy's __array_function__ protocol dispatches CuPy calls through a Python-level indirection layer, adding ~0.1-0.5ms overhead per call. With ~500 dispatched calls per fiber (meshgrid, diff, exp, roll × electrodes), this overhead dominated GPU execution time:

Approach GPU time (200 fibers) Speedup
numpy dispatch 7.93s 1.3×
direct xp=cp 1.27s 8.2×

Why CUDA streams?

Streams enable the GPU to overlap memory transfers with kernel execution. With 4 non-blocking streams, fibers are round-robin assigned, allowing pipeline parallelism without Python threading overhead.

Model-agnostic design

The xp parameter does not constrain the physics model:

  • Caller (calc_sfaps) treats bioelectric functions as black boxes
  • Any upstream model revamp only needs to maintain xp.xxx convention
  • CPU path is completely untouched

Environment

  • Python 3.12
  • CuPy 13.x (cupy-cuda12x)
  • NVIDIA GeForce RTX 3090
  • Windows 11 / CUDA 12.x

@veylantis veylantis force-pushed the feat/gpu-cupy-backend branch from d3c8cbb to 9122b07 Compare May 19, 2026 15:31
Add xp (array backend) parameter to core bioelectric functions for
direct CuPy dispatch, eliminating numpy __array_function__ overhead.
GPU path in calc_sfaps() uses CUDA streams for concurrent execution.

Changes:
- bioelectric.py: add xp=np param to get_tm_current_dz,
  get_current_density, get_elementary_current_response
- motor_unit_sim.py: GPU branch with CuPy detection, CUDA streams,
  graceful CPU fallback

Performance (RTX 3090, 200 fibers x 25 electrodes):
- dz=0.1:  8.2x speedup (10.4s CPU -> 1.3s GPU)
- dz=0.01: 27.3x speedup (27.1s CPU -> 1.0s GPU)

3-way parity: RMSE < 2e-19, xcorr = 1.000000000000000
Fully backward-compatible: xp=np default, no API breakage.
@veylantis
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up commit (820ca5e) with quality improvements:

  1. shift_padding → xp-aware: Restored as a proper function with xp=np parameter instead of inlining — single source of truth for both backends
  2. Eliminated device-to-host sync: int(xp.round(…))round(float(…))
  3. use_gpu tri-state API:
    def calc_sfaps(..., use_gpu: bool | None = None):
        # None  → auto (GPU if available AND MYOGEN_DISABLE_GPU not set)
        # True  → require GPU, raises RuntimeError if unavailable
        # False → force CPU
  4. Hardened tests: pytest.importorskip("cupy") + cupy.cuda.is_available() for graceful skip on CPU-only CI. Added TestShiftPadding contract tests (positive/negative/zero shift + GPU parity)
  5. PEP 8: Split multi-statement ; line in __init__.py
  6. Inline comments for .astype(xp.float64) and xp.ravel() rationale

No performance regression — benchmark confirmed 8.1× @ dz=0.1, 27.7× @ dz=0.01.

…mpy/cupy mixing

When callers pass numpy arrays or numpy scalars (e.g. from pint
.magnitude indexing) to get_current_density() or
get_elementary_current_response() with xp=cupy, CuPy raises
TypeError because it cannot mix numpy.ndarray with cupy.ndarray.

Add xp.asarray() for array inputs and float() casts for scalar
parameters at function entry.  Each xp-aware function now owns
its backend contract — callers need not pre-convert.
@RaulSimpetru RaulSimpetru changed the title feat: optional GPU acceleration for SFAP computation via CuPy feat: optional GPU acceleration for SFAP computation via CuPy closes #8 May 19, 2026
@RaulSimpetru RaulSimpetru changed the title feat: optional GPU acceleration for SFAP computation via CuPy closes #8 feat: optional GPU acceleration for SFAP computation via CuPy May 19, 2026
@RaulSimpetru RaulSimpetru requested a review from DRohlf May 19, 2026 19:17
Copy link
Copy Markdown
Collaborator

@DRohlf DRohlf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, GPU/CPU parity is clean and the use_gpu and xp-aware kernels are a good pattern.

Two non-blocking notes, don't block the PR for these:

  1. Per-fiber loop does cp.asarray(z) / cp.asarray(radial_distance) and builds z via np.arange/np.concatenate on the cpu. Fine at smaller scales, but at larger production motor unit sizes the transfer time will start to dominate.
  2. Consider a cp.get_default_memory_pool().free_all_blocks() after cp.asnumpy(sfaps_dev) to release pool memory. This is important for memory allocation on smaller GPUs or with large simulated muscles.

@DRohlf DRohlf merged commit c061601 into NsquaredLab:main May 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] GPU-accelerated SFAP computation for simulate_muaps() via CuPy backend

2 participants