feat: optional GPU acceleration for SFAP computation via CuPy#11
Merged
Conversation
d3c8cbb to
9122b07
Compare
Add xp (array backend) parameter to core bioelectric functions for direct CuPy dispatch, eliminating numpy __array_function__ overhead. GPU path in calc_sfaps() uses CUDA streams for concurrent execution. Changes: - bioelectric.py: add xp=np param to get_tm_current_dz, get_current_density, get_elementary_current_response - motor_unit_sim.py: GPU branch with CuPy detection, CUDA streams, graceful CPU fallback Performance (RTX 3090, 200 fibers x 25 electrodes): - dz=0.1: 8.2x speedup (10.4s CPU -> 1.3s GPU) - dz=0.01: 27.3x speedup (27.1s CPU -> 1.0s GPU) 3-way parity: RMSE < 2e-19, xcorr = 1.000000000000000 Fully backward-compatible: xp=np default, no API breakage.
9122b07 to
c1f5adf
Compare
…tate, hardened tests
Contributor
Author
|
Pushed a follow-up commit (820ca5e) with quality improvements:
No performance regression — benchmark confirmed 8.1× @ dz=0.1, 27.7× @ dz=0.01. |
…mpy/cupy mixing When callers pass numpy arrays or numpy scalars (e.g. from pint .magnitude indexing) to get_current_density() or get_elementary_current_response() with xp=cupy, CuPy raises TypeError because it cannot mix numpy.ndarray with cupy.ndarray. Add xp.asarray() for array inputs and float() casts for scalar parameters at function entry. Each xp-aware function now owns its backend contract — callers need not pre-convert.
DRohlf
approved these changes
May 24, 2026
Collaborator
DRohlf
left a comment
There was a problem hiding this comment.
Well done, GPU/CPU parity is clean and the use_gpu and xp-aware kernels are a good pattern.
Two non-blocking notes, don't block the PR for these:
- Per-fiber loop does cp.asarray(z) / cp.asarray(radial_distance) and builds z via np.arange/np.concatenate on the cpu. Fine at smaller scales, but at larger production motor unit sizes the transfer time will start to dominate.
- Consider a cp.get_default_memory_pool().free_all_blocks() after cp.asnumpy(sfaps_dev) to release pool memory. This is important for memory allocation on smaller GPUs or with large simulated muscles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GPU Acceleration PR — Benchmark & Parity Data
PR Summary
Closes #8. Optional GPU acceleration for SFAP computation via CuPy. Adds
xp=np(array backend) parameter to core bioelectric functions for direct CuPy dispatch, eliminating numpy__array_function__protocol overhead. GPU path incalc_sfaps()uses CUDA streams for concurrent kernel execution with graceful CPU fallback.Changes (2 files, +15 / +64 net lines)
bioelectric.py(+15 net)xp=npparameter toget_tm_current_dz,get_current_density,get_elementary_current_responsenp.xxx→xp.xxxinside function bodies (mechanical substitution)shift_paddingfor xp-compatibility whenxp != npxp=nppreserves full backward compatibility — zero API breakagemotor_unit_sim.py(+64 net)HAS_CUPYflag) with Windows CUDA DLL discoverycalc_sfaps(): CUDA streams +xp=cpdispatchBenchmark: CPU vs GPU (dz=0.1, default)
Benchmark: GPU Speedup Scaling with dz
3-Way Numerical Parity
Three-way regression test comparing:
mainbranch, NumPyxp=np(NumPy path)xp=cp(CuPy path)Test configuration
6.4294464436e-02Design Decisions
Why
xp=npparameter instead of numpy__array_function__?NumPy's
__array_function__protocol dispatches CuPy calls through a Python-level indirection layer, adding ~0.1-0.5ms overhead per call. With ~500 dispatched calls per fiber (meshgrid, diff, exp, roll × electrodes), this overhead dominated GPU execution time:Why CUDA streams?
Streams enable the GPU to overlap memory transfers with kernel execution. With 4 non-blocking streams, fibers are round-robin assigned, allowing pipeline parallelism without Python threading overhead.
Model-agnostic design
The
xpparameter does not constrain the physics model:calc_sfaps) treats bioelectric functions as black boxesxp.xxxconventionEnvironment