Skip to content

feat: GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring#112

Open
FIrgolitsch wants to merge 13 commits intosphinx-configfrom
pr-m-gpu-kvikio
Open

feat: GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring#112
FIrgolitsch wants to merge 13 commits intosphinx-configfrom
pr-m-gpu-kvikio

Conversation

@FIrgolitsch
Copy link
Copy Markdown
Contributor

@FIrgolitsch FIrgolitsch commented Apr 30, 2026

Stacked PR 15/16 — review order: #115#97#98#99#100#101#108#106#107#87#116#110#111#40#112#113

Base: sphinx-config (#40). Retargets to main as upstream PRs merge.


PR — GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring

Extends the GPU stack with end-to-end on-device data flow for the OCT reconstruction pipeline and adds a GPUDirect Storage (GDS) reader as a fast path for reading uncompressed zarr arrays straight into device memory.

GPU keep-on-device

  • New linumpy.gpu.zarr_io with gpu_zarr_context() (uses zarr.config.enable_gpu()) and read_zarr_to_gpu(...) with auto backend selection (kvikio when available, zarr-gpu otherwise).
  • linumpy.gpu.interpolation: device-preserving resize, affine_transform, map_coordinates.
  • New linumpy.gpu.interface with a GPU implementation of find_tissue_interface (no-mask path) using cupyx filters.
  • linumpy.geometry.interface.find_tissue_interface(..., use_gpu=...) and linumpy.mosaic.stacking.find_z_overlap(..., use_gpu=...) now route to GPU when requested.
  • linum_aip.py and linum_resample_mosaic_grid.py use gpu_zarr_context to keep tiles on-device through the slab loop and writer.
  • linum_detect_focal_curvature.py: vectorized roll via take_along_axis (xp dispatch) and --use_gpu/--no-use_gpu.
  • linum_stack_slices_motor.py: --use_gpu/--no-use_gpu plumbed to find_z_overlap.

kvikio (GDS) reader (prototype)

  • linumpy/gpu/kvikio_zarr.py: GDS reader for raw uncompressed zarr v2 + v3.
    • Refuses incompatible arrays (compressed, filtered, non-C order, mismatched endian) with NotImplementedError.
    • Uses contiguous scratch buffer for CuFile.pread.
  • scripts/linum_benchmark_kvikio_zarr.py: benchmark with kvikio and zarr.config.enable_gpu() paths for comparison.
  • read_zarr_to_gpu falls back to zarr-gpu when kvikio is in compat mode, when arrays aren't GDS-compatible, or on any runtime failure.

Server / build

  • shell_scripts/server_setup/nvfs_kernel7_patch.sh: nvidia-fs 2.28.4 patch for kernel 7.0; symvers helper now also handles .ko.zst.
  • pyproject.toml: bump ome-zarr to >=0.16.0 (NGFF 0.5).

Nextflow pipeline GPU wiring

  • fix_focal_curvature and stack processes pass --use_gpu/--no-use_gpu from params.use_gpu.
  • nextflow.config: withName: "fix_focal_curvature" gets maxForks = params.use_gpu ? 4 : null.
  • withName: "resample_mosaic_grid": maxForks = params.use_gpu ? 6 : null (measured ~1 GB GPU mem per fork; IO-gated).
  • _run_pipelined: prefetch + GPU compute pipeline; periodic free of cupy memory pool.

FIrgolitsch and others added 13 commits April 29, 2026 22:46
linumpy/gpu/kvikio_zarr.py: read_zarr_v2_to_gpu() loads an uncompressed
zarr v2 array directly into a CuPy array using kvikio.CuFile.pread,
bypassing the host bounce buffer when GDS is active.

scripts/linum_benchmark_kvikio_zarr.py: benchmarks the GDS path against
the conventional zarr.open + numpy + cupy.asarray path; supports
synthetic dataset generation.

Prototype scope: zarr v2, compressor=None, order=C. Compressed chunks
(blosc/lz4) need nvCOMP for on-device decompression and are out of
scope here.
CuFile.pread requires a contiguous device buffer; out[slices] is
generally non-contiguous. Read each chunk into a single reused
chunk-shaped scratch and copy into the output.
…c_grid process

Co-authored-by: Copilot <copilot@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant