Skip to content

CI: Self-hosted GPU runners needed to fully test VkFFT backends #75

@hjmjohnson

Description

@hjmjohnson

Summary

The C++ build/compile legs of this module are validated on GitHub-hosted runners (ubuntu-24.04, macos-15, windows-2022), but FFT correctness cannot be exercised there. The hosted runners have no real GPU; the only OpenCL ICD available is pocl (CPU), whose VkFFT kernel results diverge from real-GPU output and fail baseline image comparison. As a result, hosted CI runs only the two lightweight checks (VkFFTBackendKWStyleTest, VkFFTBackendInDoxygenGroup) and skips every functional FFT test.

Full validation requires self-hosted GPU-enabled runners.

Current state

Two workflows already target self-hosted GPU runners but have no online runner to claim them, so they queued indefinitely (12h+) on every push/PR until they were gated:

Workflow runs-on labels Needs
test-gpu.yml [self-hosted, gpu] GPU + OpenCL/CUDA/HIP/Level-Zero/Metal driver
test-notebooks.yml [self-hosted, notebook-gpu] GPU + Jupyter/nbmake + clinfo device

These are now gated behind workflow_dispatch or a gpu-ci PR label so they no longer accumulate stuck queued jobs. They will only do useful work once a matching runner is online.

What's needed

  1. Register self-hosted runner(s) for InsightSoftwareConsortium/ITKVkFFTBackend (or an org runner group this repo can use) with the labels gpu and notebook-gpu.

  2. The runner host should expose at least one real GPU backend so VkFFT can be exercised end-to-end. Coverage goal across the VKFFT_BACKEND modes:

    VKFFT_BACKEND Backend Requires
    1 CUDA NVIDIA GPU + CUDA toolkit
    2 HIP AMD GPU + ROCm
    3 OpenCL any GPU + real (non-pocl) OpenCL ICD
    4 Level Zero Intel GPU + oneAPI Level Zero
    5 Metal Apple Silicon + macOS

    A single NVIDIA host covers backends 1 and 3; full matrix coverage needs additional hosts.

  3. Once a runner is live, functional FFT tests (currently skipped on hosted CI) and the notebook tests will run via the existing gated workflows — add the gpu-ci label to a PR or use the Actions "Run workflow" button.

Why this matters

Without GPU CI, every backend change (e.g. the Level Zero backend added in #73, the OpenCL multi-ICD fix, CUDA 13 API updates) is only smoke-tested for compilation. Regressions in actual FFT output can land undetected because no automated job computes a transform on real hardware and compares against the baseline images.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions