Add GPU-capable demo cluster image by pingsutw · Pull Request #7243 · flyteorg/flyte

pingsutw · 2026-04-20T23:58:21Z

Summary

Adds a GPU variant of the demo-bundled image so users on an NVIDIA-enabled host can run:

flyte start demo --image ghcr.io/flyteorg/flyte-demo:gpu-latest

and submit Flyte tasks with Resources(gpu=1) — no PodTemplate, no runtimeClassName needed — and get a GPU.

What's in the image

Dockerfile.gpu — stages NVIDIA Container Toolkit v1.19.x (nvidia-ctk, nvidia-container-runtime, libnvidia-container) into the rancher/k3s final image. Two subtle prereqs the OCI hook needs:
- Libs under /usr/lib/<arch-triple>/ — the nvidia-ctk OCI hook runs without inheriting LD_LIBRARY_PATH.
- A statically-linked /sbin/ldconfig from debian:bookworm-slim — rancher/k3s ships none, and the toolkit's update-ldcache hook bind-mounts it into workload pods.
containerd-config.toml.tmpl — sets default_runtime_name = "nvidia". GPU pods get GPUs automatically; non-GPU pods are unaffected (nvidia-container-runtime is a passthrough).
nvidia-device-plugin.yaml — RuntimeClass nvidia + DaemonSet (nvcr.io/nvidia/k8s-device-plugin:v0.17.0) auto-applied by k3s on startup.
Makefile — new build-gpu target.
CI — new build-and-push step publishing gpu-latest, gpu-nightly, and gpu-<sha> tags to both flyte-demo and flyte-sandbox-v2.

Companion change

A matching --gpu flag on flyte start demo (adds --gpus all to docker run) will land in flyteorg/flyte-sdk. Image is useful without it via manual docker run --gpus all ….

Test plan

import flyte

torch_image = flyte.Image.from_debian_base(
    python_version=(3, 12),
    registry="localhost:30000",
).with_pip_packages("torch")

gpu_env = flyte.TaskEnvironment(
    name="torch_gpu",
    image=torch_image,
    resources=flyte.Resources(cpu=1, memory="2Gi", gpu=1),
)


@gpu_env.task
def check_torch() -> str:
    import torch

    info = (
        f"torch={torch.__version__}\n"
        f"cuda_available={torch.cuda.is_available()}\n"
        f"device_count={torch.cuda.device_count()}\n"
        f"device_name={torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}"
    )
    print(info)
    return info


if __name__ == "__main__":
    flyte.init_from_config()
    r = flyte.run(check_torch)
    print("name:", r.name)
    print("url:", r.url)
    r.wait()

Layered test image (NVIDIA toolkit + device plugin on top of flyte-demo:nightly) verified on an A10G: k3s auto-registers nvidia runtime, node advertises nvidia.com/gpu: 1, Flyte task with Resources(gpu=1) ran torch 2.11.0+cu130 reporting cuda_available=True, device_name="NVIDIA A10G".
Full multi-stage Dockerfile.gpu has NOT been built locally — this CI run is the first build. Expect potential fixup iterations.

Follow-ups

GPU type filters (Resources(gpu="A10G:1")) require GPU Feature Discovery to label the node.
MIG support (A100/H100 partitioning) needs a device-plugin ConfigMap with migStrategy.

main
- Flyte 2 #6583
  - Add GPU-capable demo cluster image 👈
    - Rename flyte-demo to flyte-devbox #7247

Adds a GPU variant of the demo-bundled image so users with NVIDIA GPUs can run `flyte start demo --image ghcr.io/flyteorg/flyte-demo:gpu-latest` and submit tasks with `Resources(gpu=1)`. - Dockerfile.gpu stages NVIDIA Container Toolkit v1.19.x binaries and their shared libs into the rancher/k3s final image. Libs are copied into /usr/lib/<triple>/ because the nvidia-ctk OCI hook runs without inheriting LD_LIBRARY_PATH. A statically-linked /sbin/ldconfig is also staged (rancher/k3s ships none) because the toolkit's update-ldcache hook bind-mounts it into workload pods. - containerd-config.toml.tmpl sets nvidia as the default containerd runtime. Pods requesting nvidia.com/gpu get GPUs without needing runtimeClassName in their spec; non-GPU pods are unaffected (nvidia-container-runtime is a passthrough when no GPU is requested). - nvidia-device-plugin.yaml installs a RuntimeClass and the NVIDIA k8s-device-plugin DaemonSet so nvidia.com/gpu is advertised on the node. Auto-applied by k3s at startup. - Makefile gains a build-gpu target producing flyte-demo:gpu-latest. - CI gains a build-and-push step publishing gpu-latest, gpu-nightly, and gpu-<sha> tags to both flyte-demo and flyte-sandbox-v2. The GPU plumbing was verified end-to-end with a layered test image on an A10G (torch 2.11.0+cu130 reported cuda_available=True). The full multi-stage Dockerfile.gpu has not been built locally; the CI run here is the first end-to-end test of the production Dockerfile and may need fixup iterations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes the duplicated builder/bootstrap/pg-cache stages and final-stage setup by making Dockerfile.gpu a thin layer on top of flyte-demo:latest (parameterized via ARG BASE_IMAGE). CI now builds the CPU image first and passes its sha-tag in as BASE_IMAGE to the GPU build. - Dockerfile.gpu shrinks from ~165 to ~75 lines; inherits flyte-binary, embedded postgres, staging manifests, and k3d entrypoint from the base image unchanged. - Makefile build-gpu target now depends on build (not the full prereq chain) and passes BASE_IMAGE=flyte-demo:latest. - CI gates the GPU build on push/workflow_dispatch since PR builds don't push the CPU image to ghcr.io (nothing to pull for BASE_IMAGE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the `if:` gate and conditions `push:` on the same expression the CPU build uses, so both steps always build and only push on v2-branch pushes or workflow_dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On pull_request events the CPU build step runs with push=false, so the GPU build's FROM ghcr.io/.../flyte-demo:sha-<sha> fails to resolve (image not found in the registry). Fix by producing an OCI archive of the CPU image locally and passing it to the GPU build as a named build context (build-contexts: base=oci-layout://...) with BASE_IMAGE=base. Registry push happens in a separate step that only runs on push / workflow_dispatch, so PR builds no longer need ghcr credentials for the GPU step.

Add GHA cache (type=gha) to the three docker/build-push-action steps in build-and-push-demo-bundled-image. CPU archive and CPU push share the demo-cpu scope so the push reuses layers from the archive build; GPU gets its own demo-gpu scope.

The oci-layout:// build-context source requires Dockerfile frontend 1.5+. CI was failing with 'unsupported context source oci-layout for base'. Signed-off-by: Kevin Su <pingsutw@apache.org>

## Summary Adds `--gpu` to `flyte start demo`. When set, the underlying `docker run` is invoked with `--gpus all`, giving the demo container access to host NVIDIA GPUs. ```bash flyte start demo --gpu --image ghcr.io/flyteorg/flyte-demo:gpu-latest ``` Default off — existing non-GPU users are unaffected. ## Why The GPU-capable demo image being added in [flyteorg/flyte#7243](flyteorg/flyte#7243) configures k3s, containerd, and the NVIDIA device plugin inside the image, but none of that matters if the host GPUs aren't passed through to the container. `--gpus all` on the `docker run` is the piece the CLI owns. ## Change - `_start.py` — new `--gpu` click option on the `demo` command. - `_demo.py` — threads `gpu: bool` through `launch_demo → _launch_demo_rich/plain → _run_step → _run_container`, which appends `--gpus all` when true. ## Test plan <img width="1879" height="769" alt="Screenshot 2026-04-21 at 12 38 07 AM" src="https://github.com/user-attachments/assets/44a92867-f1dd-4c89-b96d-3c5af63389f4" /> - [x] Installed locally with `uv tool install --from . flyte`; `flyte start demo --help` lists the new flag. - [x] `flyte start demo --gpu --image flyte-demo:gpu-local` on an A10G host launched the container with `--gpus all`, a Flyte task with `Resources(gpu=1)` ran `torch.cuda.is_available() == True`. - [x] No regression expected on existing users (flag defaults to off). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-gpu

github-actions bot mentioned this pull request Apr 20, 2026

Flyte 2 #6583

Draft

3 tasks

pingsutw mentioned this pull request Apr 21, 2026

Add --gpu flag to flyte start demo flyteorg/flyte-sdk#989

Merged

3 tasks

pingsutw and others added 4 commits April 21, 2026 00:30

Bump Dockerfile.gpu syntax to 1.7-labs for oci-layout build context

65d5095

The oci-layout:// build-context source requires Dockerfile frontend 1.5+. CI was failing with 'unsupported context source oci-layout for base'. Signed-off-by: Kevin Su <pingsutw@apache.org>

Merge branch 'v2' of github.com:flyteorg/flyte into feat/demo-bundled…

68673c1

…-gpu

cosmicBboy approved these changes Apr 21, 2026

View reviewed changes

github-actions bot mentioned this pull request Apr 21, 2026

Rename flyte-demo to flyte-devbox #7247

Merged

8 tasks

pingsutw merged commit 762b0cd into v2 Apr 21, 2026
20 checks passed

pingsutw deleted the feat/demo-bundled-gpu branch April 21, 2026 17:19

pingsutw self-assigned this Apr 21, 2026

pingsutw added the flyte2 label Apr 21, 2026

pingsutw added this to the V2 GA milestone Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU-capable demo cluster image#7243

Add GPU-capable demo cluster image#7243
pingsutw merged 7 commits intov2from
feat/demo-bundled-gpu

pingsutw commented Apr 20, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pingsutw commented Apr 20, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the image

Companion change

Test plan

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pingsutw commented Apr 20, 2026 •

edited by github-actions bot

Loading