Skip to content

Add GPU-capable demo cluster image#7243

Merged
pingsutw merged 7 commits intov2from
feat/demo-bundled-gpu
Apr 21, 2026
Merged

Add GPU-capable demo cluster image#7243
pingsutw merged 7 commits intov2from
feat/demo-bundled-gpu

Conversation

@pingsutw
Copy link
Copy Markdown
Member

@pingsutw pingsutw commented Apr 20, 2026

Summary

Adds a GPU variant of the demo-bundled image so users on an NVIDIA-enabled host can run:

flyte start demo --image ghcr.io/flyteorg/flyte-demo:gpu-latest

and submit Flyte tasks with Resources(gpu=1) — no PodTemplate, no runtimeClassName needed — and get a GPU.

What's in the image

  • Dockerfile.gpu — stages NVIDIA Container Toolkit v1.19.x (nvidia-ctk, nvidia-container-runtime, libnvidia-container) into the rancher/k3s final image. Two subtle prereqs the OCI hook needs:
    • Libs under /usr/lib/<arch-triple>/ — the nvidia-ctk OCI hook runs without inheriting LD_LIBRARY_PATH.
    • A statically-linked /sbin/ldconfig from debian:bookworm-slim — rancher/k3s ships none, and the toolkit's update-ldcache hook bind-mounts it into workload pods.
  • containerd-config.toml.tmpl — sets default_runtime_name = "nvidia". GPU pods get GPUs automatically; non-GPU pods are unaffected (nvidia-container-runtime is a passthrough).
  • nvidia-device-plugin.yamlRuntimeClass nvidia + DaemonSet (nvcr.io/nvidia/k8s-device-plugin:v0.17.0) auto-applied by k3s on startup.
  • Makefile — new build-gpu target.
  • CI — new build-and-push step publishing gpu-latest, gpu-nightly, and gpu-<sha> tags to both flyte-demo and flyte-sandbox-v2.

Companion change

A matching --gpu flag on flyte start demo (adds --gpus all to docker run) will land in flyteorg/flyte-sdk. Image is useful without it via manual docker run --gpus all ….

Test plan

Screenshot 2026-04-20 at 5 19 34 PM
import flyte

torch_image = flyte.Image.from_debian_base(
    python_version=(3, 12),
    registry="localhost:30000",
).with_pip_packages("torch")

gpu_env = flyte.TaskEnvironment(
    name="torch_gpu",
    image=torch_image,
    resources=flyte.Resources(cpu=1, memory="2Gi", gpu=1),
)


@gpu_env.task
def check_torch() -> str:
    import torch

    info = (
        f"torch={torch.__version__}\n"
        f"cuda_available={torch.cuda.is_available()}\n"
        f"device_count={torch.cuda.device_count()}\n"
        f"device_name={torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}"
    )
    print(info)
    return info


if __name__ == "__main__":
    flyte.init_from_config()
    r = flyte.run(check_torch)
    print("name:", r.name)
    print("url:", r.url)
    r.wait()
  • Layered test image (NVIDIA toolkit + device plugin on top of flyte-demo:nightly) verified on an A10G: k3s auto-registers nvidia runtime, node advertises nvidia.com/gpu: 1, Flyte task with Resources(gpu=1) ran torch 2.11.0+cu130 reporting cuda_available=True, device_name="NVIDIA A10G".
  • Full multi-stage Dockerfile.gpu has NOT been built locally — this CI run is the first build. Expect potential fixup iterations.

Follow-ups

  • GPU type filters (Resources(gpu="A10G:1")) require GPU Feature Discovery to label the node.
  • MIG support (A100/H100 partitioning) needs a device-plugin ConfigMap with migStrategy.

Adds a GPU variant of the demo-bundled image so users with NVIDIA GPUs
can run `flyte start demo --image ghcr.io/flyteorg/flyte-demo:gpu-latest`
and submit tasks with `Resources(gpu=1)`.

- Dockerfile.gpu stages NVIDIA Container Toolkit v1.19.x binaries and
  their shared libs into the rancher/k3s final image. Libs are copied
  into /usr/lib/<triple>/ because the nvidia-ctk OCI hook runs without
  inheriting LD_LIBRARY_PATH. A statically-linked /sbin/ldconfig is
  also staged (rancher/k3s ships none) because the toolkit's
  update-ldcache hook bind-mounts it into workload pods.
- containerd-config.toml.tmpl sets nvidia as the default containerd
  runtime. Pods requesting nvidia.com/gpu get GPUs without needing
  runtimeClassName in their spec; non-GPU pods are unaffected
  (nvidia-container-runtime is a passthrough when no GPU is requested).
- nvidia-device-plugin.yaml installs a RuntimeClass and the NVIDIA
  k8s-device-plugin DaemonSet so nvidia.com/gpu is advertised on the
  node. Auto-applied by k3s at startup.
- Makefile gains a build-gpu target producing flyte-demo:gpu-latest.
- CI gains a build-and-push step publishing gpu-latest, gpu-nightly,
  and gpu-<sha> tags to both flyte-demo and flyte-sandbox-v2.

The GPU plumbing was verified end-to-end with a layered test image on
an A10G (torch 2.11.0+cu130 reported cuda_available=True). The full
multi-stage Dockerfile.gpu has not been built locally; the CI run here
is the first end-to-end test of the production Dockerfile and may
need fixup iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot mentioned this pull request Apr 20, 2026
3 tasks
Removes the duplicated builder/bootstrap/pg-cache stages and final-stage
setup by making Dockerfile.gpu a thin layer on top of flyte-demo:latest
(parameterized via ARG BASE_IMAGE). CI now builds the CPU image first
and passes its sha-tag in as BASE_IMAGE to the GPU build.

- Dockerfile.gpu shrinks from ~165 to ~75 lines; inherits flyte-binary,
  embedded postgres, staging manifests, and k3d entrypoint from the
  base image unchanged.
- Makefile build-gpu target now depends on build (not the full prereq
  chain) and passes BASE_IMAGE=flyte-demo:latest.
- CI gates the GPU build on push/workflow_dispatch since PR builds
  don't push the CPU image to ghcr.io (nothing to pull for BASE_IMAGE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pingsutw and others added 4 commits April 21, 2026 00:30
Drops the `if:` gate and conditions `push:` on the same expression the
CPU build uses, so both steps always build and only push on v2-branch
pushes or workflow_dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On pull_request events the CPU build step runs with push=false, so the
GPU build's FROM ghcr.io/.../flyte-demo:sha-<sha> fails to resolve
(image not found in the registry). Fix by producing an OCI archive of
the CPU image locally and passing it to the GPU build as a named build
context (build-contexts: base=oci-layout://...) with BASE_IMAGE=base.

Registry push happens in a separate step that only runs on push /
workflow_dispatch, so PR builds no longer need ghcr credentials for
the GPU step.
Add GHA cache (type=gha) to the three docker/build-push-action steps in
build-and-push-demo-bundled-image. CPU archive and CPU push share the
demo-cpu scope so the push reuses layers from the archive build; GPU
gets its own demo-gpu scope.
The oci-layout:// build-context source requires Dockerfile frontend 1.5+.
CI was failing with 'unsupported context source oci-layout for base'.

Signed-off-by: Kevin Su <pingsutw@apache.org>
cosmicBboy pushed a commit to flyteorg/flyte-sdk that referenced this pull request Apr 21, 2026
## Summary
Adds `--gpu` to `flyte start demo`. When set, the underlying `docker
run` is invoked with `--gpus all`, giving the demo container access to
host NVIDIA GPUs.

```bash
flyte start demo --gpu --image ghcr.io/flyteorg/flyte-demo:gpu-latest
```

Default off — existing non-GPU users are unaffected.

## Why

The GPU-capable demo image being added in
[flyteorg/flyte#7243](flyteorg/flyte#7243)
configures k3s, containerd, and the NVIDIA device plugin inside the
image, but none of that matters if the host GPUs aren't passed through
to the container. `--gpus all` on the `docker run` is the piece the CLI
owns.

## Change
- `_start.py` — new `--gpu` click option on the `demo` command.
- `_demo.py` — threads `gpu: bool` through `launch_demo →
_launch_demo_rich/plain → _run_step → _run_container`, which appends
`--gpus all` when true.

## Test plan
<img width="1879" height="769" alt="Screenshot 2026-04-21 at 12 38
07 AM"
src="https://github.com/user-attachments/assets/44a92867-f1dd-4c89-b96d-3c5af63389f4"
/>

- [x] Installed locally with `uv tool install --from . flyte`; `flyte
start demo --help` lists the new flag.
- [x] `flyte start demo --gpu --image flyte-demo:gpu-local` on an A10G
host launched the container with `--gpus all`, a Flyte task with
`Resources(gpu=1)` ran `torch.cuda.is_available() == True`.
- [x] No regression expected on existing users (flag defaults to off).

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot mentioned this pull request Apr 21, 2026
8 tasks
@pingsutw pingsutw merged commit 762b0cd into v2 Apr 21, 2026
20 checks passed
@pingsutw pingsutw deleted the feat/demo-bundled-gpu branch April 21, 2026 17:19
@pingsutw pingsutw self-assigned this Apr 21, 2026
@pingsutw pingsutw added this to the V2 GA milestone Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants