You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make AMD a first-class accelerator tier in LLMKube, alongside NVIDIA CUDA (in-cluster pods) and Apple Metal (off-cluster agent). The immediate driver is a homelab AMD Strix Halo node (Ryzen AI Max+ 395, RDNA 3.5 iGPU "Radeon 8060S" / gfx1151, 128GB LPDDR5X unified memory), but the goal is a general, supported path for AMD GPUs/APUs.
This is an umbrella epic. The work decomposes into a runtime image, node onboarding, a validated example, observability, and a follow-on ROCm tier.
Problem Statement
As a fleet operator running heterogeneous on-prem hardware, I want to add an AMD node and have LLMKube schedule, serve, observe, and route to it with the same CRDs I use for NVIDIA and Metal, so that AMD is a real tier and not a manual bare-metal workaround.
Today the operator hardcodes NVIDIA assumptions on the pod path and ships only CUDA + Metal runtimes. A user can run llama.cpp on an AMD APU on bare metal but has no first-class LLMKube path. AMD APUs are a compelling sovereign-inference tier: a single Strix Halo box exposes ~90GB of unified memory to the iGPU, enough for large models, at low power.
Proposed Solution
Ship AMD support as a Vulkan-first tier (Vulkan/RADV is faster and far more stable than ROCm on gfx1151 today), with ROCm as a follow-on per-model opt-in. All of it runs in-cluster through the same Model / InferenceService path as CUDA; the only new node-level primitive is a render-device plugin.
ROCm-first: rejected for v1. On gfx1151 today ROCm is Preview-tier and crash-prone (dense-model hipGraphInstantiate crashes, KV-cache-spills-to-host bug), needs a pinned kernel, and produces a 6-12GB image. Kept as a follow-on tier.
Off-cluster agent (Metal pattern): unnecessary for Linux+Vulkan, which containerizes cleanly. In-cluster reuses the existing CUDA machinery.
Backends compared on gfx1151: Vulkan/RADV vs ROCm/HIP. Vulkan wins generation throughput (25-32% in a 128-run benchmark) and stability across dense + MoE models; both share the same /dev/dri/renderD128 Kubernetes exposure.
Similar features: the existing Apple Metal tier is the precedent for "a non-NVIDIA accelerator as a first-class citizen."
Feature Description
Make AMD a first-class accelerator tier in LLMKube, alongside NVIDIA CUDA (in-cluster pods) and Apple Metal (off-cluster agent). The immediate driver is a homelab AMD Strix Halo node (Ryzen AI Max+ 395, RDNA 3.5 iGPU "Radeon 8060S" /
gfx1151, 128GB LPDDR5X unified memory), but the goal is a general, supported path for AMD GPUs/APUs.This is an umbrella epic. The work decomposes into a runtime image, node onboarding, a validated example, observability, and a follow-on ROCm tier.
Problem Statement
Today the operator hardcodes NVIDIA assumptions on the pod path and ships only CUDA + Metal runtimes. A user can run llama.cpp on an AMD APU on bare metal but has no first-class LLMKube path. AMD APUs are a compelling sovereign-inference tier: a single Strix Halo box exposes ~90GB of unified memory to the iGPU, enough for large models, at low power.
Proposed Solution
Ship AMD support as a Vulkan-first tier (Vulkan/RADV is faster and far more stable than ROCm on
gfx1151today), with ROCm as a follow-on per-model opt-in. All of it runs in-cluster through the sameModel/InferenceServicepath as CUDA; the only new node-level primitive is a render-device plugin.Decomposition (each its own issue):
/dev/drigeneric-device-plugin escape hatch. DONE via feat(crd): make GPU resource name configurable to support AMD/Vulkan/Intel scheduling #709 (community contribution by @joryirving), incl. themodel_amd_vulkan_igpu.yamlsample.checkAcceleratorAvailabilityignores the newgpu.resourceNameoverride, soAcceleratorReadystatus is inaccurate for escape-hatch users (status-only, not a scheduling gate).llama-serverimage in the build matrix (the biggest gap)./metrics, Grafana panel (the DCGM-for-AMD analog).Alternatives Considered
gfx1151today ROCm is Preview-tier and crash-prone (dense-modelhipGraphInstantiatecrashes, KV-cache-spills-to-host bug), needs a pinned kernel, and produces a 6-12GB image. Kept as a follow-on tier.Additional Context
gfx1151: Vulkan/RADV vs ROCm/HIP. Vulkan wins generation throughput (25-32% in a 128-run benchmark) and stability across dense + MoE models; both share the same/dev/dri/renderD128Kubernetes exposure.Priority
Willingness to Contribute