You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A validated, end-to-end example of serving a model on the AMD node (Vulkan tier), plus a benchmark entry so the AMD tier has the same "here is a working manifest and here are the numbers" treatment as CUDA and Metal.
Problem Statement
As someone evaluating LLMKube on AMD, I want a known-good example manifest and real tokens/sec, so that I can trust the tier works and size my hardware.
The other tiers have validated examples and benchmark numbers. The AMD tier needs the same proof, and it doubles as the acceptance test that #696 actually landed.
Proposed Solution
A Model + InferenceService example manifest targeting the AMD node (vendor amd, Vulkan runtime image, GPU layer offload set for the unified-memory budget).
A documented end-to-end run: deploy, hit the OpenAI-compatible endpoint, confirm GPU offload, record decode/prefill tokens/sec at a couple of context lengths.
A benchmark entry alongside the existing CUDA/Metal numbers (a MoE model such as a Qwen3 30B-class is a good showcase for the 90GB unified pool).
Feature Description
A validated, end-to-end example of serving a model on the AMD node (Vulkan tier), plus a benchmark entry so the AMD tier has the same "here is a working manifest and here are the numbers" treatment as CUDA and Metal.
Problem Statement
The other tiers have validated examples and benchmark numbers. The AMD tier needs the same proof, and it doubles as the acceptance test that #696 actually landed.
Proposed Solution
Model+InferenceServiceexample manifest targeting the AMD node (vendoramd, Vulkan runtime image, GPU layer offload set for the unified-memory budget).Alternatives Considered
Additional Context
Priority
Willingness to Contribute