Skip to content

[FEATURE] Validated AMD (Vulkan) example InferenceService + benchmark #699

@Defilan

Description

@Defilan

Feature Description

A validated, end-to-end example of serving a model on the AMD node (Vulkan tier), plus a benchmark entry so the AMD tier has the same "here is a working manifest and here are the numbers" treatment as CUDA and Metal.

Problem Statement

As someone evaluating LLMKube on AMD, I want a known-good example manifest and real tokens/sec, so that I can trust the tier works and size my hardware.

The other tiers have validated examples and benchmark numbers. The AMD tier needs the same proof, and it doubles as the acceptance test that #696 actually landed.

Proposed Solution

  • A Model + InferenceService example manifest targeting the AMD node (vendor amd, Vulkan runtime image, GPU layer offload set for the unified-memory budget).
  • A documented end-to-end run: deploy, hit the OpenAI-compatible endpoint, confirm GPU offload, record decode/prefill tokens/sec at a couple of context lengths.
  • A benchmark entry alongside the existing CUDA/Metal numbers (a MoE model such as a Qwen3 30B-class is a good showcase for the 90GB unified pool).
  • Wire it into the heterogeneous-fleet story: this node becomes a real backend tier the gateway ([FEATURE] First-class Envoy AI Gateway integration: make the gateway fleet-aware #661) and router can target.

Alternatives Considered

  • Folding this into the runtime-image issue: kept separate so the image PR stays focused and the example carries the reproducible numbers.

Additional Context

Priority

  • High - Would significantly improve my workflow

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions