Skip to content

[FEATURE] ROCm 7.2 HIP runtime tier for AMD nodes (per-model opt-in, follow-on) #701

@Defilan

Description

@Defilan

Feature Description

A ROCm 7.2 (HIP) runtime tier for AMD nodes, as a per-model opt-in alongside the Vulkan default. This is the "ROCm proper" follow-on under the AMD epic (#696): for the specific models where ROCm beats Vulkan, allow an InferenceService to select the ROCm runtime image.

Problem Statement

As a fleet operator, I want to run the models that perform better under ROCm/HIP on AMD via the same CRDs, so that I am not locked to Vulkan when ROCm wins.

Vulkan is the right v1 default on gfx1151 (faster generation, far more stable), but ROCm 7.x can edge it on some small-dense and MoE workloads with a fully-built HIP stack (rocWMMA + hipBLASLt). Once the Vulkan tier is solid, ROCm is worth offering for those cases.

Proposed Solution

  • A ROCm llama-server image built from rocm/dev-ubuntu-24.04:7.2-complete with -DGPU_TARGETS=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_NO_VMM=ON -DGGML_HIP_MMQ_MFMA=ON (or COPY prebuilt lemonade-sdk/llamacpp-rocm gfx1151 binaries to avoid long HIP builds).
  • Pod mounts /dev/dri/render* + /dev/kfd via the device plugin (feat(crd): make GPU resource name configurable to support AMD/Vulkan/Intel scheduling #395); the ROCm node prerequisites extend the node runbook.
  • Per-model/per-InferenceService runtime selection so an operator opts a specific model into ROCm.
  • Re-benchmark ROCm vs Vulkan per model and document which wins where; track ROCm release notes for when gfx1151 leaves Preview and the hipGraph/KV-cache bugs are fixed.

Alternatives Considered

  • Make ROCm the default: rejected for now given gfx1151 instability (dense-model crashes, KV-cache-to-host). Vulkan stays the default; ROCm is opt-in.

Additional Context

Priority

  • Medium - Nice to have

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions