Skip to content

EAI-6030 Add GPU_STACK_FAMILY for ROCm/GPU Operator defaults#259

Open
pre wants to merge 3 commits into
EAI-6030-aim-gpu-familyfrom
EAI-6030-gpu-stack-family
Open

EAI-6030 Add GPU_STACK_FAMILY for ROCm/GPU Operator defaults#259
pre wants to merge 3 commits into
EAI-6030-aim-gpu-familyfrom
EAI-6030-gpu-stack-family

Conversation

@pre

@pre pre commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Related:

Summary

Second half of EAI-6030: ROCm + GPU Operator install defaults driven by GPU family. Stacked on #256 (the AIM catalog work); base is EAI-6030-aim-gpu-family.

Adds a new single-select GPU_STACK_FAMILY flag (radeon | instinct), distinct from AIM_HARDWARE_FAMILY. Single-select because host ROCm is one version per node, so a heterogeneous Radeon+Instinct GPU stack can't be expressed (the AIM catalog can still be heterogeneous).

pkg/config/gpu_stack_matrix.go resolves the whole stack as one qualified matrix row. Empty resolves to instinct, reproducing today's exact pins; radeon selects the ROCm 7.13 tech-preview stack:

GPU_STACK_FAMILY host ROCm amd-gpu-operator amd-gpu-operator-config DeviceConfig ROCm driver
empty / instinct 7.1.1 (70101-1) v1.4.1 v1.4.1 7.0
radeon 7.13.0 (71300-1) v1.5.1-beta.0 v1.5.1-beta.0 7.13
  • Default unchanged: empty resolves to instinct, reproducing today's exact pins. No regression for existing installs.
  • Radeon: selects the ROCm 7.13 tech-preview stack, including GPU Operator chart amd-gpu-operator/v1.5.1-beta.0 and config chart amd-gpu-operator-config/v1.5.1-beta.0, and prints a tech-preview notice at install time. Both beta charts are vendored in cluster-forge #744; the defaults stay v1.4.1 and the betas are selected only for GPU_STACK_FAMILY=radeon.
  • Compatibility matrix + fail-fast: unsupported combinations (e.g. a radeon stack resolving to ROCm 7.2) fail config.Validate before install with an error naming the incompatible component.
  • Injection to cluster-forge: resolved selections flow as ansible vars. Host ROCm overrides the play vars via ConfigToAnsibleVars. Small clusters get --set apps.amd-gpu-operator.path=..., apps.amd-gpu-operator-config.path=..., and apps.amd-gpu-operator-config.valuesObject.* on the helm render; medium/large go through the gitea-init-job values.

The GPU Operator and config chart paths are real pins. The radeon host ROCm and DeviceConfig driver versions mirror values owned outside cluster-bloom / cluster-forge (ROCm PMO / EAI-5906) and are not authoritative pins held here.

Test plan

  • go build ./...
  • go vet ./pkg/config/... ./pkg/ansible/...
  • go test ./pkg/config/... (family resolution, instinct == existing defaults, radeon selects operator + config v1.5.1-beta.0, radeon+7.2 rejected, radeon+7.13 accepted, ApplyGPUStackVars sets both paths)
  • Schema field-count assertion bumped 38 → 39
  • YAML lint on edited ansible tasks + schema
  • Reviewer: end-to-end bloom run on a cluster (ansible/helm not runnable in this env). Note: the standard test VM has no GPU, so it exercises the default/instinct path + cluster-forge rendering, not a real Radeon ROCm install.

Selects host ROCm + GPU Operator install defaults by GPU family
(radeon | instinct). Empty resolves to instinct, reproducing the
current qualified defaults so existing installs are unchanged. Radeon
selects the ROCm 7.13 tech-preview stack and prints a tech-preview
notice.

A per-family compatibility matrix in pkg/config/gpu_stack_matrix.go
resolves host ROCm, GPU Operator chart path, and DeviceConfig ROCm
driver version as one row, and fails install validation for
unsupported combinations (e.g. radeon on too-old ROCm) with an error
naming the incompatible component. Resolved selections are injected as
ansible vars and passed through to cluster-forge (small via --set on
the helm render, medium/large via the gitea-init-job values).

Radeon version strings are placeholders pending EAI-5906. Part of
EAI-6030.
pre added 2 commits June 12, 2026 15:01
Point the radeon stack row at the amd-gpu-operator v1.5.1-beta.0
tech-preview chart, while instinct (and empty/default) stays on the
qualified v1.4.1 chart. The operator path pins are now real, not
placeholders; radeon host ROCm and DeviceConfig driver versions are
sourced outside cluster-bloom / cluster-forge. Part of EAI-6030.
Resolve a per-family GPU Operator config chart path alongside the
operator chart path: instinct uses amd-gpu-operator-config/v1.4.1 and
radeon uses amd-gpu-operator-config/v1.5.1-beta.0, so the config chart
and its DeviceConfig schema match the selected operator version.

The resolved path flows to cluster-forge as gpu_operator_config_path:
small clusters via --set apps.amd-gpu-operator-config.path, medium/large
via the gitea-init-job values. Part of EAI-6030.
@pre pre requested a review from blankdots June 12, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant