EAI-6030 Add GPU_STACK_FAMILY for ROCm/GPU Operator defaults#259
Open
pre wants to merge 3 commits into
Open
Conversation
Selects host ROCm + GPU Operator install defaults by GPU family (radeon | instinct). Empty resolves to instinct, reproducing the current qualified defaults so existing installs are unchanged. Radeon selects the ROCm 7.13 tech-preview stack and prints a tech-preview notice. A per-family compatibility matrix in pkg/config/gpu_stack_matrix.go resolves host ROCm, GPU Operator chart path, and DeviceConfig ROCm driver version as one row, and fails install validation for unsupported combinations (e.g. radeon on too-old ROCm) with an error naming the incompatible component. Resolved selections are injected as ansible vars and passed through to cluster-forge (small via --set on the helm render, medium/large via the gitea-init-job values). Radeon version strings are placeholders pending EAI-5906. Part of EAI-6030.
This was referenced Jun 11, 2026
Point the radeon stack row at the amd-gpu-operator v1.5.1-beta.0 tech-preview chart, while instinct (and empty/default) stays on the qualified v1.4.1 chart. The operator path pins are now real, not placeholders; radeon host ROCm and DeviceConfig driver versions are sourced outside cluster-bloom / cluster-forge. Part of EAI-6030.
Resolve a per-family GPU Operator config chart path alongside the operator chart path: instinct uses amd-gpu-operator-config/v1.4.1 and radeon uses amd-gpu-operator-config/v1.5.1-beta.0, so the config chart and its DeviceConfig schema match the selected operator version. The resolved path flows to cluster-forge as gpu_operator_config_path: small clusters via --set apps.amd-gpu-operator-config.path, medium/large via the gitea-init-job values. Part of EAI-6030.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related:
Summary
Second half of EAI-6030: ROCm + GPU Operator install defaults driven by GPU family. Stacked on #256 (the AIM catalog work); base is
EAI-6030-aim-gpu-family.Adds a new single-select
GPU_STACK_FAMILYflag (radeon|instinct), distinct fromAIM_HARDWARE_FAMILY. Single-select because host ROCm is one version per node, so a heterogeneous Radeon+Instinct GPU stack can't be expressed (the AIM catalog can still be heterogeneous).pkg/config/gpu_stack_matrix.goresolves the whole stack as one qualified matrix row. Empty resolves to instinct, reproducing today's exact pins; radeon selects the ROCm 7.13 tech-preview stack:instinct, reproducing today's exact pins. No regression for existing installs.amd-gpu-operator/v1.5.1-beta.0and config chartamd-gpu-operator-config/v1.5.1-beta.0, and prints a tech-preview notice at install time. Both beta charts are vendored in cluster-forge #744; the defaults stayv1.4.1and the betas are selected only forGPU_STACK_FAMILY=radeon.config.Validatebefore install with an error naming the incompatible component.ConfigToAnsibleVars. Small clusters get--set apps.amd-gpu-operator.path=...,apps.amd-gpu-operator-config.path=..., andapps.amd-gpu-operator-config.valuesObject.*on the helm render; medium/large go through the gitea-init-job values.The GPU Operator and config chart paths are real pins. The radeon host ROCm and DeviceConfig driver versions mirror values owned outside cluster-bloom / cluster-forge (ROCm PMO / EAI-5906) and are not authoritative pins held here.
Test plan
go build ./...go vet ./pkg/config/... ./pkg/ansible/...go test ./pkg/config/...(family resolution, instinct == existing defaults, radeon selects operator + configv1.5.1-beta.0, radeon+7.2 rejected, radeon+7.13 accepted,ApplyGPUStackVarssets both paths)