Skip to content

EAI-6030 Select GPU Operator and DeviceConfig by GPU family#744

Open
pre wants to merge 5 commits into
EAI-6030-aim-gpu-familyfrom
EAI-6030-gpu-stack-family
Open

EAI-6030 Select GPU Operator and DeviceConfig by GPU family#744
pre wants to merge 5 commits into
EAI-6030-aim-gpu-familyfrom
EAI-6030-gpu-stack-family

Conversation

@pre

@pre pre commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Related:

Summary

Second half of EAI-6030: the cluster-forge side of GPU-family-driven ROCm / GPU Operator defaults. Stacked on #741 (the AIM catalog work); base is EAI-6030-aim-gpu-family. Pairs with cluster-bloom #259.

The GPU stack is selected as one matrix row by GPU_STACK_FAMILY (injected by cluster-bloom). Empty / instinct keeps today's qualified defaults; radeon selects the ROCm 7.13 tech-preview stack:

GPU_STACK_FAMILY amd-gpu-operator amd-gpu-operator-config DeviceConfig ROCm driver
empty / instinct v1.4.1 v1.4.1 7.0
radeon v1.5.1-beta.0 v1.5.1-beta.0 7.13
  • GPU Operator chart vendored, default stays v1.4.1: sources/amd-gpu-operator/v1.5.1-beta.0 is added so radeon can resolve to it, but root/values.yaml + sources/amd-gpu-operator/source.yaml keep v1.4.1. The beta chart is selected only when cluster-bloom injects apps.amd-gpu-operator.path=amd-gpu-operator/v1.5.1-beta.0 for GPU_STACK_FAMILY=radeon.
  • Config chart versioned per family: the previously unversioned sources/amd-gpu-operator-config is split into v1.4.1 (the existing DeviceConfig example, default) and v1.5.1-beta.0 (the new-schema DeviceConfig for the radeon tech-preview stack). root/values.yaml defaults the config app path to amd-gpu-operator-config/v1.4.1; the beta config chart is selected only when bloom injects apps.amd-gpu-operator-config.path for radeon.
  • DeviceConfig driver version parameterized: both config chart versions resolve spec.driver.version through the gpuStack.driverVersion helper (precedence: explicit driverVersion > profiles[family].driverVersion > instinct). Empty resolves to instinct7.0; radeon7.13.
  • gitea-init-job (medium/large): now emits apps.amd-gpu-operator.path, apps.amd-gpu-operator-config.path, and the amd-gpu-operator-config valuesObject (gpuStackFamily / driverVersion) into cluster-values.
  • root/values.yaml: adds apps.amd-gpu-operator-config.valuesObject.gpuStackFamily/driverVersion (empty defaults) so bloom can inject and a GitOps-only operator can set the same keys (ADR-0002 parity).

The radeon DeviceConfig driver version (7.13) mirrors a value owned outside cluster-bloom / cluster-forge (ROCm PMO / EAI-5906); it is not an authoritative pin held here.

Test plan

  • helm lint on amd-gpu-operator-config/v1.4.1, amd-gpu-operator-config/v1.5.1-beta.0, and gitea-init-job/0.1.0
  • DeviceConfig version resolution: default/empty → 7.0, instinct7.0, radeon7.13, explicit driverVersion override wins (both chart versions)
  • App-of-apps render (small): default → amd-gpu-operator/v1.4.1 + amd-gpu-operator-config/v1.4.1; radeon --set → both v1.5.1-beta.0
  • gitea-init-job (medium/large) emits apps.amd-gpu-operator.path, apps.amd-gpu-operator-config.path, and config valuesObject
  • Default render (no flags) unchanged across small/medium/large (empty values → instinct → v1.4.1)
  • SBOM validate-sync.sh passes
  • Reviewer: confirm ArgoCD sync against a live cluster (not testable locally)

Parameterizes the amd-gpu-operator-config DeviceConfig out-of-tree ROCm
driver version by GPU family (radeon | instinct) via a gpuStackFamily /
driverVersion value and a gpuStack.driverVersion helper. Empty resolves
to instinct (7.0), so existing installs are unchanged; radeon resolves
to the ROCm 7.13 tech-preview pin.

The GPU Operator chart version is selected separately through the
app-level apps.amd-gpu-operator.path field. cluster-bloom injects both
the path and the config valuesObject: small clusters via --set on the
helm render, medium/large via the gitea-init-job, which now emits the
amd-gpu-operator path and amd-gpu-operator-config valuesObject into
cluster-values.

Radeon driver version is a placeholder pending EAI-5906. Part of
EAI-6030.
pre added 4 commits June 12, 2026 14:59
Split the previously unversioned amd-gpu-operator-config chart into
v1.4.1 (the existing DeviceConfig example, default) and v1.5.1-beta.0
(the new-schema DeviceConfig for the radeon tech-preview stack). The
beta DeviceConfig resolves its ROCm driver version through the existing
gpuStack.driverVersion helper, so radeon gets 7.13 and the default
stays 7.0.

root/values.yaml defaults the config app path to
amd-gpu-operator-config/v1.4.1; the beta chart is selected only when
cluster-bloom injects apps.amd-gpu-operator-config.path for
GPU_STACK_FAMILY=radeon. The gitea-init-job now emits that path for
medium/large clusters. Part of EAI-6030.
Read the imageRegistrySecret name from values.yaml instead of
hardcoding dockerhub-amdpsdo-regcred across the four DeviceConfig
component sections. Default keeps the existing secret name. Part of
EAI-6030.
Guard each imageRegistrySecret block with the resolved name so an empty
or absent imageRegistrySecret.name renders no key at all, instead of
emitting an empty name that produces invalid DeviceConfig YAML. Part of
EAI-6030.
@pre pre force-pushed the EAI-6030-gpu-stack-family branch from 7d536ee to 3e4b701 Compare June 12, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant