EAI-6030 Select GPU Operator and DeviceConfig by GPU family#744
Open
pre wants to merge 5 commits into
Open
Conversation
Parameterizes the amd-gpu-operator-config DeviceConfig out-of-tree ROCm driver version by GPU family (radeon | instinct) via a gpuStackFamily / driverVersion value and a gpuStack.driverVersion helper. Empty resolves to instinct (7.0), so existing installs are unchanged; radeon resolves to the ROCm 7.13 tech-preview pin. The GPU Operator chart version is selected separately through the app-level apps.amd-gpu-operator.path field. cluster-bloom injects both the path and the config valuesObject: small clusters via --set on the helm render, medium/large via the gitea-init-job, which now emits the amd-gpu-operator path and amd-gpu-operator-config valuesObject into cluster-values. Radeon driver version is a placeholder pending EAI-5906. Part of EAI-6030.
53a9ef5 to
c22222e
Compare
Split the previously unversioned amd-gpu-operator-config chart into v1.4.1 (the existing DeviceConfig example, default) and v1.5.1-beta.0 (the new-schema DeviceConfig for the radeon tech-preview stack). The beta DeviceConfig resolves its ROCm driver version through the existing gpuStack.driverVersion helper, so radeon gets 7.13 and the default stays 7.0. root/values.yaml defaults the config app path to amd-gpu-operator-config/v1.4.1; the beta chart is selected only when cluster-bloom injects apps.amd-gpu-operator-config.path for GPU_STACK_FAMILY=radeon. The gitea-init-job now emits that path for medium/large clusters. Part of EAI-6030.
Read the imageRegistrySecret name from values.yaml instead of hardcoding dockerhub-amdpsdo-regcred across the four DeviceConfig component sections. Default keeps the existing secret name. Part of EAI-6030.
Guard each imageRegistrySecret block with the resolved name so an empty or absent imageRegistrySecret.name renders no key at all, instead of emitting an empty name that produces invalid DeviceConfig YAML. Part of EAI-6030.
7d536ee to
3e4b701
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related:
Summary
Second half of EAI-6030: the cluster-forge side of GPU-family-driven ROCm / GPU Operator defaults. Stacked on #741 (the AIM catalog work); base is
EAI-6030-aim-gpu-family. Pairs with cluster-bloom #259.The GPU stack is selected as one matrix row by
GPU_STACK_FAMILY(injected by cluster-bloom). Empty / instinct keeps today's qualified defaults; radeon selects the ROCm 7.13 tech-preview stack:sources/amd-gpu-operator/v1.5.1-beta.0is added so radeon can resolve to it, butroot/values.yaml+sources/amd-gpu-operator/source.yamlkeepv1.4.1. The beta chart is selected only when cluster-bloom injectsapps.amd-gpu-operator.path=amd-gpu-operator/v1.5.1-beta.0forGPU_STACK_FAMILY=radeon.sources/amd-gpu-operator-configis split intov1.4.1(the existing DeviceConfig example, default) andv1.5.1-beta.0(the new-schema DeviceConfig for the radeon tech-preview stack).root/values.yamldefaults the config app path toamd-gpu-operator-config/v1.4.1; the beta config chart is selected only when bloom injectsapps.amd-gpu-operator-config.pathfor radeon.spec.driver.versionthrough thegpuStack.driverVersionhelper (precedence: explicitdriverVersion>profiles[family].driverVersion> instinct). Empty resolves toinstinct→7.0;radeon→7.13.apps.amd-gpu-operator.path,apps.amd-gpu-operator-config.path, and theamd-gpu-operator-configvaluesObject (gpuStackFamily/driverVersion) into cluster-values.apps.amd-gpu-operator-config.valuesObject.gpuStackFamily/driverVersion(empty defaults) so bloom can inject and a GitOps-only operator can set the same keys (ADR-0002 parity).The radeon DeviceConfig driver version (
7.13) mirrors a value owned outside cluster-bloom / cluster-forge (ROCm PMO / EAI-5906); it is not an authoritative pin held here.Test plan
helm lintonamd-gpu-operator-config/v1.4.1,amd-gpu-operator-config/v1.5.1-beta.0, andgitea-init-job/0.1.07.0,instinct→7.0,radeon→7.13, explicitdriverVersionoverride wins (both chart versions)amd-gpu-operator/v1.4.1+amd-gpu-operator-config/v1.4.1; radeon--set→ bothv1.5.1-beta.0apps.amd-gpu-operator.path,apps.amd-gpu-operator-config.path, and config valuesObjectvalidate-sync.shpasses