Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/values_inheritance_pattern.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,16 @@ When ArgoCD renders applications with multi-source:
not a string, so no comma parsing is involved. The base `root/values.yaml`
default is an empty list, which selects the legacy (install-all) branch.

GPU stack family (ROCm + GPU Operator) is injected the same way, driven by
cluster-bloom's `GPU_STACK_FAMILY`. Two child-app keys are set:
`apps.amd-gpu-operator.path` selects the GPU Operator chart version, and
`apps.amd-gpu-operator-config.valuesObject.gpuStackFamily` /
`.driverVersion` drive the DeviceConfig out-of-tree ROCm driver version (see
`sources/amd-gpu-operator-config`). Empty values resolve to the instinct
defaults, so existing installs are unchanged. Note the GPU Operator version
is the app-level `path` field (a sibling of `valuesObject`), not a Helm
value, so it is set directly rather than inside `valuesObject`.

## Developer Workflow

### Local Configuration Management (Local Mode)
Expand Down
8 changes: 7 additions & 1 deletion root/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,14 @@ apps:
install: false
amd-gpu-operator-config:
namespace: kube-amd-gpu
path: amd-gpu-operator-config
path: amd-gpu-operator-config/v1.4.1
syncWave: 0
valuesObject:
# GPU stack family driving the DeviceConfig ROCm pins (radeon | instinct).
# Empty = instinct (current default). cluster-bloom injects the selected
# family + resolved driver version at deploy time.
gpuStackFamily: ""
driverVersion: ""
appwrapper:
namespace: appwrapper-system
path: appwrapper/v1.1.2
Expand Down
39 changes: 39 additions & 0 deletions sources/amd-gpu-operator-config/v1.4.1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# amd-gpu-operator-config

Cluster-side configuration for the AMD GPU Operator: the `DeviceConfig` custom
resource plus supporting RBAC and the metrics `gpu-config` ConfigMap.

## GPU stack family selection

The DeviceConfig out-of-tree driver ROCm version is selected by GPU family so it
matches the host ROCm installed by cluster-bloom and the GPU Operator chart
version. This is driven by cluster-bloom's `GPU_STACK_FAMILY` flag.

Values:

| Value | Meaning |
|-------|---------|
| `gpuStackFamily` | `radeon` \| `instinct`. Empty resolves to `instinct` (the current default). |
| `driverVersion` | Explicit DeviceConfig `spec.driver.version` override. When set, wins over the per-family default. |
| `profiles.<family>.driverVersion` | Per-family default ROCm driver version used when `driverVersion` is empty. |

Resolution precedence (see `templates/_helpers.tpl`, `gpuStack.driverVersion`):

1. `driverVersion` if set (cluster-bloom injects the family-resolved value here),
2. else `profiles[gpuStackFamily].driverVersion`,
3. else the `instinct` profile.

Empty input resolves to `instinct` → `7.0`, so existing installs are unchanged.

### How the value is injected

- **Small clusters:** cluster-bloom renders the parent chart with
`--set apps.amd-gpu-operator-config.valuesObject.gpuStackFamily=<family>` and
`--set apps.amd-gpu-operator-config.valuesObject.driverVersion=<version>`.
- **Medium / large clusters:** the `gitea-init-job` writes the same keys into the
cluster-values repo.

The GPU Operator chart version itself is selected separately via the app-level
`apps.amd-gpu-operator.path` field (a sibling of `valuesObject`), not in this
chart.

17 changes: 17 additions & 0 deletions sources/amd-gpu-operator-config/v1.4.1/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{{/*
Resolve the DeviceConfig out-of-tree driver ROCm version.
Precedence:
1. .Values.driverVersion (explicit override injected by cluster-bloom)
2. .Values.profiles[<family>].driverVersion for the selected gpuStackFamily
3. the instinct profile (the current default)
Empty gpuStackFamily resolves to instinct, so existing installs are unchanged.
*/}}
{{- define "gpuStack.driverVersion" -}}
{{- if .Values.driverVersion -}}
{{- .Values.driverVersion -}}
{{- else -}}
{{- $family := .Values.gpuStackFamily | default "instinct" -}}
{{- $profile := index .Values.profiles $family | default (index .Values.profiles "instinct") -}}
{{- $profile.driverVersion -}}
{{- end -}}
{{- end -}}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ spec:
# Specify the out-of-tree driver version
# NOTE: Starting from ROCm 7.1 the amdgpu version is using new versioning schema
# please refer to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html
version: "7.0"
# Resolved per GPU stack family (gpuStackFamily); empty family = instinct = "7.0".
version: "{{ include "gpuStack.driverVersion" . }}"

# Specify driver image here
# DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
Expand Down
22 changes: 22 additions & 0 deletions sources/amd-gpu-operator-config/v1.4.1/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright © Advanced Micro Devices, Inc., or its affiliates.
#
# SPDX-License-Identifier: MIT

# GPU stack family driving the DeviceConfig ROCm pins (radeon | instinct).
# Empty resolves to instinct (the current default). cluster-bloom injects this
# at deploy time from GPU_STACK_FAMILY.
gpuStackFamily: ""

# DeviceConfig out-of-tree driver ROCm version. When set, it overrides the
# per-family default below. cluster-bloom passes the family-resolved version
# here so the GPU Operator DeviceConfig matches the host ROCm train.
driverVersion: ""

# Per-family DeviceConfig defaults, used when driverVersion is not set.
# TODO(EAI-5906): replace the radeon ROCm driver version with the real
# ROCm 7.13 tech-preview value once published.
profiles:
instinct:
driverVersion: "7.0"
radeon:
driverVersion: "7.13"
4 changes: 4 additions & 0 deletions sources/amd-gpu-operator-config/v1.5.1-beta.0/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v2
name: amd-gpu-operator-config
description: A Helm chart with config for the AMD GPU Operator
version: 0.1.0
39 changes: 39 additions & 0 deletions sources/amd-gpu-operator-config/v1.5.1-beta.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# amd-gpu-operator-config

Cluster-side configuration for the AMD GPU Operator: the `DeviceConfig` custom
resource plus supporting RBAC and the metrics `gpu-config` ConfigMap.

## GPU stack family selection

The DeviceConfig out-of-tree driver ROCm version is selected by GPU family so it
matches the host ROCm installed by cluster-bloom and the GPU Operator chart
version. This is driven by cluster-bloom's `GPU_STACK_FAMILY` flag.

Values:

| Value | Meaning |
|-------|---------|
| `gpuStackFamily` | `radeon` \| `instinct`. Empty resolves to `instinct` (the current default). |
| `driverVersion` | Explicit DeviceConfig `spec.driver.version` override. When set, wins over the per-family default. |
| `profiles.<family>.driverVersion` | Per-family default ROCm driver version used when `driverVersion` is empty. |

Resolution precedence (see `templates/_helpers.tpl`, `gpuStack.driverVersion`):

1. `driverVersion` if set (cluster-bloom injects the family-resolved value here),
2. else `profiles[gpuStackFamily].driverVersion`,
3. else the `instinct` profile.

Empty input resolves to `instinct` → `7.0`, so existing installs are unchanged.

### How the value is injected

- **Small clusters:** cluster-bloom renders the parent chart with
`--set apps.amd-gpu-operator-config.valuesObject.gpuStackFamily=<family>` and
`--set apps.amd-gpu-operator-config.valuesObject.driverVersion=<version>`.
- **Medium / large clusters:** the `gitea-init-job` writes the same keys into the
cluster-values repo.

The GPU Operator chart version itself is selected separately via the app-level
`apps.amd-gpu-operator.path` field (a sibling of `valuesObject`), not in this
chart.

Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{{/*
Resolve the DeviceConfig out-of-tree driver ROCm version.
Precedence:
1. .Values.driverVersion (explicit override injected by cluster-bloom)
2. .Values.profiles[<family>].driverVersion for the selected gpuStackFamily
3. the instinct profile (the current default)
Empty gpuStackFamily resolves to instinct, so existing installs are unchanged.
*/}}
{{- define "gpuStack.driverVersion" -}}
{{- if .Values.driverVersion -}}
{{- .Values.driverVersion -}}
{{- else -}}
{{- $family := .Values.gpuStackFamily | default "instinct" -}}
{{- $profile := index .Values.profiles $family | default (index .Values.profiles "instinct") -}}
{{- $profile.driverVersion -}}
{{- end -}}
{{- end -}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
apiVersion: v1
data:
config.json: |
{
"GPUConfig": {
"Fields": [
"GPU_NODES_TOTAL",
"GPU_PACKAGE_POWER",
"GPU_AVERAGE_PACKAGE_POWER",
"GPU_EDGE_TEMPERATURE",
"GPU_JUNCTION_TEMPERATURE",
"GPU_MEMORY_TEMPERATURE",
"GPU_HBM_TEMPERATURE",
"GPU_GFX_ACTIVITY",
"GPU_UMC_ACTIVITY",
"GPU_MMA_ACTIVITY",
"GPU_VCN_ACTIVITY",
"GPU_JPEG_ACTIVITY",
"GPU_VOLTAGE",
"GPU_GFX_VOLTAGE",
"GPU_MEMORY_VOLTAGE",
"PCIE_SPEED",
"PCIE_MAX_SPEED",
"PCIE_BANDWIDTH",
"GPU_ENERGY_CONSUMED",
"PCIE_REPLAY_COUNT",
"PCIE_RECOVERY_COUNT",
"PCIE_REPLAY_ROLLOVER_COUNT",
"PCIE_NACK_SENT_COUNT",
"PCIE_NAC_RECEIVED_COUNT",
"GPU_CLOCK",
"GPU_POWER_USAGE",
"GPU_TOTAL_VRAM",
"GPU_ECC_CORRECT_TOTAL",
"GPU_ECC_UNCORRECT_TOTAL",
"GPU_ECC_CORRECT_SDMA",
"GPU_ECC_UNCORRECT_SDMA",
"GPU_ECC_CORRECT_GFX",
"GPU_ECC_UNCORRECT_GFX",
"GPU_ECC_CORRECT_MMHUB",
"GPU_ECC_UNCORRECT_MMHUB",
"GPU_ECC_CORRECT_ATHUB",
"GPU_ECC_UNCORRECT_ATHUB",
"GPU_ECC_CORRECT_BIF",
"GPU_ECC_UNCORRECT_BIF",
"GPU_ECC_CORRECT_HDP",
"GPU_ECC_UNCORRECT_HDP",
"GPU_ECC_CORRECT_XGMI_WAFL",
"GPU_ECC_UNCORRECT_XGMI_WAFL",
"GPU_ECC_CORRECT_DF",
"GPU_ECC_UNCORRECT_DF",
"GPU_ECC_CORRECT_SMN",
"GPU_ECC_UNCORRECT_SMN",
"GPU_ECC_CORRECT_SEM",
"GPU_ECC_UNCORRECT_SEM",
"GPU_ECC_CORRECT_MP0",
"GPU_ECC_UNCORRECT_MP0",
"GPU_ECC_CORRECT_MP1",
"GPU_ECC_UNCORRECT_MP1",
"GPU_ECC_CORRECT_FUSE",
"GPU_ECC_UNCORRECT_FUSE",
"GPU_ECC_CORRECT_UMC",
"GPU_ECC_UNCORRECT_UMC",
"GPU_XGMI_NBR_0_NOP_TX",
"GPU_XGMI_NBR_0_REQ_TX",
"GPU_XGMI_NBR_0_RESP_TX",
"GPU_XGMI_NBR_0_BEATS_TX",
"GPU_XGMI_NBR_1_NOP_TX",
"GPU_XGMI_NBR_1_REQ_TX",
"GPU_XGMI_NBR_1_RESP_TX",
"GPU_XGMI_NBR_1_BEATS_TX",
"GPU_XGMI_NBR_0_TX_THRPUT",
"GPU_XGMI_NBR_1_TX_THRPUT",
"GPU_XGMI_NBR_2_TX_THRPUT",
"GPU_XGMI_NBR_3_TX_THRPUT",
"GPU_XGMI_NBR_4_TX_THRPUT",
"GPU_XGMI_NBR_5_TX_THRPUT",
"GPU_USED_VRAM",
"GPU_FREE_VRAM",
"GPU_TOTAL_VISIBLE_VRAM",
"GPU_USED_VISIBLE_VRAM",
"GPU_FREE_VISIBLE_VRAM",
"GPU_TOTAL_GTT",
"GPU_USED_GTT",
"GPU_FREE_GTT",
"GPU_ECC_CORRECT_MCA",
"GPU_ECC_UNCORRECT_MCA",
"GPU_ECC_CORRECT_VCN",
"GPU_ECC_UNCORRECT_VCN",
"GPU_ECC_CORRECT_JPEG",
"GPU_ECC_UNCORRECT_JPEG",
"GPU_ECC_CORRECT_IH",
"GPU_ECC_UNCORRECT_IH",
"GPU_ECC_CORRECT_MPIO",
"GPU_ECC_UNCORRECT_MPIO"
],
"Labels": [
"GPU_UUID",
"SERIAL_NUMBER",
"GPU_ID",
"POD",
"NAMESPACE",
"CONTAINER",
"CLUSTER_NAME",
"CARD_SERIES",
"CARD_MODEL",
"CARD_VENDOR",
"DRIVER_VERSION",
"VBIOS_VERSION",
"HOSTNAME"
],
"ExtraPodLabels" : {
"WORKLOAD_ID" : "airm.silogen.ai/workload-id",
"PROJECT_ID" : "airm.silogen.ai/project-id"
},
"CustomLabels" : {
"KUBE_CLUSTER_NAME" : "demo-cluster"
}
}
}
kind: ConfigMap
metadata:
name: gpu-config
namespace: {{ .Release.Namespace }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: gpu-operator
namespace: {{ .Release.Namespace }}
spec:
configManager:
enable: false
image: rocm/device-config-manager:v1.5.1-beta.0
imagePullPolicy: IfNotPresent
{{- with (.Values.imageRegistrySecret).name }}
imageRegistrySecret:
name: {{ . }}
{{- end }}
devicePlugin:
devicePluginImage: rocm/k8s-device-plugin:latest
enableDevicePlugin: true
enableNodeLabeller: true
{{- with (.Values.imageRegistrySecret).name }}
imageRegistrySecret:
name: {{ . }}
{{- end }}
kubeletSocketPath: /var/lib/kubelet/device-plugins
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
nodeLabellerImagePullPolicy: Always
driver:
blacklist: false
driverType: container
enable: false
image: imageregistry.io/username/repo
# Resolved per GPU stack family (gpuStackFamily); empty family = instinct = "7.0".
version: "{{ include "gpuStack.driverVersion" . }}"
metricsExporter:
config:
name: gpu-config
enable: true
image: docker.io/rocm/device-metrics-exporter:v1.5.1-beta.0
imagePullPolicy: IfNotPresent
{{- with (.Values.imageRegistrySecret).name }}
imageRegistrySecret:
name: {{ . }}
{{- end }}
nodePort: 32500
podAnnotations: {}
podResourceAPISocketPath: /var/lib/kubelet/pod-resources
port: 5000
resource:
limits:
cpu: "2"
memory: 4G
requests:
cpu: 500m
memory: 512M
serviceAnnotations: {}
serviceType: NodePort
selector:
feature.node.kubernetes.io/amd-gpu: "true"
testRunner:
enable: false
image: docker.io/rocm/test-runner:v1.5.1-beta.0
imagePullPolicy: IfNotPresent
{{- with (.Values.imageRegistrySecret).name }}
imageRegistrySecret:
name: {{ . }}
{{- end }}
logsLocation:
hostPath: /var/log/amd-test-runner
mountPath: /var/log/amd-test-runner
Loading
Loading