Skip to content

EAI-6030: Add per-hardware-family AIM model source selection#741

Open
pre wants to merge 2 commits into
mainfrom
EAI-6030-aim-gpu-family
Open

EAI-6030: Add per-hardware-family AIM model source selection#741
pre wants to merge 2 commits into
mainfrom
EAI-6030-aim-gpu-family

Conversation

@pre

@pre pre commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Related

Summary

Converts sources/aim-cluster-model-source from an ArgoCD directory app into a Helm chart that installs either the legacy generic model sources (default) or per-hardware-family AIMClusterModelSource resources, selected by hardwareFamilies.

  • Legacy (default): when hardwareFamilies is empty, renders the existing amd-aim-release-* sources (0.8.5, 0.9.0, 0.10.0, 0.11.0) unchanged, so ArgoCD does not prune/recreate existing installs.
  • Per-family profiles: when set, installs only the listed families (cpu, epyc, instinct, radeon).
  • Value is supplied as a structured YAML list via valuesObject (cluster-bloom injects it at deploy time, see the companion cluster-bloom PR). No comma-as-list-separator pitfall on any hop; cluster-apps.yaml is untouched.

Pairs with cluster-bloom PR for the AIM_HARDWARE_FAMILY install flag. Part of EAI-6030. This covers the AIM-catalog portion only; the ROCm 7.13 / GPU Operator profile defaults are separate work.

Design notes

  • instinct/radeon are GPU families; cpu/epyc are CPU inference targets, hence "hardware family" rather than "GPU family".
  • cpu and radeon are pinned as placeholders on ghcr.io (require ghcr-regcred, not provisioned), accepted, their pull fails until a docker.io release exists.

Test plan

  • helm lint sources/aim-cluster-model-source/
  • Empty value → 4 legacy sources, 0 profiles
  • hardwareFamilies=[instinct] → instinct only
  • hardwareFamilies=[epyc,instinct] → both, nothing else
  • Parent render bakes the list into the child Application values: (default [] = legacy)
  • Reviewer: confirm ArgoCD sync against a live cluster (not testable locally)

Convert sources/aim-cluster-model-source from an ArgoCD directory app into
a Helm chart that renders either the legacy generic model sources (default,
when hardwareFamilies is empty) or per-hardware-family AIMClusterModelSource
resources (cpu, epyc, instinct, radeon). The legacy branch reproduces the
existing amd-aim-release-* resources unchanged so ArgoCD does not prune or
recreate existing installs.

The app's hardwareFamilies value is supplied as a structured YAML list via
valuesObject (cluster-bloom injects the selected families at deploy time),
so no comma parsing is involved on any hop. The base default is an empty
list, preserving legacy behavior.

Part of EAI-6030.
For medium/large clusters ArgoCD reads cluster-values/values.yaml, which the
gitea-init-job rebuilds from a template rather than copying the seeded
complete_values.yaml wholesale. Add an aimHardwareFamily value and emit the
apps.aim-cluster-model-source.valuesObject.hardwareFamilies block into
cluster-values when set, mirroring the existing airmImageRepository handling.
Without it the chart fell back to the legacy install-all branch on
medium/large.

Part of EAI-6030.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant