feat(recipe): add NFD as standalone shared component#518
feat(recipe): add NFD as standalone shared component#518ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component. Both gpu-operator and network-operator depend on this shared NFD instance. Changes: - Add nfd component to registry.yaml with helm config and nodeScheduling - Create recipes/components/nfd/values.yaml (Master + Worker + GC) - Add nfd to base.yaml componentRefs (before cert-manager) - Add nfd to gpu-operator dependencyRefs in all overlays - Disable gpu-operator NFD sub-chart (nfd.enabled: false) - Enable network-operator NodeFeatureRule deployment - Add nfd to network-operator dependencyRefs (kind.yaml, aks.yaml) - Remove NFD nodeScheduling paths from gpu-operator registry entry - Remove nfd.enabled: true from kind.yaml gpu-operator overrides - Create Chainsaw health check for NFD (Master + Worker + pod phases) NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (no accelerated nodeSelector) to discover hardware features everywhere. Master and GC run on system nodes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Update Chainsaw test assertions to reflect NFD as a standalone component: - cuj1-training/assert-recipe.yaml: add nfd to componentRefs and deploymentOrder - cuj1-training/assert-bundle-scheduling.yaml: replace node-feature-discovery scheduling paths with nfd.enabled: false - ai-conformance/offline/assert-recipe.yaml: add nfd to componentRefs and deploymentOrder - ai-conformance/cluster/assert-gpu-operator.yaml: remove NFD master/worker/gc assertions (now in node-feature-discovery namespace, covered by standalone nfd health check) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
mchmarny
left a comment
There was a problem hiding this comment.
Clean, declarative PR that follows the cert-manager shared-component pattern. No Go code changes needed — the registry + overlay system handles everything. Good separation of concerns.
Issues to address:
-
Missing overlays —
rtx-pro-6000-lke-inference.yamlandrtx-pro-6000-lke-training.yamlre-declare gpu-operator'sdependencyRefswithoutnfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes. -
Health check missing GC — The values enable
gc.enable: trueand the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.
Minor:
- PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.
Looks good:
- Worker excluded from accelerated
nodeSelectorPathsso it runs everywhere — correct and well-commented nfd.enabled: falsein gpu-operator +deployNodeFeatureRules: truein network-operator is the right combination- kind overlay cleanup (removing
nfd.enabled: trueoverride) is correct - Old NFD assertions removed from
assert-gpu-operator.yamlsince they'd check the wrong namespace now - Registry entry is well-structured with proper nodeScheduling paths
| manifestFiles: | ||
| - components/gpu-operator/manifests/dcgm-exporter.yaml | ||
| dependencyRefs: | ||
| - nfd |
There was a problem hiding this comment.
Two overlays that re-declare gpu-operator's dependencyRefs were not updated to include nfd:
recipes/overlays/rtx-pro-6000-lke-inference.yaml— gpu-operator dependencyRefs:[cert-manager, kube-prometheus-stack]recipes/overlays/rtx-pro-6000-lke-training.yaml— gpu-operator dependencyRefs:[cert-manager, kube-prometheus-stack]
If leaf overlays replace (not merge) the base dependencyRefs, gpu-operator would not depend on nfd in RTX PRO 6000 LKE recipes. NFD still deploys as a standalone component, but without the dependency edge there's no ordering guarantee — gpu-operator could start before nfd is ready.
| name: node-feature-discovery-worker | ||
| namespace: node-feature-discovery | ||
| status: | ||
| (numberReady > `0`): true |
There was a problem hiding this comment.
The values.yaml enables the GC component (gc.enable: true) and the registry defines gc.nodeSelector/gc.tolerations paths, but this health check doesn't validate the GC Deployment. The old gpu-operator assertions (assert-gpu-operator.yaml) did check node-feature-discovery-gc.
Consider adding a step:
- name: validate-gc-deployment
try:
- assert:
resource:
apiVersion: apps/v1
kind: Deployment
metadata:
name: node-feature-discovery-gc
namespace: node-feature-discovery
status:
(conditions[?type == 'Available']):
- status: "True"| tolerationPaths: | ||
| - master.tolerations | ||
| - gc.tolerations | ||
| accelerated: |
There was a problem hiding this comment.
Good call excluding worker from accelerated.nodeSelectorPaths — workers need to run on every node for hardware discovery. The comment is clear and sufficient.
| # NFD deployed as standalone shared component — disable sub-chart | ||
| nfd: | ||
| enabled: false |
There was a problem hiding this comment.
The old values had node-feature-discovery.fullnameOverride: node-feature-discovery under the sub-chart key. The new values use nfd.enabled: false — which is the correct top-level key for the gpu-operator chart's NFD toggle. Clean.
One thing to verify: does the gpu-operator chart version used in current recipes support nfd.enabled: false as the disable mechanism? Older gpu-operator chart versions used the node-feature-discovery sub-chart key directly. If any overlay pins an older gpu-operator chart version, the sub-chart might still deploy.
| # limitations under the License. | ||
|
|
||
| # Node Feature Discovery standalone deployment | ||
| # NFD v0.18.3 — shared by gpu-operator and network-operator |
There was a problem hiding this comment.
Version is documented as v0.18.3 in the comment and pinned in base.yaml. The values here are minimal and sensible defaults. Topology updater disabled by default with a note about overlay overrides is a good approach.
|
|
||
| nfd: | ||
| enabled: false | ||
| deployNodeFeatureRules: true |
There was a problem hiding this comment.
This enables deployNodeFeatureRules while keeping nfd.enabled: false. That's the correct combination — network-operator uses the external standalone NFD to create NodeFeatureRule CRs for NIC labeling without deploying its own NFD instance.
| - matchExpressions: | ||
| - key: node-role.kubernetes.io/control-plane |
There was a problem hiding this comment.
The kind overlay removed nfd.enabled: true from gpu-operator overrides (no longer needed since NFD is standalone). But it also needs nfd in network-operator's dependencyRefs at line 194 — which is included in this PR. Looks correct.
| - kgateway | ||
| - kube-prometheus-stack | ||
| - k8s-ephemeral-storage-metrics | ||
| - nfd |
There was a problem hiding this comment.
The PR description states the install order is "nfd → cert-manager → gpu-operator", but the deployment order here shows nfd after cert-manager and kube-prometheus-stack. This isn't a bug — nfd and cert-manager are independent in the dependency graph, so their relative order is determined by the topological sort tiebreaker. But updating the PR description to match the actual computed order would avoid confusion.
Summary
dependencyRefsNodeFeatureRuleCR (deployNodeFeatureRules: true)Changes
recipes/registry.yamlnfdcomponent with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operatorrecipes/components/nfd/values.yamlrecipes/overlays/base.yamlnfdcomponentRef before cert-managerrecipes/overlays/*.yaml(12 files)nfdto gpu-operator/network-operatordependencyRefsrecipes/components/gpu-operator/values.yamlnfd.enabled: false(disable sub-chart)recipes/components/network-operator/values.yamlnfd.deployNodeFeatureRules: truerecipes/overlays/kind.yamlnfd.enabled: trueoverride (no longer needed)recipes/checks/nfd/health-check.yamlDesign
Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to
node-feature-discoverynamespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.Install order: nfd → cert-manager → gpu-operator / network-operator → nvsentinel, etc.
Verification
make test— all packages pass with race detector (72.5% coverage)make lint-go— 0 issuesmake build— builds successfullyaicr queryconfirmsgpu-operator.values.nfd.enabled: falseaicr queryconfirmsnetwork-operator.values.nfd.deployNodeFeatureRules: truenfdin dependencyRefsRelated
Track B of NFD Adoption (Track A — NFD snapshot enrichment — completed 2026-04-08)