feat(recipe): add NFD as standalone shared component by ArangoGutierrez · Pull Request #518 · NVIDIA/aicr

ArangoGutierrez · 2026-04-09T17:36:17Z

Summary

Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component
Both gpu-operator and network-operator now depend on this shared NFD instance via dependencyRefs
Network-operator gains NIC labeling via NodeFeatureRule CR (deployNodeFeatureRules: true)
NFD becomes a base-level component — every recipe inherits it automatically

Changes

Area	Change
`recipes/registry.yaml`	Add `nfd` component with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operator
`recipes/components/nfd/values.yaml`	New file — Master + Worker + GC enabled, TopologyUpdater off
`recipes/overlays/base.yaml`	Add `nfd` componentRef before cert-manager
`recipes/overlays/*.yaml` (12 files)	Add `nfd` to gpu-operator/network-operator `dependencyRefs`
`recipes/components/gpu-operator/values.yaml`	`nfd.enabled: false` (disable sub-chart)
`recipes/components/network-operator/values.yaml`	`nfd.deployNodeFeatureRules: true`
`recipes/overlays/kind.yaml`	Remove `nfd.enabled: true` override (no longer needed)
`recipes/checks/nfd/health-check.yaml`	Chainsaw health check: Master Deployment + Worker DaemonSet + pod phases

Design

Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.

Install order: nfd → cert-manager → gpu-operator / network-operator → nvsentinel, etc.

Verification

make test — all packages pass with race detector (72.5% coverage)
make lint-go — 0 issues
make build — builds successfully
Recipe resolution: NFD appears in all resolved recipes
aicr query confirms gpu-operator.values.nfd.enabled: false
aicr query confirms network-operator.values.nfd.deployNodeFeatureRules: true
gpu-operator and network-operator both have nfd in dependencyRefs

Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component. Both gpu-operator and network-operator depend on this shared NFD instance. Changes: - Add nfd component to registry.yaml with helm config and nodeScheduling - Create recipes/components/nfd/values.yaml (Master + Worker + GC) - Add nfd to base.yaml componentRefs (before cert-manager) - Add nfd to gpu-operator dependencyRefs in all overlays - Disable gpu-operator NFD sub-chart (nfd.enabled: false) - Enable network-operator NodeFeatureRule deployment - Add nfd to network-operator dependencyRefs (kind.yaml, aks.yaml) - Remove NFD nodeScheduling paths from gpu-operator registry entry - Remove nfd.enabled: true from kind.yaml gpu-operator overrides - Create Chainsaw health check for NFD (Master + Worker + pod phases) NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (no accelerated nodeSelector) to discover hardware features everywhere. Master and GC run on system nodes. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Update Chainsaw test assertions to reflect NFD as a standalone component: - cuj1-training/assert-recipe.yaml: add nfd to componentRefs and deploymentOrder - cuj1-training/assert-bundle-scheduling.yaml: replace node-feature-discovery scheduling paths with nfd.enabled: false - ai-conformance/offline/assert-recipe.yaml: add nfd to componentRefs and deploymentOrder - ai-conformance/cluster/assert-gpu-operator.yaml: remove NFD master/worker/gc assertions (now in node-feature-discovery namespace, covered by standalone nfd health check) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

mchmarny

Clean, declarative PR that follows the cert-manager shared-component pattern. No Go code changes needed — the registry + overlay system handles everything. Good separation of concerns.

Issues to address:

Missing overlays — rtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml re-declare gpu-operator's dependencyRefs without nfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes.
Health check missing GC — The values enable gc.enable: true and the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.

Minor:

PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.

Looks good:

Worker excluded from accelerated nodeSelectorPaths so it runs everywhere — correct and well-commented
nfd.enabled: false in gpu-operator + deployNodeFeatureRules: true in network-operator is the right combination
kind overlay cleanup (removing nfd.enabled: true override) is correct
Old NFD assertions removed from assert-gpu-operator.yaml since they'd check the wrong namespace now
Registry entry is well-structured with proper nodeScheduling paths

mchmarny · 2026-04-14T15:33:22Z

      manifestFiles:
        - components/gpu-operator/manifests/dcgm-exporter.yaml
      dependencyRefs:
+        - nfd


Two overlays that re-declare gpu-operator's dependencyRefs were not updated to include nfd:

recipes/overlays/rtx-pro-6000-lke-inference.yaml — gpu-operator dependencyRefs: [cert-manager, kube-prometheus-stack]

recipes/overlays/rtx-pro-6000-lke-training.yaml — gpu-operator dependencyRefs: [cert-manager, kube-prometheus-stack]

If leaf overlays replace (not merge) the base dependencyRefs, gpu-operator would not depend on nfd in RTX PRO 6000 LKE recipes. NFD still deploys as a standalone component, but without the dependency edge there's no ordering guarantee — gpu-operator could start before nfd is ready.

mchmarny · 2026-04-14T15:33:23Z

+                name: node-feature-discovery-worker
+                namespace: node-feature-discovery
+              status:
+                (numberReady > `0`): true


The values.yaml enables the GC component (gc.enable: true) and the registry defines gc.nodeSelector/gc.tolerations paths, but this health check doesn't validate the GC Deployment. The old gpu-operator assertions (assert-gpu-operator.yaml) did check node-feature-discovery-gc.

Consider adding a step:

- name: validate-gc-deployment try: - assert: resource: apiVersion: apps/v1 kind: Deployment metadata: name: node-feature-discovery-gc namespace: node-feature-discovery status: (conditions[?type == 'Available']): - status: "True"

mchmarny · 2026-04-14T15:33:25Z

+        tolerationPaths:
+          - master.tolerations
+          - gc.tolerations
+      accelerated:


Good call excluding worker from accelerated.nodeSelectorPaths — workers need to run on every node for hardware discovery. The comment is clear and sufficient.

mchmarny · 2026-04-14T15:33:27Z

+# NFD deployed as standalone shared component — disable sub-chart
+nfd:
+  enabled: false


The old values had node-feature-discovery.fullnameOverride: node-feature-discovery under the sub-chart key. The new values use nfd.enabled: false — which is the correct top-level key for the gpu-operator chart's NFD toggle. Clean.

One thing to verify: does the gpu-operator chart version used in current recipes support nfd.enabled: false as the disable mechanism? Older gpu-operator chart versions used the node-feature-discovery sub-chart key directly. If any overlay pins an older gpu-operator chart version, the sub-chart might still deploy.

mchmarny · 2026-04-14T15:33:29Z

+# limitations under the License.
+
+# Node Feature Discovery standalone deployment
+# NFD v0.18.3 — shared by gpu-operator and network-operator


Version is documented as v0.18.3 in the comment and pinned in base.yaml. The values here are minimal and sensible defaults. Topology updater disabled by default with a note about overlay overrides is a good approach.

mchmarny · 2026-04-14T15:33:31Z


 nfd:
  enabled: false
+  deployNodeFeatureRules: true


This enables deployNodeFeatureRules while keeping nfd.enabled: false. That's the correct combination — network-operator uses the external standalone NFD to create NodeFeatureRule CRs for NIC labeling without deploying its own NFD instance.

mchmarny · 2026-04-14T15:33:32Z

                  - matchExpressions:
                      - key: node-role.kubernetes.io/control-plane


The kind overlay removed nfd.enabled: true from gpu-operator overrides (no longer needed since NFD is standalone). But it also needs nfd in network-operator's dependencyRefs at line 194 — which is included in this PR. Looks correct.

mchmarny · 2026-04-14T15:33:34Z

  - kgateway
  - kube-prometheus-stack
  - k8s-ephemeral-storage-metrics
+  - nfd


The PR description states the install order is "nfd → cert-manager → gpu-operator", but the deployment order here shows nfd after cert-manager and kube-prometheus-stack. This isn't a bug — nfd and cert-manager are independent in the dependency graph, so their relative order is determined by the topological sort tiebreaker. But updating the PR description to match the actual computed order would avoid confusion.

ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:36

github-actions bot added area/recipes size/M labels Apr 9, 2026

ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:48

github-actions bot added area/tests size/L and removed size/M labels Apr 9, 2026

Merge branch 'main' into feat/nfd-standalone-component

e3bd1a6

mchmarny assigned ArangoGutierrez Apr 10, 2026

Merge branch 'main' into feat/nfd-standalone-component

03e768f

mchmarny added this to the v0.12 milestone Apr 14, 2026

mchmarny requested changes Apr 14, 2026

View reviewed changes

Merge branch 'main' into feat/nfd-standalone-component

55eb6c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipe): add NFD as standalone shared component#518

feat(recipe): add NFD as standalone shared component#518
ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component

ArangoGutierrez commented Apr 9, 2026

Uh oh!

mchmarny left a comment

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

mchmarny Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- matchExpressions:
		- key: node-role.kubernetes.io/control-plane

Conversation

ArangoGutierrez commented Apr 9, 2026

Summary

Changes

Design

Verification

Related

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants