Skip to content

feat(recipe): add NFD as standalone shared component#518

Open
ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component
Open

feat(recipe): add NFD as standalone shared component#518
ArangoGutierrez wants to merge 5 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Contributor

Summary

  • Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component
  • Both gpu-operator and network-operator now depend on this shared NFD instance via dependencyRefs
  • Network-operator gains NIC labeling via NodeFeatureRule CR (deployNodeFeatureRules: true)
  • NFD becomes a base-level component — every recipe inherits it automatically

Changes

Area Change
recipes/registry.yaml Add nfd component with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operator
recipes/components/nfd/values.yaml New file — Master + Worker + GC enabled, TopologyUpdater off
recipes/overlays/base.yaml Add nfd componentRef before cert-manager
recipes/overlays/*.yaml (12 files) Add nfd to gpu-operator/network-operator dependencyRefs
recipes/components/gpu-operator/values.yaml nfd.enabled: false (disable sub-chart)
recipes/components/network-operator/values.yaml nfd.deployNodeFeatureRules: true
recipes/overlays/kind.yaml Remove nfd.enabled: true override (no longer needed)
recipes/checks/nfd/health-check.yaml Chainsaw health check: Master Deployment + Worker DaemonSet + pod phases

Design

Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.

Install order: nfd → cert-manager → gpu-operator / network-operator → nvsentinel, etc.

Verification

  • make test — all packages pass with race detector (72.5% coverage)
  • make lint-go — 0 issues
  • make build — builds successfully
  • Recipe resolution: NFD appears in all resolved recipes
  • aicr query confirms gpu-operator.values.nfd.enabled: false
  • aicr query confirms network-operator.values.nfd.deployNodeFeatureRules: true
  • gpu-operator and network-operator both have nfd in dependencyRefs

Related

Track B of NFD Adoption (Track A — NFD snapshot enrichment — completed 2026-04-08)

Extract Node Feature Discovery from gpu-operator's sub-chart into a
standalone shared component. Both gpu-operator and network-operator
depend on this shared NFD instance.

Changes:
- Add nfd component to registry.yaml with helm config and nodeScheduling
- Create recipes/components/nfd/values.yaml (Master + Worker + GC)
- Add nfd to base.yaml componentRefs (before cert-manager)
- Add nfd to gpu-operator dependencyRefs in all overlays
- Disable gpu-operator NFD sub-chart (nfd.enabled: false)
- Enable network-operator NodeFeatureRule deployment
- Add nfd to network-operator dependencyRefs (kind.yaml, aks.yaml)
- Remove NFD nodeScheduling paths from gpu-operator registry entry
- Remove nfd.enabled: true from kind.yaml gpu-operator overrides
- Create Chainsaw health check for NFD (Master + Worker + pod phases)

NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs
on all nodes (no accelerated nodeSelector) to discover hardware features
everywhere. Master and GC run on system nodes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Update Chainsaw test assertions to reflect NFD as a standalone
component:

- cuj1-training/assert-recipe.yaml: add nfd to componentRefs and
  deploymentOrder
- cuj1-training/assert-bundle-scheduling.yaml: replace
  node-feature-discovery scheduling paths with nfd.enabled: false
- ai-conformance/offline/assert-recipe.yaml: add nfd to componentRefs
  and deploymentOrder
- ai-conformance/cluster/assert-gpu-operator.yaml: remove NFD
  master/worker/gc assertions (now in node-feature-discovery namespace,
  covered by standalone nfd health check)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:48
@mchmarny mchmarny added this to the v0.12 milestone Apr 14, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, declarative PR that follows the cert-manager shared-component pattern. No Go code changes needed — the registry + overlay system handles everything. Good separation of concerns.

Issues to address:

  1. Missing overlaysrtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml re-declare gpu-operator's dependencyRefs without nfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes.

  2. Health check missing GC — The values enable gc.enable: true and the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.

Minor:

  • PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.

Looks good:

  • Worker excluded from accelerated nodeSelectorPaths so it runs everywhere — correct and well-commented
  • nfd.enabled: false in gpu-operator + deployNodeFeatureRules: true in network-operator is the right combination
  • kind overlay cleanup (removing nfd.enabled: true override) is correct
  • Old NFD assertions removed from assert-gpu-operator.yaml since they'd check the wrong namespace now
  • Registry entry is well-structured with proper nodeScheduling paths

manifestFiles:
- components/gpu-operator/manifests/dcgm-exporter.yaml
dependencyRefs:
- nfd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two overlays that re-declare gpu-operator's dependencyRefs were not updated to include nfd:

  • recipes/overlays/rtx-pro-6000-lke-inference.yaml — gpu-operator dependencyRefs: [cert-manager, kube-prometheus-stack]
  • recipes/overlays/rtx-pro-6000-lke-training.yaml — gpu-operator dependencyRefs: [cert-manager, kube-prometheus-stack]

If leaf overlays replace (not merge) the base dependencyRefs, gpu-operator would not depend on nfd in RTX PRO 6000 LKE recipes. NFD still deploys as a standalone component, but without the dependency edge there's no ordering guarantee — gpu-operator could start before nfd is ready.

name: node-feature-discovery-worker
namespace: node-feature-discovery
status:
(numberReady > `0`): true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values.yaml enables the GC component (gc.enable: true) and the registry defines gc.nodeSelector/gc.tolerations paths, but this health check doesn't validate the GC Deployment. The old gpu-operator assertions (assert-gpu-operator.yaml) did check node-feature-discovery-gc.

Consider adding a step:

- name: validate-gc-deployment
  try:
    - assert:
        resource:
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: node-feature-discovery-gc
            namespace: node-feature-discovery
          status:
            (conditions[?type == 'Available']):
              - status: "True"

Comment thread recipes/registry.yaml
Comment on lines +57 to +60
tolerationPaths:
- master.tolerations
- gc.tolerations
accelerated:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call excluding worker from accelerated.nodeSelectorPaths — workers need to run on every node for hardware discovery. The comment is clear and sufficient.

Comment on lines +85 to +87
# NFD deployed as standalone shared component — disable sub-chart
nfd:
enabled: false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old values had node-feature-discovery.fullnameOverride: node-feature-discovery under the sub-chart key. The new values use nfd.enabled: false — which is the correct top-level key for the gpu-operator chart's NFD toggle. Clean.

One thing to verify: does the gpu-operator chart version used in current recipes support nfd.enabled: false as the disable mechanism? Older gpu-operator chart versions used the node-feature-discovery sub-chart key directly. If any overlay pins an older gpu-operator chart version, the sub-chart might still deploy.

# limitations under the License.

# Node Feature Discovery standalone deployment
# NFD v0.18.3 — shared by gpu-operator and network-operator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version is documented as v0.18.3 in the comment and pinned in base.yaml. The values here are minimal and sensible defaults. Topology updater disabled by default with a note about overlay overrides is a good approach.


nfd:
enabled: false
deployNodeFeatureRules: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables deployNodeFeatureRules while keeping nfd.enabled: false. That's the correct combination — network-operator uses the external standalone NFD to create NodeFeatureRule CRs for NIC labeling without deploying its own NFD instance.

Comment on lines 72 to 73
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kind overlay removed nfd.enabled: true from gpu-operator overrides (no longer needed since NFD is standalone). But it also needs nfd in network-operator's dependencyRefs at line 194 — which is included in this PR. Looks correct.

- kgateway
- kube-prometheus-stack
- k8s-ephemeral-storage-metrics
- nfd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states the install order is "nfd → cert-manager → gpu-operator", but the deployment order here shows nfd after cert-manager and kube-prometheus-stack. This isn't a bug — nfd and cert-manager are independent in the dependency graph, so their relative order is determined by the topological sort tiebreaker. But updating the PR description to match the actual computed order would avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants