feat: exclude HAMI virtualized GPUs from NVIDIA device plugin via Node annotation watch by Zhangxiurui520 · Pull Request #1674 · NVIDIA/k8s-device-plugin

Zhangxiurui520 · 2026-03-25T02:48:34Z

Background

In clusters where HAMi virtualizes selected GPUs, Node annotation hami.io/node-nvidia-register contains the GPU UUIDs managed by HAMi (virtualized GPUs).
The NVIDIA should avoid reporting these GPUs to kubelet to prevent resource overlap and scheduling conflicts.

What this PR changes

Adds a Node annotation watcher in [main.go]:

Uses Kubernetes Watch API (not polling) to watch only current Node (metadata.name=<NODE_NAME>).
Watches annotation hami.io/node-nvidia-register.
Parses GPU UUIDs (supports both GPU-... and hami-core:GPU-... token format).
Triggers plugin update when annotation changes.

Extends plugin interface for dynamic filtering:

Adds [HandleAllowedDeviceIDs([]string)] in [api.go].
Device plugin implementation now handles runtime device-set updates.

Enables dynamic re-reporting to kubelet:

In [server.go], adds an update signal channel.
ListAndWatch now re-sends device list on filter updates, so kubelet sees changes without restarting plugin.

Adds filtering capability in resource manager:

In [rm.go], stores both:
full discovered device set ([allDevices]
，currently exposed device set ([devices]
HandleAllowedDeviceIDs excludes GPUs present in HAMi annotation UUID list.
Empty UUID list restores full device set.
Adds RW mutex protection for concurrent read/update access.

Keeps resource manager constructors aligned:

NVML/Tegra resource managers initialize both [allDevices] and [devices] so runtime filtering is safe and reversible.

Behavior

On startup: watcher does an initial Node GET sync.
On Node annotation update: plugin recalculates exposed GPUs and updates kubelet via ListAndWatch.
On annotation cleanup/removal: plugin restores full discovered GPU set.

Why Watch instead of polling

Faster reaction to annotation changes.
Lower API overhead compared with periodic GET loops.
Cleaner event-driven behavior for production clusters.

Compatibility / Notes

Backward compatible when HAMi annotation is absent: plugin behavior remains unchanged.
Requires running in cluster (InClusterConfig) and Node read/watch RBAC.
Recommended to inject NODE_NAME via Downward API ([spec.nodeName] for accurate node targeting.

…ect, supporting dynamic adjustment of GPU devices Signed-off-by: zhangxr <944702164@qq.com>

copy-pr-bot · 2026-03-25T02:48:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

feat: Enhanced features: Added linkage with the hami open source proj…

d394262

…ect, supporting dynamic adjustment of GPU devices Signed-off-by: zhangxr <944702164@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: exclude HAMI virtualized GPUs from NVIDIA device plugin via Node annotation watch#1674

feat: exclude HAMI virtualized GPUs from NVIDIA device plugin via Node annotation watch#1674
Zhangxiurui520 wants to merge 1 commit intoNVIDIA:mainfrom
Zhangxiurui520:zhangxr_runwithhami

Zhangxiurui520 commented Mar 25, 2026

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zhangxiurui520 commented Mar 25, 2026

Background

What this PR changes

Behavior

Why Watch instead of polling

Compatibility / Notes

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant