feat: exclude HAMI virtualized GPUs from NVIDIA device plugin via Node annotation watch#1674
Open
Zhangxiurui520 wants to merge 1 commit intoNVIDIA:mainfrom
Open
feat: exclude HAMI virtualized GPUs from NVIDIA device plugin via Node annotation watch#1674Zhangxiurui520 wants to merge 1 commit intoNVIDIA:mainfrom
Zhangxiurui520 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…ect, supporting dynamic adjustment of GPU devices Signed-off-by: zhangxr <944702164@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
In clusters where HAMi virtualizes selected GPUs, Node annotation hami.io/node-nvidia-register contains the GPU UUIDs managed by HAMi (virtualized GPUs).
The NVIDIA should avoid reporting these GPUs to kubelet to prevent resource overlap and scheduling conflicts.
What this PR changes
Uses Kubernetes Watch API (not polling) to watch only current Node (metadata.name=<NODE_NAME>).
Watches annotation hami.io/node-nvidia-register.
Parses GPU UUIDs (supports both GPU-... and hami-core:GPU-... token format).
Triggers plugin update when annotation changes.
Adds [HandleAllowedDeviceIDs([]string)] in [api.go].
Device plugin implementation now handles runtime device-set updates.
In [server.go], adds an update signal channel.
ListAndWatch now re-sends device list on filter updates, so kubelet sees changes without restarting plugin.
In [rm.go], stores both:
full discovered device set ([allDevices]
,currently exposed device set ([devices]
HandleAllowedDeviceIDs excludes GPUs present in HAMi annotation UUID list.
Empty UUID list restores full device set.
Adds RW mutex protection for concurrent read/update access.
NVML/Tegra resource managers initialize both [allDevices] and [devices] so runtime filtering is safe and reversible.
Behavior
On startup: watcher does an initial Node GET sync.
On Node annotation update: plugin recalculates exposed GPUs and updates kubelet via ListAndWatch.
On annotation cleanup/removal: plugin restores full discovered GPU set.
Why Watch instead of polling
Faster reaction to annotation changes.
Lower API overhead compared with periodic GET loops.
Cleaner event-driven behavior for production clusters.
Compatibility / Notes
Backward compatible when HAMi annotation is absent: plugin behavior remains unchanged.
Requires running in cluster (InClusterConfig) and Node read/watch RBAC.
Recommended to inject NODE_NAME via Downward API ([spec.nodeName] for accurate node targeting.