Skip to content

feat(gpu): implement topology hints for associated devices#15

Open
luomingmeng wants to merge 35 commits intoJustinChengLZ:dev/device-affinity-strong-constraintfrom
luomingmeng:dev/gpu-plugin-support-get-associated-device-topology-hints
Open

feat(gpu): implement topology hints for associated devices#15
luomingmeng wants to merge 35 commits intoJustinChengLZ:dev/device-affinity-strong-constraintfrom
luomingmeng:dev/gpu-plugin-support-get-associated-device-topology-hints

Conversation

@luomingmeng
Copy link
Copy Markdown

What type of PR is this?

Enhancements

What this PR does / why we need it:

add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology include strategy framework integration for NVLink and other affinity requirements add comprehensive tests for various hint generation scenarios

Which issue(s) this PR fixes:

Special notes for your reviewer:

xdc0527 and others added 19 commits March 27, 2026 15:51
Initialize inner plugins with cached responses from checkpoint to prevent data omission after restart. Modify checkpoint logic to persist both remote and inner plugin states. Add tests for cache initialization and checkpoint restoration.
feat(*): add switch to support disabling some reporters
feat(*): add switch to support disabling cnr lifecycle
Breaching the memory.high limit doesn't trigger the OOM killer but
throttles the offending cgroup, task will not hold lock when trying
reclaim memory, which will not lead to priority inversion and hung task.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
feat(qrm): support set memory.high to throttle reclaimed_cores
@luomingmeng luomingmeng force-pushed the dev/device-affinity-strong-constraint branch 2 times, most recently from ccf7f48 to 4bbe7cb Compare April 17, 2026 15:26
…ity-strong-constraint

feat(gpu): strict device affinity requirement and refactor
@luomingmeng luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from d86c498 to cffdbba Compare April 20, 2026 11:48
luomingmeng and others added 7 commits April 21, 2026 12:03
Add missing call to run device topology registry in base plugin's Run method and ensure it's started in StaticPolicy's Start method
enhance(qrm): mem bandwidth throttler qrm plugin supports multiple resctrl groups sharing same priority
Add check for active pods before counting failed updates to avoid including terminated pods in the statistics. Update error message to show active pod count separately from total count.
…lugin-manager

feat(fetcher): add checkpoint support for inner reporter plugins
Add new condition constants and logic to track when no pods match a VPA's target workload. The condition is set when:
- TargetRef kind is unsupported
- Workload object cannot be found
- No pods match the workload selector
Also includes corresponding test cases to verify the new condition behavior.
feat(sriov): skip dynamic VF allocation by pod annotations
xu282934741 and others added 6 commits April 23, 2026 11:44
…lugin-run

feat(gpu): start device topology registry in base plugin
…pply-condition-message

fix(vpa-status): skip inactive pods when checking recommendation applied
…-sidecar

feat: support setting of cpu burst only for main container
Revise memory headroom calculation to reduce OOM
@luomingmeng luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from cffdbba to e45cf23 Compare April 23, 2026 11:52
add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology
include strategy framework integration for NVLink and other affinity requirements
add comprehensive tests for various hint generation scenarios
Only consider healthy GPUs when generating NUMA node topology hints to ensure proper device allocation. Update tests to include health status checks.
@luomingmeng luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from e45cf23 to 2d2b3c0 Compare April 23, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants