feat(gpu): implement topology hints for associated devices#15
Open
luomingmeng wants to merge 35 commits intoJustinChengLZ:dev/device-affinity-strong-constraintfrom
Conversation
…que=false; simplify test cases
Initialize inner plugins with cached responses from checkpoint to prevent data omission after restart. Modify checkpoint logic to persist both remote and inner plugin states. Add tests for cache initialization and checkpoint restoration.
feat(*): add switch to support disabling some reporters
feat(*): add switch to support disabling cnr lifecycle
Breaching the memory.high limit doesn't trigger the OOM killer but throttles the offending cgroup, task will not hold lock when trying reclaim memory, which will not lead to priority inversion and hung task. Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
feat(qrm): support set memory.high to throttle reclaimed_cores
ccf7f48 to
4bbe7cb
Compare
…ity-strong-constraint feat(gpu): strict device affinity requirement and refactor
d86c498 to
cffdbba
Compare
Add missing call to run device topology registry in base plugin's Run method and ensure it's started in StaticPolicy's Start method
enhance(qrm): mem bandwidth throttler qrm plugin supports multiple resctrl groups sharing same priority
Add check for active pods before counting failed updates to avoid including terminated pods in the statistics. Update error message to show active pod count separately from total count.
…lugin-manager feat(fetcher): add checkpoint support for inner reporter plugins
Add new condition constants and logic to track when no pods match a VPA's target workload. The condition is set when: - TargetRef kind is unsupported - Workload object cannot be found - No pods match the workload selector Also includes corresponding test cases to verify the new condition behavior.
feat(sriov): skip dynamic VF allocation by pod annotations
…lugin-run feat(gpu): start device topology registry in base plugin
…pply-condition-message fix(vpa-status): skip inactive pods when checking recommendation applied
…-sidecar feat: support setting of cpu burst only for main container
Revise memory headroom calculation to reduce OOM
cffdbba to
e45cf23
Compare
add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology include strategy framework integration for NVLink and other affinity requirements add comprehensive tests for various hint generation scenarios
Only consider healthy GPUs when generating NUMA node topology hints to ensure proper device allocation. Update tests to include health status checks.
e45cf23 to
2d2b3c0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Enhancements
What this PR does / why we need it:
add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology include strategy framework integration for NVLink and other affinity requirements add comprehensive tests for various hint generation scenarios
Which issue(s) this PR fixes:
Special notes for your reviewer: