feat(gpu): implement topology hints for associated devices by luomingmeng · Pull Request #15 · JustinChengLZ/katalyst-core

luomingmeng · 2026-04-17T08:52:43Z

What type of PR is this?

Enhancements

What this PR does / why we need it:

add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology include strategy framework integration for NVLink and other affinity requirements add comprehensive tests for various hint generation scenarios

Which issue(s) this PR fixes:

Special notes for your reviewer:

…que=false; simplify test cases

Initialize inner plugins with cached responses from checkpoint to prevent data omission after restart. Modify checkpoint logic to persist both remote and inner plugin states. Add tests for cache initialization and checkpoint restoration.

feat(*): add switch to support disabling some reporters

feat(*): add switch to support disabling cnr lifecycle

… priority

Breaching the memory.high limit doesn't trigger the OOM killer but throttles the offending cgroup, task will not hold lock when trying reclaim memory, which will not lead to priority inversion and hung task. Signed-off-by: linzhecheng <linzhecheng@bytedance.com>

feat(qrm): support set memory.high to throttle reclaimed_cores

…ity-strong-constraint feat(gpu): strict device affinity requirement and refactor

Add missing call to run device topology registry in base plugin's Run method and ensure it's started in StaticPolicy's Start method

enhance(qrm): mem bandwidth throttler qrm plugin supports multiple resctrl groups sharing same priority

Add check for active pods before counting failed updates to avoid including terminated pods in the statistics. Update error message to show active pod count separately from total count.

…lugin-manager feat(fetcher): add checkpoint support for inner reporter plugins

Add new condition constants and logic to track when no pods match a VPA's target workload. The condition is set when: - TargetRef kind is unsupported - Workload object cannot be found - No pods match the workload selector Also includes corresponding test cases to verify the new condition behavior.

feat(sriov): skip dynamic VF allocation by pod annotations

…lugin-run feat(gpu): start device topology registry in base plugin

…pply-condition-message fix(vpa-status): skip inactive pods when checking recommendation applied

…-sidecar feat: support setting of cpu burst only for main container

…ry usage

Revise memory headroom calculation to reduce OOM

add logic to generate topology hints for GPU devices based on existing allocations, pre-allocate resources, and device topology include strategy framework integration for NVLink and other affinity requirements add comprehensive tests for various hint generation scenarios

Only consider healthy GPUs when generating NUMA node topology hints to ensure proper device allocation. Update tests to include health status checks.

xdc0527 and others added 19 commits March 27, 2026 15:51

add new test cases

62e51d6

support same priority

bd5327b

refactor

016317a

refactor advisor

9d140af

add more tests

ee97bd0

refactor code

072b2c8

add missing file

658818b

args --mb-extra-group-priorities=machine=9000 --mb-group-priority-uni…

5b51e32

…que=false; simplify test cases

feat(*): add switch to support disabling some reporters

37b712e

feat(*): add switch to support disabling cnr lifecycle

097c148

minor refactor: functions naming combined groups

1267850

fix(*): do not return error when reporter of gvk not found

bbec2f6

feat: support setting of cpu burst only for main container

eae0f32

Merge pull request kubewharf#1114 from fantastic-hf/dev/reporters

7320721

feat(*): add switch to support disabling some reporters

Merge pull request kubewharf#1115 from fantastic-hf/dev/cnrlifecycle

f4d9f38

feat(*): add switch to support disabling cnr lifecycle

refactor: policy uses the advisor able to process groups of identical…

b31c94e

… priority

Merge pull request kubewharf#1125 from cheney-lin/dev/mem_guard

6546378

feat(qrm): support set memory.high to throttle reclaimed_cores

luomingmeng force-pushed the dev/device-affinity-strong-constraint branch 2 times, most recently from ccf7f48 to 4bbe7cb Compare April 17, 2026 15:26

Merge pull request kubewharf#1121 from JustinChengLZ/dev/device-affin…

927cc35

…ity-strong-constraint feat(gpu): strict device affinity requirement and refactor

luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from d86c498 to cffdbba Compare April 20, 2026 11:48

luomingmeng and others added 7 commits April 21, 2026 12:03

feat(gpu): start device topology registry in base plugin

abcb4f9

Add missing call to run device topology registry in base plugin's Run method and ensure it's started in StaticPolicy's Start method

Merge pull request kubewharf#1111 from h-w-chen/dev/mbm-polify-db-uce

c4e3a35

enhance(qrm): mem bandwidth throttler qrm plugin supports multiple resctrl groups sharing same priority

feat(sriov): skip dynamic VF allocation by pod annotations

1f061d6

fix(vpa-status): skip inactive pods when checking recommendation applied

7df6384

Add check for active pods before counting failed updates to avoid including terminated pods in the statistics. Update error message to show active pod count separately from total count.

Merge pull request kubewharf#1116 from luomingmeng/dev/fix-reporter-p…

710c9be

…lugin-manager feat(fetcher): add checkpoint support for inner reporter plugins

Merge pull request kubewharf#1112 from junyu-peng/dev/sriov

7e636b0

feat(sriov): skip dynamic VF allocation by pod annotations

xu282934741 and others added 6 commits April 23, 2026 11:44

Merge pull request kubewharf#1108 from luomingmeng/dev/fix-gpu-base-p…

eef39fd

…lugin-run feat(gpu): start device topology registry in base plugin

Merge pull request kubewharf#1129 from luomingmeng/dev/refactor-vpa-a…

8885128

…pply-condition-message fix(vpa-status): skip inactive pods when checking recommendation applied

Merge pull request kubewharf#1123 from JustinChengLZ/dev/cpu-burst-no…

217669d

…-sidecar feat: support setting of cpu burst only for main container

feat(qosaware): revise memory headroom policy to consider actual memo…

78a50ee

…ry usage

feat(qosaware): add request based ratio in crd

5593598

Merge pull request kubewharf#1128 from jinxin32/memory_headroom

80d9615

Revise memory headroom calculation to reduce OOM

luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from cffdbba to e45cf23 Compare April 23, 2026 11:52

luomingmeng added 2 commits April 23, 2026 19:53

fix(gpu): filter out unhealthy GPUs when generating topology hints

2d2b3c0

Only consider healthy GPUs when generating NUMA node topology hints to ensure proper device allocation. Update tests to include health status checks.

luomingmeng force-pushed the dev/gpu-plugin-support-get-associated-device-topology-hints branch from e45cf23 to 2d2b3c0 Compare April 23, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu): implement topology hints for associated devices#15

feat(gpu): implement topology hints for associated devices#15
luomingmeng wants to merge 35 commits intoJustinChengLZ:dev/device-affinity-strong-constraintfrom
luomingmeng:dev/gpu-plugin-support-get-associated-device-topology-hints

luomingmeng commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

luomingmeng commented Apr 17, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants