Bug Description
Follow-up to #750 (vendor-neutral DRA) and #753 (CEL guard fix), surfaced during review of #753.
Two related DRA correctness gaps remain, both pre-existing and out of scope for #753:
1. checkAcceleratorAvailability is not DRA-aware (functional).
A Model can now request a GPU via hardware.gpu.resourceClaims (DRA) instead of hardware.gpu.resourceName (device plugin). But checkAcceleratorAvailability (internal/controller/model_controller.go:757) only resolves availability from the resourceName override (line 772-773) or the vendor/accelerator default extended-resource name (line 777-785). There is no resourceClaims branch. A DRA-only Model (no resourceName) therefore falls through to a vendor-default extended-resource check (for example amd -> amd.com/gpu), which is not the signal a DRA-scheduled GPU advertises. Result: Status.AcceleratorReady is inaccurate for DRA-only Models.
2. resourceName field doc lists DRA as a use case (doc cleanup).
The ResourceName field doc comment in api/v1alpha1/model_types.go still lists "DRA-backed resources" as a use case for resourceName. Now that resourceName and resourceClaims are mutually exclusive (enforced by the CEL rule fixed in #753), that mention is misleading. DRA workflows should use resourceClaims.
Steps to Reproduce
- Deploy a Model with
hardware.gpu.resourceClaims set and no hardware.gpu.resourceName.
- Observe
Status.AcceleratorReady.
- It reflects vendor-default extended-resource availability rather than DRA claim/device-class readiness.
Expected Behavior
checkAcceleratorAvailability accounts for the resourceClaims (DRA) path when determining AcceleratorReady, and the resourceName field doc no longer presents DRA as a resourceName use case.
Actual Behavior
checkAcceleratorAvailability resolves availability only via resourceName override or vendor/accelerator default; the resourceClaims-only path is not considered. The resourceName doc still references DRA.
Environment
LLMKube Version: main (post #750; v0.8.9 line)
Code-logic gap, not environment-specific.
Logs
N/A
Bug Description
Follow-up to #750 (vendor-neutral DRA) and #753 (CEL guard fix), surfaced during review of #753.
Two related DRA correctness gaps remain, both pre-existing and out of scope for #753:
1.
checkAcceleratorAvailabilityis not DRA-aware (functional).A Model can now request a GPU via
hardware.gpu.resourceClaims(DRA) instead ofhardware.gpu.resourceName(device plugin). ButcheckAcceleratorAvailability(internal/controller/model_controller.go:757) only resolves availability from theresourceNameoverride (line 772-773) or the vendor/accelerator default extended-resource name (line 777-785). There is noresourceClaimsbranch. A DRA-only Model (noresourceName) therefore falls through to a vendor-default extended-resource check (for exampleamd->amd.com/gpu), which is not the signal a DRA-scheduled GPU advertises. Result:Status.AcceleratorReadyis inaccurate for DRA-only Models.2.
resourceNamefield doc lists DRA as a use case (doc cleanup).The
ResourceNamefield doc comment inapi/v1alpha1/model_types.gostill lists "DRA-backed resources" as a use case forresourceName. Now thatresourceNameandresourceClaimsare mutually exclusive (enforced by the CEL rule fixed in #753), that mention is misleading. DRA workflows should useresourceClaims.Steps to Reproduce
hardware.gpu.resourceClaimsset and nohardware.gpu.resourceName.Status.AcceleratorReady.Expected Behavior
checkAcceleratorAvailabilityaccounts for theresourceClaims(DRA) path when determiningAcceleratorReady, and theresourceNamefield doc no longer presents DRA as aresourceNameuse case.Actual Behavior
checkAcceleratorAvailabilityresolves availability only viaresourceNameoverride or vendor/accelerator default; theresourceClaims-only path is not considered. TheresourceNamedoc still references DRA.Environment
LLMKube Version: main (post #750; v0.8.9 line)
Code-logic gap, not environment-specific.
Logs
N/A