Skip to content

[BUG] checkAcceleratorAvailability ignores GPU.resourceClaims; AcceleratorReady inaccurate for DRA-only Models #754

@Defilan

Description

@Defilan

Bug Description

Follow-up to #750 (vendor-neutral DRA) and #753 (CEL guard fix), surfaced during review of #753.

Two related DRA correctness gaps remain, both pre-existing and out of scope for #753:

1. checkAcceleratorAvailability is not DRA-aware (functional).
A Model can now request a GPU via hardware.gpu.resourceClaims (DRA) instead of hardware.gpu.resourceName (device plugin). But checkAcceleratorAvailability (internal/controller/model_controller.go:757) only resolves availability from the resourceName override (line 772-773) or the vendor/accelerator default extended-resource name (line 777-785). There is no resourceClaims branch. A DRA-only Model (no resourceName) therefore falls through to a vendor-default extended-resource check (for example amd -> amd.com/gpu), which is not the signal a DRA-scheduled GPU advertises. Result: Status.AcceleratorReady is inaccurate for DRA-only Models.

2. resourceName field doc lists DRA as a use case (doc cleanup).
The ResourceName field doc comment in api/v1alpha1/model_types.go still lists "DRA-backed resources" as a use case for resourceName. Now that resourceName and resourceClaims are mutually exclusive (enforced by the CEL rule fixed in #753), that mention is misleading. DRA workflows should use resourceClaims.

Steps to Reproduce

  1. Deploy a Model with hardware.gpu.resourceClaims set and no hardware.gpu.resourceName.
  2. Observe Status.AcceleratorReady.
  3. It reflects vendor-default extended-resource availability rather than DRA claim/device-class readiness.

Expected Behavior

checkAcceleratorAvailability accounts for the resourceClaims (DRA) path when determining AcceleratorReady, and the resourceName field doc no longer presents DRA as a resourceName use case.

Actual Behavior

checkAcceleratorAvailability resolves availability only via resourceName override or vendor/accelerator default; the resourceClaims-only path is not considered. The resourceName doc still references DRA.

Environment

LLMKube Version: main (post #750; v0.8.9 line)

Code-logic gap, not environment-specific.

Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions