Skip to content

fix(controller): make checkAcceleratorAvailability DRA-aware (#754)#776

Open
Defilan wants to merge 2 commits into
defilantech:mainfrom
Defilan:foreman/issue-754-accel-dra-2
Open

fix(controller): make checkAcceleratorAvailability DRA-aware (#754)#776
Defilan wants to merge 2 commits into
defilantech:mainfrom
Defilan:foreman/issue-754-accel-dra-2

Conversation

@Defilan

@Defilan Defilan commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

Make checkAcceleratorAvailability DRA-aware: a Model that requests a GPU via
hardware.gpu.resourceClaims (no resourceName) now resolves
AcceleratorReady from the referenced ResourceClaim/ResourceClaimTemplate
instead of falling through to a vendor-default extended-resource check. Also
removes the misleading "DRA-backed resources" mention from the resourceName
field doc (now that the two are mutually exclusive). Fixes #754.

How

  • internal/controller/model_controller.go: DRA branch in
    checkAcceleratorAvailability -> hasDRAAvailability, which returns
    false when the referenced ResourceClaim/ResourceClaimTemplate is
    NotFound (accurate readiness) and fails open (true) only on transient/RBAC
    errors. Claims are resolved in the model's own namespace.
  • api/v1alpha1/model_types.go: resourceName doc cleanup.
  • regenerated CRDs + controller tests (including a non-default-namespace case).

Provenance / review

Foreman coder (Strix Qwopus-27B) over two cycles plus human review, gate-verified
by the in-cluster verify gate (full make test, GATE-PASS):

  • cycle 1 produced a hollow check (hasDRAAvailability did lookups but always
    returned true) and stale CRDs;
  • the cluster gate caught the codegen drift, review caught the hollow check;
  • cycle 2 fixed the logic + CRDs but hardcoded Namespace: "default";
  • review caught the namespace regression; fixed by hand with a
    non-default-namespace test that fails against the hardcoded version.

Checklist

  • Tests added/updated (including non-default namespace)
  • make test passes (verify gate Job: full make test, GATE-PASS)
  • make lint passes (GOOS=linux golangci-lint run, 0 issues)
  • CRDs regenerated (make manifests / make chart-crds)
  • Commit messages follow conventional commits
  • All commits are signed off (DCO)

Defilan added 2 commits June 21, 2026 01:54
checkAcceleratorAvailability only checked resourceName overrides and
vendor-default extended resources, ignoring the resourceClaims (DRA) path
introduced in defilantech#750. A DRA-only Model (resourceClaims set, no resourceName)
would fall through to a vendor-default extended-resource check, producing
an inaccurate AcceleratorReady status.

Add a hasDRAAvailability helper that checks whether each referenced
ResourceClaim or ResourceClaimTemplate exists. Return false on NotFound
so AcceleratorReady reflects reality; fail-open (true) for transient or
RBAC errors. Also remove the misleading "DRA-backed resources" mention
from the resourceName field doc comment, since resourceName and
resourceClaims are now mutually exclusive.

Fixes defilantech#754

Signed-off-by: Foreman Bot <chris@mahercode.io>
hasDRAAvailability hardcoded Namespace: "default" when looking up the
referenced ResourceClaim/ResourceClaimTemplate, so a Model in any other
namespace would always resolve NotFound and be marked AcceleratorReady=false.
Thread the model's namespace through. Add a controller test that creates the
claim in a non-default namespace to cover the path.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.37037% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/controller/model_controller.go 70.37% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] checkAcceleratorAvailability ignores GPU.resourceClaims; AcceleratorReady inaccurate for DRA-only Models

1 participant