fix(controller): make checkAcceleratorAvailability DRA-aware (#754)#776
Open
Defilan wants to merge 2 commits into
Open
fix(controller): make checkAcceleratorAvailability DRA-aware (#754)#776Defilan wants to merge 2 commits into
Defilan wants to merge 2 commits into
Conversation
checkAcceleratorAvailability only checked resourceName overrides and vendor-default extended resources, ignoring the resourceClaims (DRA) path introduced in defilantech#750. A DRA-only Model (resourceClaims set, no resourceName) would fall through to a vendor-default extended-resource check, producing an inaccurate AcceleratorReady status. Add a hasDRAAvailability helper that checks whether each referenced ResourceClaim or ResourceClaimTemplate exists. Return false on NotFound so AcceleratorReady reflects reality; fail-open (true) for transient or RBAC errors. Also remove the misleading "DRA-backed resources" mention from the resourceName field doc comment, since resourceName and resourceClaims are now mutually exclusive. Fixes defilantech#754 Signed-off-by: Foreman Bot <chris@mahercode.io>
hasDRAAvailability hardcoded Namespace: "default" when looking up the referenced ResourceClaim/ResourceClaimTemplate, so a Model in any other namespace would always resolve NotFound and be marked AcceleratorReady=false. Thread the model's namespace through. Add a controller test that creates the claim in a non-default namespace to cover the path. Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make
checkAcceleratorAvailabilityDRA-aware: a Model that requests a GPU viahardware.gpu.resourceClaims(noresourceName) now resolvesAcceleratorReadyfrom the referenced ResourceClaim/ResourceClaimTemplateinstead of falling through to a vendor-default extended-resource check. Also
removes the misleading "DRA-backed resources" mention from the
resourceNamefield doc (now that the two are mutually exclusive). Fixes #754.
How
internal/controller/model_controller.go: DRA branch incheckAcceleratorAvailability->hasDRAAvailability, which returnsfalse when the referenced ResourceClaim/ResourceClaimTemplate is
NotFound (accurate readiness) and fails open (true) only on transient/RBAC
errors. Claims are resolved in the model's own namespace.
api/v1alpha1/model_types.go:resourceNamedoc cleanup.Provenance / review
Foreman coder (Strix Qwopus-27B) over two cycles plus human review, gate-verified
by the in-cluster verify gate (full
make test, GATE-PASS):hasDRAAvailabilitydid lookups but alwaysreturned true) and stale CRDs;
Namespace: "default";non-default-namespace test that fails against the hardcoded version.
Checklist
make testpasses (verify gate Job: fullmake test, GATE-PASS)make lintpasses (GOOS=linux golangci-lint run, 0 issues)make manifests/make chart-crds)