fix(metal-agent): clear SchedulingStatus on memory-check-pass (#777)#779
Merged
Defilan merged 1 commit intoJun 21, 2026
Merged
Conversation
The metal-agent sets status.schedulingStatus to "InsufficientMemory" or "MemoryCheckFailed" on a failed memory admission, but the memory-check-pass path returned nil without clearing those fields. With PR defilantech#774 the controller no longer clears them (it correctly preserves agent-owned scheduling status). Net result: once set, SchedulingStatus persisted after the condition resolved, so a service that recovered kept showing a stale InsufficientMemory. On a successful memory check, clear SchedulingStatus and SchedulingMessage if they are set, and update the InferenceService status. Add a regression test that verifies stale status is cleared. Fixes defilantech#777 Signed-off-by: Foreman Bot <chris@mahercode.io>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Have the metal-agent clear
status.schedulingStatus/status.schedulingMessagewhen a memory admission check passes, so a service that recovers (memory freed,
then starts) no longer shows a stale
InsufficientMemory/MemoryCheckFailed.Fixes #777.
Foreman-authored (Strix Qwopus-27B), gate-verified by the in-cluster verify
gate (full
make test, GATE-PASS).Why
The agent only set these fields on a failed check; nothing cleared them on
success. With #643 / PR #774 the controller correctly stopped clobbering them,
which (correctly) exposed that the agent owns them but never cleared them. Net
before this fix: the rejection persisted forever after recovery.
How
pkg/agent/agent.go: on the memory-check-pass path (after the "memory checkpassed" log, before
return nil), clear both fields and push a status update— only when one is non-empty, to avoid a spurious write on every pass.
Best-effort with a warn log, matching the existing status-write pattern.
Leaves the controller-owned
WaitingForuntouched.pkg/agent/memory_admission_test.go: assert a pre-set scheduling status iscleared after a passing check.
Checklist
make testpasses (verify gate Job, GATE-PASS)make lintpasses