fix(controller): preserve agent-written schedulingStatus on reconcile (#643)#774
Merged
Defilan merged 1 commit intoJun 21, 2026
Merged
Conversation
…us update The metal-agent writes status.schedulingStatus and status.schedulingMessage (e.g. "MemoryCheckFailed", "InsufficientMemory") when admission rejects an InferenceService. The controller's reconcile loop subsequently updated the status without preserving those fields, so they read back empty almost immediately. The fix removes the else branch in updateStatusWithSchedulingInfo that unconditionally cleared SchedulingStatus, SchedulingMessage, and WaitingFor when schedulingInfo is nil. When schedulingInfo is nil (the common case for non-GPU-scheduling scenarios), the existing agent-written values are now preserved. A regression test verifies that agent-written scheduling fields survive a controller reconcile. Fixes defilantech#643 Signed-off-by: Foreman Bot <chris@mahercode.io>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This was referenced Jun 21, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stop the InferenceService controller from clobbering the metal-agent's
status.schedulingStatus/status.schedulingMessage.updateStatusWithSchedulingInfozeroed those fields whenever
schedulingInfo == nil, wiping the agent'sInsufficientMemory/MemoryCheckFailedwrites almost immediately.Foreman-authored (Strix Qwopus-27B), gate-verified: the in-cluster verify
gate ran the full
make test(controller envtest) to GATE-PASS.Why
The agent writes the scheduling rejection so
kubectl describe/getshowswhy a service won't start, but the controller's next reconcile erased it.
Fixes #643.
How
internal/controller/status_builder.go: drop theelsebranch that setthe scheduling fields to
""; whenschedulingInfo == nil, leave theagent-owned fields intact. (The preserve approach from the issue's two
suggestions; the SSA-field-manager alternative was not taken.)
internal/controller/inferenceservice_reconcile_test.go: envtest specasserting an agent-written scheduling status survives a controller status
update.
Reviewer note (follow-up, not blocking this fix)
With the controller no longer clearing on
schedulingInfo == nil, and themetal-agent only setting
SchedulingStatus(pkg/agent/agent.go:1388,:1426) without clearing it on a later successful memory check(
:1396returns nil without clearing), the field will now persist afterthe condition resolves — a recovered service can show a stale
InsufficientMemory. This fix is still a net improvement (the bug was thatthe rejection was never visible), but a follow-up should have the agent
clear
SchedulingStatus/SchedulingMessageon memory-check-pass.