Skip to content

fix(controller): preserve agent-written schedulingStatus on reconcile (#643)#774

Merged
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-643-status-clobber
Jun 21, 2026
Merged

fix(controller): preserve agent-written schedulingStatus on reconcile (#643)#774
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-643-status-clobber

Conversation

@Defilan

@Defilan Defilan commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

Stop the InferenceService controller from clobbering the metal-agent's
status.schedulingStatus / status.schedulingMessage. updateStatusWithSchedulingInfo
zeroed those fields whenever schedulingInfo == nil, wiping the agent's
InsufficientMemory / MemoryCheckFailed writes almost immediately.

Foreman-authored (Strix Qwopus-27B), gate-verified: the in-cluster verify
gate ran the full make test (controller envtest) to GATE-PASS.

Why

The agent writes the scheduling rejection so kubectl describe/get shows
why a service won't start, but the controller's next reconcile erased it.
Fixes #643.

How

  • internal/controller/status_builder.go: drop the else branch that set
    the scheduling fields to ""; when schedulingInfo == nil, leave the
    agent-owned fields intact. (The preserve approach from the issue's two
    suggestions; the SSA-field-manager alternative was not taken.)
  • internal/controller/inferenceservice_reconcile_test.go: envtest spec
    asserting an agent-written scheduling status survives a controller status
    update.

Reviewer note (follow-up, not blocking this fix)

With the controller no longer clearing on schedulingInfo == nil, and the
metal-agent only setting SchedulingStatus (pkg/agent/agent.go:1388,
:1426) without clearing it on a later successful memory check
(:1396 returns nil without clearing), the field will now persist after
the condition resolves
— a recovered service can show a stale
InsufficientMemory. This fix is still a net improvement (the bug was that
the rejection was never visible), but a follow-up should have the agent
clear SchedulingStatus/SchedulingMessage on memory-check-pass.

…us update

The metal-agent writes status.schedulingStatus and status.schedulingMessage
(e.g. "MemoryCheckFailed", "InsufficientMemory") when admission rejects an
InferenceService. The controller's reconcile loop subsequently updated the
status without preserving those fields, so they read back empty almost
immediately.

The fix removes the else branch in updateStatusWithSchedulingInfo that
unconditionally cleared SchedulingStatus, SchedulingMessage, and WaitingFor
when schedulingInfo is nil. When schedulingInfo is nil (the common case for
non-GPU-scheduling scenarios), the existing agent-written values are now
preserved.

A regression test verifies that agent-written scheduling fields survive a
controller reconcile.

Fixes defilantech#643

Signed-off-by: Foreman Bot <chris@mahercode.io>
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Defilan Defilan merged commit 4321028 into defilantech:main Jun 21, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Controller reconcile clobbers agent-written schedulingStatus on InferenceService

1 participant