Skip to content

fix(metal-agent): clear SchedulingStatus on memory-check-pass (#777)#779

Merged
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-777-clear-schedstatus
Jun 21, 2026
Merged

fix(metal-agent): clear SchedulingStatus on memory-check-pass (#777)#779
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:foreman/issue-777-clear-schedstatus

Conversation

@Defilan

@Defilan Defilan commented Jun 21, 2026

Copy link
Copy Markdown
Member

What

Have the metal-agent clear status.schedulingStatus / status.schedulingMessage
when a memory admission check passes, so a service that recovers (memory freed,
then starts) no longer shows a stale InsufficientMemory / MemoryCheckFailed.
Fixes #777.

Foreman-authored (Strix Qwopus-27B), gate-verified by the in-cluster verify
gate (full make test, GATE-PASS).

Why

The agent only set these fields on a failed check; nothing cleared them on
success. With #643 / PR #774 the controller correctly stopped clobbering them,
which (correctly) exposed that the agent owns them but never cleared them. Net
before this fix: the rejection persisted forever after recovery.

How

  • pkg/agent/agent.go: on the memory-check-pass path (after the "memory check
    passed" log, before return nil), clear both fields and push a status update
    — only when one is non-empty, to avoid a spurious write on every pass.
    Best-effort with a warn log, matching the existing status-write pattern.
    Leaves the controller-owned WaitingFor untouched.
  • pkg/agent/memory_admission_test.go: assert a pre-set scheduling status is
    cleared after a passing check.

Checklist

  • Tests added/updated
  • make test passes (verify gate Job, GATE-PASS)
  • make lint passes
  • Commits signed off (DCO)

The metal-agent sets status.schedulingStatus to "InsufficientMemory" or
"MemoryCheckFailed" on a failed memory admission, but the
memory-check-pass path returned nil without clearing those fields. With
PR defilantech#774 the controller no longer clears them (it correctly preserves
agent-owned scheduling status). Net result: once set, SchedulingStatus
persisted after the condition resolved, so a service that recovered kept
showing a stale InsufficientMemory.

On a successful memory check, clear SchedulingStatus and
SchedulingMessage if they are set, and update the InferenceService
status. Add a regression test that verifies stale status is cleared.

Fixes defilantech#777

Signed-off-by: Foreman Bot <chris@mahercode.io>
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/agent/agent.go 60.00% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Defilan Defilan merged commit ed1c4eb into defilantech:main Jun 21, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] metal-agent never clears SchedulingStatus on memory-check-pass (stale after recovery)

1 participant