Skip to content

fix(miles): keep health checks in admission mode#7

Open
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:zhenyu/fix-router-health-admission
Open

fix(miles): keep health checks in admission mode#7
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:zhenyu/fix-router-health-admission

Conversation

@TianyeGGBond
Copy link
Copy Markdown

@TianyeGGBond TianyeGGBond commented May 30, 2026

Context

This is a narrow follow-up to the M11 router admission path.

In admission mode, an empty enabled_workers set is a valid runtime state: the scheduler may have disabled/offloaded every registered worker, and generation requests should suspend until a worker is re-admitted.

Before this change, _health_check_loop used if self.enabled_workers to decide whether admission mode was active. That makes an empty enabled set ambiguous:

  • legacy/pre-admission mode: no enabled set has been declared yet, so probing the full registered worker set is correct;
  • admission/zero-active mode: all workers are intentionally disabled, so falling back to the full registered set incorrectly probes disabled workers.

That fallback can fight the sleep/offload lifecycle because disabled workers are parked and should not be health-probed until they are re-admitted.

Change

  • Use _admission_declared as the admission-mode switch in _health_check_loop.
  • In admission mode, probe only enabled, non-dead workers, even when the enabled set is empty.
  • Preserve legacy behavior before admission has ever been declared: probe registered workers minus dead workers.
  • Add a focused router admission test for the zero-active case, verifying a disabled worker is not health-probed.
  • Keep the production code intentionally small: no new helper and no extra feature-specific comments.

Validation

Passed:

python -m pytest tests/test_partial_sleep_wake.py::TestRouterAdmissionLifecycle -q
# 4 passed, 1 warning

Passed:

git diff --check origin/zhenyu/m11-mvp-test..HEAD

Also tried the whole file:

python -m pytest tests/test_partial_sleep_wake.py -q

Local result is limited by unrelated missing heavy dependencies in this Windows environment:

  • ModuleNotFoundError: No module named 'ray' for TestEngineInfoStateMachine
  • ModuleNotFoundError: No module named 'torch' for TestSchedulerPreemptClassification

@TianyeGGBond TianyeGGBond force-pushed the zhenyu/fix-router-health-admission branch 2 times, most recently from 7fc79f1 to 56bf34a Compare May 30, 2026 03:44
@TianyeGGBond TianyeGGBond force-pushed the zhenyu/fix-router-health-admission branch from 56bf34a to 4059f2a Compare May 30, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant