Skip to content

Fail fast when overlap wait times out#20

Open
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/f2-fail-fast-overlap-wait
Open

Fail fast when overlap wait times out#20
TianyeGGBond wants to merge 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/f2-fail-fast-overlap-wait

Conversation

@TianyeGGBond
Copy link
Copy Markdown
Collaborator

Context

_wait_for_overlap_engines_offloaded is the safety gate before training resumes on GPUs that may have just been used by inference engines. If the engine state probe fails or the wait times out, continuing can surface later as a less-informative train wake-up OOM.

Change

  • raise when engine state polling fails
  • raise when engines do not reach a safe state before the timeout
  • raise when the free-memory gate times out

Validation

  • python -m py_compile rlix/pipeline/miles_pipeline.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant