fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340
Conversation
…w#338) Signed-off-by: priyank <priyank8445@gmail.com>
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a one-line bug in LocalProcessBackend.__get_job_status() where the else branch (reached when all steps are in TRAINJOB_COMPLETE state) incorrectly returned TRAINJOB_CREATED instead of TRAINJOB_COMPLETE. This caused wait_for_job_status() to always time out (after 600 seconds) on the local backend, even for successfully completed jobs.
Changes:
- Fix the
elsebranch of__get_job_statusto returnTRAINJOB_COMPLETEinstead ofTRAINJOB_CREATEDwhen all steps have finished successfully.
|
/ok-to-test |
Signed-off-by: priyank <priyank8445@gmail.com>
|
@Fiona-Waters added a test case for TRAINJOB |
Fiona-Waters
left a comment
There was a problem hiding this comment.
Thanks @priyank766 for adding the test. Just one other comment
Signed-off-by: priyank <priyank8445@gmail.com>
Fiona-Waters
left a comment
There was a problem hiding this comment.
Thanks @priyank766
/lgtm
|
Welcome @Fiona-Waters |
Signed-off-by: priyank <priyank8445@gmail.com>
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks @priyank766!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
LocalProcessBackend.__get_job_status()returnsTRAINJOB_CREATEDwhen all steps have finished, instead ofTRAINJOB_COMPLETE. This causes wait_for_job_status() to always timeout (600s) on the local backend even when jobs complete successfully. This is a one-line fix in theelsebranch to return the correct status.Which issue(s) this PR fixes:
Fixes #338, #315
Checklist: