Skip to content

fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340

Merged
google-oss-prow[bot] merged 4 commits intokubeflow:mainfrom
priyank766:fix/local-job-status-338
Mar 4, 2026
Merged

fix(trainer): return TRAINJOB_COMPLETE when all steps are done#340
google-oss-prow[bot] merged 4 commits intokubeflow:mainfrom
priyank766:fix/local-job-status-338

Conversation

@priyank766
Copy link
Contributor

What this PR does / why we need it:
LocalProcessBackend.__get_job_status() returns TRAINJOB_CREATED when all steps have finished, instead of TRAINJOB_COMPLETE. This causes wait_for_job_status() to always timeout (600s) on the local backend even when jobs complete successfully. This is a one-line fix in the else branch to return the correct status.

Which issue(s) this PR fixes:

Fixes #338, #315

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings February 28, 2026 06:00
@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@priyank766 priyank766 changed the title fix(local): return TRAINJOB_COMPLETE when all steps are done (#338) fix(local): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a one-line bug in LocalProcessBackend.__get_job_status() where the else branch (reached when all steps are in TRAINJOB_COMPLETE state) incorrectly returned TRAINJOB_CREATED instead of TRAINJOB_COMPLETE. This caused wait_for_job_status() to always time out (after 600 seconds) on the local backend, even for successfully completed jobs.

Changes:

  • Fix the else branch of __get_job_status to return TRAINJOB_COMPLETE instead of TRAINJOB_CREATED when all steps have finished successfully.

@priyank766 priyank766 changed the title fix(local): return TRAINJOB_COMPLETE when all steps are done fix(local): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026
@priyank766 priyank766 changed the title fix(local): return TRAINJOB_COMPLETE when all steps are done fix(trainer): return TRAINJOB_COMPLETE when all steps are done Feb 28, 2026
@priyank766
Copy link
Contributor Author

priyank766 commented Feb 28, 2026

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@Fiona-Waters
Copy link
Contributor

/ok-to-test

Signed-off-by: priyank <priyank8445@gmail.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/XS labels Mar 2, 2026
@priyank766
Copy link
Contributor Author

priyank766 commented Mar 2, 2026

@Fiona-Waters added a test case for TRAINJOB
And E2E test are solved

@priyank766 priyank766 requested a review from Fiona-Waters March 2, 2026 17:56
Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyank766 for adding the test. Just one other comment

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766 priyank766 requested a review from Fiona-Waters March 3, 2026 17:43
Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyank766
/lgtm

@priyank766
Copy link
Contributor Author

Welcome @Fiona-Waters
#313 I had created this PR and got the
/ok-to-test to it and all E2E test ran successfully
But then there was a Maintainer for review it
Maybe if you can check it

@google-oss-prow google-oss-prow bot removed the lgtm label Mar 4, 2026
Signed-off-by: priyank <priyank8445@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @priyank766!
/lgtm
/approve

@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit ebbedf1 into kubeflow:main Mar 4, 2026
17 of 18 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.4 milestone Mar 4, 2026
@priyank766 priyank766 deleted the fix/local-job-status-338 branch March 4, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalProcessBackend.__get_job_status never returns TRAINJOB_COMPLETE

4 participants