Skip to content

Fix type error for suspended kubeflow jobs#7241

Open
strigazi wants to merge 1 commit intoflyteorg:masterfrom
strigazi:fix/suspended-kfjobs
Open

Fix type error for suspended kubeflow jobs#7241
strigazi wants to merge 1 commit intoflyteorg:masterfrom
strigazi:fix/suspended-kfjobs

Conversation

@strigazi
Copy link
Copy Markdown

When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0].

https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase.

[0] failed Execute for node. Error: failed at Node[dn1]. RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]

related: #6294 #6295

Observed in versions: v1.16.3, v1.16.4, v1.16.5

@strigazi
Copy link
Copy Markdown
Author

Hello @fg91, following your PR #6295 to make flyte work with suspended kubeflow jobs, can you have a quick look in this PR? Does the direction of this patch look good to you?

pingsutw
pingsutw previously approved these changes Apr 20, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.95%. Comparing base (e938b03) to head (e914edb).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7241      +/-   ##
==========================================
- Coverage   56.96%   56.95%   -0.01%     
==========================================
  Files         931      931              
  Lines       58246    58250       +4     
==========================================
+ Hits        33178    33179       +1     
- Misses      22014    22017       +3     
  Partials     3054     3054              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <ø> (ø)
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.09% <ø> (-0.05%) ⬇️
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.19% <100.00%> (+0.01%) ⬆️
unittests-flytepropeller 53.71% <ø> (ø)
unittests-flytestdlib 62.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@fg91 fg91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it doesn't make sense that a kubeflow job that is .JobSuspended (meaning not yet running) is treated as pluginsCore.PhaseInfoRunning. The ray plugin treats it as "phase info queued", see here which makes more sense to me. Let's return pluginsCore.PhaseInfoQueued(... instead.

@fg91 fg91 dismissed pingsutw’s stale review April 20, 2026 19:37

@pingsutw I think this is currently not correct :)

@pingsutw
Copy link
Copy Markdown
Member

but kubeflow may also suspend a running job too, (running -> suspended)

@fg91
Copy link
Copy Markdown
Member

fg91 commented Apr 20, 2026

but kubeflow may also suspend a running job too, (running -> suspended)

Do you know in which situation kubeflow does this? Haven't seen this yet, only by external systems like kueue via an admission webhook right after creation of the job object.

@strigazi
Copy link
Copy Markdown
Author

I think that pluginsCore.PhaseInfoQueued makes more sense as well. Kueue suspends workloads to queue them for later. Either right after creation or after already running.

When kubeflow jobs are suspended by an external system (eg kueue),
the controller error's with the following [0].

https://github.com/flyteorg/flyte/pull/6295/changes relies only on
app.Spec.RunPolicy.Suspend and does not take into account the phase.

[0] RuntimeExecutionError: failed during plugin execution, caused by:
    Invalid transition for plugin [pytorch]: transition doesn't have task
    info nor an execution error filled
    [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]

Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
@strigazi strigazi force-pushed the fix/suspended-kfjobs branch from e914edb to fe3c9ad Compare April 21, 2026 08:20
@strigazi strigazi requested a review from fg91 April 21, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants