Fix type error for suspended kubeflow jobs#7241
Fix type error for suspended kubeflow jobs#7241strigazi wants to merge 1 commit intoflyteorg:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7241 +/- ##
==========================================
- Coverage 56.96% 56.95% -0.01%
==========================================
Files 931 931
Lines 58246 58250 +4
==========================================
+ Hits 33178 33179 +1
- Misses 22014 22017 +3
Partials 3054 3054
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
To me it doesn't make sense that a kubeflow job that is .JobSuspended (meaning not yet running) is treated as pluginsCore.PhaseInfoRunning. The ray plugin treats it as "phase info queued", see here which makes more sense to me. Let's return pluginsCore.PhaseInfoQueued(... instead.
@pingsutw I think this is currently not correct :)
|
but kubeflow may also suspend a running job too, (running -> suspended) |
Do you know in which situation kubeflow does this? Haven't seen this yet, only by external systems like kueue via an admission webhook right after creation of the job object. |
|
I think that pluginsCore.PhaseInfoQueued makes more sense as well. Kueue suspends workloads to queue them for later. Either right after creation or after already running. |
When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0]. https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase. [0] RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>] Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
e914edb to
fe3c9ad
Compare
When kubeflow jobs are suspended by an external system (eg kueue), the controller error's with the following [0].
https://github.com/flyteorg/flyte/pull/6295/changes relies only on app.Spec.RunPolicy.Suspend and does not take into account the phase.
[0]
failed Execute for node. Error: failed at Node[dn1]. RuntimeExecutionError: failed during plugin execution, caused by: Invalid transition for plugin [pytorch]: transition doesn't have task info nor an execution error filled [TransitionTypeEphemeral,Phase<PhaseUndefined:0 <nil> Reason:>]related: #6294 #6295
Observed in versions: v1.16.3, v1.16.4, v1.16.5