Skip to content

Python SDK: worker exception causes silent stuck RUNNING workflow — execute() returns RUNNING with no error #41

@nthmost-orkes

Description

@nthmost-orkes

Summary

When a worker raises an exception, executor.execute() returns a workflow run with status: RUNNING and output: {} — with no indication that anything went wrong. The actual error is only visible in background logs. A first-time user sees the workflow silently hang and has no idea what failed.

Steps to reproduce

from conductor.client.automator.task_handler import TaskHandler
from conductor.client.configuration.configuration import Configuration
from conductor.client.orkes_clients import OrkesClients
from conductor.client.workflow.conductor_workflow import ConductorWorkflow
from conductor.client.worker.worker_task import worker_task

@worker_task(task_definition_name='bad_worker', register_task_def=True)
def bad_worker(name: str) -> str:
    raise ValueError("intentional failure")

config = Configuration()
clients = OrkesClients(configuration=config)
executor = clients.get_workflow_executor()

workflow = ConductorWorkflow(name='fail_test', version=1, executor=executor)
t = bad_worker(task_ref_name='t', name=workflow.input('name'))
workflow >> t
workflow.register(overwrite=True)

with TaskHandler(configuration=config, scan_for_annotated_workers=True) as th:
    th.start_processes()
    run = executor.execute(name='fail_test', version=1, workflow_input={'name': 'x'})
    print('Status:', run.status)  # prints: RUNNING
    print('Output:', run.output)  # prints: {}

What the user sees

Status: RUNNING
Output: {}

The actual error traceback (ValueError: intentional failure) is logged to stderr from a background worker process, but:

  • run.status is RUNNING not FAILED
  • run.output is empty {}
  • run.reason_for_incompletion returns None (and is deprecated with a warning)

Why this happens

executor.execute() defaults to wait_for_seconds=10. When a worker fails, Conductor schedules retries (default: 3 retries with 60-second delays). The workflow is genuinely still RUNNING (waiting for retry) when execute() times out after 10 seconds. The workflow won't fail for at least 3 × 60 = 180 seconds.

Impact on first-time users

This is a silent failure mode. New users:

  1. Write a worker, it has a bug
  2. Run their app — see Status: RUNNING, Output: {}
  3. Have no idea why it's not completing
  4. Must know to look at background INFO/ERROR logs from a separate process

Expected behavior

At minimum, the SDK should provide a clear path to surface the failure reason. Possible improvements:

  • If the workflow status is RUNNING after the timeout, check for failed tasks and surface the failure reason in the exception or returned object
  • Document wait_for_seconds prominently and suggest increasing it for debugging
  • Suppress or fix the deprecation warning on reason_for_incompletion and ensure it returns useful info
  • Add a run.failed_tasks or similar accessor

Environment

  • Python 3.14, conductor-python 1.3.8, Conductor OSS server
  • Tested on macOS 15

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: sdkAny language SDKbugSomething isn't workingcriticalSeriously impacts zero-to-one onboardingfix: codeFix requires a code change in a reposdk: python

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions