Skip to content

[Bug] Workflow remains READY_STOP when task kill status check misclassifies child processes #18311

@zhuxiangyi

Description

@zhuxiangyi

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When stopping a running workflow instance from the Web UI, the workflow can remain in READY_STOP and the task instance can remain RUNNING_EXECUTION even though the worker logs report that the process tree was killed successfully.

Image

In the observed Web UI screenshot, the workflow instance row showed the stop status tooltip as "Preparing Stop" after clicking stop, and it stayed in that state.

The database state was:

select id, name, state, host, start_time, end_time, update_time, state_history
from t_ds_workflow_instance
where id = 94;
id: 94
name: test111-20260601075309950
state: 4
host: 10.57.0.80:5678
start_time: 2026-06-01 07:53:10
end_time: NULL
update_time: 2026-06-01 15:54:12
state_history: SUBMITTED_SUCCESS -> RUNNING_EXECUTION

The running task instance was:

select id, name, state, host, start_time, end_time, log_path
from t_ds_task_instance
where workflow_instance_id = 94
order by id;
id: 136
name: Test
state: 1
host: 10.57.0.209:1234
start_time: 2026-06-01 07:53:10
end_time: NULL
log_path: /data/soft/apache-dolphinscheduler-3.4.1-bin/worker-server/logs/20260601/174970851976672/1/94/136.log

In the affected case, the worker process tree was similar to:

sudo -u airflow -i /tmp/dolphinscheduler/exec/process/136/136.sh
  bash --login -c /tmp/dolphinscheduler/exec/process/136/136.sh
    /bin/bash /tmp/dolphinscheduler/exec/process/136/136.sh
      python3 success_file_check.py ...

More specifically, ps still showed the processes alive after the stop operation:

PID    PPID  PGID   SID   STAT  ELAPSED  CMD
30448   424   372 30429   S     33:21    sudo -u airflow -i /tmp/dolphinscheduler/exec/process/136/136.sh
30451 30448   372 30429   S     33:21    -bash --login -c /tmp/dolphinscheduler/exec/process/136/136.sh
30481 30451   372 30429   S     33:21    /bin/bash /tmp/dolphinscheduler/exec/process/136/136.sh
30514 30481   372 30429   S     33:21    python3 /home/airflow/codes/common/utils/success_file_check.py ...

The master published the stop event and task kill event:

WorkflowStopLifecycleEvent{workflow=test111-20260601075309950}
Success set WorkflowExecuteRunnable state from RUNNING_EXECUTION to READY_STOP
Publish event: TaskKillLifecycleEvent{task=Test, delayTime=0}
Kill task Test on executor 10.57.0.209:1234 successfully

The worker logged:

Begin killing task instance, processId: 30448
prepare to parse pid, raw pid string: sudo(30448)---bash(30451)---136.sh(30481)---python3(30514)
All processes already terminated.
Successfully killed process tree by SIGINT, processId: 30448
Process tree for task: 136 is killed or already finished, pid: 30448

But ps still showed the task process tree alive afterwards.

The tenant process check also showed that the root-owned sudo process existed but was not signalable by the tenant user:

sudo -u airflow -i kill -0 30448; echo $?
/etc/profile.d/history.sh: line 1: HISTSIZE: readonly variable
/etc/profile.d/history.sh: line 2: HISTTIMEFORMAT: readonly variable
-bash: line 0: kill: (30448) - Operation not permitted
1

What you expected to happen

The worker should not report kill success while killable child processes are still alive. It should continue to send the kill signal to child processes that the tenant user can operate on, even if the root-owned sudo parent returns Operation not permitted for kill -0.

How to reproduce

  1. Run a shell task as a tenant through sudo -u <tenant> -i.
  2. Let the shell script start a long-running child process, for example a Python script that sleeps or polls in a loop.
  3. Configure the tenant login shell/profile so sudo -u <tenant> -i kill -0 <pid> can emit stderr even with exit code 0, for example readonly variable warnings from /etc/profile.d.
  4. Stop the workflow instance from the Web UI.
  5. Observe that the worker can mark the process tree as already terminated while child processes remain alive, and the workflow instance can stay in READY_STOP / "Preparing Stop".

Anything else

Root cause:

  • ProcessUtils.isProcessAlive treats every exception from OSUtils.exeCmd("kill -0 <pid>") as "not alive".
  • AbstractShell throws ExitCodeException when stderr is non-empty, even if the command exit code is 0.
  • kill -0 returning Operation not permitted means the process exists but cannot be signalled by the current user, not that the process has exited.

The fix should distinguish alive, not-alive, and no-permission states when checking PIDs before sending kill signals.

Version

3.4.1 and current dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions