Search before asking
What happened
When stopping a running workflow instance from the Web UI, the workflow can remain in READY_STOP and the task instance can remain RUNNING_EXECUTION even though the worker logs report that the process tree was killed successfully.
In the observed Web UI screenshot, the workflow instance row showed the stop status tooltip as "Preparing Stop" after clicking stop, and it stayed in that state.
The database state was:
select id, name, state, host, start_time, end_time, update_time, state_history
from t_ds_workflow_instance
where id = 94;
id: 94
name: test111-20260601075309950
state: 4
host: 10.57.0.80:5678
start_time: 2026-06-01 07:53:10
end_time: NULL
update_time: 2026-06-01 15:54:12
state_history: SUBMITTED_SUCCESS -> RUNNING_EXECUTION
The running task instance was:
select id, name, state, host, start_time, end_time, log_path
from t_ds_task_instance
where workflow_instance_id = 94
order by id;
id: 136
name: Test
state: 1
host: 10.57.0.209:1234
start_time: 2026-06-01 07:53:10
end_time: NULL
log_path: /data/soft/apache-dolphinscheduler-3.4.1-bin/worker-server/logs/20260601/174970851976672/1/94/136.log
In the affected case, the worker process tree was similar to:
sudo -u airflow -i /tmp/dolphinscheduler/exec/process/136/136.sh
bash --login -c /tmp/dolphinscheduler/exec/process/136/136.sh
/bin/bash /tmp/dolphinscheduler/exec/process/136/136.sh
python3 success_file_check.py ...
More specifically, ps still showed the processes alive after the stop operation:
PID PPID PGID SID STAT ELAPSED CMD
30448 424 372 30429 S 33:21 sudo -u airflow -i /tmp/dolphinscheduler/exec/process/136/136.sh
30451 30448 372 30429 S 33:21 -bash --login -c /tmp/dolphinscheduler/exec/process/136/136.sh
30481 30451 372 30429 S 33:21 /bin/bash /tmp/dolphinscheduler/exec/process/136/136.sh
30514 30481 372 30429 S 33:21 python3 /home/airflow/codes/common/utils/success_file_check.py ...
The master published the stop event and task kill event:
WorkflowStopLifecycleEvent{workflow=test111-20260601075309950}
Success set WorkflowExecuteRunnable state from RUNNING_EXECUTION to READY_STOP
Publish event: TaskKillLifecycleEvent{task=Test, delayTime=0}
Kill task Test on executor 10.57.0.209:1234 successfully
The worker logged:
Begin killing task instance, processId: 30448
prepare to parse pid, raw pid string: sudo(30448)---bash(30451)---136.sh(30481)---python3(30514)
All processes already terminated.
Successfully killed process tree by SIGINT, processId: 30448
Process tree for task: 136 is killed or already finished, pid: 30448
But ps still showed the task process tree alive afterwards.
The tenant process check also showed that the root-owned sudo process existed but was not signalable by the tenant user:
sudo -u airflow -i kill -0 30448; echo $?
/etc/profile.d/history.sh: line 1: HISTSIZE: readonly variable
/etc/profile.d/history.sh: line 2: HISTTIMEFORMAT: readonly variable
-bash: line 0: kill: (30448) - Operation not permitted
1
What you expected to happen
The worker should not report kill success while killable child processes are still alive. It should continue to send the kill signal to child processes that the tenant user can operate on, even if the root-owned sudo parent returns Operation not permitted for kill -0.
How to reproduce
- Run a shell task as a tenant through
sudo -u <tenant> -i.
- Let the shell script start a long-running child process, for example a Python script that sleeps or polls in a loop.
- Configure the tenant login shell/profile so
sudo -u <tenant> -i kill -0 <pid> can emit stderr even with exit code 0, for example readonly variable warnings from /etc/profile.d.
- Stop the workflow instance from the Web UI.
- Observe that the worker can mark the process tree as already terminated while child processes remain alive, and the workflow instance can stay in
READY_STOP / "Preparing Stop".
Anything else
Root cause:
ProcessUtils.isProcessAlive treats every exception from OSUtils.exeCmd("kill -0 <pid>") as "not alive".
AbstractShell throws ExitCodeException when stderr is non-empty, even if the command exit code is 0.
kill -0 returning Operation not permitted means the process exists but cannot be signalled by the current user, not that the process has exited.
The fix should distinguish alive, not-alive, and no-permission states when checking PIDs before sending kill signals.
Version
3.4.1 and current dev
Are you willing to submit PR?
Search before asking
What happened
When stopping a running workflow instance from the Web UI, the workflow can remain in
READY_STOPand the task instance can remainRUNNING_EXECUTIONeven though the worker logs report that the process tree was killed successfully.In the observed Web UI screenshot, the workflow instance row showed the stop status tooltip as "Preparing Stop" after clicking stop, and it stayed in that state.
The database state was:
The running task instance was:
In the affected case, the worker process tree was similar to:
More specifically,
psstill showed the processes alive after the stop operation:The master published the stop event and task kill event:
The worker logged:
But
psstill showed the task process tree alive afterwards.The tenant process check also showed that the root-owned
sudoprocess existed but was not signalable by the tenant user:What you expected to happen
The worker should not report kill success while killable child processes are still alive. It should continue to send the kill signal to child processes that the tenant user can operate on, even if the root-owned
sudoparent returnsOperation not permittedforkill -0.How to reproduce
sudo -u <tenant> -i.sudo -u <tenant> -i kill -0 <pid>can emit stderr even with exit code 0, for example readonly variable warnings from/etc/profile.d.READY_STOP/ "Preparing Stop".Anything else
Root cause:
ProcessUtils.isProcessAlivetreats every exception fromOSUtils.exeCmd("kill -0 <pid>")as "not alive".AbstractShellthrowsExitCodeExceptionwhen stderr is non-empty, even if the command exit code is 0.kill -0returningOperation not permittedmeans the process exists but cannot be signalled by the current user, not that the process has exited.The fix should distinguish alive, not-alive, and no-permission states when checking PIDs before sending kill signals.
Version
3.4.1 and current dev
Are you willing to submit PR?