fix(computer): enforce shell timeout on Windows by killing the process tree#8820
fix(computer): enforce shell timeout on Windows by killing the process tree#8820he-yufeng wants to merge 1 commit into
Conversation
…s tree LocalShellComponent.exec ran commands through subprocess.run with capture_output and a timeout. On Windows the shell process keeps running while the command it spawned holds the captured stdout/stderr pipes open, so on timeout subprocess.run kills only the shell and then blocks in communicate() until the child exits. The configured timeout is effectively ignored: a 2s timeout on a 30s command still takes ~30s to return. Switch to Popen + communicate(timeout) and, on TimeoutExpired, kill the whole process tree via taskkill /T before draining and re-raising the same exception. POSIX keeps the existing single-process kill, so its behavior is unchanged. The re-raised TimeoutExpired message is identical, so the upper-layer error string stays the same.
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the timeout handling path, the post-kill
proc.communicate()/proc.wait()calls have no timeout and could hang iftaskkill/killfails or a zombie process sticks around; consider adding a short, bounded timeout (and logging/fallback) to this cleanup phase to avoid blocking a worker thread indefinitely. - The integration-style timeout test currently uses a 30s sleep, which makes failures slow and slightly increases normal runtime; you could shorten the child sleep (e.g., 5–10s) while keeping the shell timeout at 2s to preserve the behavior under test but make the suite faster and less painful when it regresses.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the timeout handling path, the post-kill `proc.communicate()`/`proc.wait()` calls have no timeout and could hang if `taskkill`/`kill` fails or a zombie process sticks around; consider adding a short, bounded timeout (and logging/fallback) to this cleanup phase to avoid blocking a worker thread indefinitely.
- The integration-style timeout test currently uses a 30s sleep, which makes failures slow and slightly increases normal runtime; you could shorten the child sleep (e.g., 5–10s) while keeping the shell timeout at 2s to preserve the behavior under test but make the suite faster and less painful when it regresses.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request replaces subprocess.run with subprocess.Popen in the local booter to better handle timeouts on Windows by introducing a _kill_process_tree helper that forcefully terminates a process and all its children. The tests have been updated accordingly to mock Popen and verify timeout enforcement. The review feedback highlights two important robustness improvements: wrapping proc.kill() in a try-except block to safely handle potential OSError exceptions, and replacing the blocking proc.communicate() call in the Windows timeout handler with a timed proc.wait() to prevent indefinite hangs if child processes fail to terminate.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if os.name == "nt": | ||
| proc.communicate() | ||
| else: | ||
| proc.wait() |
There was a problem hiding this comment.
Calling proc.communicate() without a timeout on Windows inside the TimeoutExpired exception handler can block indefinitely if _kill_process_tree fails to terminate all child processes (e.g., if taskkill is unavailable and we fall back to proc.kill(), which leaves children alive).
In CPython's standard library subprocess.run implementation, Windows does not call communicate() after a timeout precisely because it cannot reliably wait for the process to exit and close its pipes without blocking. Instead, it only calls proc.wait(). We should align with this behavior and call proc.wait() on Windows as well, or at least provide a short timeout to proc.communicate() to prevent an infinite hang.
_kill_process_tree(proc)
try:
proc.wait(timeout=5)
except subprocess.TimeoutExpired:
pass
raise| if os.name == "nt": | ||
| try: | ||
| subprocess.run( | ||
| ["taskkill", "/F", "/T", "/PID", str(proc.pid)], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| check=False, | ||
| ) | ||
| return | ||
| except OSError: | ||
| # taskkill missing/unavailable: fall back to a single-process kill. | ||
| pass | ||
| proc.kill() |
There was a problem hiding this comment.
In _kill_process_tree, if os.name == "nt" is true, we attempt to run taskkill. If taskkill is missing or fails, we catch OSError and fall back to proc.kill(). However, proc.kill() itself can raise an OSError (such as PermissionError or ProcessLookupError) if the process has already exited or if access is denied. It is safer to wrap proc.kill() in a try...except OSError: block to prevent unexpected crashes during cleanup.
if os.name == "nt":
try:
subprocess.run(
["taskkill", "/F", "/T", "/PID", str(proc.pid)],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=False,
)
return
except OSError:
# taskkill missing/unavailable: fall back to a single-process kill.
pass
try:
proc.kill()
except OSError:
pass
On Windows, the local computer-use shell tool ignores its
timeout. A command giventimeout=2that spawns a child sleeping for 30s still blocks for ~30s before returning.Root cause is in
LocalShellComponent.exec. It runs the command viasubprocess.run(..., shell=True, capture_output=True, timeout=...). Withshell=Truethe command runs as a child ofcmd.exe. When the timeout fires,subprocess.runcallsPopen.kill(), which terminates onlycmd.exe. The real worker keeps running and keeps the captured stdout/stderr pipes open, so the follow-upcommunicate()blocks until that child finally exits. The timeout is effectively a no-op, and sinceexecruns in a worker thread (asyncio.to_thread), the thread stays tied up the whole time.This does not bite on POSIX, where the shell typically
execs a single command in place so there is no surviving child to leak the pipe.Modifications / 改动点
astrbot/core/computer/booters/local.py:Non-background branch now uses
subprocess.Popen(...) as proc+proc.communicate(timeout=...).On
TimeoutExpired, kill the whole process tree, then drain/reap and re-raise. New helper_kill_process_treerunstaskkill /F /T /PID <pid>on Windows (tears down the tree) and falls back toproc.kill()on POSIX, so POSIX behavior is unchanged. The drain mirrors CPython's ownsubprocess.run(communicate()on Windows,wait()on POSIX).The same
TimeoutExpiredis re-raised, so the upper-layer error string is unchanged.This is NOT a breaking change. / 这不是一个破坏性变更。
Verification Steps / Test Results
Reproduced on current
master(Windows 11, Python 3.13) with a real subprocess: a 30s sleep undertimeout=2returned after 30.1s.Added
tests/test_local_shell_component.py::test_local_shell_component_timeout_returns_promptly(sleeps 30s,timeout=2, asserts it returns in <15s):FAILED ... AssertionError: shell timeout was not enforced (took 30.1s)Command '...' timed out after 2 secondsmessage.The 4 existing decode tests mocked
subprocess.run; since the code now usesPopen, they were switched to a small_FakePopencontext manager exposingcommunicate()/returncode.ruff format --checkandruff checkpass on the changed source file.Checklist / 检查清单
Summary by Sourcery
Ensure local shell commands respect execution timeouts by correctly handling process termination across platforms.
Bug Fixes:
Tests: