Skip to content

fix(computer): enforce shell timeout on Windows by killing the process tree#8820

Open
he-yufeng wants to merge 1 commit into
AstrBotDevs:masterfrom
he-yufeng:fix/local-shell-timeout-process-tree
Open

fix(computer): enforce shell timeout on Windows by killing the process tree#8820
he-yufeng wants to merge 1 commit into
AstrBotDevs:masterfrom
he-yufeng:fix/local-shell-timeout-process-tree

Conversation

@he-yufeng

@he-yufeng he-yufeng commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

On Windows, the local computer-use shell tool ignores its timeout. A command given timeout=2 that spawns a child sleeping for 30s still blocks for ~30s before returning.

Root cause is in LocalShellComponent.exec. It runs the command via subprocess.run(..., shell=True, capture_output=True, timeout=...). With shell=True the command runs as a child of cmd.exe. When the timeout fires, subprocess.run calls Popen.kill(), which terminates only cmd.exe. The real worker keeps running and keeps the captured stdout/stderr pipes open, so the follow-up communicate() blocks until that child finally exits. The timeout is effectively a no-op, and since exec runs in a worker thread (asyncio.to_thread), the thread stays tied up the whole time.

This does not bite on POSIX, where the shell typically execs a single command in place so there is no surviving child to leak the pipe.

Modifications / 改动点

astrbot/core/computer/booters/local.py:

  • Non-background branch now uses subprocess.Popen(...) as proc + proc.communicate(timeout=...).

  • On TimeoutExpired, kill the whole process tree, then drain/reap and re-raise. New helper _kill_process_tree runs taskkill /F /T /PID <pid> on Windows (tears down the tree) and falls back to proc.kill() on POSIX, so POSIX behavior is unchanged. The drain mirrors CPython's own subprocess.run (communicate() on Windows, wait() on POSIX).

  • The same TimeoutExpired is re-raised, so the upper-layer error string is unchanged.

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Verification Steps / Test Results

Reproduced on current master (Windows 11, Python 3.13) with a real subprocess: a 30s sleep under timeout=2 returned after 30.1s.

Added tests/test_local_shell_component.py::test_local_shell_component_timeout_returns_promptly (sleeps 30s, timeout=2, asserts it returns in <15s):

  • Before the fix: FAILED ... AssertionError: shell timeout was not enforced (took 30.1s)
  • After the fix: same command returns in ~2.5s with the identical Command '...' timed out after 2 seconds message.

The 4 existing decode tests mocked subprocess.run; since the code now uses Popen, they were switched to a small _FakePopen context manager exposing communicate()/returncode.

$ python -m pytest tests/test_local_shell_component.py -q
5 passed in ~6.8s

ruff format --check and ruff check pass on the changed source file.

Checklist / 检查清单

  • My changes have been well-tested, and verification steps/results are provided above.
  • No new dependencies are introduced.
  • My changes do not introduce malicious code.

Summary by Sourcery

Ensure local shell commands respect execution timeouts by correctly handling process termination across platforms.

Bug Fixes:

  • Fix Windows shell executions ignoring timeouts by killing the entire process tree when a timeout occurs.

Tests:

  • Add an integration-style timeout test for LocalShellComponent to verify that long-running shell commands return promptly on timeout.
  • Update LocalShellComponent decoding tests to use a fake Popen context manager instead of mocking subprocess.run.

…s tree

LocalShellComponent.exec ran commands through subprocess.run with
capture_output and a timeout. On Windows the shell process keeps running
while the command it spawned holds the captured stdout/stderr pipes open,
so on timeout subprocess.run kills only the shell and then blocks in
communicate() until the child exits. The configured timeout is effectively
ignored: a 2s timeout on a 30s command still takes ~30s to return.

Switch to Popen + communicate(timeout) and, on TimeoutExpired, kill the
whole process tree via taskkill /T before draining and re-raising the same
exception. POSIX keeps the existing single-process kill, so its behavior is
unchanged. The re-raised TimeoutExpired message is identical, so the
upper-layer error string stays the same.
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels Jun 16, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the timeout handling path, the post-kill proc.communicate()/proc.wait() calls have no timeout and could hang if taskkill/kill fails or a zombie process sticks around; consider adding a short, bounded timeout (and logging/fallback) to this cleanup phase to avoid blocking a worker thread indefinitely.
  • The integration-style timeout test currently uses a 30s sleep, which makes failures slow and slightly increases normal runtime; you could shorten the child sleep (e.g., 5–10s) while keeping the shell timeout at 2s to preserve the behavior under test but make the suite faster and less painful when it regresses.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the timeout handling path, the post-kill `proc.communicate()`/`proc.wait()` calls have no timeout and could hang if `taskkill`/`kill` fails or a zombie process sticks around; consider adding a short, bounded timeout (and logging/fallback) to this cleanup phase to avoid blocking a worker thread indefinitely.
- The integration-style timeout test currently uses a 30s sleep, which makes failures slow and slightly increases normal runtime; you could shorten the child sleep (e.g., 5–10s) while keeping the shell timeout at 2s to preserve the behavior under test but make the suite faster and less painful when it regresses.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces subprocess.run with subprocess.Popen in the local booter to better handle timeouts on Windows by introducing a _kill_process_tree helper that forcefully terminates a process and all its children. The tests have been updated accordingly to mock Popen and verify timeout enforcement. The review feedback highlights two important robustness improvements: wrapping proc.kill() in a try-except block to safely handle potential OSError exceptions, and replacing the blocking proc.communicate() call in the Windows timeout handler with a timed proc.wait() to prevent indefinite hangs if child processes fail to terminate.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +161 to +164
if os.name == "nt":
proc.communicate()
else:
proc.wait()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling proc.communicate() without a timeout on Windows inside the TimeoutExpired exception handler can block indefinitely if _kill_process_tree fails to terminate all child processes (e.g., if taskkill is unavailable and we fall back to proc.kill(), which leaves children alive).

In CPython's standard library subprocess.run implementation, Windows does not call communicate() after a timeout precisely because it cannot reliably wait for the process to exit and close its pipes without blocking. Instead, it only calls proc.wait(). We should align with this behavior and call proc.wait() on Windows as well, or at least provide a short timeout to proc.communicate() to prevent an infinite hang.

                    _kill_process_tree(proc)
                    try:
                        proc.wait(timeout=5)
                    except subprocess.TimeoutExpired:
                        pass
                    raise

Comment on lines +56 to +68
if os.name == "nt":
try:
subprocess.run(
["taskkill", "/F", "/T", "/PID", str(proc.pid)],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=False,
)
return
except OSError:
# taskkill missing/unavailable: fall back to a single-process kill.
pass
proc.kill()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _kill_process_tree, if os.name == "nt" is true, we attempt to run taskkill. If taskkill is missing or fails, we catch OSError and fall back to proc.kill(). However, proc.kill() itself can raise an OSError (such as PermissionError or ProcessLookupError) if the process has already exited or if access is denied. It is safer to wrap proc.kill() in a try...except OSError: block to prevent unexpected crashes during cleanup.

    if os.name == "nt":
        try:
            subprocess.run(
                ["taskkill", "/F", "/T", "/PID", str(proc.pid)],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
                check=False,
            )
            return
        except OSError:
            # taskkill missing/unavailable: fall back to a single-process kill.
            pass
    try:
        proc.kill()
    except OSError:
        pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant