Skip to content

bug(hooks): tokio::join! in fire_shell_hook can hang past timeout — read_fut blocks if stdout not closed before kill #4011

@bug-ops

Description

@bug-ops

Description

In `crates/zeph-subagent/src/hooks.rs`, `fire_shell_hook` uses `tokio::join!` to concurrently await child process exit (with a timeout) and read stdout:

```rust
let (wait_res, stdout_bytes) = tokio::join!(
timeout(Duration::from_secs(timeout_secs), child.wait()),
read_fut, // <-- reads up to HOOK_STDOUT_CAP bytes from child stdout
);
match wait_res {
// ...
Err(_) => {
let _ = child.kill().await; // kill happens AFTER join! returns
Err(HookError::Timeout { ... })
}
}
```

`tokio::join!` only returns when both futures complete. The timeout governs `child.wait()`, but `read_fut` has no independent timeout. If the hook process:

  1. Writes a large amount of stdout (keeping the pipe open), AND
  2. Takes longer than `timeout_secs`

then `wait_res` times out, but `read_fut` continues blocking because the child hasn't closed its stdout yet. `child.kill()` is only called after `join!` returns — but `join!` cannot return until `read_fut` completes — which waits for EOF on stdout — which only comes after the process exits or is killed. Deadlock: `read_fut` waits for EOF, kill is not issued yet because `join!` is still waiting.

The actual block duration is bounded only by how long the child keeps stdout open after its wait-timeout fires, which is unbounded.

Reproduction Steps

  1. Configure a PostToolUse shell hook that: (a) writes a continuous stream to stdout and (b) takes longer than `timeout_secs`.
  2. Run the agent and trigger the hooked tool.
  3. Observe: hook does not return after `timeout_secs`; agent loop stalls.

Expected Behavior

The hook must return within `timeout_secs` regardless of stdout activity. One correct approach:

```rust
// Kill child first, then read any buffered stdout.
match timeout(Duration::from_secs(timeout_secs), child.wait()).await {
Ok(Ok(status)) if status.success() => {
let stdout_bytes = read_stdout(child.stdout.take(), HOOK_STDOUT_CAP).await;
Ok(parse_hook_stdout(command, &stdout_bytes))
}
Err(_) => {
let _ = child.kill().await;
// Read any bytes already buffered (bounded by HOOK_STDOUT_CAP).
Err(HookError::Timeout { command: command.to_owned(), timeout_secs })
}
// ...
}
```

Or wrap the entire join! in an outer timeout.

Actual Behavior

Agent loop can stall indefinitely on hook timeout when the child keeps stdout open past the wait timeout.

Environment

Metadata

Metadata

Assignees

Labels

P2High value, medium complexitybugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions