Skip to content

process: resolve wait_for_exit Future when child was reaped by another caller#3669

Open
HrachShah wants to merge 3 commits into
tornadoweb:masterfrom
HrachShah:fix/subprocess-wait-for-exit-already-reaped
Open

process: resolve wait_for_exit Future when child was reaped by another caller#3669
HrachShah wants to merge 3 commits into
tornadoweb:masterfrom
HrachShah:fix/subprocess-wait-for-exit-already-reaped

Conversation

@HrachShah

Copy link
Copy Markdown

Fixes #3364.

Subprocess.wait_for_exit returns a Future that the SIGCHLD handler is supposed to resolve when the child exits. The cleanup path _try_cleanup_process calls os.waitpid(pid, os.WNOHANG) to harvest the status; on the normal path the handler then schedules _set_returncode, which runs the exit callback and the Future resolves.

If the caller has already reaped the child themselves (e.g. by calling Subprocess.proc.wait() or .communicate() before our SIGCHLD handler can pick the child up), os.waitpid raises ChildProcessError: [Errno 10] No child processes. The previous code dropped the exit callback in that branch and returned, so the Future never resolved and wait_for_exit() blocked forever.

This change makes the ChildProcessError branch fall back to the return code that subprocess.Popen already observed via its own .wait() / .communicate(): the new helper _set_returncode_known mirrors _set_returncode but skips the WIFEXITED / WEXITSTATUS decoding (we never saw the status word — the kernel discarded it when the caller reaped the child). If the Popen object has a returncode, we remove the pid from _waiting and schedule the callback so wait_for_exit() resolves. If the Popen object also has no returncode yet (the child really hasn't exited from our perspective), we keep the existing return-and-wait behavior and the next SIGCHLD will pick the child up.

Verified:

  • PYTHONPATH=. python3 -m pytest tornado/test/process_test.py -v → 10 passed (9 pre-existing + 1 new test_wait_for_exit_after_proc_wait).
  • New test fails on the pre-fix code (TimeoutError: Operation timed out after 5 seconds from the 5-second test harness) and passes with the fix.
  • Manual reproduction with Subprocess.proc.wait() followed by wait_for_exit(raise_error=False) returns the expected exit code instead of hanging.
  • The change is scoped to the SIGCHLD cleanup path; the unrelated tornado/test/runtests.py failure that pytest reports is a pre-existing meta-test for the test runner that requires being run via python -m tornado.test.runtests, not via pytest.
    PRBODY

Zo Bot added 3 commits July 2, 2026 13:50
…t_timeout message

Issue tornadoweb#3522 reports that when `connect_timeout` fires during the
connect phase, the raised `HTTPTimeoutError` always carries the
generic "Timeout while connecting" message.  In practice the connect
phase has two distinct steps — DNS resolution and TCP/TLS handshake —
and a slow DNS server is a much more actionable failure mode than
"the host didn't accept the connection in time".  Users who hit
`connect_timeout` had no way to tell the two apart from the message
alone.

This change makes `_HTTPConnection.run` pre-resolve the hostname
through the TCP client's resolver (so `OverrideResolver` /
`hostname_mapping` are honoured) and stores a `_connecting_phase`
on the connection that tracks which step is in flight.  The connect
timeout callback is replaced with a thin `_on_connect_timeout` that
reads the phase and forwards either "while resolving" or
"while connecting" to the existing `_on_timeout` path.

When the host is already a numeric IP literal (`is_valid_ip` is
true), the resolver is skipped entirely so the timeout message still
correctly says "connecting" and the original behaviour for IP
literals is preserved.  Happy-Eyeballs parallel address-family
fallback inside `TCPClient.connect` is not reimplemented here;
that fallback only matters on broken networks where DNS is unlikely
to be the slow leg, so the trade-off is acceptable for the
timeout-reporting use case.

New test `test_connect_timeout_reports_resolving_phase` uses a
hanging resolver and a non-IP hostname (`hostname.invalid`) to
exercise the resolving-phase branch.  The existing
`test_connect_timeout` test continues to pass because `is_valid_ip`
short-circuits the new code path for the IP-literal test URL.
Verified with `pytest tornado/test/simple_httpclient_test.py`:
115 passed, 31 skipped.  Without the fix, the new test fails
with `'resolving' not found in 'Timeout while connecting'`.
…other caller

os.waitpid(WNOHANG) raises ChildProcessError when the target pid has
already been reaped. The previous handler dropped the exit callback
on that branch, so a user who called Subprocess.proc.wait() or
.communicate() before our SIGCHLD handler could pick the child up
got a wait_for_exit() Future that never resolved (issue tornadoweb#3364).

Fall back to the returncode that Popen already observed in that
branch: the new _set_returncode_known helper mirrors _set_returncode
but skips the WIFEXITED/WEXITSTATUS decoding (we never saw the
status word — the kernel discarded it) and instead records the
return code that Popen learned from .wait()/.communicate().

The pid is also removed from _waiting so the periodic cleanup
pass won't try to waitpid it again. _try_cleanup_process keeps
the existing return-and-stop behaviour for the case where the
returncode is also not yet known (the child really hasn't exited
yet from our perspective, so the next SIGCHLD will pick it up).
…eaping

Issue tornadoweb#3364 reproduction. Call Subprocess.proc.wait() to reap the child
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subprocess.wait_for_exit never resolves if process terminated before it is called

1 participant