fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293
Open
leszko wants to merge 1 commit into
Open
fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293leszko wants to merge 1 commit into
leszko wants to merge 1 commit into
Conversation
A client whose transport half-opens (tab closed without a close frame, network drop, or — common in prod — a Cloudflare-tunnel half-open) leaves ws.send buffering instead of raising ConnectionClosed. The session then never tears down: the runner keeps generating at full GPU and holds the one-session-per-pod seat forever. The pod keeps heartbeating "healthy", so the pool keeps routing users to it, they get preempt/"another connection took over", and the only recovery is destroying the pod. The server already widened the keepalive ping_timeout to 90s (to avoid GIL-starvation false-disconnects) and its comment explicitly calls for an "app-level idle-session reaper" to pair with it. This adds that reaper. Signal: the client acks every received slice (monotonic byte count). The reaper runs only while we're actively SENDING slices (sent advancing) and the ack count has stalled for _DEAD_CLIENT_ACK_TIMEOUT_S (default 30s, env DEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact dead-client fingerprint we observed in prod: 70-95% GPU generating, zero connected sockets, acks frozen. On detection it flips state.running (runner tears down on its next iteration) and closes the ws to unblock the hung send/recv. False-positive-proof: an idle-paused session sends no slices, so `sent` doesn't advance and the reaper never fires — idle+dead is left to the WS keepalive (GPU is free then, so the keepalive thread isn't starved). The reaper is a daemon thread that self-exits when state.running flips, so it adds no teardown coupling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
leszko
added a commit
that referenced
this pull request
Jun 25, 2026
* fix(rtmg): reap dead WS clients that wedge the send path
The websockets sync send path (send_context → send_data) holds the
connection's protocol lock across a *blocking* socket.sendall with no
send timeout. When a client stops reading — a killed tab, a slept
laptop, or a Cloudflare named-tunnel half-open where the pod↔edge TCP
stays ESTABLISHED while the browser is gone — the next slice ws.send
wedges on a full TCP window and pins the protocol lock forever. The
library's own keepalive ping then deadlocks acquiring that same lock,
so the dead client is never detected: state.running stays True, the
runner holds the pod's one-session seat, and /sessions never empties
(lockHeld stays true). Measured: a frozen client left ~2.6 MB stuck in
the server's send queue with the session still registered 13+ minutes
later. The recv loop can't help either (a half-open peer sends no close
frame) and our own ConnectionClosed detectors are the threads stuck in
sendall.
Add a per-session dead-client watchdog that detects this from OUTSIDE
the wedged Python I/O threads, via GIL- and lock-independent syscalls on
the socket FD:
- send stall (SIOCOUTQ): the kernel send queue stays positive and
drains zero bytes for a grace window. A live client's kernel ACKs and
drains regardless of the client app's responsiveness, so a slow link,
a GC pause, or a backgrounded tab keeps draining and is never reaped;
only a client reading nothing trips it. Keying on kernel drainage —
not app-level ack progress — is what avoids the false-reap of a
GIL-starved-but-alive client that sinks an ack-stall timeout.
- peer close (TCP_INFO state != ESTABLISHED / FD error): the
"0 established sockets but session still registered" fingerprint,
where the library's recv thread missed the EOF.
On either signal it flips state.running False (runner exits → registry
unregister fires → /sessions empties → lockHeld flips false and the pod
auto-returns to the pool) and force-shuts the socket so the wedged
sendall and recv thread unblock at once. Grace is configurable via
DEMON_DEAD_CLIENT_STALL_S (default 30s).
Supersedes #293 (the ack-stall reaper), whose app-level ack-progress
signal false-reaps live-but-GIL-starved clients.
Validated on a 5090 pod: a genuinely half-open (frozen) client is reaped
via send_stall; clean close, an active client, a paused-but-alive client
(draining, zero acks, 78s), and a backpressured client with acks
trickling (78s, window-drops engaged) are all left untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(rtmg): harden dead-client watchdog (Linux gate + ack-progress)
Off Linux the SIOCOUTQ/TCP_INFO syscalls fail and would reap every
session, so gate the watchdog to Linux and no-op elsewhere. Add
TCP_INFO.tcpi_bytes_acked as the drain signal so a backpressured-but-
alive client whose send queue plateaus at a constant positive level
isn't false-reaped. Hoist struct/sys imports.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (observed in prod)
Two fleet pods became unconnectable while still heartbeating "healthy" — the only recovery was destroying them. SSH diagnosis:
Pod A was a zombie: the runner kept generating audio at full GPU with zero connected clients, still holding the one-session-per-pod seat. Its session log showed a session that took the seat hours earlier and never tore down.
Root cause
When a client's transport half-opens — tab closed without a close frame, a network drop, or (common with our Cloudflare named tunnels) a half-open TCP —
ws.sendbuffers into the dead socket instead of raisingConnectionClosed. Every teardown path keys offstate.running, which is only flipped onConnectionClosed, so the session runs forever.The WS keepalive can't be the sole backstop:
server.pydeliberately widenedping_timeoutto 90s to avoid GIL-starvation false-disconnects, and its own comment says to "pair with an app-level idle-session reaper." That reaper didn't exist — this adds it.Fix
A per-session daemon reaper (
_dead_client_reaper) in_handle_client_body:_slice_flow). The reaper acts only while we're actively sending slices (sentadvancing) and the ack count has stalled for_DEAD_CLIENT_ACK_TIMEOUT_S(default 30s, envDEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact zombie fingerprint: generating hard, acks frozen.state.running(runner tears down on its next iteration) andws.close()to unblock the hung send/recv. Closes withoutsend_lockon purpose — closing is what unblocks a send hung inside that lock.Why it won't false-reap
sentdoesn't advance and the reaper never fires. Idle+dead is left to the WS keepalive (the GPU is free while idle, so the keepalive thread isn't GIL-starved).acked_tsseeds at session start, so a client that never acks at all is also caught after the timeout.state.runningflips, so it adds no teardown coupling and can't itself leak.Test notes
py_compileclean. The change is additive (one new daemon thread + one timestamp field); the existing flow-control/drop path is untouched.:devpod: start a session, kill the client's network (or close the tab uncleanly), confirm GPU returns to idle within ~30–35s anddead_client_reapis logged, and that a fresh connection then succeeds.Follow-ups (not in this PR)
conn_handlerthreads look like a second issue (handshake/HTTP-probe connections not being reaped); worth a separate look.