Skip to content

fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293

Open
leszko wants to merge 1 commit into
mainfrom
rafal/ws-dead-client-reaper
Open

fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293
leszko wants to merge 1 commit into
mainfrom
rafal/ws-dead-client-reaper

Conversation

@leszko

@leszko leszko commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Problem (observed in prod)

Two fleet pods became unconnectable while still heartbeating "healthy" — the only recovery was destroying them. SSH diagnosis:

Pod GPU Connected WS sockets conn_handler threads
A 74–95% 0 236
B 0% 0 242

Pod A was a zombie: the runner kept generating audio at full GPU with zero connected clients, still holding the one-session-per-pod seat. Its session log showed a session that took the seat hours earlier and never tore down.

Root cause

When a client's transport half-opens — tab closed without a close frame, a network drop, or (common with our Cloudflare named tunnels) a half-open TCPws.send buffers into the dead socket instead of raising ConnectionClosed. Every teardown path keys off state.running, which is only flipped on ConnectionClosed, so the session runs forever.

The WS keepalive can't be the sole backstop: server.py deliberately widened ping_timeout to 90s to avoid GIL-starvation false-disconnects, and its own comment says to "pair with an app-level idle-session reaper." That reaper didn't exist — this adds it.

Fix

A per-session daemon reaper (_dead_client_reaper) in _handle_client_body:

  • Signal: the client acks every received slice (monotonic byte count, already tracked in _slice_flow). The reaper acts only while we're actively sending slices (sent advancing) and the ack count has stalled for _DEAD_CLIENT_ACK_TIMEOUT_S (default 30s, env DEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact zombie fingerprint: generating hard, acks frozen.
  • Action: flip state.running (runner tears down on its next iteration) and ws.close() to unblock the hung send/recv. Closes without send_lock on purpose — closing is what unblocks a send hung inside that lock.

Why it won't false-reap

  • An idle-paused session sends no slices, so sent doesn't advance and the reaper never fires. Idle+dead is left to the WS keepalive (the GPU is free while idle, so the keepalive thread isn't GIL-starved).
  • acked_ts seeds at session start, so a client that never acks at all is also caught after the timeout.
  • The reaper is a daemon thread that self-exits when state.running flips, so it adds no teardown coupling and can't itself leak.

Test notes

  • py_compile clean. The change is additive (one new daemon thread + one timestamp field); the existing flow-control/drop path is untouched.
  • Recommend validating on a :dev pod: start a session, kill the client's network (or close the tab uncleanly), confirm GPU returns to idle within ~30–35s and dead_client_reap is logged, and that a fresh connection then succeeds.

Follow-ups (not in this PR)

  • The ~240 leaked conn_handler threads look like a second issue (handshake/HTTP-probe connections not being reaped); worth a separate look.
  • Fleet-side safety net (in demon-public-demo): have the orchestrator/pool detect a wedged-but-heartbeating pod and bounce its session process instead of relying on manual destroy.

A client whose transport half-opens (tab closed without a close frame,
network drop, or — common in prod — a Cloudflare-tunnel half-open) leaves
ws.send buffering instead of raising ConnectionClosed. The session then
never tears down: the runner keeps generating at full GPU and holds the
one-session-per-pod seat forever. The pod keeps heartbeating "healthy",
so the pool keeps routing users to it, they get preempt/"another
connection took over", and the only recovery is destroying the pod.

The server already widened the keepalive ping_timeout to 90s (to avoid
GIL-starvation false-disconnects) and its comment explicitly calls for an
"app-level idle-session reaper" to pair with it. This adds that reaper.

Signal: the client acks every received slice (monotonic byte count). The
reaper runs only while we're actively SENDING slices (sent advancing) and
the ack count has stalled for _DEAD_CLIENT_ACK_TIMEOUT_S (default 30s, env
DEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact dead-client fingerprint
we observed in prod: 70-95% GPU generating, zero connected sockets, acks
frozen. On detection it flips state.running (runner tears down on its next
iteration) and closes the ws to unblock the hung send/recv.

False-positive-proof: an idle-paused session sends no slices, so `sent`
doesn't advance and the reaper never fires — idle+dead is left to the WS
keepalive (GPU is free then, so the keepalive thread isn't starved). The
reaper is a daemon thread that self-exits when state.running flips, so it
adds no teardown coupling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
leszko added a commit that referenced this pull request Jun 25, 2026
* fix(rtmg): reap dead WS clients that wedge the send path

The websockets sync send path (send_context → send_data) holds the
connection's protocol lock across a *blocking* socket.sendall with no
send timeout. When a client stops reading — a killed tab, a slept
laptop, or a Cloudflare named-tunnel half-open where the pod↔edge TCP
stays ESTABLISHED while the browser is gone — the next slice ws.send
wedges on a full TCP window and pins the protocol lock forever. The
library's own keepalive ping then deadlocks acquiring that same lock,
so the dead client is never detected: state.running stays True, the
runner holds the pod's one-session seat, and /sessions never empties
(lockHeld stays true). Measured: a frozen client left ~2.6 MB stuck in
the server's send queue with the session still registered 13+ minutes
later. The recv loop can't help either (a half-open peer sends no close
frame) and our own ConnectionClosed detectors are the threads stuck in
sendall.

Add a per-session dead-client watchdog that detects this from OUTSIDE
the wedged Python I/O threads, via GIL- and lock-independent syscalls on
the socket FD:

  - send stall (SIOCOUTQ): the kernel send queue stays positive and
    drains zero bytes for a grace window. A live client's kernel ACKs and
    drains regardless of the client app's responsiveness, so a slow link,
    a GC pause, or a backgrounded tab keeps draining and is never reaped;
    only a client reading nothing trips it. Keying on kernel drainage —
    not app-level ack progress — is what avoids the false-reap of a
    GIL-starved-but-alive client that sinks an ack-stall timeout.
  - peer close (TCP_INFO state != ESTABLISHED / FD error): the
    "0 established sockets but session still registered" fingerprint,
    where the library's recv thread missed the EOF.

On either signal it flips state.running False (runner exits → registry
unregister fires → /sessions empties → lockHeld flips false and the pod
auto-returns to the pool) and force-shuts the socket so the wedged
sendall and recv thread unblock at once. Grace is configurable via
DEMON_DEAD_CLIENT_STALL_S (default 30s).

Supersedes #293 (the ack-stall reaper), whose app-level ack-progress
signal false-reaps live-but-GIL-starved clients.

Validated on a 5090 pod: a genuinely half-open (frozen) client is reaped
via send_stall; clean close, an active client, a paused-but-alive client
(draining, zero acks, 78s), and a backpressured client with acks
trickling (78s, window-drops engaged) are all left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(rtmg): harden dead-client watchdog (Linux gate + ack-progress)

Off Linux the SIOCOUTQ/TCP_INFO syscalls fail and would reap every
session, so gate the watchdog to Linux and no-op elsewhere. Add
TCP_INFO.tcpi_bytes_acked as the drain signal so a backpressured-but-
alive client whose send queue plateaus at a constant positive level
isn't false-reaped. Hoist struct/sys imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant