fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod by leszko · Pull Request #293 · daydreamlive/DEMON

leszko · 2026-06-23T13:01:49Z

Problem (observed in prod)

Two fleet pods became unconnectable while still heartbeating "healthy" — the only recovery was destroying them. SSH diagnosis:

Pod	GPU	Connected WS sockets	conn_handler threads
A	74–95%	0	236
B	0%	0	242

Pod A was a zombie: the runner kept generating audio at full GPU with zero connected clients, still holding the one-session-per-pod seat. Its session log showed a session that took the seat hours earlier and never tore down.

Root cause

When a client's transport half-opens — tab closed without a close frame, a network drop, or (common with our Cloudflare named tunnels) a half-open TCP — ws.send buffers into the dead socket instead of raising ConnectionClosed. Every teardown path keys off state.running, which is only flipped on ConnectionClosed, so the session runs forever.

The WS keepalive can't be the sole backstop: server.py deliberately widened ping_timeout to 90s to avoid GIL-starvation false-disconnects, and its own comment says to "pair with an app-level idle-session reaper." That reaper didn't exist — this adds it.

Fix

A per-session daemon reaper (_dead_client_reaper) in _handle_client_body:

Signal: the client acks every received slice (monotonic byte count, already tracked in _slice_flow). The reaper acts only while we're actively sending slices (sent advancing) and the ack count has stalled for _DEAD_CLIENT_ACK_TIMEOUT_S (default 30s, env DEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact zombie fingerprint: generating hard, acks frozen.
Action: flip state.running (runner tears down on its next iteration) and ws.close() to unblock the hung send/recv. Closes without send_lock on purpose — closing is what unblocks a send hung inside that lock.

Why it won't false-reap

An idle-paused session sends no slices, so sent doesn't advance and the reaper never fires. Idle+dead is left to the WS keepalive (the GPU is free while idle, so the keepalive thread isn't GIL-starved).
acked_ts seeds at session start, so a client that never acks at all is also caught after the timeout.
The reaper is a daemon thread that self-exits when state.running flips, so it adds no teardown coupling and can't itself leak.

Test notes

py_compile clean. The change is additive (one new daemon thread + one timestamp field); the existing flow-control/drop path is untouched.
Recommend validating on a :dev pod: start a session, kill the client's network (or close the tab uncleanly), confirm GPU returns to idle within ~30–35s and dead_client_reap is logged, and that a fresh connection then succeeds.

Follow-ups (not in this PR)

The ~240 leaked conn_handler threads look like a second issue (handshake/HTTP-probe connections not being reaped); worth a separate look.
Fleet-side safety net (in demon-public-demo): have the orchestrator/pool detect a wedged-but-heartbeating pod and bounce its session process instead of relying on manual destroy.

A client whose transport half-opens (tab closed without a close frame, network drop, or — common in prod — a Cloudflare-tunnel half-open) leaves ws.send buffering instead of raising ConnectionClosed. The session then never tears down: the runner keeps generating at full GPU and holds the one-session-per-pod seat forever. The pod keeps heartbeating "healthy", so the pool keeps routing users to it, they get preempt/"another connection took over", and the only recovery is destroying the pod. The server already widened the keepalive ping_timeout to 90s (to avoid GIL-starvation false-disconnects) and its comment explicitly calls for an "app-level idle-session reaper" to pair with it. This adds that reaper. Signal: the client acks every received slice (monotonic byte count). The reaper runs only while we're actively SENDING slices (sent advancing) and the ack count has stalled for _DEAD_CLIENT_ACK_TIMEOUT_S (default 30s, env DEMON_DEAD_CLIENT_ACK_TIMEOUT_S). That's the exact dead-client fingerprint we observed in prod: 70-95% GPU generating, zero connected sockets, acks frozen. On detection it flips state.running (runner tears down on its next iteration) and closes the ws to unblock the hung send/recv. False-positive-proof: an idle-paused session sends no slices, so `sent` doesn't advance and the reaper never fires — idle+dead is left to the WS keepalive (GPU is free then, so the keepalive thread isn't starved). The reaper is a daemon thread that self-exits when state.running flips, so it adds no teardown coupling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(rtmg): reap dead WS clients that wedge the send path The websockets sync send path (send_context → send_data) holds the connection's protocol lock across a *blocking* socket.sendall with no send timeout. When a client stops reading — a killed tab, a slept laptop, or a Cloudflare named-tunnel half-open where the pod↔edge TCP stays ESTABLISHED while the browser is gone — the next slice ws.send wedges on a full TCP window and pins the protocol lock forever. The library's own keepalive ping then deadlocks acquiring that same lock, so the dead client is never detected: state.running stays True, the runner holds the pod's one-session seat, and /sessions never empties (lockHeld stays true). Measured: a frozen client left ~2.6 MB stuck in the server's send queue with the session still registered 13+ minutes later. The recv loop can't help either (a half-open peer sends no close frame) and our own ConnectionClosed detectors are the threads stuck in sendall. Add a per-session dead-client watchdog that detects this from OUTSIDE the wedged Python I/O threads, via GIL- and lock-independent syscalls on the socket FD: - send stall (SIOCOUTQ): the kernel send queue stays positive and drains zero bytes for a grace window. A live client's kernel ACKs and drains regardless of the client app's responsiveness, so a slow link, a GC pause, or a backgrounded tab keeps draining and is never reaped; only a client reading nothing trips it. Keying on kernel drainage — not app-level ack progress — is what avoids the false-reap of a GIL-starved-but-alive client that sinks an ack-stall timeout. - peer close (TCP_INFO state != ESTABLISHED / FD error): the "0 established sockets but session still registered" fingerprint, where the library's recv thread missed the EOF. On either signal it flips state.running False (runner exits → registry unregister fires → /sessions empties → lockHeld flips false and the pod auto-returns to the pool) and force-shuts the socket so the wedged sendall and recv thread unblock at once. Grace is configurable via DEMON_DEAD_CLIENT_STALL_S (default 30s). Supersedes #293 (the ack-stall reaper), whose app-level ack-progress signal false-reaps live-but-GIL-starved clients. Validated on a 5090 pod: a genuinely half-open (frozen) client is reaped via send_stall; clean close, an active client, a paused-but-alive client (draining, zero acks, 78s), and a backpressured client with acks trickling (78s, window-drops engaged) are all left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(rtmg): harden dead-client watchdog (Linux gate + ack-progress) Off Linux the SIOCOUTQ/TCP_INFO syscalls fail and would reap every session, so gate the watchdog to Linux and no-op elsewhere. Add TCP_INFO.tcpi_bytes_acked as the drain signal so a backpressured-but- alive client whose send queue plateaus at a constant positive level isn't false-reaped. Hoist struct/sys imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

leszko mentioned this pull request Jun 24, 2026

fix(rtmg): reap dead WS clients that wedge the send path #295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293

fix(rtmg): reap dead-client sessions so a half-open WS can't wedge a pod#293
leszko wants to merge 1 commit into
mainfrom
rafal/ws-dead-client-reaper

leszko commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

leszko commented Jun 23, 2026

Problem (observed in prod)

Root cause

Fix

Why it won't false-reap

Test notes

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant