fix(rtmg): reap dead WS clients that wedge the send path by leszko · Pull Request #295 · daydreamlive/DEMON

leszko · 2026-06-24T09:13:33Z

The websockets sync send path holds the connection's protocol lock across a blocking socket.sendall. A half-open client (killed tab, slept laptop, Cloudflare tunnel still ESTABLISHED while the browser is gone) wedges the next ws.send on a full TCP window and pins that lock forever. The keepalive ping then deadlocks on the same lock, so the dead client is never detected: state.running stays True, the pod's one session seat is never released, and /sessions never empties (measured: ~2.6 MB stuck, session still registered 13+ min later).

Add a per-session watchdog that detects this from outside the wedged I/O threads, via GIL-/lock-independent syscalls on the socket FD:

Send stall — SIOCOUTQ (queued bytes) stays positive while TCP_INFO.tcpi_bytes_acked makes no progress and the queue doesn't drain for a grace window. Keying on the kernel's acked-bytes counter (not app-level acks) leaves slow links, GC pauses, backgrounded tabs, and backpressured-but-alive clients untouched; only a client reading nothing trips it.
Peer close — TCP_INFO state leaves ESTABLISHED, or the FD errors.

On either signal it flips state.running False (runner exits → registry unregister → pod returns to pool) and force-shuts the socket so the wedged sendall and recv thread unblock at once. Grace is configurable via DEMON_DEAD_CLIENT_STALL_S (default 30s). Linux-only; no-ops elsewhere.

Supersedes #293 (the app-level ack-stall reaper, which false-reaps GIL-starved-but-alive clients). Validated on a 5090 pod: a frozen client is reaped via send_stall; clean close, active, paused-but-alive, and backpressured clients are all left untouched.

The websockets sync send path (send_context → send_data) holds the connection's protocol lock across a *blocking* socket.sendall with no send timeout. When a client stops reading — a killed tab, a slept laptop, or a Cloudflare named-tunnel half-open where the pod↔edge TCP stays ESTABLISHED while the browser is gone — the next slice ws.send wedges on a full TCP window and pins the protocol lock forever. The library's own keepalive ping then deadlocks acquiring that same lock, so the dead client is never detected: state.running stays True, the runner holds the pod's one-session seat, and /sessions never empties (lockHeld stays true). Measured: a frozen client left ~2.6 MB stuck in the server's send queue with the session still registered 13+ minutes later. The recv loop can't help either (a half-open peer sends no close frame) and our own ConnectionClosed detectors are the threads stuck in sendall. Add a per-session dead-client watchdog that detects this from OUTSIDE the wedged Python I/O threads, via GIL- and lock-independent syscalls on the socket FD: - send stall (SIOCOUTQ): the kernel send queue stays positive and drains zero bytes for a grace window. A live client's kernel ACKs and drains regardless of the client app's responsiveness, so a slow link, a GC pause, or a backgrounded tab keeps draining and is never reaped; only a client reading nothing trips it. Keying on kernel drainage — not app-level ack progress — is what avoids the false-reap of a GIL-starved-but-alive client that sinks an ack-stall timeout. - peer close (TCP_INFO state != ESTABLISHED / FD error): the "0 established sockets but session still registered" fingerprint, where the library's recv thread missed the EOF. On either signal it flips state.running False (runner exits → registry unregister fires → /sessions empties → lockHeld flips false and the pod auto-returns to the pool) and force-shuts the socket so the wedged sendall and recv thread unblock at once. Grace is configurable via DEMON_DEAD_CLIENT_STALL_S (default 30s). Supersedes #293 (the ack-stall reaper), whose app-level ack-progress signal false-reaps live-but-GIL-starved clients. Validated on a 5090 pod: a genuinely half-open (frozen) client is reaped via send_stall; clean close, an active client, a paused-but-alive client (draining, zero acks, 78s), and a backpressured client with acks trickling (78s, window-drops engaged) are all left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Off Linux the SIOCOUTQ/TCP_INFO syscalls fail and would reap every session, so gate the watchdog to Linux and no-op elsewhere. Add TCP_INFO.tcpi_bytes_acked as the drain signal so a backpressured-but- alive client whose send queue plateaus at a constant positive level isn't false-reaped. Hoist struct/sys imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

leszko and others added 2 commits June 24, 2026 08:38

leszko marked this pull request as ready for review June 25, 2026 06:35

leszko merged commit 5b8b585 into main Jun 25, 2026

leszko deleted the rafal/ws-teardown-fix branch June 25, 2026 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rtmg): reap dead WS clients that wedge the send path#295

fix(rtmg): reap dead WS clients that wedge the send path#295
leszko merged 2 commits into
mainfrom
rafal/ws-teardown-fix

leszko commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

leszko commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leszko commented Jun 24, 2026 •

edited

Loading