Skip to content

fix(rtmg): reap dead WS clients that wedge the send path#295

Merged
leszko merged 2 commits into
mainfrom
rafal/ws-teardown-fix
Jun 25, 2026
Merged

fix(rtmg): reap dead WS clients that wedge the send path#295
leszko merged 2 commits into
mainfrom
rafal/ws-teardown-fix

Conversation

@leszko

@leszko leszko commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

The websockets sync send path holds the connection's protocol lock across a blocking socket.sendall. A half-open client (killed tab, slept laptop, Cloudflare tunnel still ESTABLISHED while the browser is gone) wedges the next ws.send on a full TCP window and pins that lock forever. The keepalive ping then deadlocks on the same lock, so the dead client is never detected: state.running stays True, the pod's one session seat is never released, and /sessions never empties (measured: ~2.6 MB stuck, session still registered 13+ min later).

Add a per-session watchdog that detects this from outside the wedged I/O threads, via GIL-/lock-independent syscalls on the socket FD:

  • Send stallSIOCOUTQ (queued bytes) stays positive while TCP_INFO.tcpi_bytes_acked makes no progress and the queue doesn't drain for a grace window. Keying on the kernel's acked-bytes counter (not app-level acks) leaves slow links, GC pauses, backgrounded tabs, and backpressured-but-alive clients untouched; only a client reading nothing trips it.
  • Peer closeTCP_INFO state leaves ESTABLISHED, or the FD errors.

On either signal it flips state.running False (runner exits → registry unregister → pod returns to pool) and force-shuts the socket so the wedged sendall and recv thread unblock at once. Grace is configurable via DEMON_DEAD_CLIENT_STALL_S (default 30s). Linux-only; no-ops elsewhere.

Supersedes #293 (the app-level ack-stall reaper, which false-reaps GIL-starved-but-alive clients). Validated on a 5090 pod: a frozen client is reaped via send_stall; clean close, active, paused-but-alive, and backpressured clients are all left untouched.

leszko and others added 2 commits June 24, 2026 08:38
The websockets sync send path (send_context → send_data) holds the
connection's protocol lock across a *blocking* socket.sendall with no
send timeout. When a client stops reading — a killed tab, a slept
laptop, or a Cloudflare named-tunnel half-open where the pod↔edge TCP
stays ESTABLISHED while the browser is gone — the next slice ws.send
wedges on a full TCP window and pins the protocol lock forever. The
library's own keepalive ping then deadlocks acquiring that same lock,
so the dead client is never detected: state.running stays True, the
runner holds the pod's one-session seat, and /sessions never empties
(lockHeld stays true). Measured: a frozen client left ~2.6 MB stuck in
the server's send queue with the session still registered 13+ minutes
later. The recv loop can't help either (a half-open peer sends no close
frame) and our own ConnectionClosed detectors are the threads stuck in
sendall.

Add a per-session dead-client watchdog that detects this from OUTSIDE
the wedged Python I/O threads, via GIL- and lock-independent syscalls on
the socket FD:

  - send stall (SIOCOUTQ): the kernel send queue stays positive and
    drains zero bytes for a grace window. A live client's kernel ACKs and
    drains regardless of the client app's responsiveness, so a slow link,
    a GC pause, or a backgrounded tab keeps draining and is never reaped;
    only a client reading nothing trips it. Keying on kernel drainage —
    not app-level ack progress — is what avoids the false-reap of a
    GIL-starved-but-alive client that sinks an ack-stall timeout.
  - peer close (TCP_INFO state != ESTABLISHED / FD error): the
    "0 established sockets but session still registered" fingerprint,
    where the library's recv thread missed the EOF.

On either signal it flips state.running False (runner exits → registry
unregister fires → /sessions empties → lockHeld flips false and the pod
auto-returns to the pool) and force-shuts the socket so the wedged
sendall and recv thread unblock at once. Grace is configurable via
DEMON_DEAD_CLIENT_STALL_S (default 30s).

Supersedes #293 (the ack-stall reaper), whose app-level ack-progress
signal false-reaps live-but-GIL-starved clients.

Validated on a 5090 pod: a genuinely half-open (frozen) client is reaped
via send_stall; clean close, an active client, a paused-but-alive client
(draining, zero acks, 78s), and a backpressured client with acks
trickling (78s, window-drops engaged) are all left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Off Linux the SIOCOUTQ/TCP_INFO syscalls fail and would reap every
session, so gate the watchdog to Linux and no-op elsewhere. Add
TCP_INFO.tcpi_bytes_acked as the drain signal so a backpressured-but-
alive client whose send queue plateaus at a constant positive level
isn't false-reaped. Hoist struct/sys imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@leszko leszko marked this pull request as ready for review June 25, 2026 06:35
@leszko leszko merged commit 5b8b585 into main Jun 25, 2026
@leszko leszko deleted the rafal/ws-teardown-fix branch June 25, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant