fix(relay): disk-backed segments, transient-error tolerance, SIGTERM#69
Merged
Merged
Conversation
8f4612a to
3da0e5f
Compare
Three production resilience issues observed in trio-relay, fixed together because they touch the same upload loop: 1. **Cloud blip = 30-60s of dead air.** Any non-204 response killed the whole ffmpeg session and the wrapper script had to respawn. Classify outcomes as OK/TRANSIENT/FATAL: 5xx and httpx TransportError are transient (log + skip + continue), only 401/403 abort (auth won't self-heal). Adds three regression tests. 2. **Worker pinned at 0% CPU after HTTP error.** httpx per-byte timeouts don't fire on a half-closed connection (server stopped reading body but never sent FIN). Wrap each POST in `asyncio.wait_for(..., timeout=30s)` and classify the deadline as transient. 3. **SIGKILL'd runs leak segment files in /tmp.** Production wrapper uses `timeout --signal=TERM`; the CLI only registered SIGINT, so Python's default SIGTERM handler killed the process before the `TemporaryDirectory` context exited. Register SIGTERM alongside SIGINT, and add a startup sweep that removes `trio-relay-segments-*` dirs older than 1h (concurrent relays are protected by mtime — they touch their dir every segment). Bonus correctness changes that fell out of the rewrite: - Capture and upload now run concurrently via an asyncio.Queue, so a slow POST no longer stalls the ffmpeg reader (which would back up the kernel pipe and stutter capture). - Segments are written to disk during capture instead of buffered in memory — bounds memory under upload backpressure. - httpx read/write timeouts bumped 30s → 120s for the upload client. - Camera registration no longer leaks RTSP URLs (with credentials) into the registry payload. - End-to-end regression test delivers a real SIGTERM and asserts teardown fires. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3da0e5f to
716529b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three production resilience issues in
trio-relay, fixed together because they touch the same upload loop:httpx.TransportErrorare transient (log + skip + continue); only 401/403 abort.asyncio.wait_for(..., timeout=30s).timeout --signal=TERM; CLI only registered SIGINT, so theTemporaryDirectorynever unwound. Register SIGTERM, and add a startup sweep that removestrio-relay-segments-*dirs older than 1h.Bonus correctness changes that fell out of the rewrite:
asyncio.Queue— slow POST no longer stalls the ffmpeg reader.Test plan
🤖 Generated with Claude Code