Skip to content

fix(relay): disk-backed segments, transient-error tolerance, SIGTERM#69

Merged
Liuhaai merged 2 commits into
mainfrom
feat/relay-disk-segments-transient-tolerance
May 18, 2026
Merged

fix(relay): disk-backed segments, transient-error tolerance, SIGTERM#69
Liuhaai merged 2 commits into
mainfrom
feat/relay-disk-segments-transient-tolerance

Conversation

@Liuhaai

@Liuhaai Liuhaai commented May 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three production resilience issues in trio-relay, fixed together because they touch the same upload loop:

  1. Cloud blip = 30-60s of dead air. Previously any non-204 response killed the ffmpeg session and the wrapper script had to respawn. Now classify outcomes as OK/TRANSIENT/FATAL — 5xx and httpx.TransportError are transient (log + skip + continue); only 401/403 abort.
  2. Worker pinned at 0% CPU after HTTP error. httpx per-byte timeouts don't fire on a half-closed connection. Wrap each POST in asyncio.wait_for(..., timeout=30s).
  3. SIGKILL'd runs leak segment files in /tmp. Production wrapper uses timeout --signal=TERM; CLI only registered SIGINT, so the TemporaryDirectory never unwound. Register SIGTERM, and add a startup sweep that removes trio-relay-segments-* dirs older than 1h.

Bonus correctness changes that fell out of the rewrite:

  • Capture and upload run concurrently via asyncio.Queue — slow POST no longer stalls the ffmpeg reader.
  • Segments written to disk during capture instead of buffered in memory — bounds memory under upload backpressure.
  • httpx upload timeouts bumped 30s → 120s.
  • Camera registration no longer leaks RTSP credentials into the registry payload.

Test plan

  • 3 new tests for transient/fatal classification (5xx, 401, transport error)
  • Upload deadline test with a hung POST
  • Concurrent capture/upload test (second segment captured while first POST is in-flight)
  • 3 stale-tmpdir sweep tests (removes old, leaves fresh, threshold)
  • End-to-end SIGTERM regression test delivers a real signal and asserts teardown fires
  • All previous segmented-post tests updated for new disk-backed reader

🤖 Generated with Claude Code

@Liuhaai Liuhaai force-pushed the feat/relay-disk-segments-transient-tolerance branch from 8f4612a to 3da0e5f Compare May 18, 2026 16:06
Liuhaai and others added 2 commits May 18, 2026 09:10
Three production resilience issues observed in trio-relay, fixed
together because they touch the same upload loop:

1. **Cloud blip = 30-60s of dead air.** Any non-204 response killed
   the whole ffmpeg session and the wrapper script had to respawn.
   Classify outcomes as OK/TRANSIENT/FATAL: 5xx and httpx
   TransportError are transient (log + skip + continue), only 401/403
   abort (auth won't self-heal). Adds three regression tests.

2. **Worker pinned at 0% CPU after HTTP error.** httpx per-byte
   timeouts don't fire on a half-closed connection (server stopped
   reading body but never sent FIN). Wrap each POST in
   `asyncio.wait_for(..., timeout=30s)` and classify the deadline as
   transient.

3. **SIGKILL'd runs leak segment files in /tmp.** Production wrapper
   uses `timeout --signal=TERM`; the CLI only registered SIGINT, so
   Python's default SIGTERM handler killed the process before the
   `TemporaryDirectory` context exited. Register SIGTERM alongside
   SIGINT, and add a startup sweep that removes
   `trio-relay-segments-*` dirs older than 1h (concurrent relays are
   protected by mtime — they touch their dir every segment).

Bonus correctness changes that fell out of the rewrite:
- Capture and upload now run concurrently via an asyncio.Queue, so a
  slow POST no longer stalls the ffmpeg reader (which would back up
  the kernel pipe and stutter capture).
- Segments are written to disk during capture instead of buffered in
  memory — bounds memory under upload backpressure.
- httpx read/write timeouts bumped 30s → 120s for the upload client.
- Camera registration no longer leaks RTSP URLs (with credentials)
  into the registry payload.
- End-to-end regression test delivers a real SIGTERM and asserts
  teardown fires.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Liuhaai Liuhaai force-pushed the feat/relay-disk-segments-transient-tolerance branch from 3da0e5f to 716529b Compare May 18, 2026 16:10
@Liuhaai Liuhaai merged commit d3022ad into main May 18, 2026
7 checks passed
@Liuhaai Liuhaai deleted the feat/relay-disk-segments-transient-tolerance branch May 18, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant