Skip to content

fix: resolve packet_split instability — connection leak, idle hang,#21

Merged
ParsaKSH merged 1 commit intomainfrom
dev
Mar 29, 2026
Merged

fix: resolve packet_split instability — connection leak, idle hang,#21
ParsaKSH merged 1 commit intomainfrom
dev

Conversation

@ParsaKSH
Copy link
Copy Markdown
Owner

tunnel stall

Five root causes of degradation under load and after restart:

  1. readLoop 5s backpressure blocked entire tunnel per stuck connection: one ConnID with full buffer caused ALL other ConnIDs on the same TunnelConn to stall for 5 seconds per frame. Reverted to immediate drop (with log) so readLoop never blocks.

  2. Reorderer gap-skip only triggered on new frame arrival: if no more frames arrived, Next() was never called again and the timer never expired — connection hung forever. Added 500ms gapTicker in RelayUpstreamToClient to drive gap-skip and detect idle connections (30s idle → give up and close).

  3. CentralServer had no idle-based cleanup after 5-min hard limit was removed: stuck connections accumulated indefinitely. Added lastActive tracking with 60s idle eviction, checked every 15 seconds.

  4. IncrConns() called on all N instances per user connection inflated dashboard count by N× (40 users × 10 instances = 400 shown, not 40). Fixed to track on one instance only.

  5. After restart, user connections arrived before any tunnel was ready, causing immediate SYN failures and reconnect storms. Added HasTunnels() fast-fail and WaitReady() helper on TunnelPool.

tunnel stall

Five root causes of degradation under load and after restart:

1. readLoop 5s backpressure blocked entire tunnel per stuck connection:
   one ConnID with full buffer caused ALL other ConnIDs on the same
   TunnelConn to stall for 5 seconds per frame. Reverted to immediate
   drop (with log) so readLoop never blocks.

2. Reorderer gap-skip only triggered on new frame arrival: if no more
   frames arrived, Next() was never called again and the timer never
   expired — connection hung forever. Added 500ms gapTicker in
   RelayUpstreamToClient to drive gap-skip and detect idle connections
   (30s idle → give up and close).

3. CentralServer had no idle-based cleanup after 5-min hard limit was
   removed: stuck connections accumulated indefinitely. Added lastActive
   tracking with 60s idle eviction, checked every 15 seconds.

4. IncrConns() called on all N instances per user connection inflated
   dashboard count by N× (40 users × 10 instances = 400 shown, not 40).
   Fixed to track on one instance only.

5. After restart, user connections arrived before any tunnel was ready,
   causing immediate SYN failures and reconnect storms. Added
   HasTunnels()
   fast-fail and WaitReady() helper on TunnelPool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ParsaKSH ParsaKSH merged commit 00648a7 into main Mar 29, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant