Skip to content

refactor: harden DNS recovery and interactive transport behavior#76

Open
ebpfx wants to merge 1 commit into
net2share:mainfrom
ebpfx:fix/recovery
Open

refactor: harden DNS recovery and interactive transport behavior#76
ebpfx wants to merge 1 commit into
net2share:mainfrom
ebpfx:fix/recovery

Conversation

@ebpfx
Copy link
Copy Markdown
Contributor

@ebpfx ebpfx commented Apr 15, 2026

Summary

This PR consolidates the reliability and transport-tuning work needed to make the DNS tunnel more usable on degraded or shutdown-style networks.

The main goals are:

  1. Recover faster when the resolver path is partially dead instead of requiring manual client restarts.
  2. Stop silently losing valid server responses under burst load.
  3. Improve interactive downstream latency without making throughput defaults too aggressive.
  4. Make the important transport knobs runtime tunable so production tuning does not require rebuilds.

Why This Patch Is Needed

The original code path assumed a relatively normal DNS environment. In real shutdown conditions that assumption breaks:

  • per-query UDP can become unusable without fully crashing the transport layer
  • session creation can keep retrying on a poisoned resolver path
  • idle sessions can get stuck in a bad state and repeatedly fail new stream opens
  • server-side response work can pile up under bursts
  • silently dropping valid responses is catastrophic because it amplifies retries and congestion
  • interactive apps need faster downstream polling than bulk traffic, but hardcoded timing makes that tradeoff rigid

This patch set addresses those problems without changing the tunnel wire format or the single-session model.

Files

  • README.md
  • client/client.go
  • client/dns.go
  • client/udp.go
  • man/vaydns-client.1
  • man/vaydns-server.1
  • vaydns-client/main.go
  • vaydns-server/main.go

What Changed

1. Rebuild transport on failed session creation

Changes:

  • When createSession fails, the client now tears down and rebuilds the resolver/DNS transport stack before retrying session creation.
  • Managed reconnect logging was adjusted so session ready is emitted only after a session is actually usable.

Reason:

  • Reusing the same possibly poisoned transport after a failed KCP/Noise/smux setup can keep the client stuck on a bad path.

Impact:

  • Better recovery when the resolver path or lower DNS transport is degraded during session establishment.
  • Less need for repeated manual restarts.

2. Detect stale per-query UDP transport from the session layer

Changes:

  • UDPPacketConn now tracks the time of the last successful DNS response.
  • The managed client loop uses that timestamp to detect stale per-query UDP transport.
  • Sessions are only retired for staleness when streams actually need transport.

Reason:

  • Per-query UDP workers can keep failing while the higher DNS packet layer still appears alive.
  • Before this patch, the session manager had no direct visibility into that failure mode.

Impact:

  • Faster recovery from dead or poisoned resolver paths.
  • Avoids killing a healthy idle session just because it has been quiet for a long time.

3. Retire only idle poisoned sessions after repeated stream-open failures

Changes:

  • Added an idle-only consecutive stream-open failure counter.
  • The client retires a session only after repeated OpenStream failures when there are no active streams.
  • Successful stream opens reset the counter.

Reason:

  • An idle poisoned session should not remain forever.
  • A busy session should not be killed because one new stream failed during a short resolver problem.

Impact:

  • Improves recovery for new connections.
  • Preserves already-working streams.
  • Keeps the single-session design intact.

4. Make client polling stream-aware and tunable

Changes:

  • Added stream-aware DNS polling with separate idle and active timing.
  • Added new client flags:
    • -poll-delay
    • -active-poll-delay
    • -poll-max-delay
    • -udp-transport-stale-timeout
    • -open-stream-failure-limit
  • Managed and library Handle() paths now both contribute to active-stream tracking.

Reason:

  • Interactive apps need downstream polling to stay aggressive while streams are active.
  • Idle traffic should still back off to avoid unnecessary resolver load.

Impact:

  • Better downstream responsiveness for interactive traffic.
  • Conservative idle behavior remains intact.
  • Production tuning can happen without rebuilds.

5. Fix fractional -rps handling

Changes:

  • NewRateLimiter now uses a minimum bucket capacity of 1.0 token even when rps < 1.

Reason:

  • The old token bucket could not accumulate a whole token for fractional rates, so -rps values below 1 were effectively broken.

Impact:

  • Fractional rate limits now behave correctly.

6. Make server response queue overload visible

Changes:

  • Kept response enqueue explicitly drop-based.
  • Added a server runtime flag:
    • -response-queue-size
  • Added a dropped-response counter in periodic server stats.

Reason:

  • Silent response drops are one of the worst failure modes in a DNS tunnel.
  • The server needs to stay responsive under burst load instead of turning queue pressure into hidden behavior.
  • For this queue, drop-with-visibility is the safer failure mode.

Impact:

  • Server stays responsive under burst load.
  • Response loss is now visible through the server stats counter.
  • Response queue depth remains tunable in production.

7. Reduce server response hold from 1s to 200ms

Changes:

  • Replaced the old hardcoded 1s downstream response wait with a configurable response delay.
  • Default is now 200ms.

Reason:

  • A 1-second response hold is too slow for interactive apps over a DNS tunnel.
  • 200ms keeps the wait window comfortably below the client UDP timeout while still leaving enough room to bundle downstream packets.

Impact:

  • Better downstream latency for interactive traffic.
  • Safer default than the old 1-second hold.
  • Throughput is preserved better than more aggressive latency-only values.

8. Reduce edge TCP latency with TCP_NODELAY

Changes:

  • Enable TCP_NODELAY on the client local TCP side.
  • Enable TCP_NODELAY on the server upstream TCP side.

Reason:

  • Small interactive packets should not sit behind Nagle delays when the DNS tunnel is already the main bottleneck.

Impact:

  • Lower latency for chat/proxy-style traffic with many small writes.

Defaults After This Patch

Client

  • poll-delay = 500ms
  • active-poll-delay = 200ms
  • poll-max-delay = 2s
  • udp-transport-stale-timeout = 3s
  • open-stream-failure-limit = 3

Server

  • response-delay = 200ms
  • response-workers = 2
  • response-queue-size = queue-size (default 0, effective default 512)
  • response queue overflow is fixed to drop

Backward Compatibility

  • No wire format changes.
  • No protocol framing changes in DNS/KCP/Noise/smux.
  • Single-session design is preserved.
  • Public client constructor compatibility is preserved.
  • The new runtime flags are additive.

Expected Improvements

  • Fewer manual client restarts after resolver degradation
  • Faster recovery from dead per-query UDP paths
  • Better recovery from poisoned idle sessions
  • No more silent loss of valid response work under burst load
  • Better interactive latency for apps that rely on small, frequent packets
  • Runtime-tunable transport behavior for production resolver differences

Tighten the client and server DNS transport path for shutdown-style
networks where resolver behavior is bursty, forged, delayed, or
partially dead.

Key changes:
- rebuild the transport stack when session creation fails instead of
  retrying on a poisoned resolver path
- track per-query UDP health and retire stale sessions only when streams
  actually need transport
- retire idle poisoned sessions after repeated stream-open failures
  without dropping busy sessions
- lower the server response hold default from 1s to 200ms and default to
  two response workers for better interactive behavior without hurting
  downstream bundling too aggressively
- make client polling stream-aware and tunable, with a 200ms active poll
  cap and a 2s idle max backoff
- enable TCP_NODELAY on client local TCP sockets and server upstream TCP
  sockets to reduce small-packet latency
- fix fractional -rps handling so rates below 1 query/sec still work
  correctly
- document the new tuning knobs and defaults

This keeps the single-session model and tunnel wire format intact while
improving recovery, stability, and responsiveness under censored or
unreliable resolver paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant