refactor: harden DNS recovery and interactive transport behavior by ebpfx · Pull Request #76 · net2share/vaydns

ebpfx · 2026-04-15T01:32:19Z

Summary

This PR consolidates the reliability and transport-tuning work needed to make the DNS tunnel more usable on degraded or shutdown-style networks.

The main goals are:

Recover faster when the resolver path is partially dead instead of requiring manual client restarts.
Stop silently losing valid server responses under burst load.
Improve interactive downstream latency without making throughput defaults too aggressive.
Make the important transport knobs runtime tunable so production tuning does not require rebuilds.

Why This Patch Is Needed

The original code path assumed a relatively normal DNS environment. In real shutdown conditions that assumption breaks:

per-query UDP can become unusable without fully crashing the transport layer
session creation can keep retrying on a poisoned resolver path
idle sessions can get stuck in a bad state and repeatedly fail new stream opens
server-side response work can pile up under bursts
silently dropping valid responses is catastrophic because it amplifies retries and congestion
interactive apps need faster downstream polling than bulk traffic, but hardcoded timing makes that tradeoff rigid

This patch set addresses those problems without changing the tunnel wire format or the single-session model.

Files

README.md
client/client.go
client/dns.go
client/udp.go
man/vaydns-client.1
man/vaydns-server.1
vaydns-client/main.go
vaydns-server/main.go

What Changed

1. Rebuild transport on failed session creation

Changes:

When createSession fails, the client now tears down and rebuilds the resolver/DNS transport stack before retrying session creation.
Managed reconnect logging was adjusted so session ready is emitted only after a session is actually usable.

Reason:

Reusing the same possibly poisoned transport after a failed KCP/Noise/smux setup can keep the client stuck on a bad path.

Impact:

Better recovery when the resolver path or lower DNS transport is degraded during session establishment.
Less need for repeated manual restarts.

2. Detect stale per-query UDP transport from the session layer

Changes:

UDPPacketConn now tracks the time of the last successful DNS response.
The managed client loop uses that timestamp to detect stale per-query UDP transport.
Sessions are only retired for staleness when streams actually need transport.

Reason:

Per-query UDP workers can keep failing while the higher DNS packet layer still appears alive.
Before this patch, the session manager had no direct visibility into that failure mode.

Impact:

Faster recovery from dead or poisoned resolver paths.
Avoids killing a healthy idle session just because it has been quiet for a long time.

3. Retire only idle poisoned sessions after repeated stream-open failures

Changes:

Added an idle-only consecutive stream-open failure counter.
The client retires a session only after repeated OpenStream failures when there are no active streams.
Successful stream opens reset the counter.

Reason:

An idle poisoned session should not remain forever.
A busy session should not be killed because one new stream failed during a short resolver problem.

Impact:

Improves recovery for new connections.
Preserves already-working streams.
Keeps the single-session design intact.

4. Make client polling stream-aware and tunable

Changes:

Added stream-aware DNS polling with separate idle and active timing.
Added new client flags:
- -poll-delay
- -active-poll-delay
- -poll-max-delay
- -udp-transport-stale-timeout
- -open-stream-failure-limit
Managed and library Handle() paths now both contribute to active-stream tracking.

Reason:

Interactive apps need downstream polling to stay aggressive while streams are active.
Idle traffic should still back off to avoid unnecessary resolver load.

Impact:

Better downstream responsiveness for interactive traffic.
Conservative idle behavior remains intact.
Production tuning can happen without rebuilds.

5. Fix fractional `-rps` handling

Changes:

NewRateLimiter now uses a minimum bucket capacity of 1.0 token even when rps < 1.

Reason:

The old token bucket could not accumulate a whole token for fractional rates, so -rps values below 1 were effectively broken.

Impact:

Fractional rate limits now behave correctly.

6. Make server response queue overload visible

Changes:

Kept response enqueue explicitly drop-based.
Added a server runtime flag:
- -response-queue-size
Added a dropped-response counter in periodic server stats.

Reason:

Silent response drops are one of the worst failure modes in a DNS tunnel.
The server needs to stay responsive under burst load instead of turning queue pressure into hidden behavior.
For this queue, drop-with-visibility is the safer failure mode.

Impact:

Server stays responsive under burst load.
Response loss is now visible through the server stats counter.
Response queue depth remains tunable in production.

7. Reduce server response hold from 1s to 200ms

Changes:

Replaced the old hardcoded 1s downstream response wait with a configurable response delay.
Default is now 200ms.

Reason:

A 1-second response hold is too slow for interactive apps over a DNS tunnel.
200ms keeps the wait window comfortably below the client UDP timeout while still leaving enough room to bundle downstream packets.

Impact:

Better downstream latency for interactive traffic.
Safer default than the old 1-second hold.
Throughput is preserved better than more aggressive latency-only values.

8. Reduce edge TCP latency with `TCP_NODELAY`

Changes:

Enable TCP_NODELAY on the client local TCP side.
Enable TCP_NODELAY on the server upstream TCP side.

Reason:

Small interactive packets should not sit behind Nagle delays when the DNS tunnel is already the main bottleneck.

Impact:

Lower latency for chat/proxy-style traffic with many small writes.

Defaults After This Patch

Client

poll-delay = 500ms
active-poll-delay = 200ms
poll-max-delay = 2s
udp-transport-stale-timeout = 3s
open-stream-failure-limit = 3

Server

response-delay = 200ms
response-workers = 2
response-queue-size = queue-size (default 0, effective default 512)
response queue overflow is fixed to drop

Backward Compatibility

No wire format changes.
No protocol framing changes in DNS/KCP/Noise/smux.
Single-session design is preserved.
Public client constructor compatibility is preserved.
The new runtime flags are additive.

Expected Improvements

Fewer manual client restarts after resolver degradation
Faster recovery from dead per-query UDP paths
Better recovery from poisoned idle sessions
No more silent loss of valid response work under burst load
Better interactive latency for apps that rely on small, frequent packets
Runtime-tunable transport behavior for production resolver differences

Tighten the client and server DNS transport path for shutdown-style networks where resolver behavior is bursty, forged, delayed, or partially dead. Key changes: - rebuild the transport stack when session creation fails instead of retrying on a poisoned resolver path - track per-query UDP health and retire stale sessions only when streams actually need transport - retire idle poisoned sessions after repeated stream-open failures without dropping busy sessions - lower the server response hold default from 1s to 200ms and default to two response workers for better interactive behavior without hurting downstream bundling too aggressively - make client polling stream-aware and tunable, with a 200ms active poll cap and a 2s idle max backoff - enable TCP_NODELAY on client local TCP sockets and server upstream TCP sockets to reduce small-packet latency - fix fractional -rps handling so rates below 1 query/sec still work correctly - document the new tuning knobs and defaults This keeps the single-session model and tunnel wire format intact while improving recovery, stability, and responsiveness under censored or unreliable resolver paths.

ebpfx force-pushed the fix/recovery branch from b10bfd9 to c408f93 Compare April 15, 2026 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: harden DNS recovery and interactive transport behavior#76

refactor: harden DNS recovery and interactive transport behavior#76
ebpfx wants to merge 1 commit into
net2share:mainfrom
ebpfx:fix/recovery

ebpfx commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ebpfx commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Patch Is Needed

Files

What Changed

1. Rebuild transport on failed session creation

2. Detect stale per-query UDP transport from the session layer

3. Retire only idle poisoned sessions after repeated stream-open failures

4. Make client polling stream-aware and tunable

5. Fix fractional -rps handling

6. Make server response queue overload visible

7. Reduce server response hold from 1s to 200ms

8. Reduce edge TCP latency with TCP_NODELAY

Defaults After This Patch

Client

Server

Backward Compatibility

Expected Improvements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ebpfx commented Apr 15, 2026 •

edited

Loading

5. Fix fractional `-rps` handling

8. Reduce edge TCP latency with `TCP_NODELAY`