Skip to content

Fix tunmux wgconf disconnect deadlock (and the cleanup fallout it caused)#3

Open
pansen wants to merge 15 commits into
CaddyGlow:mainfrom
pansen:fix/andi/cleanup_issue
Open

Fix tunmux wgconf disconnect deadlock (and the cleanup fallout it caused)#3
pansen wants to merge 15 commits into
CaddyGlow:mainfrom
pansen:fix/andi/cleanup_issue

Conversation

@pansen

@pansen pansen commented Jun 15, 2026

Copy link
Copy Markdown

Summary

wgconf disconnect could hang and leave the machine wedged: the userspace helper
ignored SIGTERM, never ran its network cleanup, and so left routes and DNS
overrides in place — hijacking the LAN until a reboot. This PR fixes the root-cause
deadlock and bundles the surrounding cleanup/observability work that the bug exposed.

The deadlock

The macOS helper runs on a single-threaded (current_thread) Tokio runtime. Its
once-per-second shutdown loop ran a diagnostic probe (wg show <iface> transfer)
inline as a blocking call. That probe connects to gotatun's own in-process UAPI
server, whose reply is produced by an async task on the very runtime thread now
blocked in the probe — a circular wait:

runtime thread → blocked in Command::output() waiting for `wg`
        `wg`   → waiting for gotatun's UAPI thread to reply
 UAPI thread   → blocking_recv, waiting for handle_api
  handle_api   → needs the runtime thread … which is blocked in `wg`

While wedged, the select! arms for SIGTERM/SIGINT and the control-socket
removal check were never polled, and the tunnel's packet tasks stopped — so the
process survived disconnect, its routes blackholed the LAN, and DNS overrides
persisted.

Fix: run the probe on spawn_blocking bounded by a timeout, so it can no
longer stall the executor. The shutdown select! (signals + socket removal) is now
always reachable, and device stop is likewise bounded by a timeout. SIGTERM is
honored, cleanup_network runs, and routes/DNS are restored on disconnect.

Resilience to stale state

To stop the system getting stuck in a phantom "connected" state after a crash or
reboot, ConnectionState gained is_live() (verifies the tunnel process/interface
still exists), and stale or corrupted state files are pruned automatically on the
next connect.

Other changes in this branch

  • wgconf status — report current connection state.
  • Streamed logs — a framed response protocol lets the privileged service capture
    and stream setup/teardown logs (including the gotatun helper's, via a tailed log
    file) back to the CLI in real time.
  • --debug flag — enable debug logging from the CLI.
  • --mtu for userspace wgconfwgconf reads MTU = from [Interface];
    --mtu overrides it for direct kernel/userspace mode and kernel --proxy (not
    --local-proxy, which creates no host TUN). Includes MTU validation and macOS
    data-plane testing notes in the README.
  • LAN reconciliation — when the VPN overlaps the local network, reconcile against
    the current LAN rather than blackholing it.
  • Dependencies — bump gotatun to 0.7.1 (and restore the smoltcp submodule
    pin after an accidental downgrade).

pansen added 12 commits June 14, 2026 09:13
### The deadlock chain

The helper runs on a **`new_current_thread`** Tokio runtime (`src/userspace_helper.rs:212`). Inside `wait_for_shutdown` it ticks once per second and, on macOS, runs a diagnostic probe **synchronously, inline, blocking the executor**:

- `src/userspace_helper.rs:392` — `next_diag_at = Instant::now()`
- `src/userspace_helper.rs:388` — `interval(1s)`; tokio's first `tick()` returns **immediately**
- `src/userspace_helper.rs:428-433` — first tick → `log_macos_dataplane_probe(...)` (a plain blocking call, not awaited)
- `src/userspace_helper.rs:451 → 505-508` — `read_wg_transfer_bytes` runs `wg show <iface> transfer` via `Command::output()` (no timeout), where `<iface>` = `control_interface_name` = `wgconf0`.

`wg show wgconf0 transfer` connects to `/var/run/wireguard/wgconf0.sock` — **gotatun's own in-process UAPI server**. Servicing that request requires gotatun's async `handle_api` task to run:

- gotatun's UAPI accept/IO runs on a dedicated `std::thread` (`uapi/mod.rs:113,203`), which calls **`send_sync` → `blocking_send` + `blocking_recv`** waiting for a response (`uapi/mod.rs:89-97`)
- the response is produced by `handle_api`, an **async task on the current-thread runtime** (`uapi/mod.rs:262`, `respond.send(...)`)

So:

```
runtime thread → blocked in Command::output() waiting for `wg` to exit
        `wg`   → blocked waiting for UAPI std-thread to write a reply
 UAPI thread   → blocking_recv, waiting for handle_api to answer
  handle_api   → needs the runtime thread to poll it … which is blocked in `wg`
```

A circular wait. The runtime thread is wedged forever, which means:

1. The `sigterm.recv()` / `sigint.recv()` arms (`:408`, `:404`) are never polled → **SIGTERM is silently ignored**.
2. The `control_socket_path.exists()` check (`:413`) never runs → socket removal wouldn't help either.
3. The device packet tasks (`handle_incoming`/`handle_outgoing`) can't be polled → **the tunnel passes no traffic**, and the routes it installed blackhole the LAN.

At disconnect, the privileged side dutifully `SIGTERM`s pid 78780, waits exactly 8 s, the wedged helper never moves, and you get `gotatun helper 78780 did not exit after SIGTERM; cleanup is incomplete` — the same pid surviving across both disconnect attempts. Cleanup (`cleanup_network`) never runs, so the routes + DNS overrides persist → LAN hijack → reboot. Same end state as before, different proximate failure — exactly your "different, but same system-broken result."
Extend `--mtu` support to include the userspace backend for `wgconf`.
Update the privileged helper to pass MTU overrides to `gotatun` and add
input validation for the MTU value. Update documentation with usage
details and macOS-specific troubleshooting for userspace data planes.
Implement a framed response protocol that allows the privileged service
to
capture log output during request processing and stream it to the CLI.
Updates the `gotatun` helper to write logs to a file that the service
tails, ensuring setup and teardown activity is reported back to the user
in real-time.
* fix/andi/cleanup_issue_mtu:
  Enable --mtu support for userspace WireGuard backends
Add `is_live()` to `ConnectionState` to verify if a tunnel process or
interface is still active. Automatically prune stale or corrupted state
files during connection attempts to prevent the system from getting
wedged in a "connected" state after a reboot.
Copilot AI review requested due to automatic review settings June 15, 2026 19:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR expands userspace WireGuard support (gotatun) with MTU override handling, improved helper lifecycle/logging, and better stale-state detection so reconnects don’t get wedged after reboots/crashes.

Changes:

  • Add MTU parsing/validation and plumb MTU overrides through wgconf/userspace gotatun flows.
  • Improve privileged<->CLI observability by streaming privileged-side log frames and adding per-interface helper log files + PID tracking/cleanup confirmation.
  • Add stale direct-connection detection (is_live / direct_connection_active) and a new wgconf status command.

Reviewed changes

Copilot reviewed 26 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/wireguard/userspace.rs Adds MTU-aware up/down wrappers and passes MTU override to privileged gotatun run
src/wireguard/proxy_tunnel/mod.rs Formatting-only changes to spawned proxy handlers
src/wireguard/kernel.rs Moves MTU validation into shared config module
src/wireguard/connection.rs Adds ConnectionState::is_live() + tests for stale-state detection
src/wireguard/config.rs Parses MTU= from [Interface], adds parse_mtu/validate_mtu, and tests
src/wgconf/handlers.rs Adds wgconf status, MTU validation, and routes MTU to userspace connect
src/userspace_helper.rs Adds MTU support, helper PID/name/log/cleanup status files, macOS route reconcile, and structured cleanup
src/shared/util.rs Test formatting change
src/shared/connection_ops.rs Makes disconnect remove state even on teardown error; adds stale direct cleanup helper
src/proton/handlers.rs Formatting-only change
src/privileged_client/util.rs Import reordering
src/privileged_client/transport.rs Adds --debug propagation, detaches daemon stdio, and implements framed log/response reader
src/privileged_client/mod.rs Extends gotatun_run request with MTU override + debug flag
src/privileged_api.rs Adds MTU override + debug fields to requests and validates MTU override
src/privileged/mod.rs Streams per-request logs back to clients via framed responses; merges helper + service logs
src/privileged/dispatch.rs Passes MTU/debug through to gotatun up implementation
src/privileged/commands.rs Implements gotatun helper PID tracking, cleanup confirmation, and MTU/debug env propagation
src/mullvad/handlers.rs Formatting-only change
src/main.rs Enables debug via env and uses service logging initializer in privileged mode
src/logging.rs Adds debug env flag, service log capture writer, and synchronous file logger
src/local_proxy.rs Exposes proc_alive for connection liveness checks
src/ivpn/handlers.rs Formatting-only change
src/cli.rs Adds --debug alias for verbose and adds wgconf status subcommand + tests
src/airvpn/handlers.rs Formatting-only change
README.md Documents --debug and MTU behavior; adds macOS testing note
Cargo.toml Updates gotatun dependency version
.gitignore Ignores repo-local /var directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/shared/connection_ops.rs
Comment thread src/shared/connection_ops.rs Outdated
Comment thread src/userspace_helper.rs
Comment thread src/userspace_helper.rs
Comment thread src/privileged/mod.rs
Comment thread src/privileged/mod.rs
Comment thread src/userspace_helper.rs
@pansen pansen requested a review from Copilot June 15, 2026 20:07
@pansen pansen changed the title Fix: tunmux wgconf disconnect deadlock Fix tunmux wgconf disconnect deadlock (and the cleanup fallout it caused) Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 28 changed files in this pull request and generated 5 comments.

Comment thread src/userspace_helper.rs Outdated
Comment thread src/privileged/mod.rs
Comment thread src/logging.rs
Comment thread src/privileged_client/transport.rs
Comment thread src/privileged_api.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants