Fix tunmux wgconf disconnect deadlock (and the cleanup fallout it caused)#3
Open
pansen wants to merge 15 commits into
Open
Fix tunmux wgconf disconnect deadlock (and the cleanup fallout it caused)#3pansen wants to merge 15 commits into
tunmux wgconf disconnect deadlock (and the cleanup fallout it caused)#3pansen wants to merge 15 commits into
Conversation
### The deadlock chain
The helper runs on a **`new_current_thread`** Tokio runtime (`src/userspace_helper.rs:212`). Inside `wait_for_shutdown` it ticks once per second and, on macOS, runs a diagnostic probe **synchronously, inline, blocking the executor**:
- `src/userspace_helper.rs:392` — `next_diag_at = Instant::now()`
- `src/userspace_helper.rs:388` — `interval(1s)`; tokio's first `tick()` returns **immediately**
- `src/userspace_helper.rs:428-433` — first tick → `log_macos_dataplane_probe(...)` (a plain blocking call, not awaited)
- `src/userspace_helper.rs:451 → 505-508` — `read_wg_transfer_bytes` runs `wg show <iface> transfer` via `Command::output()` (no timeout), where `<iface>` = `control_interface_name` = `wgconf0`.
`wg show wgconf0 transfer` connects to `/var/run/wireguard/wgconf0.sock` — **gotatun's own in-process UAPI server**. Servicing that request requires gotatun's async `handle_api` task to run:
- gotatun's UAPI accept/IO runs on a dedicated `std::thread` (`uapi/mod.rs:113,203`), which calls **`send_sync` → `blocking_send` + `blocking_recv`** waiting for a response (`uapi/mod.rs:89-97`)
- the response is produced by `handle_api`, an **async task on the current-thread runtime** (`uapi/mod.rs:262`, `respond.send(...)`)
So:
```
runtime thread → blocked in Command::output() waiting for `wg` to exit
`wg` → blocked waiting for UAPI std-thread to write a reply
UAPI thread → blocking_recv, waiting for handle_api to answer
handle_api → needs the runtime thread to poll it … which is blocked in `wg`
```
A circular wait. The runtime thread is wedged forever, which means:
1. The `sigterm.recv()` / `sigint.recv()` arms (`:408`, `:404`) are never polled → **SIGTERM is silently ignored**.
2. The `control_socket_path.exists()` check (`:413`) never runs → socket removal wouldn't help either.
3. The device packet tasks (`handle_incoming`/`handle_outgoing`) can't be polled → **the tunnel passes no traffic**, and the routes it installed blackhole the LAN.
At disconnect, the privileged side dutifully `SIGTERM`s pid 78780, waits exactly 8 s, the wedged helper never moves, and you get `gotatun helper 78780 did not exit after SIGTERM; cleanup is incomplete` — the same pid surviving across both disconnect attempts. Cleanup (`cleanup_network`) never runs, so the routes + DNS overrides persist → LAN hijack → reboot. Same end state as before, different proximate failure — exactly your "different, but same system-broken result."
Extend `--mtu` support to include the userspace backend for `wgconf`. Update the privileged helper to pass MTU overrides to `gotatun` and add input validation for the MTU value. Update documentation with usage details and macOS-specific troubleshooting for userspace data planes.
Implement a framed response protocol that allows the privileged service to capture log output during request processing and stream it to the CLI. Updates the `gotatun` helper to write logs to a file that the service tails, ensuring setup and teardown activity is reported back to the user in real-time.
* fix/andi/cleanup_issue_mtu: Enable --mtu support for userspace WireGuard backends
Add `is_live()` to `ConnectionState` to verify if a tunnel process or interface is still active. Automatically prune stale or corrupted state files during connection attempts to prevent the system from getting wedged in a "connected" state after a reboot.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR expands userspace WireGuard support (gotatun) with MTU override handling, improved helper lifecycle/logging, and better stale-state detection so reconnects don’t get wedged after reboots/crashes.
Changes:
- Add MTU parsing/validation and plumb MTU overrides through wgconf/userspace gotatun flows.
- Improve privileged<->CLI observability by streaming privileged-side log frames and adding per-interface helper log files + PID tracking/cleanup confirmation.
- Add stale direct-connection detection (
is_live/direct_connection_active) and a newwgconf statuscommand.
Reviewed changes
Copilot reviewed 26 out of 28 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/wireguard/userspace.rs | Adds MTU-aware up/down wrappers and passes MTU override to privileged gotatun run |
| src/wireguard/proxy_tunnel/mod.rs | Formatting-only changes to spawned proxy handlers |
| src/wireguard/kernel.rs | Moves MTU validation into shared config module |
| src/wireguard/connection.rs | Adds ConnectionState::is_live() + tests for stale-state detection |
| src/wireguard/config.rs | Parses MTU= from [Interface], adds parse_mtu/validate_mtu, and tests |
| src/wgconf/handlers.rs | Adds wgconf status, MTU validation, and routes MTU to userspace connect |
| src/userspace_helper.rs | Adds MTU support, helper PID/name/log/cleanup status files, macOS route reconcile, and structured cleanup |
| src/shared/util.rs | Test formatting change |
| src/shared/connection_ops.rs | Makes disconnect remove state even on teardown error; adds stale direct cleanup helper |
| src/proton/handlers.rs | Formatting-only change |
| src/privileged_client/util.rs | Import reordering |
| src/privileged_client/transport.rs | Adds --debug propagation, detaches daemon stdio, and implements framed log/response reader |
| src/privileged_client/mod.rs | Extends gotatun_run request with MTU override + debug flag |
| src/privileged_api.rs | Adds MTU override + debug fields to requests and validates MTU override |
| src/privileged/mod.rs | Streams per-request logs back to clients via framed responses; merges helper + service logs |
| src/privileged/dispatch.rs | Passes MTU/debug through to gotatun up implementation |
| src/privileged/commands.rs | Implements gotatun helper PID tracking, cleanup confirmation, and MTU/debug env propagation |
| src/mullvad/handlers.rs | Formatting-only change |
| src/main.rs | Enables debug via env and uses service logging initializer in privileged mode |
| src/logging.rs | Adds debug env flag, service log capture writer, and synchronous file logger |
| src/local_proxy.rs | Exposes proc_alive for connection liveness checks |
| src/ivpn/handlers.rs | Formatting-only change |
| src/cli.rs | Adds --debug alias for verbose and adds wgconf status subcommand + tests |
| src/airvpn/handlers.rs | Formatting-only change |
| README.md | Documents --debug and MTU behavior; adds macOS testing note |
| Cargo.toml | Updates gotatun dependency version |
| .gitignore | Ignores repo-local /var directory |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tunmux wgconf disconnect deadlocktunmux wgconf disconnect deadlock (and the cleanup fallout it caused)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wgconf disconnectcould hang and leave the machine wedged: the userspace helperignored
SIGTERM, never ran its network cleanup, and so left routes and DNSoverrides in place — hijacking the LAN until a reboot. This PR fixes the root-cause
deadlock and bundles the surrounding cleanup/observability work that the bug exposed.
The deadlock
The macOS helper runs on a single-threaded (
current_thread) Tokio runtime. Itsonce-per-second shutdown loop ran a diagnostic probe (
wg show <iface> transfer)inline as a blocking call. That probe connects to gotatun's own in-process UAPI
server, whose reply is produced by an async task on the very runtime thread now
blocked in the probe — a circular wait:
While wedged, the
select!arms forSIGTERM/SIGINTand the control-socketremoval check were never polled, and the tunnel's packet tasks stopped — so the
process survived disconnect, its routes blackholed the LAN, and DNS overrides
persisted.
Fix: run the probe on
spawn_blockingbounded by atimeout, so it can nolonger stall the executor. The shutdown
select!(signals + socket removal) is nowalways reachable, and device stop is likewise bounded by a timeout. SIGTERM is
honored,
cleanup_networkruns, and routes/DNS are restored on disconnect.Resilience to stale state
To stop the system getting stuck in a phantom "connected" state after a crash or
reboot,
ConnectionStategainedis_live()(verifies the tunnel process/interfacestill exists), and stale or corrupted state files are pruned automatically on the
next connect.
Other changes in this branch
wgconf status— report current connection state.and stream setup/teardown logs (including the gotatun helper's, via a tailed log
file) back to the CLI in real time.
--debugflag — enable debug logging from the CLI.--mtufor userspacewgconf—wgconfreadsMTU =from[Interface];--mtuoverrides it for direct kernel/userspace mode and kernel--proxy(not--local-proxy, which creates no host TUN). Includes MTU validation and macOSdata-plane testing notes in the README.
the current LAN rather than blackholing it.
gotatunto0.7.1(and restore thesmoltcpsubmodulepin after an accidental downgrade).