capsem shell (temp): Ctrl-D never exits cleanly; terminal floods with control/audit frames
Environment
- macOS 26.4
- capsem 1.2.1779673506
Problem
ctrl-d or exit in a temp capsem shell does not end the session. The terminal then fills with audit-tail churn and garbled control frames ending in pong.
Logs
Session honest-tiger-tmp. serial.log/pty.log carry no wall-clock timestamps; all three windows are aligned via process.log, whose first reconnect re-key lands at 19:59:26.683Z — immediately after the guest exit. The storm runs 19:59:26.68 → 20:03:03.11 (216s): 1217 reconnects ≈ 337/min, peaking ~390/min in the first 10s.
1. Guest agent — serial.log (storm onset ≈ 19:59:26.68, the bridge exited → reconnected pair repeating):
root@honest-tiger-tmp:~# [capsem-agent] bridge exited
[reconnect] succeeded after 1 attempt(s) (1ms)
[capsem-agent] reconnected successfully
[capsem-agent] rebinding workspace after resume
[capsem-agent] rebind: /root reconnected to host workspace
[capsem-agent] bridge exited
[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log
[reconnect] succeeded after 1 attempt(s) (916µs)
[capsem-agent] reconnected successfully
[capsem-agent] rebinding workspace after resume
[capsem-agent] rebind: /root reconnected to host workspace
[capsem-agent] bridge exited
[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log
[reconnect] succeeded after 1 attempt(s) (964µs)
[capsem-agent] reconnected successfully
Each cycle re-runs audit: connected … tailing /var/log/audit/audit.log: 1217 audit tails started, 0 teardown lines (plus 34× audit: write failed against dead bridges) — leaked audit threads accumulating, one per reconnect.
2. Host — process.log (re-keying terminal/control/audit bridges on every reconnect; one cycle, 19:59:26.683–.714Z):
19:59:26.683 terminal port: connection accepted, re-keying bridge
19:59:26.684 control port: connection accepted, performing handshake
19:59:26.706 control port: handshake successful, re-keying bridge
19:59:26.706 control bridge: active
19:59:26.706 terminal port: connection accepted, re-keying bridge
19:59:26.707 audit port: connected, reading audit records
19:59:26.707 control port: connection accepted, performing handshake
19:59:26.714 control port: handshake successful, re-keying bridge
Host counts confirm the 1:1 storm: terminal port: connection accepted ×1217, control port ×1216, audit port: connected ×1217, vsock: accepted connection ×3652 — none ever closed.
3. Terminal stream — pty.log: guest exit (Ctrl-D), then leaked Pong control frames; and serial.log:1575 showing the same frames interleaved with agent output:
exit
00 00 00 08 | 81 a1 74 a4 70 6f 6e 67 (len=8, msgpack {"t":"pong"}) → renders ☐☐☐☐☐t☐pong
00 00 00 08 | 81 a1 74 a4 70 6f 6e 67 → ☐☐☐☐☐t☐pong
…01 5c …M-^GS^F… 0008 81a1 74a4 706f6e67 … (per-message frame headers + repeated pong)
# serial.log:1575
\x00\x00\x00\x08\x81\xa1t\xa4pong\x00\x00\x00\x08\x81\xa1t\xa4pong[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log
The pong bytes are GuestToHost::Pong MessagePack {"t":"pong"} written into the recycled terminal fd by stale heartbeat threads. (Note: literal kernel SYSCALL/capsem_exec/comm="uv" audit records are not present in any captured host log — what floods these captures is the per-reconnect Pong frames plus the 1217× audit re-tail churn above.)
Root cause
The guest agent cannot distinguish bash exit from a transport disconnect, so Ctrl-D triggers an infinite reconnect storm instead of a clean teardown.
- Bash exits → PTY master
POLLHUP → bridge_loop breaks: main.rs#L1047-L1049
- The interactive bash
child is never reaped/checked (forked at main.rs#L427; the only child.wait() is run_exec's subprocess).
- Outer loop treats every
run_bridge return as a disconnect and reconnects unconditionally: main.rs#L520-L521
- Host mirrors this: terminal bridge treats a closed FD as a reconnect-wait, never closing
term_out: vsock.rs#L352, vsock.rs#L361
- No
GuestToHost "shell exited" message exists for guest-initiated exit: proto lib.rs#L497
Terminal garbage (secondary, same cause)
GuestToHost::Pong encodes to MessagePack {"t":"pong"} = \x81\xa1t\xa4pong → renders as ��t�pong. During the storm, run_bridge re-spawns writer/heartbeat/audit threads each iteration without stopping them (heartbeat main.rs#L847-L852, audit main.rs#L875); the loop closes and recycles the terminal/control FD numbers (main.rs#L651-L652), so stale threads write Pong/audit frames into a reused terminal fd.
Fix
bridge_loop returns why it exited (master-closed vs vsock-closed).
- On master-closed,
waitpid(child, WNOHANG); if bash exited, do not reconnect.
- Add
GuestToHost::ShellExited { exit_code }; agent sends it before process::exit; host closes term_out and tears down the session (auto-destroy for temp VMs).
- Stop/join per-iteration writer/heartbeat/audit threads on
run_bridge return.
Items 1–2 fix the exit; 3 makes the host exit cleanly; 4 removes the frame leak.
Complexity
Medium. Touches 3 crates (capsem-proto, capsem-agent, capsem-process) incl. a protocol addition. ~150–250 LoC + VM/hypervisor tests.
capsem shell(temp): Ctrl-D never exits cleanly; terminal floods with control/audit framesEnvironment
Problem
ctrl-dorexitin a tempcapsem shelldoes not end the session. The terminal then fills with audit-tail churn and garbled control frames ending inpong.Logs
Session
honest-tiger-tmp.serial.log/pty.logcarry no wall-clock timestamps; all three windows are aligned viaprocess.log, whose first reconnect re-key lands at 19:59:26.683Z — immediately after the guestexit. The storm runs 19:59:26.68 → 20:03:03.11 (216s): 1217 reconnects ≈ 337/min, peaking ~390/min in the first 10s.1. Guest agent —
serial.log(storm onset ≈ 19:59:26.68, thebridge exited → reconnectedpair repeating):Each cycle re-runs
audit: connected … tailing /var/log/audit/audit.log: 1217 audit tails started, 0 teardown lines (plus 34×audit: write failedagainst dead bridges) — leaked audit threads accumulating, one per reconnect.2. Host —
process.log(re-keying terminal/control/audit bridges on every reconnect; one cycle, 19:59:26.683–.714Z):Host counts confirm the 1:1 storm:
terminal port: connection accepted×1217,control port×1216,audit port: connected×1217,vsock: accepted connection×3652 — none ever closed.3. Terminal stream —
pty.log: guestexit(Ctrl-D), then leakedPongcontrol frames; andserial.log:1575showing the same frames interleaved with agent output:The
pongbytes areGuestToHost::PongMessagePack{"t":"pong"}written into the recycled terminal fd by stale heartbeat threads. (Note: literal kernelSYSCALL/capsem_exec/comm="uv"audit records are not present in any captured host log — what floods these captures is the per-reconnectPongframes plus the 1217× audit re-tail churn above.)Root cause
The guest agent cannot distinguish bash exit from a transport disconnect, so Ctrl-D triggers an infinite reconnect storm instead of a clean teardown.
POLLHUP→bridge_loopbreaks: main.rs#L1047-L1049childis never reaped/checked (forked at main.rs#L427; the onlychild.wait()isrun_exec's subprocess).run_bridgereturn as a disconnect and reconnects unconditionally: main.rs#L520-L521term_out: vsock.rs#L352, vsock.rs#L361GuestToHost"shell exited" message exists for guest-initiated exit: proto lib.rs#L497Terminal garbage (secondary, same cause)
GuestToHost::Pongencodes to MessagePack{"t":"pong"}=\x81\xa1t\xa4pong→ renders as��t�pong. During the storm,run_bridgere-spawns writer/heartbeat/audit threads each iteration without stopping them (heartbeat main.rs#L847-L852, audit main.rs#L875); the loop closes and recycles the terminal/control FD numbers (main.rs#L651-L652), so stale threads write Pong/audit frames into a reused terminal fd.Fix
bridge_loopreturns why it exited (master-closed vs vsock-closed).waitpid(child, WNOHANG); if bash exited, do not reconnect.GuestToHost::ShellExited { exit_code }; agent sends it beforeprocess::exit; host closesterm_outand tears down the session (auto-destroy for temp VMs).run_bridgereturn.Items 1–2 fix the exit; 3 makes the host exit cleanly; 4 removes the frame leak.
Complexity
Medium. Touches 3 crates (
capsem-proto,capsem-agent,capsem-process) incl. a protocol addition. ~150–250 LoC + VM/hypervisor tests.