Skip to content

capsem shell (temp): Ctrl-D never exits cleanly; terminal floods with control/audit frames #66

@davidlabianca

Description

@davidlabianca

capsem shell (temp): Ctrl-D never exits cleanly; terminal floods with control/audit frames

Environment

  • macOS 26.4
  • capsem 1.2.1779673506

Problem

ctrl-d or exit in a temp capsem shell does not end the session. The terminal then fills with audit-tail churn and garbled control frames ending in pong.

Logs

Session honest-tiger-tmp. serial.log/pty.log carry no wall-clock timestamps; all three windows are aligned via process.log, whose first reconnect re-key lands at 19:59:26.683Z — immediately after the guest exit. The storm runs 19:59:26.68 → 20:03:03.11 (216s): 1217 reconnects ≈ 337/min, peaking ~390/min in the first 10s.

1. Guest agent — serial.log (storm onset ≈ 19:59:26.68, the bridge exited → reconnected pair repeating):

root@honest-tiger-tmp:~# [capsem-agent] bridge exited
[reconnect] succeeded after 1 attempt(s) (1ms)
[capsem-agent] reconnected successfully
[capsem-agent] rebinding workspace after resume
[capsem-agent] rebind: /root reconnected to host workspace
[capsem-agent] bridge exited
[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log
[reconnect] succeeded after 1 attempt(s) (916µs)
[capsem-agent] reconnected successfully
[capsem-agent] rebinding workspace after resume
[capsem-agent] rebind: /root reconnected to host workspace
[capsem-agent] bridge exited
[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log
[reconnect] succeeded after 1 attempt(s) (964µs)
[capsem-agent] reconnected successfully

Each cycle re-runs audit: connected … tailing /var/log/audit/audit.log: 1217 audit tails started, 0 teardown lines (plus 34× audit: write failed against dead bridges) — leaked audit threads accumulating, one per reconnect.

2. Host — process.log (re-keying terminal/control/audit bridges on every reconnect; one cycle, 19:59:26.683–.714Z):

19:59:26.683 terminal port: connection accepted, re-keying bridge
19:59:26.684 control port: connection accepted, performing handshake
19:59:26.706 control port: handshake successful, re-keying bridge
19:59:26.706 control bridge: active
19:59:26.706 terminal port: connection accepted, re-keying bridge
19:59:26.707 audit port: connected, reading audit records
19:59:26.707 control port: connection accepted, performing handshake
19:59:26.714 control port: handshake successful, re-keying bridge

Host counts confirm the 1:1 storm: terminal port: connection accepted ×1217, control port ×1216, audit port: connected ×1217, vsock: accepted connection ×3652 — none ever closed.

3. Terminal stream — pty.log: guest exit (Ctrl-D), then leaked Pong control frames; and serial.log:1575 showing the same frames interleaved with agent output:

exit
00 00 00 08 | 81 a1 74 a4 70 6f 6e 67   (len=8, msgpack {"t":"pong"})  → renders ☐☐☐☐☐t☐pong
00 00 00 08 | 81 a1 74 a4 70 6f 6e 67                                   → ☐☐☐☐☐t☐pong
…01 5c …M-^GS^F… 0008 81a1 74a4 706f6e67 …  (per-message frame headers + repeated pong)
# serial.log:1575
\x00\x00\x00\x08\x81\xa1t\xa4pong\x00\x00\x00\x08\x81\xa1t\xa4pong[capsem-agent] audit: connected to host, tailing /var/log/audit/audit.log

The pong bytes are GuestToHost::Pong MessagePack {"t":"pong"} written into the recycled terminal fd by stale heartbeat threads. (Note: literal kernel SYSCALL/capsem_exec/comm="uv" audit records are not present in any captured host log — what floods these captures is the per-reconnect Pong frames plus the 1217× audit re-tail churn above.)

Root cause

The guest agent cannot distinguish bash exit from a transport disconnect, so Ctrl-D triggers an infinite reconnect storm instead of a clean teardown.

  • Bash exits → PTY master POLLHUPbridge_loop breaks: main.rs#L1047-L1049
  • The interactive bash child is never reaped/checked (forked at main.rs#L427; the only child.wait() is run_exec's subprocess).
  • Outer loop treats every run_bridge return as a disconnect and reconnects unconditionally: main.rs#L520-L521
  • Host mirrors this: terminal bridge treats a closed FD as a reconnect-wait, never closing term_out: vsock.rs#L352, vsock.rs#L361
  • No GuestToHost "shell exited" message exists for guest-initiated exit: proto lib.rs#L497

Terminal garbage (secondary, same cause)

GuestToHost::Pong encodes to MessagePack {"t":"pong"} = \x81\xa1t\xa4pong → renders as ��t�pong. During the storm, run_bridge re-spawns writer/heartbeat/audit threads each iteration without stopping them (heartbeat main.rs#L847-L852, audit main.rs#L875); the loop closes and recycles the terminal/control FD numbers (main.rs#L651-L652), so stale threads write Pong/audit frames into a reused terminal fd.

Fix

  1. bridge_loop returns why it exited (master-closed vs vsock-closed).
  2. On master-closed, waitpid(child, WNOHANG); if bash exited, do not reconnect.
  3. Add GuestToHost::ShellExited { exit_code }; agent sends it before process::exit; host closes term_out and tears down the session (auto-destroy for temp VMs).
  4. Stop/join per-iteration writer/heartbeat/audit threads on run_bridge return.

Items 1–2 fix the exit; 3 makes the host exit cleanly; 4 removes the frame leak.

Complexity

Medium. Touches 3 crates (capsem-proto, capsem-agent, capsem-process) incl. a protocol addition. ~150–250 LoC + VM/hypervisor tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions