Skip to content

fix(daemon): stop orchestrator id-increment + OS-native daemon liveness link#2185

Open
harshitsinghbhandari wants to merge 23 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:fix/backend-daemon-issues
Open

fix(daemon): stop orchestrator id-increment + OS-native daemon liveness link#2185
harshitsinghbhandari wants to merge 23 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:fix/backend-daemon-issues

Conversation

@harshitsinghbhandari

Copy link
Copy Markdown
Collaborator

Problem

On macOS the orchestrator session kept getting recreated with an incrementing id (ao-agents-14 -> 15 -> 16) across daemon restarts, app quits, and reboots, losing its context each time. The daemon was also orphaned on app close. Windows did not have the bug.

Root cause (proven by sandbox experiment): the graceful daemon-shutdown path ran SaveAndTeardownAll, which killed live tmux sessions. On restart the promptless orchestrator could not be restored (ErrNotResumable), so the frontend minted a new one (id++). The hard-kill path left tmux alive and the boot reconcile adopted it cleanly. This makes graceful shutdown behave like hard-kill, and adds an OS-native liveness link so the daemon stops cleanly (not orphaned) when the frontend dies.

What changed

Core increment fix

  • Shutdown no longer tears down live sessions; boot Reconcile adopts the surviving tmux/ConPTY sessions. SaveAndTeardownAll is dropped from the daemon's sessionLifecycle interface, so re-adding teardown-on-shutdown is now a compile error.
  • SpawnOrchestrator(clean=false) is idempotent server-side: it returns the existing live orchestrator instead of minting a duplicate (previously only a client-side guard existed).

OS-native liveness link (no orphan, no clean-close dependency)

  • A transport-agnostic supervisor watchdog: arms on first client connect (headless-safe, an ao start daemon with no frontend never self-stops), fires once after a 5s grace when the last client drops, reconnect cancels.
  • Listeners: Unix socket at ~/.ao/supervise.sock (sibling of running.json) and a Windows named pipe (go-winio), wired with onLastClientGone = RequestShutdown. Non-fatal if it cannot bind.
  • Electron holds one socket for the app lifetime (retry + reconnect, no heartbeat). Quit-time daemon teardown removed, so Cmd+Q, crash, and SIGKILL behave identically: socket drops, daemon self-stops after grace, sessions survive. A guarded last-resort kill covers the case where the link never connected.

Reboot recovery

  • When tmux is genuinely gone (real reboot), a promptless session is relaunched fresh in the same id instead of being abandoned via ErrNotResumable (which is removed). Orchestrator == worker, no special-casing.

Verification

  • go build ./... && go vet ./... && go test ./... -race green; frontend vitest 307/307 and tsc clean.
  • Live sandbox repros: graceful stop -> tmux survives -> restart adopts same id; ensure-POST returns same id (no duplicate); supervisor headless-safe + held-connection-keeps-alive + client-drop self-stops in 5.0s + reconnect cancels; reboot-sim (tmux kill-server) -> restored fresh, same id.
  • A signed + notarized macOS arm64 test build was produced and Gatekeeper-verified for manual QA.

Notes

  • Stacked on the now-merged Zellij to tmux + ConPTY runtime, session save/restore, crash-proof reconcile (port #404) #2183 (tmux/ConPTY migration); the branch is being rebased onto main to drop the already-merged commits, after which this diff shows only the daemon-lifecycle change.
  • The Windows named-pipe path cross-compiles but has not been runtime-tested on a Windows box.
  • The supervisor link attaches only on the daemon-spawn path; the attach path is intentionally not linked to preserve headless-ao start persistence.

🤖 Generated with Claude Code

harshitsinghbhandari and others added 15 commits June 25, 2026 19:30
…n boot

Remove SaveAndTeardownAll from the graceful shutdown path. Live tmux/ConPTY
sessions survive daemon exit; Reconcile on the next boot adopts them via
reconcileLive, preserving session IDs and preventing the id-increment bug.
Add runShutdownSessionLifecycle as a testable seam and narrow the
sessionLifecycle interface to Reconcile+RestoreAll only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-add SaveAndTeardownAll to the sessionLifecycle interface and to
fakeSessionLifecycle so TestShutdown_DoesNotCallSaveAndTeardownAll is
genuinely falsifiable: the flag flips if runShutdownSessionLifecycle
ever calls sl.SaveAndTeardownAll, making the assertion meaningful.
Name the sl parameter (was discarded with _) so the seam is visible.

RED: with a temporary sl.SaveAndTeardownAll call, test fails.
GREEN: without it, test passes. go test ./internal/daemon/... -race: 23/23.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d interface

Remove SaveAndTeardownAll from sessionLifecycle so daemon.Run physically
cannot call teardown on shutdown. Delete the no-op runShutdownSessionLifecycle
seam and its test. The narrowed interface is the guard: re-introducing teardown
requires a visible, reviewable interface change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… a duplicate

When clean=false, SpawnOrchestrator now checks for an existing active orchestrator
and returns it directly instead of always calling Spawn. Adds RED/GREEN tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nNoneExists

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds backend/internal/daemon/supervisor: a transport-agnostic watchdog
that fires onLastClientGone() exactly once when the live connection count
drops to zero and stays zero for a configurable grace period. Arms only
after the first accepted connection (headless safety: CLI ao start never
self-stops). Reconnect before grace elapses cancels the pending timer.
Mutex guards liveCount/armed/pendingTimer; sync.Once guards the callback.
Tests use net.Pipe + a fake listener (no OS sockets); all three behavior
contracts verified with -race: never fires pre-connect, fires once on
last disconnect, reconnect within grace cancels and re-arms correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leak, cleanup)

- Fix data race: capture liveCount into local inside lock before logging
- Replace io.EOF with errors.Is(err, net.ErrClosed) for correct production behavior; update fakeListener to match
- Fix goroutine leak: derive cancellable child context in Serve so watcher always unblocks on return
- Simplify makePipe: drop dead error return, update 3 call sites
- Remove dead defensive pendingTimer nil-check in armGrace

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates platform-split Listen() in supervisor package (Unix UDS sibling
of run-file; Windows named pipe via go-winio). Wires it into daemon.Run
before srv.Run: listener failure is non-fatal so headless ao-start works.
Adds RequestShutdown() on Server and three unit tests for listen_unix.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… exit

Add connectSupervisor() in frontend/src/main/supervisor-link.ts: holds one
client connection to the daemon's OS-native supervisor socket (Unix domain
socket on macOS/Linux, named pipe on Windows). Retries with bounded backoff
until the daemon is up; reconnects automatically on drop.

In main.ts, connect the link from reportBoundPort (once per daemon ready
transition). Remove the quit-time daemon teardown: the before-quit handler
now only disposes the browser view. The process.on("exit") killDaemon call
is also removed. When Electron exits for any reason the OS closes the fd,
the daemon detects EOF, and self-stops after its 5s grace period. tmux and
ConPTY sessions survive and are adopted on the next boot.

killDaemon and stopDaemon are kept for the explicit user-stop path
(ipcMain.handle("daemon:stop", ...)).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n; tidy test setup

Expose a live `connected` getter on SupervisorLinkHandle and add a
last-resort process.on("exit") kill that fires only when the link is
not actually connected (UDS never bound or addr was null), preventing
orphan daemons while keeping the OS-fd teardown path for the normal
case. Log a warning when addr is null so the skip is diagnosable.
Invert the setup.ts guard to the natural form, dropping the empty
if/else skeleton.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…increment)

Drop the empty-prompt early return in restoreArgv that returned ErrNotResumable
when the adapter could not resume and no prompt was saved. Now control falls
through to GetLaunchCommand unconditionally: a saved prompt is replayed, an
empty prompt (orchestrator) launches fresh with the system prompt only, same id,
same workspace. Removes the only producer of ErrNotResumable and deletes the
now-dead sentinel, its service mapping, and the corresponding tests.

Frontend reference to SESSION_NOT_RESUMABLE in TerminalPane.tsx is left intact
(the handler becomes harmlessly dead; the API simply stops returning that code).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r reconcile

Task 5 made promptless sessions relaunch fresh instead of hitting ErrNotResumable.
The reconcile crash-recovery path (documented in reconcileLive) terminates a
dead-live session then RestoreAll relaunches it on the same boot. This test
asserted the old promptless-stays-terminated artifact; update it to the intended
restored end state (live again, same id, one fresh runtime Create).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xplicit stop

Document that the liveness link is established only when the app spawned the
daemon: the attach path intentionally does not link, to keep an `ao start`
daemon persistent (headless safety). Also dispose the link on an explicit
daemon:stop so its reconnect loop does not retry a deliberately-stopped daemon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@harshitsinghbhandari harshitsinghbhandari force-pushed the fix/backend-daemon-issues branch from a5a31be to 6ed9bc5 Compare June 25, 2026 14:02
github-actions Bot and others added 8 commits June 25, 2026 14:02
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rrors

Review of AgentWrapper#2185: Serve previously returned on the first unexpected Accept error,
silently disabling the watchdog (the 'restart is caller's job' comment described
a contract the caller never fulfilled). Back off and keep accepting instead, so
a transient error (e.g. EMFILE) cannot leave the daemon unable to self-stop on
frontend death. Also drop the stale Task 3/4 planning comment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ors still relaunch fresh)

A promptless, unresumable KindWorker session had no prompt to replay and
no native session id to resume from. Blank-relaunching it via GetLaunchCommand
would silently drop its task. restoreArgv now returns ErrNotResumable for this
case, gated on (meta.Prompt == "" && kind != domain.KindOrchestrator). Orchestrators
are promptless by design and continue to relaunch fresh with the system prompt only.
Re-introduces ErrNotResumable sentinel and its Conflict API mapping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…se lingering-daemon gap)

- runfile.Info gets Owner field (omitempty): "app" when Electron spawned,
  empty for headless `ao start`. server.go reads AO_OWNER env to set it.
- daemonEnv() injects AO_OWNER=app so spawned daemons self-identify.
- Extracted establishSupervisorLink() from inline reportBoundPort code.
  Spawn path calls it unconditionally; attach paths call it only when
  shouldLinkOnAttach(owner) is true (owner === "app").
- Both attach paths (inspectExistingDaemon, resolveDaemonFromPort) now
  read the owner from the run-file and re-link when app-owned, preventing
  a lingering app-spawned daemon from self-stopping mid-session.
- Headless ao start daemons stay unlinked: persistent across app quit.
- New daemon-owner.ts + daemon-owner.test.ts (4 vitest cases, all pass).
- Go: 9 tests pass with -race; vet clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…CTOU

Review polish on AgentWrapper#2185: a promptless worker left terminated on boot is expected,
not an operational error, so log it at Warn not Error. Document the narrow
run-file re-read TOCTOU on the port-attach re-link path (worst case is linking a
headless daemon; the dispose-idempotency guard prevents any leak).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant