fix(daemon): stop orchestrator id-increment + OS-native daemon liveness link#2185
Open
harshitsinghbhandari wants to merge 23 commits into
Open
fix(daemon): stop orchestrator id-increment + OS-native daemon liveness link#2185harshitsinghbhandari wants to merge 23 commits into
harshitsinghbhandari wants to merge 23 commits into
Conversation
…n boot Remove SaveAndTeardownAll from the graceful shutdown path. Live tmux/ConPTY sessions survive daemon exit; Reconcile on the next boot adopts them via reconcileLive, preserving session IDs and preventing the id-increment bug. Add runShutdownSessionLifecycle as a testable seam and narrow the sessionLifecycle interface to Reconcile+RestoreAll only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-add SaveAndTeardownAll to the sessionLifecycle interface and to fakeSessionLifecycle so TestShutdown_DoesNotCallSaveAndTeardownAll is genuinely falsifiable: the flag flips if runShutdownSessionLifecycle ever calls sl.SaveAndTeardownAll, making the assertion meaningful. Name the sl parameter (was discarded with _) so the seam is visible. RED: with a temporary sl.SaveAndTeardownAll call, test fails. GREEN: without it, test passes. go test ./internal/daemon/... -race: 23/23. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d interface Remove SaveAndTeardownAll from sessionLifecycle so daemon.Run physically cannot call teardown on shutdown. Delete the no-op runShutdownSessionLifecycle seam and its test. The narrowed interface is the guard: re-introducing teardown requires a visible, reviewable interface change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… a duplicate When clean=false, SpawnOrchestrator now checks for an existing active orchestrator and returns it directly instead of always calling Spawn. Adds RED/GREEN tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nNoneExists Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds backend/internal/daemon/supervisor: a transport-agnostic watchdog that fires onLastClientGone() exactly once when the live connection count drops to zero and stays zero for a configurable grace period. Arms only after the first accepted connection (headless safety: CLI ao start never self-stops). Reconnect before grace elapses cancels the pending timer. Mutex guards liveCount/armed/pendingTimer; sync.Once guards the callback. Tests use net.Pipe + a fake listener (no OS sockets); all three behavior contracts verified with -race: never fires pre-connect, fires once on last disconnect, reconnect within grace cancels and re-arms correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leak, cleanup) - Fix data race: capture liveCount into local inside lock before logging - Replace io.EOF with errors.Is(err, net.ErrClosed) for correct production behavior; update fakeListener to match - Fix goroutine leak: derive cancellable child context in Serve so watcher always unblocks on return - Simplify makePipe: drop dead error return, update 3 call sites - Remove dead defensive pendingTimer nil-check in armGrace Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates platform-split Listen() in supervisor package (Unix UDS sibling of run-file; Windows named pipe via go-winio). Wires it into daemon.Run before srv.Run: listener failure is non-fatal so headless ao-start works. Adds RequestShutdown() on Server and three unit tests for listen_unix.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… exit
Add connectSupervisor() in frontend/src/main/supervisor-link.ts: holds one
client connection to the daemon's OS-native supervisor socket (Unix domain
socket on macOS/Linux, named pipe on Windows). Retries with bounded backoff
until the daemon is up; reconnects automatically on drop.
In main.ts, connect the link from reportBoundPort (once per daemon ready
transition). Remove the quit-time daemon teardown: the before-quit handler
now only disposes the browser view. The process.on("exit") killDaemon call
is also removed. When Electron exits for any reason the OS closes the fd,
the daemon detects EOF, and self-stops after its 5s grace period. tmux and
ConPTY sessions survive and are adopted on the next boot.
killDaemon and stopDaemon are kept for the explicit user-stop path
(ipcMain.handle("daemon:stop", ...)).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n; tidy test setup
Expose a live `connected` getter on SupervisorLinkHandle and add a
last-resort process.on("exit") kill that fires only when the link is
not actually connected (UDS never bound or addr was null), preventing
orphan daemons while keeping the OS-fd teardown path for the normal
case. Log a warning when addr is null so the skip is diagnosable.
Invert the setup.ts guard to the natural form, dropping the empty
if/else skeleton.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…increment) Drop the empty-prompt early return in restoreArgv that returned ErrNotResumable when the adapter could not resume and no prompt was saved. Now control falls through to GetLaunchCommand unconditionally: a saved prompt is replayed, an empty prompt (orchestrator) launches fresh with the system prompt only, same id, same workspace. Removes the only producer of ErrNotResumable and deletes the now-dead sentinel, its service mapping, and the corresponding tests. Frontend reference to SESSION_NOT_RESUMABLE in TerminalPane.tsx is left intact (the handler becomes harmlessly dead; the API simply stops returning that code). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r reconcile Task 5 made promptless sessions relaunch fresh instead of hitting ErrNotResumable. The reconcile crash-recovery path (documented in reconcileLive) terminates a dead-live session then RestoreAll relaunches it on the same boot. This test asserted the old promptless-stays-terminated artifact; update it to the intended restored end state (live again, same id, one fresh runtime Create). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xplicit stop Document that the liveness link is established only when the app spawned the daemon: the attach path intentionally does not link, to keep an `ao start` daemon persistent (headless safety). Also dispose the link on an explicit daemon:stop so its reconnect loop does not retry a deliberately-stopped daemon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
a5a31be to
6ed9bc5
Compare
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rrors Review of AgentWrapper#2185: Serve previously returned on the first unexpected Accept error, silently disabling the watchdog (the 'restart is caller's job' comment described a contract the caller never fulfilled). Back off and keep accepting instead, so a transient error (e.g. EMFILE) cannot leave the daemon unable to self-stop on frontend death. Also drop the stale Task 3/4 planning comment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ors still relaunch fresh) A promptless, unresumable KindWorker session had no prompt to replay and no native session id to resume from. Blank-relaunching it via GetLaunchCommand would silently drop its task. restoreArgv now returns ErrNotResumable for this case, gated on (meta.Prompt == "" && kind != domain.KindOrchestrator). Orchestrators are promptless by design and continue to relaunch fresh with the system prompt only. Re-introduces ErrNotResumable sentinel and its Conflict API mapping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…se lingering-daemon gap) - runfile.Info gets Owner field (omitempty): "app" when Electron spawned, empty for headless `ao start`. server.go reads AO_OWNER env to set it. - daemonEnv() injects AO_OWNER=app so spawned daemons self-identify. - Extracted establishSupervisorLink() from inline reportBoundPort code. Spawn path calls it unconditionally; attach paths call it only when shouldLinkOnAttach(owner) is true (owner === "app"). - Both attach paths (inspectExistingDaemon, resolveDaemonFromPort) now read the owner from the run-file and re-link when app-owned, preventing a lingering app-spawned daemon from self-stopping mid-session. - Headless ao start daemons stay unlinked: persistent across app quit. - New daemon-owner.ts + daemon-owner.test.ts (4 vitest cases, all pass). - Go: 9 tests pass with -race; vet clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…CTOU Review polish on AgentWrapper#2185: a promptless worker left terminated on boot is expected, not an operational error, so log it at Warn not Error. Document the narrow run-file re-read TOCTOU on the port-attach re-link path (worst case is linking a headless daemon; the dispose-idempotency guard prevents any leak). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On macOS the orchestrator session kept getting recreated with an incrementing id (
ao-agents-14 -> 15 -> 16) across daemon restarts, app quits, and reboots, losing its context each time. The daemon was also orphaned on app close. Windows did not have the bug.Root cause (proven by sandbox experiment): the graceful daemon-shutdown path ran
SaveAndTeardownAll, which killed live tmux sessions. On restart the promptless orchestrator could not be restored (ErrNotResumable), so the frontend minted a new one (id++). The hard-kill path left tmux alive and the boot reconcile adopted it cleanly. This makes graceful shutdown behave like hard-kill, and adds an OS-native liveness link so the daemon stops cleanly (not orphaned) when the frontend dies.What changed
Core increment fix
Reconcileadopts the surviving tmux/ConPTY sessions.SaveAndTeardownAllis dropped from the daemon'ssessionLifecycleinterface, so re-adding teardown-on-shutdown is now a compile error.SpawnOrchestrator(clean=false)is idempotent server-side: it returns the existing live orchestrator instead of minting a duplicate (previously only a client-side guard existed).OS-native liveness link (no orphan, no clean-close dependency)
ao startdaemon with no frontend never self-stops), fires once after a 5s grace when the last client drops, reconnect cancels.~/.ao/supervise.sock(sibling ofrunning.json) and a Windows named pipe (go-winio), wired withonLastClientGone = RequestShutdown. Non-fatal if it cannot bind.Reboot recovery
ErrNotResumable(which is removed). Orchestrator == worker, no special-casing.Verification
go build ./... && go vet ./... && go test ./... -racegreen; frontendvitest307/307 andtscclean.tmux kill-server) -> restored fresh, same id.Notes
mainto drop the already-merged commits, after which this diff shows only the daemon-lifecycle change.ao startpersistence.🤖 Generated with Claude Code