Skip to content

Zellij to tmux + ConPTY runtime, session save/restore, crash-proof reconcile (port #404)#2183

Merged
harshitsinghbhandari merged 4 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:port-tmux-conpty-404
Jun 25, 2026
Merged

Zellij to tmux + ConPTY runtime, session save/restore, crash-proof reconcile (port #404)#2183
harshitsinghbhandari merged 4 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:port-tmux-conpty-404

Conversation

@harshitsinghbhandari

Copy link
Copy Markdown
Collaborator

What

Ports ReverbCode PR #404 (9ae0573) into agent-orchestrator: migrate the terminal runtime from Zellij to tmux (Darwin/Linux) and ConPTY (Windows), plus the session save/restore lifecycle and crash-proof reconcile work that shipped with it.

This was a clean cherry-pick (-x) onto main with zero conflicts; the parent commit (e970f72d, PR #403) is already in our history and the pre-state (the zellij/ adapter) matched exactly.

Highlights

  • tmux adapter (runtime/tmux) implementing ports.Runtime via the tmux CLI; drop-in replacement for zellij on Darwin/Linux.
  • ConPTY runtime (runtime/conpty): pure-Go named-pipe framing protocol + output ring buffer, pty-host serve engine over loopback TCP, sideband pty registry, and a runtime.Runtime adapter. New hidden ao pty-host subcommand.
  • runtimeselect: picks tmux on non-Windows, conpty on Windows. zellij/ deleted entirely.
  • Stream-based Attach for tmux/zellij/conpty via shared ptyexec package (PTY spawn moved out of terminal).
  • Session lifecycle persistence: ForceDestroy, StashUncommitted/ApplyPreserved (preserve uncommitted work to refs/ao/preserved/<id>), SaveAndTeardownAll/RestoreAll, wired into daemon boot/shutdown and the frontend POST /shutdown on quit.
  • Crash-proof reconcile on boot: adopt-alive / stash+terminate-dead / reap-orphans, restore promptless orchestrators, and a frontend kill+replace for a wedged orphan daemon.
  • Fixes: tmux mouse scroll + link clicking in the embedded terminal, default TERM for Finder-launched attach, typed SESSION_NOT_RESUMABLE (no more opaque 500), local-keychain macOS signing in forge.config.ts.

Verification

  • go build ./... clean on darwin, linux, and windows (cross-compiled).
  • go vet ./... clean.
  • go test -race ./... green across the full backend suite.

Frontend changes are a verbatim carry of the merged commit (including regenerated pnpm-lock.yaml and routeTree.gen.ts); not separately rebuilt.

Upstream: aoagents/ReverbCode@9ae0573

🤖 Generated with Claude Code

harshitsinghbhandari and others added 4 commits June 25, 2026 11:42
* feat(runtime): add tmux adapter package

Adds backend/internal/adapters/runtime/tmux implementing ports.Runtime via
the tmux CLI. Drop-in replacement for the zellij adapter on Darwin/Linux.

Key design points:
- Handle is a plain session id string (no pane-id split needed for tmux).
- Exact-match session targeting via = prefix for kill-session and has-session.
- Keep-alive shell appended to launch command so sessions survive agent exit.
- send-keys -l chunked for literal text delivery (no key-name interpretation).
- IsAlive distinguishes definitive-dead (missing/no-server output) from probe
  errors so the reaper never kills a session on a transient tmux failure.
- 34 tests pass: 32 unit tests via fakeRunner seam, 2 integration tests on
  real tmux 3.6b (TestRuntimeIntegration, TestRuntimeIntegrationExactSessionParsing).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(tmux): address four code-review findings in tmux runtime adapter

- Remove em dash from tmux_test.go:462 (project hard rule); replace with semicolon
- Derive integration test session IDs from t.Name() so concurrent runs do not collide on the same tmux session
- Remove dead scaffolding variables (r/fr, r2/fr2) in TestCreateDestroysAndReturnsErrorWhenNotAlive
- Quote \${SHELL:-/bin/sh} in buildLaunchCommand and update all asserting tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(runtime): wire tmux on Darwin/Linux via runtimeselect, keep zellij on Windows

- New package runtimeselect: Runtime union interface (ports.Runtime +
  SendMessage/GetOutput/AttachCommand) with compile-time assertions for
  both adapters. New(log) returns tmux on non-Windows, zellij on Windows
  (replicating the old daemon socket-dir setup).
- daemon.go: replace zellij-specific socket-dir block with
  runtimeselect.New(log); update comment to be runtime-neutral.
- lifecycle_wiring.go: startSession param changed from *zellij.Runtime
  to runtimeselect.Runtime.
- cli/doctor.go: runtime-aware checkTerminalRuntime (tmux on Darwin/Linux,
  zellij on Windows); added checkTmux.
- cli/spawn.go: attach hint prints tmux attach -t <name> on non-Windows,
  keeps zellij attach hint on Windows.
- wiring_test.go: startSession test uses runtimeselect.New(nil); zellij
  direct tests retained for zellij-specific coverage.
- doctor_test.go: replaced three zellij tool tests with tmux equivalents.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore: tidy runtime-neutral comments and doctor import grouping

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(tmux): drop unused runner.Start seam

tmux creates sessions detached via new-session -d, so the Start method
(carried over from the zellij runner shape, where it backs the Windows
fire-and-forget spawn) is never called. Remove it from the interface and
its implementations to shrink the seam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(conpty): add protocol codec and output ring buffer (pure Go, OS-agnostic)

Ports the ConPTY named-pipe binary framing protocol and rolling output
buffer from pty-host.ts to Go. Implements EncodeMessage, MessageParser
(handles arbitrary chunk boundaries, payload copy guarantee), and Ring
(MaxOutputLines=1000, ANSI-safe, concurrent Append+Snapshot). All 15
unit tests pass on Darwin; GOOS=windows build is also clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(conpty): harden copy-safety and add concurrent ring test

Strengthen TestParserPayloadIsCopy to catch internal-buffer aliasing:
feed frame1, capture its payload, feed frame2 of the same length so the
parser's buffer overwrites the frame1 region, then assert frame1's bytes
are unchanged. The prior test only mutated the input slice post-Feed and
did not exercise the real aliasing risk.

Add TestRingConcurrent: 10 writer goroutines (Append) and 10 reader
goroutines (Snapshot + Tail) running concurrently with a WaitGroup. The
test is meaningful only under the race detector and catches any missing
mu coverage on Ring's exported methods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(ptyregistry): port Windows pty-host sideband registry to Go

Adds package ptyregistry under backend/internal/adapters/runtime/conpty/ptyregistry.
Ports windows-pty-registry.ts: defensive read, atomic temp+rename write,
delete-on-empty, register-replaces-same-ID, and auto-pruning List.
PID liveness isolated behind build tags (syscall.Kill on Unix,
OpenProcess on Windows). 10 tests all green on Darwin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(sdd): phase B briefs and progress for B1-B3

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(conpty): add pty-host serve engine with loopback TCP transport (B3)

Ports pty-host.ts behavior to Go: ptyConn interface seam, Serve engine
with ring replay, fan-out broadcast, MSG_* handlers, PTY-exit keep-alive,
and graceful shutdown (ConPTY dispose first, 50ms grace, then clients and
listener). Real conptyConn is Windows-only via build tag; non-Windows stub
keeps the package importable on Darwin/Linux. Tests use a fake ptyConn
with real loopback sockets and the B1 MessageParser, passing with -race.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(conpty): deliver scrollback snapshot and register client atomically

Review of Task B3 found one Important bug and two minors.

Important: in handleConn the ring Snapshot and the client registration
ran under two separate h.mu acquisitions. A PTY chunk arriving in that
gap was in neither the snapshot nor that client's broadcast, so it was
silently dropped (a hole in the client's stream). Now take the snapshot,
write it to the conn, and add the conn to the clients set all under a
single h.mu hold; broadcast also takes h.mu so it cannot interleave.
Added TestScrollbackLiveOrdering_NoDrop, which emits a contiguous
numbered stream while a client connects and asserts the client's stream
has no internal gap. It reliably fails against the old two-step code and
passes under -race -count=20.

Minor (faithfulness): conptyConn.Close() now also best-effort
Process.Kill() (nil-guarded) so a child that ignores ConPTY EOF still
exits and Done() fires, mirroring pty.kill() in pty-host.ts.

Minor (simplify): use os.Environ() instead of
exec.Command(shellCmd).Environ() for the child env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(sdd): B4 brief

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(conpty): add runtime adapter with loopback pty-client and session management (B4)

Implements the conpty Runtime adapter: injectable spawn seam, loopback
TCP client helpers (SendMessage/GetOutput/IsAlive/Kill), and Runtime
methods (Create/Destroy/IsAlive/SendMessage/GetOutput). Session resolution
uses an in-memory map with B2 registry fallback for daemon-restart
recovery. Windows-only detached spawn in spawn_windows.go; stub errors
on other OSes. All adapter methods are unit-tested on Darwin against an
in-process B3 Serve and fakePTY. 48 tests pass, all three GOOS builds
succeed, vet clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(conpty): split IsAlive dead-vs-transient for reaper safety

clientIsAlive collapsed every probe failure (dial timeout, read-deadline
expiry, write error, connection-refused) to false, which the reaper turns
into ProbeDead and the LCM can promote to a permanent reap. A single
transient 2s loopback timeout would spuriously kill a live idle session.

Now clientIsAlive returns (alive bool, transientErr error): a refused dial
is definitively gone (false, nil); a timeout or any connected-then-failed
I/O error is transient (false, err) so the reaper records ProbeFailed and
retries. Wire IsAlive to propagate it. Add regression test covering both
the refused-is-gone and timeout-is-transient paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(sdd): B5 brief + ledger

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(terminal): stream-based Attach for tmux/zellij/conpty

Evolve the terminal layer from argv-based attach (PTYSource.AttachCommand
+ injected spawnFunc) to stream-based attach (Source embedding
ports.Attacher). tmux/zellij keep spawning their attach CLI on a local
PTY via the new shared ptyexec.Spawn; conpty attaches by dialing its
loopback pty-host directly with a loopbackStream over the B1 framing
protocol. Reattach/backoff/size/SIGWINCH/detach semantics are unchanged.

- ports: add Stream + Attacher.
- ptyexec: new shared package holding the creack/pty (unix) and ConPTY
  (windows) spawn, moved verbatim from terminal with its tests.
- terminal: PTYSource -> Source, drop spawnFunc/WithSpawn, run loop calls
  src.Attach and uses ports.Stream.
- tmux/zellij: add Attach (argv via ptyexec.Spawn); conpty: add Attach
  (loopbackStream); ports.Attacher assertions on all three.
- runtimeselect: union embeds ports.Attacher in place of AttachCommand.
- tests migrated; new conpty attach_test against in-process Serve+fakePTY.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* style(ptyexec): replace em dashes carried from moved pty files

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(sdd): B6 brief + B5 ledger

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(runtime): select conpty on Windows, register pty-host subcommand, delete zellij

- runtimeselect.New: Windows branch now returns conpty.New(conpty.Options{}) instead
  of zellij; compile-time assertion updated to conpty.Runtime.
- cli/ptyhost.go: new hidden "ao pty-host" subcommand (DisableFlagParsing so agent
  shell args with leading dashes survive); calls conpty.RunHost and exits with its code.
- cli/root.go: wires newPtyHostCommand alongside newLaunchCommand.
- cli/doctor.go: Windows terminal-runtime check replaced with a static ConPTY
  built-in pass; zellij import and checkZellij function removed.
- cli/spawn.go: Windows attach hint updated to dashboard message (ConPTY has no
  CLI attach); zellij import removed.
- daemon/lifecycle_wiring.go: stale zellij comment updated to tmux/conpty.
- daemon/wiring_test.go: zellij import and TestDaemonZellijSocketDir test removed;
  TestWiring_StartLifecycleThreadsMessengerIntoLCM now uses tmux.New.
- terminal/attachment_integration_test.go: re-pointed at real tmux
  (TestAttachmentStreamsRealTmuxPane + TestAttachmentReattachAdoptsNewSize);
  sessions cleaned up in t.Cleanup.
- internal/adapters/runtime/zellij: deleted entirely.

All three GOOS builds pass; go test -race ./... 1607 passed; go vet clean;
grep -rn "runtime/zellij" returns nothing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(daemon): correct terminal-runtime comment to conpty on Windows

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(ptyexec): drop stale zellij reference in Windows spawn comment

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(sdd): final phase B ledger

* build(desktop): support local keychain signing for macOS builds

Bridge forge.config.ts to accept the local keychain flow (APPLE_SIGNING_IDENTITY
identity + AO_NOTARY_PROFILE notarytool profile) in addition to the existing CI
secrets path (CSC_LINK + APPLE_ID/app-specific-password). Enables a signed +
notarized macOS build from a developer Mac without exporting a .p12 or the Apple
ID app-specific password.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(daemon): default TERM so Finder-launched tmux attach works

A Finder/Dock launch starts the supervisor under launchd with no
controlling tty, so TERM is unset. The daemon inherits that, and its
tmux attach client (spawned with env=nil, inheriting the daemon env)
dies immediately with "open terminal failed: terminal does not support
clear" — the orchestrator terminal pane never opens.

Seed TERM=xterm-256color (what the renderer's xterm.js emulates) as the
base of buildDaemonEnv, the same place PATH is reconstructed for the same
class of "Finder launch lacks a terminal's env" bug. A real TERM from the
shell/process env still wins.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(lifecycle): plan for save-on-close/restore-on-open sessions

Captures the intended daemon lifecycle: on shutdown save every running
session (worker and orchestrator) plus its gitignore-respecting uncommitted
work to refs/ao/preserved/<id>, then force-remove worktrees; on boot recreate
worktrees, replay the preserved work, and restore all sessions. Reuses
existing SQLite state, session_worktrees.preserved_ref, manager.Restore, and
the /shutdown endpoint (no new file, migration, or route).

Also gitignore the built daemon binary copied into frontend/daemon/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(frontend): sync regenerated pnpm-lock and routeTree

Working-tree regeneration of the pnpm lockfile and TanStack Router generated
route tree. No hand edits; generated output only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(workspace): add ForceDestroy for shutdown-path worktree removal

Adds ForceDestroy(ctx, info) to ports.Workspace and the gitworktree
adapter. It runs `git worktree remove --force`, then prune, then
os.RemoveAll as a backstop. A new worktreeForceRemoveArgs builder in
commands.go emits --force; the existing worktreeRemoveArgs is untouched
so Destroy still refuses dirty worktrees via ErrWorkspaceDirty.

TDD: test first creates a dirty worktree, confirms Destroy refuses with
ErrWorkspaceDirty, then confirms ForceDestroy succeeds and the path is
gone and deregistered. All 1609 backend tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(workspace): add StashUncommitted and ApplyPreserved for session lifecycle

Implements the correctness-critical save-on-close / restore-on-open pair
in the gitworktree adapter:

- StashUncommitted: captures uncommitted work (tracked edits and new
  non-ignored files) via a temp GIT_INDEX_FILE into a real commit stored
  at refs/ao/preserved/<session-id>. Never touches the real index or
  stash stack. Returns empty string for clean worktrees. Logs the count
  of .gitignore-skipped paths.
- ApplyPreserved: replays the preserve commit onto a freshly re-added
  worktree via "git checkout <SHA> -- .". Deletes the ref on clean
  success; keeps it and returns ErrPreservedConflict (wrapped) on
  content conflicts.
- Adds both methods to ports.Workspace interface and stubs them in
  integration and session_manager test doubles.

TDD: wrote two failing tests first (RED confirmed via build failure on
undefined methods), then implemented to GREEN. All 39 adapter tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(workspace): replace path-checkout with cherry-pick in ApplyPreserved

git checkout <sha> -- . is a path-checkout that always exits 0 for
content divergence, making ErrPreservedConflict unreachable. Replace
with git cherry-pick --no-commit which performs a true three-way merge,
leaves textual conflict markers on conflict, and exits non-zero so the
sentinel is correctly returned. Conflict detection now uses exit code
only (locale-independent). Add TestWorkspaceIntegrationApplyPreservedConflict
to assert: error is ErrPreservedConflict, preserve ref is kept, conflict
markers appear in the file. All 40 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(session-manager): add SaveAndTeardownAll and RestoreAll for shutdown lifecycle

Implements Task 3: capture-then-destroy on shutdown and restore-all on startup.

- Adds ErrPreservedConflict to ports as a named sentinel; gitworktree aliases it
  (following the same pattern as ErrBranchCheckedOutElsewhere).
- Extends the Store interface with UpsertSessionWorktree and ListSessionWorktrees
  so the session manager can write the shutdown-saved marker and read it back.
- SaveAndTeardownAll: for every live session with a workspace path, stash
  uncommitted work, write the session_worktrees row (DB commit before worktree
  removal, crash-safety invariant), mark terminated, destroy runtime, force-remove
  the worktree. Best-effort per session; no kind filter.
- RestoreAll: for every terminated session that has a session_worktrees row (the
  marker written by SaveAndTeardownAll), re-create the worktree, apply any
  preserved ref (conflict logs and continues), then relaunch via the existing
  single-session Restore. Sessions killed by the user before shutdown (no row)
  are skipped. Best-effort per session; no kind filter.
- TDD: 9 new tests (RED confirmed via build failure, GREEN confirmed 63 pass).
  Full suite: 1621 tests across 77 packages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(terminal): enable tmux mouse scroll and fix link clicking

On macOS the runtime is tmux, but two mouse interactions were broken in
the embedded terminal while copy/paste kept working:

- Scroll: the renderer drives scrolling by writing SGR mouse-wheel
  reports into the pane (the zellij `--mouse-mode true` model), but tmux
  ignores those reports unless mouse mode is on. Create only set `status
  off`, never `mouse on`, so wheel scrolling silently no-opped. Enable
  `set-option -t <id> mouse on`, mirroring the existing status-off step.

- Link clicking: the default WebLinksAddon handler calls window.open()
  with an empty URL and then assigns location.href. Electron's
  setWindowOpenHandler denies every window.open and only forwards the URL
  passed to it, so the empty open is dropped and clicks no-op. Pass the
  matched URL to window.open directly so the main process routes it to
  shell.openExternal (the OS browser).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(session-manager): assert UpsertSessionWorktree precedes ForceDestroy

Add a shared ordered call log (sharedLog *[]string) to both fakeStore
and fakeWorkspace. TestSaveAndTeardownAll_CaptureOrderAndMarker now
wires both fakes to the same slice and asserts upsertIdx < forceIdx,
enforcing the crash-safety invariant that the DB write is committed
before the worktree is force-destroyed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(daemon): wire RestoreAll/SaveAndTeardownAll into boot/shutdown sequence

Exposes session manager through a minimal sessionLifecycle interface
(RestoreAll, SaveAndTeardownAll) returned from startSession, then calls
RestoreAll (best-effort) before srv.Run and SaveAndTeardownAll with a
fresh 30s-bounded context after srv.Run returns. Both SIGTERM and POST
/shutdown funnel through srv.Run returning, so the single save call site
covers both paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(daemon): fix seam-test tautology and lifecycle variable shadow

Finding 1: dispatch both sessionLifecycle methods through an interface
variable (var sl sessionLifecycle = fake) so the runtime body exercises
interface dispatch, not just direct struct method calls.

Finding 2: rename local variable 'lifecycle' to 'lc' in
TestWiring_StartSessionBuildsSessionService to remove the shadow of the
imported lifecycle package.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(frontend): call POST /shutdown before killing daemon on quit

In before-quit, POST /shutdown (8s AbortSignal.timeout) so the daemon
saves sessions gracefully before the SIGTERM kill. Adds a re-entrancy
guard (quitting flag) so a concurrent app.quit() cannot double-preventDefault.
Falls back to killDaemon on fetch failure or timeout: quit is never blocked.
Keeps the process.on('exit') SIGTERM fallback intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(storage): guard session_worktrees.state against empty-string CHECK violation; add ponytail comments

The save path (saveAndTeardownOne) never sets domain.SessionWorktreeRecord.State,
so it arrives at UpsertSessionWorktree as "". The generated upsert includes state
in the INSERT column list, so the DB default ('active') is never applied and the
CHECK constraint (state IN ('active', ...)) would fire at the first real shutdown.

Fix: default to 'active' in the store adapter when row.State is "". No schema
change, no migration, no gen edit.

Also add ponytail: comments on the State field (domain type), the write path, and
the read path, documenting that state is unused multi-repo scaffolding and that the
upgrade path is to wire a real value when multi-repo worktree lifecycle states ship.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(storage): add real-SQLite test for empty-State guard in UpsertSessionWorktree

Adds TestUpsertSessionWorktreeEmptyStateDefaultsToActive to the store
test file. It inserts a SessionWorktreeRecord with State at zero value
"" via UpsertSessionWorktree against a real SQLite DB, then reads the
row back and asserts State == "active". This directly exercises the
guard added in the prior commit and would fail if the guard were
removed (the CHECK constraint rejects ""). Mirrors the helpers and
setup pattern of TestSessionWorktreesRoundTrip exactly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(comments): correct shutdown-mechanism and task-ref inaccuracies

Fix 1: daemon.go comment near SaveAndTeardownAll now correctly states
that POST /shutdown closes the shutdownRequested channel (not cancel ctx).
Also tighten the RestoreAll comment to remove the inaccurate claim.

Fix 2: remove "Task 2's" phrasing from ForceDestroy ponytail comment in
workspace.go; condition still references StashUncommitted by name.

Fix 3: add note in main.ts that the 8s fetch timeout is shorter than the
daemon's 30s save bound, so a SIGTERM after fetch abort does not cut the
in-flight save short.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore: remove .superpowers workflow scratch from repo

These SDD workflow artifacts (task briefs, agent reports, progress ledger,
review packages) were committed by accident in prior work, against the
.superpowers/sdd/.gitignore intent. Remove them from the repo; they remain
local-only scratch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(spec): graceful restore + post-failure orchestrator recreate

Fix the opaque 500 when restoring an un-resumable session (typed 409
SESSION_NOT_RESUMABLE), and add a post-failure popup that offers to recreate a
fresh orchestrator on the same branch (cleaning the worktree, preserving
committed history). Orchestrators only; recreate fires only after a restore
attempt confirms the session cannot be resumed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(plan): restore-recreate orchestrator; reuse existing /orchestrators clean=true

Planning discovery: the recreate capability already ships via POST /orchestrators
(clean=true), which kills the dead orchestrator and re-spawns on the canonical
branch (addWorktree reattaches an existing branch). So the feature collapses to a
typed-error fix plus a frontend popup. Spec updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(session): return typed SESSION_NOT_RESUMABLE instead of 500 on un-resumable restore

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(renderer): offer recreate-orchestrator popup when a session cannot be restored

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(spec): drop stale OpenAPI-regen note (feature adds no route)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(ci): gofmt/goimports, golangci-lint hygiene, and Windows-aware doctor tests

Formatting: ran gofmt and goimports (with local-prefixes) on the 8 listed
files plus ptyexec/spawn_unix.go which the linter also flagged.

Lint (25 issues fixed):
- gosec G115: EncodeMessage now returns ([]byte, error) with an explicit
  bounds check before the int->uint32 conversion; all callers updated.
- govet nilness: removed dead `if lastErr == nil` branch in clientIsAlive;
  lastErr is provably non-nil at that point (real bug).
- nilerr: extracted runAcceptLoop helper so Accept-error-on-close is not
  flagged; listener close is normal shutdown, not a caller error.
- staticcheck SA4010: removed dead `full = append(...)` loop in host_test.
- revive var-declaration: `var prev int = -1` -> `prev := -1`.
- revive redefines-builtin-id: deleted local `min` helper; builtin covers it.
- unparam (2): dropped always-nil env return from attachCommand; dropped
  unused shellPath param from buildLaunchCommand; updated callers.
- errcheck (8): deferred Close/Remove calls wrapped in func(){_ = ...}();
  type assertion in host_main.go uses ok-form; fmt.Fprintf to stdout uses
  _, _ = pattern; workspace.go tmpIdx.Close() uses _ =.
- gocritic nestingReduce: inverted if+continue in runtime.go resolve loop.

Windows E2E: skip TestDoctorChecksTmuxVersion,
TestDoctorChecksTmuxVersionFailsOnError, TestDoctorWarnsWhenTmuxMissing on
windows (ao doctor emits a conpty check there, not tmux).

Verified: gofmt -l . clean, golangci-lint 0 issues, go build ok,
go test -race 1624/1624 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(ci): set git identity in worktree clone fixture; loosen tmux reattach timeouts

The preserve round-trip/conflict tests commit inside a worktree of the cloned
repo, which had no git identity; CI runners cannot auto-derive one, failing with
"empty ident name". Set user.email/user.name on the clone in setupOriginClone so
its worktrees inherit it.

The tmux reattach test drives a real shell and parses stty output, which is slow
under -race on CI; raise its echo-write and SIZE-output waits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(terminal): resend size probe on tmux reattach until the shell answers

Bumping timeouts was the wrong fix: a 30s wait still failed, so the probe output
deterministically never appeared, not slowness. onOpen signals the stream accepts
input, not that the reattached sh -i is at a prompt, so the first echo keystroke
can be dropped. Resend the probe each poll until SIZE output lands, and on timeout
dump the captured pane buffer so a remaining failure is self-explaining.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(terminal): set TERM for real-tmux attach tests so they run in CI

Root cause (from the buffer dump the prior commit added): with TERM unset on CI
runners, tmux refuses to attach a client and prints "open terminal failed:
terminal does not support clear", so the pane never runs the size probe. The
daemon defaults TERM in production; the tests bypass it. Set TERM=xterm-256color
in both real-tmux tests. Reproduced locally with `env -u TERM` (fails the same
way) and verified the fix passes under it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(spec): crash-proof session reconcile design

Boot-time reconcile makes live tmux + worktree state match the DB on every
daemon start, so a SIGKILL/crash/force-quit that skips SaveAndTeardownAll no
longer leaks an orphaned daemon, tmux sessions, or worktrees. Adopt
crash-surviving tmux sessions, preserve-and-terminate dead ones, reap
in-namespace orphans, and add a frontend kill+replace branch for a wedged
orphan daemon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(spec): simplify reconcile to per-session IsAlive, drop ListSessions

Every leak in the incident maps to a DB row, so orphan-reap is a per-session
IsAlive+Destroy over terminated rows; no runtime enumeration, no ports/conpty/
runtimeselect changes. Reaping a tmux session with no DB row is deferred (YAGNI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(plan): crash-proof session reconcile implementation plan

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(session): reconcile live pass (adopt alive, stash+terminate dead)

* feat(session): reconcile reap pass and Reconcile entry point

* feat(daemon): run Reconcile on boot in place of bare RestoreAll

* test(integration): reconcile terminates dead-live sessions and reaps leaked tmux

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(integration): correct misleading CreateSession comment in reconcile test

* feat(frontend): kill+replace a wedged orphan daemon on launch

When both inspectExistingDaemon and resolveDaemonFromPort return null but
a process still holds the daemon port (a crashed/orphaned daemon), spawning
a new Go child would collide on the port and exit 1. Detect this case, SIGTERM
the holder (via the run-file PID, falling back to the probe PID), poll until the
port is free (up to 8s), clear the stale run-file, then proceed to spawn fresh.
The healthy-daemon reuse path is unchanged.

Pure helper: src/shared/daemon-takeover.ts (planDaemonTakeover)
Unit tests:  src/shared/daemon-takeover.test.ts (3 tests, TDD red-green)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(frontend): fire orphan-daemon takeover when a holder actually exists

Replace planDaemonTakeover (inverted logic: ran kill block only when probe
was null) with shouldReplacePortHolder(probe, holderPidAlive) which returns
true when a real holder exists: non-null probe (rejected responder) OR a
run-file PID that is still alive (hung holder). Update main.ts call site to
compute PID liveness before gating the kill block. Update tests to cover all
three distinct outcomes non-vacuously.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs+test: accurate takeover comments, reconcileLive probe-error test, Reconcile doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(session): restore promptless orchestrators and crash-orphaned sessions

The orchestrator was abandoned on every app open: a fresh orchestrator
spawned each launch and the prior conversation appeared lost (it was not;
the transcript stays in ~/.claude, resumable by the deterministic
--session-id AO pins). Two defects combined:

1. Restore's guard rejected any session with no agentSessionId AND no
   prompt as ErrNotResumable. But Claude resumes via a deterministic
   session id regardless of those fields, so promptless orchestrators
   were perfectly resumable yet always rejected. Workers slipped through
   only because they carry a prompt. Move the resumability decision to the
   adapter: restoreArgv returns ErrNotResumable only when GetRestoreCommand
   reports it cannot resume AND there is no prompt to fresh-launch from.

2. reconcileLive marked a crash-orphaned (dead-runtime) session terminated
   without a restore marker, so RestoreAll skipped it and it stayed dead.
   It now saves-and-tears-down to the same end state a graceful shutdown
   produces (capture work, write the session_worktrees marker, terminate,
   remove the worktree), so RestoreAll relaunches it on the same boot,
   resuming history. Crash recovery now matches graceful restart. If work
   capture fails it terminates without a marker rather than risk losing
   un-preserved work.

Tests: promptless orchestrator restores via adapter resume; promptless
session with a non-resuming adapter still returns ErrNotResumable;
reconcileLive writes the marker + tears down the worktree. Full backend
suite green (1632), gofmt/vet clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 9ae05735d6f06ac989857534bae2766392772c71)
…tWrapper#409)

PR AgentWrapper#404 migrated the runtime adapter from Zellij to tmux (Darwin/Linux)
plus conpty (Windows), selected via runtimeselect, but ~30 stale zellij
references lingered in comments and docs describing zellij as the current
runtime. This is a comments/docs-only cleanup with no behavioral change:
comments now say tmux (or tmux/conpty when both platforms are relevant),
terminal/doc.go and docs/backend-code-structure.md are rewritten to
reflect the tmux + conpty + runtimeselect attach model, and the daemon
environment, STATUS, stack, architecture, and CLI docs are updated.

Also gitignore the local .codegraph/ and .cursor/ tooling dirs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Vaibhaav-Tiwari Vaibhaav-Tiwari self-requested a review June 25, 2026 09:00

@Vaibhaav-Tiwari Vaibhaav-Tiwari left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants