You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After orch_abort() graceful-terminates a batch in the stopped state with failedTasks > 0, a subsequent orch_resume(force=true) cannot reattach. The orch reports "Resume initiated. Phase: launching", then immediately reports "No batch is running" and the operator-visible batch summary is empty (no batch ID, no duration tracked). This forces a fall back to orch_start against the same PROMPT.md as the only recovery path.
The current behaviour silently drops the operator's intent. There's no error, no clear documentation of the abort/resume incompatibility, and no fallback handling.
Reproduction
Run any batch.
While running, orch_abort(). Wait for it to complete.
orch_status() reports the batch as stopped.
Make any external change to the worktree (e.g., supervisor manually addresses a code-review revision).
orch_resume(force=true).
Expected: the batch resumes from the worktree's preserved state, picks up the next pending step.
Actual: ephemeral "Resume initiated" message, then orch_status() reports no active batch and the supervisor receives an empty batch summary.
Concrete evidence
Production sequence from batch 20260506T105850:
1. orch_abort() →
"Graceful abort complete for batch 20260506T105850: 0 exited gracefully, 2 force-killed (60s)"
"In-memory batch state set to 'stopped'"
"Worktrees and branches are preserved for inspection."
2. (supervisor manually addresses R006-code-step2 in the worktree, commits, etc.)
3. orch_resume(force=true) →
"🔄 Resume initiated for batch. Phase: launching."
4. (immediately, no time gap)
"📊 Batch Summary —
- Result: 0/0 tasks succeeded
- Duration: in progress
- Cost: not tracked
Batch '' ended (idle)."
5. orch_status() →
"No batch is running."
The worktree at .worktrees/henrylach-20260506T105850/lane-1 was fully intact (8 commits, all source/test files, STATUS.md). The orch had everything it needed to resume; it just couldn't find an in-memory batch entry to attach to.
Root cause hypothesis
orch_abort() is destructive to in-memory state by design — it kills processes and clears registry entries. orch_resume(force=true) looks up batch state via the in-memory registry, doesn't find it, and falls through to a no-op. The disk artifacts (worktree commits, STATUS.md, .reviews/, batch-state.json snapshots) are present but unused by the resume path.
Why this matters
The operator's natural mental model after orch_abort is: "I aborted, fixed the issue manually, now I want to resume from where I was." The current behaviour breaks that workflow without any error message indicating what's wrong. Operators discover the limitation only by trial and error, and the recovery path (orch_start against the same PROMPT.md) requires the supervisor to fast-forward feature branches manually to bring worker commits into the working tree before the new batch starts.
In the production failure that motivated this issue, recovery required:
git merge --ff-only task/henrylach-lane-1-20260506T105850 from the dead worktree
Update STATUS.md with a "supervisor recovery note" preamble so the new batch's worker doesn't re-litigate completed steps
Commit the STATUS update + push
orch_start(target=PROMPT.md)
Net: ~15 minutes of supervisor time + nontrivial git surgery — all to do what orch_resume(force=true)should have done.
Fix proposals
A. Make orch_resume(force=true) reconstruct from disk
When invoked with force=true and no in-memory batch state matching, the resume path should:
Scan .pi/runtime/<batch-id>/ for the most recent failed batch
Read the persisted batch-state.json (or .pi/batch-history.json)
Inspect the worktree(s) — read STATUS.md, .reviews/, recent commits
Reconstruct enough state to relaunch the worker with the existing worktree
Only fall back to "no batch found" if the disk state is also gone.
B. Document the abort/resume incompatibility prominently
In the supervisor primer + orch_abort tool description, add:
After orch_abort(): the in-memory batch state is destroyed. To pick up where you left off, use orch_start <PROMPT.md> instead of orch_resume(force=true) — the new batch will read the worktree's STATUS.md and skip already-completed steps.
Use orch_pause() instead of orch_abort() when you intend to resume.
C. Distinguish "soft abort" from "hard abort"
If the supervisor_takeover() tool from issue #538 is added, expose a clear hierarchy:
orch_pause() — pause but preserve all in-memory state. Resumable.
orch_abort() — terminate processes, preserve worktree, reset in-memory state. Not resumable; use orch_start to continue.
orch_abort(hard=true) — terminate everything, no preservation.
Recommendation
A is the most user-friendly. C clarifies the model long-term. B is the cheap fallback if A is too much work. I'd ship A + B together.
Acceptance criteria
orch_resume(force=true) after orch_abort() either succeeds (reconstructs state from disk) or fails loudly with an error message and a recommended next step (e.g., "Run orch_start <PROMPT.md> to relaunch from the preserved worktree state").
The case is covered by an integration test (abort then force-resume) that asserts either successful reconstruction or the loud error.
Summary
After
orch_abort()graceful-terminates a batch in thestoppedstate withfailedTasks > 0, a subsequentorch_resume(force=true)cannot reattach. The orch reports "Resume initiated. Phase: launching", then immediately reports "No batch is running" and the operator-visible batch summary is empty (no batch ID, no duration tracked). This forces a fall back toorch_startagainst the same PROMPT.md as the only recovery path.The current behaviour silently drops the operator's intent. There's no error, no clear documentation of the abort/resume incompatibility, and no fallback handling.
Reproduction
orch_abort(). Wait for it to complete.orch_status()reports the batch asstopped.orch_resume(force=true).Expected: the batch resumes from the worktree's preserved state, picks up the next pending step.
Actual: ephemeral "Resume initiated" message, then
orch_status()reports no active batch and the supervisor receives an empty batch summary.Concrete evidence
Production sequence from batch
20260506T105850:The worktree at
.worktrees/henrylach-20260506T105850/lane-1was fully intact (8 commits, all source/test files, STATUS.md). The orch had everything it needed to resume; it just couldn't find an in-memory batch entry to attach to.Root cause hypothesis
orch_abort()is destructive to in-memory state by design — it kills processes and clears registry entries.orch_resume(force=true)looks up batch state via the in-memory registry, doesn't find it, and falls through to a no-op. The disk artifacts (worktree commits, STATUS.md, .reviews/, batch-state.json snapshots) are present but unused by the resume path.Why this matters
The operator's natural mental model after
orch_abortis: "I aborted, fixed the issue manually, now I want to resume from where I was." The current behaviour breaks that workflow without any error message indicating what's wrong. Operators discover the limitation only by trial and error, and the recovery path (orch_startagainst the same PROMPT.md) requires the supervisor to fast-forward feature branches manually to bring worker commits into the working tree before the new batch starts.In the production failure that motivated this issue, recovery required:
git merge --ff-only task/henrylach-lane-1-20260506T105850from the dead worktreegit pushto durable-store the workgit worktree remove --force(which itself failed, see issue Windows MAX_PATH error blocks 'git worktree remove' — orch should fall back to 'cmd rd /s /q' #543)cmd rd /s /qWindows-specific cleanupSTATUS.mdwith a "supervisor recovery note" preamble so the new batch's worker doesn't re-litigate completed stepsorch_start(target=PROMPT.md)Net: ~15 minutes of supervisor time + nontrivial git surgery — all to do what
orch_resume(force=true)should have done.Fix proposals
A. Make
orch_resume(force=true)reconstruct from diskWhen invoked with
force=trueand no in-memory batch state matching, the resume path should:.pi/runtime/<batch-id>/for the most recent failed batchbatch-state.json(or.pi/batch-history.json).reviews/, recent commitsOnly fall back to "no batch found" if the disk state is also gone.
B. Document the abort/resume incompatibility prominently
In the supervisor primer +
orch_aborttool description, add:C. Distinguish "soft abort" from "hard abort"
If the
supervisor_takeover()tool from issue #538 is added, expose a clear hierarchy:orch_pause()— pause but preserve all in-memory state. Resumable.supervisor_takeover()— pause + drain alerts. Resumable.orch_abort()— terminate processes, preserve worktree, reset in-memory state. Not resumable; useorch_startto continue.orch_abort(hard=true)— terminate everything, no preservation.Recommendation
A is the most user-friendly. C clarifies the model long-term. B is the cheap fallback if A is too much work. I'd ship A + B together.
Acceptance criteria
orch_resume(force=true)afterorch_abort()either succeeds (reconstructs state from disk) or fails loudly with an error message and a recommended next step (e.g., "Runorch_start <PROMPT.md>to relaunch from the preserved worktree state").abort then force-resume) that asserts either successful reconstruction or the loud error.Related
orch_abortas a frustration response, which then trips this issue.Affected version:
taskplane@0.28.4. Full operator console log of the failed resume attempt available on request.