Skip to content

orch_resume(force=true) cannot reattach to a stopped batch after orch_abort() #539

@HenryLach

Description

@HenryLach

Summary

After orch_abort() graceful-terminates a batch in the stopped state with failedTasks > 0, a subsequent orch_resume(force=true) cannot reattach. The orch reports "Resume initiated. Phase: launching", then immediately reports "No batch is running" and the operator-visible batch summary is empty (no batch ID, no duration tracked). This forces a fall back to orch_start against the same PROMPT.md as the only recovery path.

The current behaviour silently drops the operator's intent. There's no error, no clear documentation of the abort/resume incompatibility, and no fallback handling.

Reproduction

  1. Run any batch.
  2. While running, orch_abort(). Wait for it to complete.
  3. orch_status() reports the batch as stopped.
  4. Make any external change to the worktree (e.g., supervisor manually addresses a code-review revision).
  5. orch_resume(force=true).

Expected: the batch resumes from the worktree's preserved state, picks up the next pending step.

Actual: ephemeral "Resume initiated" message, then orch_status() reports no active batch and the supervisor receives an empty batch summary.

Concrete evidence

Production sequence from batch 20260506T105850:

1. orch_abort() →
   "Graceful abort complete for batch 20260506T105850: 0 exited gracefully, 2 force-killed (60s)"
   "In-memory batch state set to 'stopped'"
   "Worktrees and branches are preserved for inspection."

2. (supervisor manually addresses R006-code-step2 in the worktree, commits, etc.)

3. orch_resume(force=true) →
   "🔄 Resume initiated for batch. Phase: launching."

4. (immediately, no time gap)
   "📊 Batch Summary —
   - Result: 0/0 tasks succeeded
   - Duration: in progress
   - Cost: not tracked
   Batch '' ended (idle)."

5. orch_status() →
   "No batch is running."

The worktree at .worktrees/henrylach-20260506T105850/lane-1 was fully intact (8 commits, all source/test files, STATUS.md). The orch had everything it needed to resume; it just couldn't find an in-memory batch entry to attach to.

Root cause hypothesis

orch_abort() is destructive to in-memory state by design — it kills processes and clears registry entries. orch_resume(force=true) looks up batch state via the in-memory registry, doesn't find it, and falls through to a no-op. The disk artifacts (worktree commits, STATUS.md, .reviews/, batch-state.json snapshots) are present but unused by the resume path.

Why this matters

The operator's natural mental model after orch_abort is: "I aborted, fixed the issue manually, now I want to resume from where I was." The current behaviour breaks that workflow without any error message indicating what's wrong. Operators discover the limitation only by trial and error, and the recovery path (orch_start against the same PROMPT.md) requires the supervisor to fast-forward feature branches manually to bring worker commits into the working tree before the new batch starts.

In the production failure that motivated this issue, recovery required:

  1. git merge --ff-only task/henrylach-lane-1-20260506T105850 from the dead worktree
  2. git push to durable-store the work
  3. git worktree remove --force (which itself failed, see issue Windows MAX_PATH error blocks 'git worktree remove' — orch should fall back to 'cmd rd /s /q' #543)
  4. cmd rd /s /q Windows-specific cleanup
  5. Update STATUS.md with a "supervisor recovery note" preamble so the new batch's worker doesn't re-litigate completed steps
  6. Commit the STATUS update + push
  7. orch_start(target=PROMPT.md)

Net: ~15 minutes of supervisor time + nontrivial git surgery — all to do what orch_resume(force=true) should have done.

Fix proposals

A. Make orch_resume(force=true) reconstruct from disk

When invoked with force=true and no in-memory batch state matching, the resume path should:

  1. Scan .pi/runtime/<batch-id>/ for the most recent failed batch
  2. Read the persisted batch-state.json (or .pi/batch-history.json)
  3. Inspect the worktree(s) — read STATUS.md, .reviews/, recent commits
  4. Reconstruct enough state to relaunch the worker with the existing worktree

Only fall back to "no batch found" if the disk state is also gone.

B. Document the abort/resume incompatibility prominently

In the supervisor primer + orch_abort tool description, add:

After orch_abort(): the in-memory batch state is destroyed. To pick up where you left off, use orch_start <PROMPT.md> instead of orch_resume(force=true) — the new batch will read the worktree's STATUS.md and skip already-completed steps.

Use orch_pause() instead of orch_abort() when you intend to resume.

C. Distinguish "soft abort" from "hard abort"

If the supervisor_takeover() tool from issue #538 is added, expose a clear hierarchy:

  • orch_pause() — pause but preserve all in-memory state. Resumable.
  • supervisor_takeover() — pause + drain alerts. Resumable.
  • orch_abort() — terminate processes, preserve worktree, reset in-memory state. Not resumable; use orch_start to continue.
  • orch_abort(hard=true) — terminate everything, no preservation.

Recommendation

A is the most user-friendly. C clarifies the model long-term. B is the cheap fallback if A is too much work. I'd ship A + B together.

Acceptance criteria

  • orch_resume(force=true) after orch_abort() either succeeds (reconstructs state from disk) or fails loudly with an error message and a recommended next step (e.g., "Run orch_start <PROMPT.md> to relaunch from the preserved worktree state").
  • The case is covered by an integration test (abort then force-resume) that asserts either successful reconstruction or the loud error.

Related

Affected version: taskplane@0.28.4. Full operator console log of the failed resume attempt available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions