orch_resume(force=true) cannot reattach to a stopped batch after orch_abort()

## Summary

After `orch_abort()` graceful-terminates a batch in the `stopped` state with `failedTasks > 0`, a subsequent `orch_resume(force=true)` cannot reattach. The orch reports "Resume initiated. Phase: launching", then immediately reports "No batch is running" and the operator-visible batch summary is empty (no batch ID, no duration tracked). This forces a fall back to `orch_start` against the same PROMPT.md as the only recovery path.

The current behaviour silently drops the operator's intent. There's no error, no clear documentation of the abort/resume incompatibility, and no fallback handling.

## Reproduction

1. Run any batch.
2. While running, `orch_abort()`. Wait for it to complete.
3. `orch_status()` reports the batch as `stopped`.
4. Make any external change to the worktree (e.g., supervisor manually addresses a code-review revision).
5. `orch_resume(force=true)`.

Expected: the batch resumes from the worktree's preserved state, picks up the next pending step.

Actual: ephemeral "Resume initiated" message, then `orch_status()` reports no active batch and the supervisor receives an empty batch summary.

## Concrete evidence

Production sequence from batch `20260506T105850`:

```
1. orch_abort() →
   "Graceful abort complete for batch 20260506T105850: 0 exited gracefully, 2 force-killed (60s)"
   "In-memory batch state set to 'stopped'"
   "Worktrees and branches are preserved for inspection."

2. (supervisor manually addresses R006-code-step2 in the worktree, commits, etc.)

3. orch_resume(force=true) →
   "🔄 Resume initiated for batch. Phase: launching."

4. (immediately, no time gap)
   "📊 Batch Summary —
   - Result: 0/0 tasks succeeded
   - Duration: in progress
   - Cost: not tracked
   Batch '' ended (idle)."

5. orch_status() →
   "No batch is running."
```

The worktree at `.worktrees/henrylach-20260506T105850/lane-1` was fully intact (8 commits, all source/test files, STATUS.md). The orch had everything it needed to resume; it just couldn't find an in-memory batch entry to attach to.

## Root cause hypothesis

`orch_abort()` is destructive to in-memory state by design — it kills processes and clears registry entries. `orch_resume(force=true)` looks up batch state via the in-memory registry, doesn't find it, and falls through to a no-op. The disk artifacts (worktree commits, STATUS.md, .reviews/, batch-state.json snapshots) are present but unused by the resume path.

## Why this matters

The operator's natural mental model after `orch_abort` is: "I aborted, fixed the issue manually, now I want to resume from where I was." The current behaviour breaks that workflow without any error message indicating what's wrong. Operators discover the limitation only by trial and error, and the recovery path (`orch_start` against the same PROMPT.md) requires the supervisor to fast-forward feature branches manually to bring worker commits into the working tree before the new batch starts.

In the production failure that motivated this issue, recovery required:
1. `git merge --ff-only task/henrylach-lane-1-20260506T105850` from the dead worktree
2. `git push` to durable-store the work
3. `git worktree remove --force` (which itself failed, see issue #543)
4. `cmd rd /s /q` Windows-specific cleanup
5. Update `STATUS.md` with a "supervisor recovery note" preamble so the new batch's worker doesn't re-litigate completed steps
6. Commit the STATUS update + push
7. `orch_start(target=PROMPT.md)`

Net: ~15 minutes of supervisor time + nontrivial git surgery — all to do what `orch_resume(force=true)` *should* have done.

## Fix proposals

### A. Make `orch_resume(force=true)` reconstruct from disk

When invoked with `force=true` and no in-memory batch state matching, the resume path should:
1. Scan `.pi/runtime/<batch-id>/` for the most recent failed batch
2. Read the persisted `batch-state.json` (or `.pi/batch-history.json`)
3. Inspect the worktree(s) — read STATUS.md, `.reviews/`, recent commits
4. Reconstruct enough state to relaunch the worker with the existing worktree

Only fall back to "no batch found" if the disk state is also gone.

### B. Document the abort/resume incompatibility prominently

In the supervisor primer + `orch_abort` tool description, add:

> **After `orch_abort()`:** the in-memory batch state is destroyed. To pick up where you left off, use `orch_start <PROMPT.md>` instead of `orch_resume(force=true)` — the new batch will read the worktree's STATUS.md and skip already-completed steps.
>
> Use `orch_pause()` instead of `orch_abort()` when you intend to resume.

### C. Distinguish "soft abort" from "hard abort"

If the `supervisor_takeover()` tool from issue #538 is added, expose a clear hierarchy:

- `orch_pause()` — pause but preserve all in-memory state. Resumable.
- `supervisor_takeover()` — pause + drain alerts. Resumable.
- `orch_abort()` — terminate processes, preserve worktree, reset in-memory state. **Not** resumable; use `orch_start` to continue.
- `orch_abort(hard=true)` — terminate everything, no preservation.

### Recommendation

A is the most user-friendly. C clarifies the model long-term. B is the cheap fallback if A is too much work. I'd ship A + B together.

## Acceptance criteria

- [ ] `orch_resume(force=true)` after `orch_abort()` either succeeds (reconstructs state from disk) or fails loudly with an error message and a recommended next step (e.g., "Run `orch_start <PROMPT.md>` to relaunch from the preserved worktree state").
- [ ] The case is covered by an integration test (`abort then force-resume`) that asserts either successful reconstruction or the loud error.

## Related

- Issue #538 (zombie alerts) often pushes operators into `orch_abort` as a frustration response, which then trips this issue.
- Issue #537 is the upstream cause of the death-spiral that motivated the operator to abort in the first place.

Affected version: `taskplane@0.28.4`. Full operator console log of the failed resume attempt available on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

orch_resume(force=true) cannot reattach to a stopped batch after orch_abort() #539

Summary

Reproduction

Concrete evidence

Root cause hypothesis

Why this matters

Fix proposals

A. Make `orch_resume(force=true)` reconstruct from disk

B. Document the abort/resume incompatibility prominently

C. Distinguish "soft abort" from "hard abort"

Recommendation

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

orch_resume(force=true) cannot reattach to a stopped batch after orch_abort() #539

Description

Summary

Reproduction

Concrete evidence

Root cause hypothesis

Why this matters

Fix proposals

A. Make orch_resume(force=true) reconstruct from disk

B. Document the abort/resume incompatibility prominently

C. Distinguish "soft abort" from "hard abort"

Recommendation

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

A. Make `orch_resume(force=true)` reconstruct from disk