Post-failure event queue replays zombie 'wants to exit' alerts to supervisor; no clean drain mechanism

## Summary

After the orch terminates a lane (e.g., via the no-progress safety mechanism or `orch_abort()`), the supervisor continues to receive a cascade of buffered "Worker on lane N wants to exit with no progress" alerts for several minutes. None of the documented operator-side responses (`steer`, `skip`, `let it fail`, `orch_abort`, `orch_skip_task`) reliably drain the alert queue. The supervisor experience during recovery is degraded — every zombie alert looks like a fresh interrupt requiring action.

## Reproduction

1. Run any task at `Review Level: 2` and induce a death-spiral (cf. issue #537: trigger by marking a step complete in STATUS before the code review fires, then make the reviewer return REVISE).
2. Let the orch's no-progress mechanism kill the lane (3 iterations, default).
3. Observe: even *after* the lane is killed, the supervisor receives multiple buffered prompts asking for a steering message.
4. Try to silence: respond with `"skip"` or `"let it fail"` as plain text. Result: another zombie alert arrives anyway.
5. Try `send_agent_message(type='abort', ...)`. Result: `Agent "orch-..." is not currently running. Use orch_status() or orch_resume() before sending messages.`
6. Try `orch_skip_task("MIG-002")`. Result: `No batch state found.`
7. Try `orch_abort()`. Result: graceful abort completes — but more zombie alerts continue afterward.

The alerts only stop when the engine itself crashes with `Channel closed`.

## Concrete evidence

Production batch `20260506T105850`:

- Lane killed at ~14:21 UTC after 3 no-progress iterations
- Supervisor received 5 distinct "wants to exit" alerts between 14:21 and 14:28 UTC (some matching iteration counts, some replaying earlier iterations)
- 3 of the 5 had `Worker said: ""` (no payload — see issue #540)
- `list_active_agents()` returned "No active agents" the entire time the alerts were arriving

Operator's intervention attempts and outcomes:

| Attempt | Result |
|---|---|
| Plain text reply `"let it fail"` inside a sentence | Ignored (alert came through anyway) |
| `send_agent_message(type='abort', ...)` | `Agent is not currently running` |
| `orch_skip_task("MIG-002")` | `No batch state found` |
| Standalone `"skip"` reply | Another zombie arrived |
| `orch_abort()` | "graceful abort complete" — but more zombies followed |
| Standalone `"let it fail"` reply (much later) | Engine finally exited with `Channel closed` |

Total elapsed time spent fighting the alert queue: ~7 minutes.

## Why this matters

A failed batch is a stressful operator moment — they need to switch from "monitoring" to "diagnose + recover". Instead, the supervisor spends time fighting an alert system that won't acknowledge the lane is dead. The operator-perceived state diverges from the actual engine state.

This also makes the supervisor more likely to take destructive recovery actions out of frustration (e.g., `orch_abort`), which then triggers issue #539 (resume cannot reattach after abort).

## Root cause hypothesis

Two contributing factors:

1. **Mailbox events are not drained** when the orch terminates a lane via the no-progress mechanism. The events generated by the dying iterations remain in the supervisor's inbox and are delivered after the lane is already dead.
2. **The supervisor's text-reply parser is brittle.** "skip" and "let it fail" must be standalone literal replies; including them in a longer message ignores them. There's no documented contract for the parser, so operators discover the strict format only by trial and error.

## Fix proposals

### A. Drain on lane termination

When the orch's `monitor` decides to terminate a lane (no-progress timeout, hard-fail, etc.), it should:
- Synchronously purge any pending mailbox events for that agent's outbox before marking the lane terminal
- Mark all in-flight "wants to exit" prompts for that agent as resolved/dropped

### B. Soft-abort tool for operator-controlled handoff

Add a tool `supervisor_takeover(reason: string)` that the supervisor calls when it intends to manually recover a failing batch. Effect:
- Pause the current wave (no new work scheduled)
- Drain all per-agent alert queues (no further "wants to exit" prompts)
- Preserve worktrees + state for the manual recovery
- Distinct from `orch_abort` which is destructive

### C. Document the text-reply parser strictly

If the parser stays as-is, document its rules in the supervisor primer:

> When responding to a "wants to exit" alert, the orch parser accepts these standalone replies (whole message, no other text):
> - `skip` — mark the step skipped, continue
> - `let it fail` — let the lane die, continue to next task
>
> Any other free-form text is interpreted as a steering message and forwarded to the worker via `send_agent_message`.

### Recommendation

Ship **A + B**. A fixes the immediate UX bug. B gives operators a clean handoff path for manual recovery (which is what we wanted but couldn't get). C is a docs-only patch that has value regardless.

## Acceptance criteria

- [ ] After a lane is killed by the no-progress mechanism, the supervisor receives **at most one** "lane terminated" notification (not a cascade of "wants to exit" prompts).
- [ ] Supervisor can call `supervisor_takeover()` to silence further per-agent alerts cleanly.
- [ ] `orch_abort()` consistently silences further zombie alerts (no more replays after the abort completes).

## Related

- Issue #537 is the upstream death-spiral trigger that exposed this UX.
- Issue #539 (`orch_resume` after abort) is downstream of the operator's frustrated `orch_abort` reflex caused by this issue.

Affected version: `taskplane@0.28.4`. Reproducible repo + supervisor conversation log available on request.


Attempt	Result
Plain text reply `"let it fail"` inside a sentence	Ignored (alert came through anyway)
`send_agent_message(type='abort', ...)`	`Agent is not currently running`
`orch_skip_task("MIG-002")`	`No batch state found`
Standalone `"skip"` reply	Another zombie arrived
`orch_abort()`	"graceful abort complete" — but more zombies followed
Standalone `"let it fail"` reply (much later)	Engine finally exited with `Channel closed`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-failure event queue replays zombie 'wants to exit' alerts to supervisor; no clean drain mechanism #538

Summary

Reproduction

Concrete evidence

Why this matters

Root cause hypothesis

Fix proposals

A. Drain on lane termination

B. Soft-abort tool for operator-controlled handoff

C. Document the text-reply parser strictly

Recommendation

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Post-failure event queue replays zombie 'wants to exit' alerts to supervisor; no clean drain mechanism #538

Description

Summary

Reproduction

Concrete evidence

Why this matters

Root cause hypothesis

Fix proposals

A. Drain on lane termination

B. Soft-abort tool for operator-controlled handoff

C. Document the text-reply parser strictly

Recommendation

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions