Skip to content

Post-failure event queue replays zombie 'wants to exit' alerts to supervisor; no clean drain mechanism #538

@HenryLach

Description

@HenryLach

Summary

After the orch terminates a lane (e.g., via the no-progress safety mechanism or orch_abort()), the supervisor continues to receive a cascade of buffered "Worker on lane N wants to exit with no progress" alerts for several minutes. None of the documented operator-side responses (steer, skip, let it fail, orch_abort, orch_skip_task) reliably drain the alert queue. The supervisor experience during recovery is degraded — every zombie alert looks like a fresh interrupt requiring action.

Reproduction

  1. Run any task at Review Level: 2 and induce a death-spiral (cf. issue Worker spins indefinitely when code review returns REVISE on a step already marked complete in STATUS #537: trigger by marking a step complete in STATUS before the code review fires, then make the reviewer return REVISE).
  2. Let the orch's no-progress mechanism kill the lane (3 iterations, default).
  3. Observe: even after the lane is killed, the supervisor receives multiple buffered prompts asking for a steering message.
  4. Try to silence: respond with "skip" or "let it fail" as plain text. Result: another zombie alert arrives anyway.
  5. Try send_agent_message(type='abort', ...). Result: Agent "orch-..." is not currently running. Use orch_status() or orch_resume() before sending messages.
  6. Try orch_skip_task("MIG-002"). Result: No batch state found.
  7. Try orch_abort(). Result: graceful abort completes — but more zombie alerts continue afterward.

The alerts only stop when the engine itself crashes with Channel closed.

Concrete evidence

Production batch 20260506T105850:

Operator's intervention attempts and outcomes:

Attempt Result
Plain text reply "let it fail" inside a sentence Ignored (alert came through anyway)
send_agent_message(type='abort', ...) Agent is not currently running
orch_skip_task("MIG-002") No batch state found
Standalone "skip" reply Another zombie arrived
orch_abort() "graceful abort complete" — but more zombies followed
Standalone "let it fail" reply (much later) Engine finally exited with Channel closed

Total elapsed time spent fighting the alert queue: ~7 minutes.

Why this matters

A failed batch is a stressful operator moment — they need to switch from "monitoring" to "diagnose + recover". Instead, the supervisor spends time fighting an alert system that won't acknowledge the lane is dead. The operator-perceived state diverges from the actual engine state.

This also makes the supervisor more likely to take destructive recovery actions out of frustration (e.g., orch_abort), which then triggers issue #539 (resume cannot reattach after abort).

Root cause hypothesis

Two contributing factors:

  1. Mailbox events are not drained when the orch terminates a lane via the no-progress mechanism. The events generated by the dying iterations remain in the supervisor's inbox and are delivered after the lane is already dead.
  2. The supervisor's text-reply parser is brittle. "skip" and "let it fail" must be standalone literal replies; including them in a longer message ignores them. There's no documented contract for the parser, so operators discover the strict format only by trial and error.

Fix proposals

A. Drain on lane termination

When the orch's monitor decides to terminate a lane (no-progress timeout, hard-fail, etc.), it should:

  • Synchronously purge any pending mailbox events for that agent's outbox before marking the lane terminal
  • Mark all in-flight "wants to exit" prompts for that agent as resolved/dropped

B. Soft-abort tool for operator-controlled handoff

Add a tool supervisor_takeover(reason: string) that the supervisor calls when it intends to manually recover a failing batch. Effect:

  • Pause the current wave (no new work scheduled)
  • Drain all per-agent alert queues (no further "wants to exit" prompts)
  • Preserve worktrees + state for the manual recovery
  • Distinct from orch_abort which is destructive

C. Document the text-reply parser strictly

If the parser stays as-is, document its rules in the supervisor primer:

When responding to a "wants to exit" alert, the orch parser accepts these standalone replies (whole message, no other text):

  • skip — mark the step skipped, continue
  • let it fail — let the lane die, continue to next task

Any other free-form text is interpreted as a steering message and forwarded to the worker via send_agent_message.

Recommendation

Ship A + B. A fixes the immediate UX bug. B gives operators a clean handoff path for manual recovery (which is what we wanted but couldn't get). C is a docs-only patch that has value regardless.

Acceptance criteria

  • After a lane is killed by the no-progress mechanism, the supervisor receives at most one "lane terminated" notification (not a cascade of "wants to exit" prompts).
  • Supervisor can call supervisor_takeover() to silence further per-agent alerts cleanly.
  • orch_abort() consistently silences further zombie alerts (no more replays after the abort completes).

Related

Affected version: taskplane@0.28.4. Reproducible repo + supervisor conversation log available on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions