You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the orch terminates a lane (e.g., via the no-progress safety mechanism or orch_abort()), the supervisor continues to receive a cascade of buffered "Worker on lane N wants to exit with no progress" alerts for several minutes. None of the documented operator-side responses (steer, skip, let it fail, orch_abort, orch_skip_task) reliably drain the alert queue. The supervisor experience during recovery is degraded — every zombie alert looks like a fresh interrupt requiring action.
Let the orch's no-progress mechanism kill the lane (3 iterations, default).
Observe: even after the lane is killed, the supervisor receives multiple buffered prompts asking for a steering message.
Try to silence: respond with "skip" or "let it fail" as plain text. Result: another zombie alert arrives anyway.
Try send_agent_message(type='abort', ...). Result: Agent "orch-..." is not currently running. Use orch_status() or orch_resume() before sending messages.
Try orch_skip_task("MIG-002"). Result: No batch state found.
Try orch_abort(). Result: graceful abort completes — but more zombie alerts continue afterward.
The alerts only stop when the engine itself crashes with Channel closed.
Concrete evidence
Production batch 20260506T105850:
Lane killed at ~14:21 UTC after 3 no-progress iterations
Supervisor received 5 distinct "wants to exit" alerts between 14:21 and 14:28 UTC (some matching iteration counts, some replaying earlier iterations)
list_active_agents() returned "No active agents" the entire time the alerts were arriving
Operator's intervention attempts and outcomes:
Attempt
Result
Plain text reply "let it fail" inside a sentence
Ignored (alert came through anyway)
send_agent_message(type='abort', ...)
Agent is not currently running
orch_skip_task("MIG-002")
No batch state found
Standalone "skip" reply
Another zombie arrived
orch_abort()
"graceful abort complete" — but more zombies followed
Standalone "let it fail" reply (much later)
Engine finally exited with Channel closed
Total elapsed time spent fighting the alert queue: ~7 minutes.
Why this matters
A failed batch is a stressful operator moment — they need to switch from "monitoring" to "diagnose + recover". Instead, the supervisor spends time fighting an alert system that won't acknowledge the lane is dead. The operator-perceived state diverges from the actual engine state.
This also makes the supervisor more likely to take destructive recovery actions out of frustration (e.g., orch_abort), which then triggers issue #539 (resume cannot reattach after abort).
Root cause hypothesis
Two contributing factors:
Mailbox events are not drained when the orch terminates a lane via the no-progress mechanism. The events generated by the dying iterations remain in the supervisor's inbox and are delivered after the lane is already dead.
The supervisor's text-reply parser is brittle. "skip" and "let it fail" must be standalone literal replies; including them in a longer message ignores them. There's no documented contract for the parser, so operators discover the strict format only by trial and error.
Fix proposals
A. Drain on lane termination
When the orch's monitor decides to terminate a lane (no-progress timeout, hard-fail, etc.), it should:
Synchronously purge any pending mailbox events for that agent's outbox before marking the lane terminal
Mark all in-flight "wants to exit" prompts for that agent as resolved/dropped
B. Soft-abort tool for operator-controlled handoff
Add a tool supervisor_takeover(reason: string) that the supervisor calls when it intends to manually recover a failing batch. Effect:
Pause the current wave (no new work scheduled)
Drain all per-agent alert queues (no further "wants to exit" prompts)
Preserve worktrees + state for the manual recovery
Distinct from orch_abort which is destructive
C. Document the text-reply parser strictly
If the parser stays as-is, document its rules in the supervisor primer:
When responding to a "wants to exit" alert, the orch parser accepts these standalone replies (whole message, no other text):
skip — mark the step skipped, continue
let it fail — let the lane die, continue to next task
Any other free-form text is interpreted as a steering message and forwarded to the worker via send_agent_message.
Recommendation
Ship A + B. A fixes the immediate UX bug. B gives operators a clean handoff path for manual recovery (which is what we wanted but couldn't get). C is a docs-only patch that has value regardless.
Acceptance criteria
After a lane is killed by the no-progress mechanism, the supervisor receives at most one "lane terminated" notification (not a cascade of "wants to exit" prompts).
Supervisor can call supervisor_takeover() to silence further per-agent alerts cleanly.
orch_abort() consistently silences further zombie alerts (no more replays after the abort completes).
Summary
After the orch terminates a lane (e.g., via the no-progress safety mechanism or
orch_abort()), the supervisor continues to receive a cascade of buffered "Worker on lane N wants to exit with no progress" alerts for several minutes. None of the documented operator-side responses (steer,skip,let it fail,orch_abort,orch_skip_task) reliably drain the alert queue. The supervisor experience during recovery is degraded — every zombie alert looks like a fresh interrupt requiring action.Reproduction
Review Level: 2and induce a death-spiral (cf. issue Worker spins indefinitely when code review returns REVISE on a step already marked complete in STATUS #537: trigger by marking a step complete in STATUS before the code review fires, then make the reviewer return REVISE)."skip"or"let it fail"as plain text. Result: another zombie alert arrives anyway.send_agent_message(type='abort', ...). Result:Agent "orch-..." is not currently running. Use orch_status() or orch_resume() before sending messages.orch_skip_task("MIG-002"). Result:No batch state found.orch_abort(). Result: graceful abort completes — but more zombie alerts continue afterward.The alerts only stop when the engine itself crashes with
Channel closed.Concrete evidence
Production batch
20260506T105850:Worker said: ""(no payload — see issue 'Worker said:' empty in early no-progress exit alerts — supervisor lacks signal for early intervention #540)list_active_agents()returned "No active agents" the entire time the alerts were arrivingOperator's intervention attempts and outcomes:
"let it fail"inside a sentencesend_agent_message(type='abort', ...)Agent is not currently runningorch_skip_task("MIG-002")No batch state found"skip"replyorch_abort()"let it fail"reply (much later)Channel closedTotal elapsed time spent fighting the alert queue: ~7 minutes.
Why this matters
A failed batch is a stressful operator moment — they need to switch from "monitoring" to "diagnose + recover". Instead, the supervisor spends time fighting an alert system that won't acknowledge the lane is dead. The operator-perceived state diverges from the actual engine state.
This also makes the supervisor more likely to take destructive recovery actions out of frustration (e.g.,
orch_abort), which then triggers issue #539 (resume cannot reattach after abort).Root cause hypothesis
Two contributing factors:
Fix proposals
A. Drain on lane termination
When the orch's
monitordecides to terminate a lane (no-progress timeout, hard-fail, etc.), it should:B. Soft-abort tool for operator-controlled handoff
Add a tool
supervisor_takeover(reason: string)that the supervisor calls when it intends to manually recover a failing batch. Effect:orch_abortwhich is destructiveC. Document the text-reply parser strictly
If the parser stays as-is, document its rules in the supervisor primer:
Recommendation
Ship A + B. A fixes the immediate UX bug. B gives operators a clean handoff path for manual recovery (which is what we wanted but couldn't get). C is a docs-only patch that has value regardless.
Acceptance criteria
supervisor_takeover()to silence further per-agent alerts cleanly.orch_abort()consistently silences further zombie alerts (no more replays after the abort completes).Related
orch_resumeafter abort) is downstream of the operator's frustratedorch_abortreflex caused by this issue.Affected version:
taskplane@0.28.4. Reproducible repo + supervisor conversation log available on request.