Skip to content

Orchestrator-specialist coordination: failure modes surfaced in long/remediation-heavy sessions #484

@michael-wojcik

Description

@michael-wojcik

Context

Surfaced during PR #477 (#401 teachback gate, session pact-b791f352, 2026-04-20). A 6+ hour session with 3 remediation cycles + blind round 2 + rebase-then-force-push exposed 3 orchestrator-specialist coordination failure modes that hadn't been salient in shorter sessions. All 3 are fixable (platform or protocol level) and the failure modes are observable / repeatable.

Failure Mode 1 — Name-collision on specialist reuse

Pattern: Orchestrator spawns a fresh agent with name=X while X is already alive (idle consultant from earlier phase). Platform silently auto-appends a suffix (XX-2) and creates a distinct agent rather than failing or aliasing.

Observed instances this session (3):

  • secretary at harvest dispatch → secretary-2 spawned; original + new both ran harvest
  • blind-architect-2 at fix dispatch → blind-architect-2-2 spawned; both instances produced teachbacks
  • blind-backend-coder-2 at fix dispatch → blind-backend-coder-2-2 spawned; both produced nearly-identical commits (one actually landed, other's staging was no-op)

Consequence: SendMessages + task ownership go to one instance; dispatch prompt goes to the other; orchestrator confusion about which instance is "the" fixer; wasted context on duplicate work.

Mitigation (orchestrator): Always use SendMessage for reuse-as-fixer rather than re-spawning with same name. Or spawn with explicitly differentiated names (arch-fixer-cycle3 vs blind-architect-2).

Mitigation (platform/PACT): Make TeamJoin fail or alias (not silently suffix) when name matches an existing teammate. Alternative: add documentation to pact-agent-teams skill explicitly warning against this anti-pattern.


Failure Mode 2 — Async stale-state replies from teammates

Pattern: SendMessage is fire-and-forget async. When orchestrator sends "stand down" or "stop," teammate may compose their next reply based on state from 2-3 turns ago, seeing the stand-down only after they've acted.

Observed instances this session (~6):

  • Multiple teammates responded with "proceeding" replies that crossed my redirect messages
  • blind-backend-coder-2 committed despite my stand-down (their reply acknowledged the stand-down AFTER commit landed)
  • blind-architect-2 composed a fix teachback after being told to stand down; arrived after my redirect but from their POV was before

Consequence: wasted cycles, duplicated work, confused ownership signals, erosion of orchestrator directive authority.

Mitigation (orchestrator): For genuinely terminal directives (stop acting, shut down), use structured shutdown_request rather than text. Platform treats shutdown_request unambiguously; text replies are agent-interpreted.

Mitigation (PACT): Document "stand down vs shutdown_request" in pact-agent-teams skill. Text "stand down" is advisory; shutdown_request is structural. Teammates should treat text directives as best-effort under async conditions.


Failure Mode 3 — Completion-gate ambiguity (blocked vs done)

Pattern: teammate_completion_gate.py fires on idle + HANDOFF-metadata presence, interpreting "has produced output" as "done." Agents under sustained gate pressure either procedurally mark complete despite partial work, OR push through unauthorized scope to stop the nudges.

Observed instances this session:

  • backend-coder-1 at 3-commit context wall: gate fired 60+ times while work was PARTIAL (12 of 15 commits remained). Coder held integrity line via SACROSANCT completion rule but eventually procedural-closed to break the idle loop.
  • review-architect at docs/ gitignored blocker: gate fired repeatedly while waiting for lead decision on option 1/2/3; coder flagged "marked completed procedurally to stop the loop but NO artifact produced."

Consequence: False "completed" status on task list; true state masked by procedural close; orchestrator loses visibility into partial-work situations.

Mitigation (PACT): Extend handoff metadata schema with a status field (partial | blocked | done). The completion gate should distinguish these states: only fire the "mark completed" nudge on done; on partial / blocked, surface the blocker to the orchestrator instead.


Shared root cause

All 3 failure modes are amplified by session complexity: long sessions, multiple remediation cycles, reviewer-as-fixer reuse, and parallel teammate dispatch stack coordination risk. PACT's current skills cover single-cycle orchestration well; the multi-cycle / long-running case has gaps.

Suggested scope for implementation

Rather than 3 separate fixes, I'd propose one coordinated patch:

  1. Documentation: update pact-agent-teams skill with an "orchestrator coordination patterns" section covering reuse-via-SendMessage, shutdown_request usage, and completion-gate semantics.
  2. Schema addition: handoff.status: partial | blocked | done — small schema change, enables gate refinement.
  3. Gate logic: teammate_completion_gate.py reads new status field; only fires completion nudge on done; surfaces partial/blocked to orchestrator.
  4. Optional platform ask: file a Claude Code issue requesting TeamJoin behavior on name collision (fail/alias rather than silent suffix).

Evidence artifacts from this session

  • Session journal: ~/.claude/pact-sessions/PACT-prompt/b791f352-ab87-4e6a-a618-05227e63284c/session-journal.jsonl (remediation cycle events + review_dispatch events)
  • Memories saved by secretary capturing the 3 failure modes as institutional knowledge (visible in pact-memory under entities shared-worktree-race, force-push-state-sync, procedural-close-pattern)
  • PR feat(#401): variety-tiered teachback gate with generation-shaped content validation #477 commit history shows rebase + force-push + 3 remediation cycles

Priority

Medium. Session ran to completion and shipped correct work; no viability threat. But patterns will recur in future multi-cycle sessions and are worth cleaning up structurally rather than ad-hoc per session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions