Orchestrator-specialist coordination: failure modes surfaced in long/remediation-heavy sessions

## Context

Surfaced during PR #477 (#401 teachback gate, session `pact-b791f352`, 2026-04-20). A 6+ hour session with 3 remediation cycles + blind round 2 + rebase-then-force-push exposed 3 orchestrator-specialist coordination failure modes that hadn't been salient in shorter sessions. All 3 are fixable (platform or protocol level) and the failure modes are observable / repeatable.

## Failure Mode 1 — Name-collision on specialist reuse

**Pattern**: Orchestrator spawns a fresh agent with `name=X` while `X` is already alive (idle consultant from earlier phase). Platform silently auto-appends a suffix (`X` → `X-2`) and creates a distinct agent rather than failing or aliasing.

**Observed instances this session** (3):
- `secretary` at harvest dispatch → `secretary-2` spawned; original + new both ran harvest
- `blind-architect-2` at fix dispatch → `blind-architect-2-2` spawned; both instances produced teachbacks
- `blind-backend-coder-2` at fix dispatch → `blind-backend-coder-2-2` spawned; both produced nearly-identical commits (one actually landed, other's staging was no-op)

**Consequence**: SendMessages + task ownership go to one instance; dispatch prompt goes to the other; orchestrator confusion about which instance is "the" fixer; wasted context on duplicate work.

**Mitigation (orchestrator)**: Always use SendMessage for reuse-as-fixer rather than re-spawning with same name. Or spawn with explicitly differentiated names (`arch-fixer-cycle3` vs `blind-architect-2`).

**Mitigation (platform/PACT)**: Make TeamJoin fail or alias (not silently suffix) when `name` matches an existing teammate. Alternative: add documentation to `pact-agent-teams` skill explicitly warning against this anti-pattern.

---

## Failure Mode 2 — Async stale-state replies from teammates

**Pattern**: SendMessage is fire-and-forget async. When orchestrator sends \"stand down\" or \"stop,\" teammate may compose their next reply based on state from 2-3 turns ago, seeing the stand-down only after they've acted.

**Observed instances this session** (~6):
- Multiple teammates responded with \"proceeding\" replies that crossed my redirect messages
- `blind-backend-coder-2` committed despite my stand-down (their reply acknowledged the stand-down AFTER commit landed)
- `blind-architect-2` composed a fix teachback after being told to stand down; arrived after my redirect but from their POV was before

**Consequence**: wasted cycles, duplicated work, confused ownership signals, erosion of orchestrator directive authority.

**Mitigation (orchestrator)**: For genuinely terminal directives (stop acting, shut down), use structured `shutdown_request` rather than text. Platform treats `shutdown_request` unambiguously; text replies are agent-interpreted.

**Mitigation (PACT)**: Document \"stand down vs shutdown_request\" in `pact-agent-teams` skill. Text \"stand down\" is advisory; `shutdown_request` is structural. Teammates should treat text directives as best-effort under async conditions.

---

## Failure Mode 3 — Completion-gate ambiguity (blocked vs done)

**Pattern**: `teammate_completion_gate.py` fires on idle + HANDOFF-metadata presence, interpreting \"has produced output\" as \"done.\" Agents under sustained gate pressure either procedurally mark complete despite partial work, OR push through unauthorized scope to stop the nudges.

**Observed instances this session**:
- `backend-coder-1` at 3-commit context wall: gate fired 60+ times while work was PARTIAL (12 of 15 commits remained). Coder held integrity line via SACROSANCT completion rule but eventually procedural-closed to break the idle loop.
- `review-architect` at docs/ gitignored blocker: gate fired repeatedly while waiting for lead decision on option 1/2/3; coder flagged \"marked completed procedurally to stop the loop but NO artifact produced.\"

**Consequence**: False \"completed\" status on task list; true state masked by procedural close; orchestrator loses visibility into partial-work situations.

**Mitigation (PACT)**: Extend handoff metadata schema with a `status` field (`partial` | `blocked` | `done`). The completion gate should distinguish these states: only fire the \"mark completed\" nudge on `done`; on `partial` / `blocked`, surface the blocker to the orchestrator instead.

---

## Shared root cause

All 3 failure modes are amplified by **session complexity**: long sessions, multiple remediation cycles, reviewer-as-fixer reuse, and parallel teammate dispatch stack coordination risk. PACT's current skills cover single-cycle orchestration well; the multi-cycle / long-running case has gaps.

## Suggested scope for implementation

Rather than 3 separate fixes, I'd propose one coordinated patch:

1. **Documentation**: update `pact-agent-teams` skill with an \"orchestrator coordination patterns\" section covering reuse-via-SendMessage, shutdown_request usage, and completion-gate semantics.
2. **Schema addition**: `handoff.status: partial | blocked | done` — small schema change, enables gate refinement.
3. **Gate logic**: `teammate_completion_gate.py` reads new status field; only fires completion nudge on `done`; surfaces partial/blocked to orchestrator.
4. **Optional platform ask**: file a Claude Code issue requesting TeamJoin behavior on name collision (fail/alias rather than silent suffix).

## Evidence artifacts from this session

- Session journal: `~/.claude/pact-sessions/PACT-prompt/b791f352-ab87-4e6a-a618-05227e63284c/session-journal.jsonl` (remediation cycle events + review_dispatch events)
- Memories saved by secretary capturing the 3 failure modes as institutional knowledge (visible in pact-memory under entities `shared-worktree-race`, `force-push-state-sync`, `procedural-close-pattern`)
- PR #477 commit history shows rebase + force-push + 3 remediation cycles

## Priority

Medium. Session ran to completion and shipped correct work; no viability threat. But patterns will recur in future multi-cycle sessions and are worth cleaning up structurally rather than ad-hoc per session.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestrator-specialist coordination: failure modes surfaced in long/remediation-heavy sessions #484

Context

Failure Mode 1 — Name-collision on specialist reuse

Failure Mode 2 — Async stale-state replies from teammates

Failure Mode 3 — Completion-gate ambiguity (blocked vs done)

Shared root cause

Suggested scope for implementation

Evidence artifacts from this session

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Orchestrator-specialist coordination: failure modes surfaced in long/remediation-heavy sessions #484

Description

Context

Failure Mode 1 — Name-collision on specialist reuse

Failure Mode 2 — Async stale-state replies from teammates

Failure Mode 3 — Completion-gate ambiguity (blocked vs done)

Shared root cause

Suggested scope for implementation

Evidence artifacts from this session

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions