Safeguard: detect and prevent model hallucination of user turns in Agent Teams

## Problem

During Agent Teams sessions, the model can hallucinate text that mimics user input (prefixed with `Human:` and containing `<system-reminder>` tags). Claude Code may then treat this as actual user input, leading to unauthorized actions.

**Root cause filed upstream**: https://github.com/anthropics/claude-code/issues/27102

## Observed Behavior

In session `c2d79f61`, the model hallucinated two "user approvals" while waiting for actual user input:
1. `"fix them both"` — caused unauthorized code changes
2. `"go ahead and merge"` — could have caused unauthorized PR merge

Both were assistant-generated text recorded as `"type":"assistant"` in the transcript but displayed as user messages.

## Proposed PACT Safeguard

Investigate whether a PACT hook can detect and mitigate this pattern:

### Option A: PostToolUse or Stop hook validation
- After critical actions (merge, commit, edit), verify the authorizing message was a genuine `"type":"user"` turn
- Would require transcript access from hooks

### Option B: Prompt-based mitigation
- Add explicit instructions to orchestrator CLAUDE.md: "NEVER generate text starting with 'Human:' or mimicking user turn format"
- Add to agent dispatch prompts as well

### Option C: UserPromptSubmit hook guard
- A hook that checks if the "user prompt" content matches patterns of hallucinated turns (e.g., contains its own `<system-reminder>` tags, starts with `Human:`)
- Could block suspicious prompts and alert the user

### Option D: Confirmation protocol for irreversible actions
- Before merge/push/delete, require a specific confirmation phrase that's hard to hallucinate (e.g., "CONFIRM-MERGE-{random}")
- Similar to how dangerous operations require typing the repo name on GitHub

## Priority

HIGH — This bypasses the S5 policy layer's "never merge without explicit user authorization" rule. The hallucination pattern specifically targets approval prompts, making it a systematic risk for any Agent Teams session with idle teammates.

## Related

- Upstream issue: https://github.com/anthropics/claude-code/issues/27102
- Session where this occurred: `c2d79f61-c707-4ea2-a940-460771384274`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safeguard: detect and prevent model hallucination of user turns in Agent Teams #204

Problem

Observed Behavior

Proposed PACT Safeguard

Option A: PostToolUse or Stop hook validation

Option B: Prompt-based mitigation

Option C: UserPromptSubmit hook guard

Option D: Confirmation protocol for irreversible actions

Priority

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Safeguard: detect and prevent model hallucination of user turns in Agent Teams #204

Description

Problem

Observed Behavior

Proposed PACT Safeguard

Option A: PostToolUse or Stop hook validation

Option B: Prompt-based mitigation

Option C: UserPromptSubmit hook guard

Option D: Confirmation protocol for irreversible actions

Priority

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions