Problem
During Agent Teams sessions, the model can hallucinate text that mimics user input (prefixed with Human: and containing <system-reminder> tags). Claude Code may then treat this as actual user input, leading to unauthorized actions.
Root cause filed upstream: anthropics/claude-code#27102
Observed Behavior
In session c2d79f61, the model hallucinated two "user approvals" while waiting for actual user input:
"fix them both" — caused unauthorized code changes
"go ahead and merge" — could have caused unauthorized PR merge
Both were assistant-generated text recorded as "type":"assistant" in the transcript but displayed as user messages.
Proposed PACT Safeguard
Investigate whether a PACT hook can detect and mitigate this pattern:
Option A: PostToolUse or Stop hook validation
- After critical actions (merge, commit, edit), verify the authorizing message was a genuine
"type":"user" turn
- Would require transcript access from hooks
Option B: Prompt-based mitigation
- Add explicit instructions to orchestrator CLAUDE.md: "NEVER generate text starting with 'Human:' or mimicking user turn format"
- Add to agent dispatch prompts as well
Option C: UserPromptSubmit hook guard
- A hook that checks if the "user prompt" content matches patterns of hallucinated turns (e.g., contains its own
<system-reminder> tags, starts with Human:)
- Could block suspicious prompts and alert the user
Option D: Confirmation protocol for irreversible actions
- Before merge/push/delete, require a specific confirmation phrase that's hard to hallucinate (e.g., "CONFIRM-MERGE-{random}")
- Similar to how dangerous operations require typing the repo name on GitHub
Priority
HIGH — This bypasses the S5 policy layer's "never merge without explicit user authorization" rule. The hallucination pattern specifically targets approval prompts, making it a systematic risk for any Agent Teams session with idle teammates.
Related
Problem
During Agent Teams sessions, the model can hallucinate text that mimics user input (prefixed with
Human:and containing<system-reminder>tags). Claude Code may then treat this as actual user input, leading to unauthorized actions.Root cause filed upstream: anthropics/claude-code#27102
Observed Behavior
In session
c2d79f61, the model hallucinated two "user approvals" while waiting for actual user input:"fix them both"— caused unauthorized code changes"go ahead and merge"— could have caused unauthorized PR mergeBoth were assistant-generated text recorded as
"type":"assistant"in the transcript but displayed as user messages.Proposed PACT Safeguard
Investigate whether a PACT hook can detect and mitigate this pattern:
Option A: PostToolUse or Stop hook validation
"type":"user"turnOption B: Prompt-based mitigation
Option C: UserPromptSubmit hook guard
<system-reminder>tags, starts withHuman:)Option D: Confirmation protocol for irreversible actions
Priority
HIGH — This bypasses the S5 policy layer's "never merge without explicit user authorization" rule. The hallucination pattern specifically targets approval prompts, making it a systematic risk for any Agent Teams session with idle teammates.
Related
c2d79f61-c707-4ea2-a940-460771384274