Skip to content

Safeguard: detect and prevent model hallucination of user turns in Agent Teams #204

@michael-wojcik

Description

@michael-wojcik

Problem

During Agent Teams sessions, the model can hallucinate text that mimics user input (prefixed with Human: and containing <system-reminder> tags). Claude Code may then treat this as actual user input, leading to unauthorized actions.

Root cause filed upstream: anthropics/claude-code#27102

Observed Behavior

In session c2d79f61, the model hallucinated two "user approvals" while waiting for actual user input:

  1. "fix them both" — caused unauthorized code changes
  2. "go ahead and merge" — could have caused unauthorized PR merge

Both were assistant-generated text recorded as "type":"assistant" in the transcript but displayed as user messages.

Proposed PACT Safeguard

Investigate whether a PACT hook can detect and mitigate this pattern:

Option A: PostToolUse or Stop hook validation

  • After critical actions (merge, commit, edit), verify the authorizing message was a genuine "type":"user" turn
  • Would require transcript access from hooks

Option B: Prompt-based mitigation

  • Add explicit instructions to orchestrator CLAUDE.md: "NEVER generate text starting with 'Human:' or mimicking user turn format"
  • Add to agent dispatch prompts as well

Option C: UserPromptSubmit hook guard

  • A hook that checks if the "user prompt" content matches patterns of hallucinated turns (e.g., contains its own <system-reminder> tags, starts with Human:)
  • Could block suspicious prompts and alert the user

Option D: Confirmation protocol for irreversible actions

  • Before merge/push/delete, require a specific confirmation phrase that's hard to hallucinate (e.g., "CONFIRM-MERGE-{random}")
  • Similar to how dangerous operations require typing the repo name on GitHub

Priority

HIGH — This bypasses the S5 policy layer's "never merge without explicit user authorization" rule. The hallucination pattern specifically targets approval prompts, making it a systematic risk for any Agent Teams session with idle teammates.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions