feat(harness): detect redundant tool-call loops (3 consecutive identical outputs)#63
Conversation
…cal outputs) Production trace 2026-04-26: Slack DM "explore the codebase structure deeply" hit max_iterations_reached after sage's harness called workspace_list with the same path 7 times in a row, getting back byte-identical 12,604-char responses every time. The harness had no way to detect "same tool + same output again — make a decision", so it burned the budget and produced no useful reply. This adds a 5-entry ring buffer of (toolName, outputHash) per turn. When the last 3 entries match on BOTH toolName and outputHash, the harness exits with outcome=failed, stopReason=redundant_tool_loop — failed (not deferred) because more budget will not help; the model is dead-stuck on the same tool call and needs a different prompt or tool surface to make progress. User-facing copy added to stopReasonToUserMessage so consumers (sage, specialist surfaces) automatically pick up a useful error message instead of the generic "could not complete" fallback. Self-reviewed and peer-reviewed. Tests + typecheck green (163/163). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b44038280
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| } else { | ||
| await emit(config, input, state, { type: 'tool_finished', result }); | ||
| const hash = djb2Hash(result.output ?? JSON.stringify(result.structuredOutput ?? {})); |
There was a problem hiding this comment.
Skip loop hashing when tool result has no payload
This hashes JSON.stringify({}) whenever both result.output and result.structuredOutput are absent, and HarnessToolResult allows that shape. As a result, any three successful calls from the same tool that intentionally return no payload (common for side-effect tools) are treated as identical and the turn is terminated with redundant_tool_loop, even if each call had different inputs and made progress. Only run redundant-loop comparison when there is an actual deterministic payload, or include call-specific data in the signature.
Useful? React with 👍 / 👎.
| } | ||
| } else { | ||
| await emit(config, input, state, { type: 'tool_finished', result }); | ||
| const hash = djb2Hash(result.output ?? JSON.stringify(result.structuredOutput ?? {})); |
There was a problem hiding this comment.
Guard structuredOutput hashing against stringify failures
When result.output is undefined, the detector hashes JSON.stringify(result.structuredOutput ?? {}), but structuredOutput is typed as Record<string, unknown> and may contain non-JSON-safe values (for example BigInt or circular references). In that case JSON.stringify throws, the outer catch converts the turn to runtime_error, and a successful tool execution can fail the whole run. The loop detector should use a safe serialization fallback so hashing cannot throw.
Useful? React with 👍 / 👎.
…stringify Two codex review issues on PR #63: P1 — false-positive on side-effect tools. Pre-fix the detector hashed `JSON.stringify({})` whenever both `result.output` and `result.structuredOutput` were absent. Three successful calls from a side-effect tool (writes/notifications/etc) all hashed identically and falsely tripped redundant_tool_loop after the third call, even though each call had different inputs and made real progress. Fix: skip the detector entirely when the result has no payload to compare. P2 — JSON.stringify can throw on non-serializable values. structuredOutput is typed as Record<string, unknown> and may carry BigInt or circular refs. A throw inside the detector bubbled to the outer harness catch and converted a successful tool execution into a runtime_error that failed the whole turn. Fix: wrap stringify in try/catch; treat serialization failures as "no comparable signature" — the detector skips, the turn continues. Both behaviors live in a new `computeToolResultSignature()` helper that returns null when there's no comparable payload (or the payload can't be serialized) and a number otherwise. The redundant-loop check is gated on the helper returning a non-null signature. Two new regression tests covering each case. 165/165 harness suite passes; typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds redundant-tool-loop detection to the harness. After every successful tool call, hashes the output and checks the last 3 entries in a 5-slot ring buffer. If 3 consecutive entries are from the same toolName and have the same outputHash, the harness exits with
outcome: failed, stopReason: redundant_tool_loop.Production failure this catches
Slack DM "explore the codebase structure deeply" → sage harness called
workspace_listwith the same path 7 times in a row (each returning byte-identical 12,604-char output) before hitting max_iterations_reached. With this PR, the harness exits at the 3rd identical response with a clear stopReason instead of burning the iteration budget.Validation
Downstream
After merge + republish, sage bumps
@agent-assistant/harness(workflow F:bump-sage-after-harness-release.ts) and the next sage release picks up the new stopReason copy automatically.🤖 Generated with Claude Code