fix(core,web): normalize inner Claude tool input instead of rejecting it#16
fix(core,web): normalize inner Claude tool input instead of rejecting it#16That1Drifter merged 3 commits intomasterfrom
Conversation
A fresh-eyes Sonnet playthrough surfaced that ~half of first attempts
were 500ing with "inner claude tool call did not match contract" — the
throw at apps/web/lib/inner-claude.ts:401. Root cause: the runtime
guard isInnerClaudeResponse was stricter than the emit_turn_response
tool's input_schema, so when the model omitted an empty optional
field (e.g. no stakeholder messages that turn) the API accepted the
tool call but the guard rejected it.
Replaces the boolean guard with a normalizer that fills sane defaults
for everything except visible_effects (the only field whose absence
genuinely means there's nothing to render to the trainee). Specifically:
- environment_delta missing or null → {}
- stakeholder_messages missing or non-array → []
- surprise_triggered missing → null
- hidden_state_updates missing → undefined
- visible_effects missing or whitespace-only → reject (real failure)
Also fixes the typeof-null trap that the old guard fell into:
typeof null === 'object', so an environment_delta: null would have
passed the old check then crashed at the first .new_tickets access
downstream. Normalize coerces it to {}.
isInnerClaudeResponse is kept as a thin wrapper over normalize so any
external callers stay source-compatible. Both call sites in
inner-claude.ts switch to normalize directly and use the normalized
value as the response.
Adds 9 unit tests covering the partial-input cases that broke the
playthrough, plus the typeof-null trap and the hard visible_effects
requirement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Smoke testing the previous commit on staging surfaced two recoverable failure modes the lenient normalizer alone couldn't fix: 1. Haiku occasionally returns an undercomplete tool call — only environment_delta, no stakeholder_messages, no visible_effects, no surprise_triggered. The normalizer correctly rejects (no visible_effects = nothing to render). 2. Haiku also occasionally emits stakeholder_messages as a stringified JSON blob containing the array AND the sibling visible_effects / hidden_state_updates / surprise_triggered fields, leaving the top level missing those keys. Same rejection. Both fail roughly 1 in 5 turns on the cheap model and almost never recur on retry. The existing retry path (only triggered on stop_reason === 'max_tokens') already had everything needed — re-prompt with a corrective hint, fresh tool call. Just had to widen the trigger to also fire on normalize failure. Two retry messages now: one for truncation (existing) and one for malformed tool input (explicit "emit each field as a top-level key, not a stringified blob" instruction). Also adds a permanent console.error log of the rejected input whenever normalize fails. Useful both for the retry decision path and for debugging if a future failure mode evades both layers. Verified on staging: 10 consecutive turns through support-triage, 0 user-facing errors, 2 recovered via retry. Both the undercomplete case and the stringified-blob case were caught and recovered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke results + follow-up commitDeployed the original fix and ran 8 turns. 2 of 8 still failed with the contract error, but with the new diagnostic `console.error` we can finally see exactly what the model is returning. What Haiku is actually doing wrongTwo distinct failure modes, both ~1-in-5 frequency on the cheap model, almost never recurring on retry: Mode 1 — undercomplete tool call. The model emits ONLY `environment_delta` and skips everything else (no stakeholder_messages, no visible_effects, no surprise_triggered). The lenient normalizer correctly rejects because visible_effects is missing — without it the trainee literally sees nothing. Mode 2 — stringified blob. The model emits `stakeholder_messages` as a JSON-encoded string containing the array AND the sibling fields: Fix in 9475d7bWidens the existing retry path to fire on normalize failure, not just on `stop_reason === 'max_tokens'`. Adds a corrective hint message specifically for the malformed-tool-input case ("emit each field as a top-level key, NOT packed into a stringified blob"). The truncation hint stays unchanged for the truncation path. Also keeps the diagnostic `console.error` as a permanent observability hook — if a future failure mode evades both layers, we'll see the offending input in the server log instead of having to add print statements again. Re-smoke results10 consecutive turns through support-triage with the new commit: ``` TOTAL: 10 attempts, 0 user-facing errors, 2 recovered via retry The retry rate (~20%) matches the per-call failure rate the playthrough originally surfaced, but it's now invisible to the user — the failures get auto-recovered and only show up as a `retried: true` flag and an extra cost line. Ready to merge. |
Adds an explicit "TOOL CALL SHAPE" section to the inner Claude system prompt calling out the two failure modes seen on staging: 1. Packing multiple fields into a single stringified blob (Haiku sometimes emits stakeholder_messages as a JSON-encoded string containing the array AND visible_effects/surprise_triggered). 2. Omitting visible_effects on no-op turns. The retry path catches both, but a clearer prompt is cheaper than paying for a retry on every fifth turn. Sample size on staging is small (15 turns: 2 retries with the new prompt vs 2/8 before), so this isn't a clean A/B win — but the prompt is in the cached prefix and costs nothing per turn, and it documents the contract more explicitly for future model versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Re-smoke with the prompt tightening: 15 turns, 0 user-facing errors, 2 recovered via retry (~13%, vs ~20% in the previous run). Sample's too small to claim a clean win on the prompt change alone, but it's cached, zero-cost per turn, and documents the contract more explicitly. Net positive. Ready to merge. |
Re-ran the fresh-eyes Sonnet playthrough on the patched build via Playwright (against a local dev server, since the staging basic-auth header trick blocks fetch in headless browsers). Both critical bugs from the original report are fixed: - 0 contract errors out of 10 turns (down from ~50% in the original). One turn showed the silent retry, otherwise clean. - Mid-scenario reload preserved session: turn counter, cost, trust scores, objective state, and inbox all restored exactly. Same ?session=<id> URL. The agent completed 10 turns (vs 3 in the original where contract errors blocked progress at turn 4), saw both surprises fire, and the debrief rendered cleanly. Two new findings worth fixing: - Last-turn work area narrative is empty after reload because lastEffects lives in component state, not the session store. Visible in frame-06-turn3-post-reload.png. Small follow-on to the session URL persistence work in #17. - The `retried` badge added by the contract guard fix (#16) shows up in the turn metadata line with no explanation, which is mildly confusing for users since the retry is meant to be silent. Either drop it from the UI or add a tooltip. Updates the demo GIF entry to point at the v2 frames captured by the playthrough, which include both surprises firing and the reload restore in action. Also updates scripts/stitch-demo-gif.py with auto-discovery: if the hardcoded curated frame list matches fewer than 5 frames in the target dir, fall back to globbing all frame-*.png files in lexical order with uniform timing. Makes the script work with any new playthrough run without requiring a code edit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
A fresh-eyes Sonnet playthrough surfaced that ~half of first attempts on staging were 500ing with `inner claude tool call did not match contract` (the throw at `apps/web/lib/inner-claude.ts:401`). This is a regression from the tool_use migration in #14 — the runtime guard `isInnerClaudeResponse` was stricter than the `emit_turn_response` tool's `input_schema`, so when the model omitted an empty optional field (no stakeholder messages that turn), the API accepted the tool call but our runtime guard rejected it.
What changed
Tests
Adds 9 unit tests in `packages/core/src/tests/contract.test.ts` covering the partial-input cases that broke the playthrough (each missing field, the typeof-null trap, hard visible_effects requirement, non-object inputs, and the wrapper-stays-in-sync invariant).
Test plan
This is one of the three critical bugs the fresh-eyes playthrough surfaced (TODO Now section). The other two — page-reload session loss, debrief visual structure — will follow in separate PRs.