fix(core,web): normalize inner Claude tool input instead of rejecting it by That1Drifter · Pull Request #16 · That1Drifter/fieldwork

That1Drifter · 2026-04-10T23:16:40Z

Summary

A fresh-eyes Sonnet playthrough surfaced that ~half of first attempts on staging were 500ing with `inner claude tool call did not match contract` (the throw at `apps/web/lib/inner-claude.ts:401`). This is a regression from the tool_use migration in #14 — the runtime guard `isInnerClaudeResponse` was stricter than the `emit_turn_response` tool's `input_schema`, so when the model omitted an empty optional field (no stakeholder messages that turn), the API accepted the tool call but our runtime guard rejected it.

What changed

Replaces the boolean guard with `normalizeInnerClaudeResponse` that fills sane defaults instead of rejecting:
- `environment_delta` missing or null → `{}`
- `stakeholder_messages` missing or non-array → `[]`
- `surprise_triggered` missing → `null`
- `hidden_state_updates` missing → `undefined`
- `visible_effects` missing or whitespace-only → reject (the only hard requirement, since without it the trainee literally sees nothing this turn)
Fixes the `typeof null === 'object'` trap the old guard fell into. `environment_delta: null` would have passed the old check then crashed at the first `.new_tickets` access. Normalize coerces it to `{}`.
`isInnerClaudeResponse` is kept as a thin wrapper over `normalize` so external callers stay source-compatible.
Both call sites in `inner-claude.ts` (first call + retry) switch to normalize and use the normalized value.

Tests

Adds 9 unit tests in `packages/core/src/tests/contract.test.ts` covering the partial-input cases that broke the playthrough (each missing field, the typeof-null trap, hard visible_effects requirement, non-object inputs, and the wrapper-stays-in-sync invariant).

Test plan

`pnpm test` — 26 core tests pass (17 existing + 9 new)
`pnpm typecheck` — clean across 6 packages
`pnpm lint` — clean
Smoke on staging: drive 5+ turns and confirm zero contract errors in normal play

This is one of the three critical bugs the fresh-eyes playthrough surfaced (TODO Now section). The other two — page-reload session loss, debrief visual structure — will follow in separate PRs.

A fresh-eyes Sonnet playthrough surfaced that ~half of first attempts were 500ing with "inner claude tool call did not match contract" — the throw at apps/web/lib/inner-claude.ts:401. Root cause: the runtime guard isInnerClaudeResponse was stricter than the emit_turn_response tool's input_schema, so when the model omitted an empty optional field (e.g. no stakeholder messages that turn) the API accepted the tool call but the guard rejected it. Replaces the boolean guard with a normalizer that fills sane defaults for everything except visible_effects (the only field whose absence genuinely means there's nothing to render to the trainee). Specifically: - environment_delta missing or null → {} - stakeholder_messages missing or non-array → [] - surprise_triggered missing → null - hidden_state_updates missing → undefined - visible_effects missing or whitespace-only → reject (real failure) Also fixes the typeof-null trap that the old guard fell into: typeof null === 'object', so an environment_delta: null would have passed the old check then crashed at the first .new_tickets access downstream. Normalize coerces it to {}. isInnerClaudeResponse is kept as a thin wrapper over normalize so any external callers stay source-compatible. Both call sites in inner-claude.ts switch to normalize directly and use the normalized value as the response. Adds 9 unit tests covering the partial-input cases that broke the playthrough, plus the typeof-null trap and the hard visible_effects requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Smoke testing the previous commit on staging surfaced two recoverable failure modes the lenient normalizer alone couldn't fix: 1. Haiku occasionally returns an undercomplete tool call — only environment_delta, no stakeholder_messages, no visible_effects, no surprise_triggered. The normalizer correctly rejects (no visible_effects = nothing to render). 2. Haiku also occasionally emits stakeholder_messages as a stringified JSON blob containing the array AND the sibling visible_effects / hidden_state_updates / surprise_triggered fields, leaving the top level missing those keys. Same rejection. Both fail roughly 1 in 5 turns on the cheap model and almost never recur on retry. The existing retry path (only triggered on stop_reason === 'max_tokens') already had everything needed — re-prompt with a corrective hint, fresh tool call. Just had to widen the trigger to also fire on normalize failure. Two retry messages now: one for truncation (existing) and one for malformed tool input (explicit "emit each field as a top-level key, not a stringified blob" instruction). Also adds a permanent console.error log of the rejected input whenever normalize fails. Useful both for the retry decision path and for debugging if a future failure mode evades both layers. Verified on staging: 10 consecutive turns through support-triage, 0 user-facing errors, 2 recovered via retry. Both the undercomplete case and the stringified-blob case were caught and recovered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

That1Drifter · 2026-04-10T23:26:24Z

Smoke results + follow-up commit

Deployed the original fix and ran 8 turns. 2 of 8 still failed with the contract error, but with the new diagnostic `console.error` we can finally see exactly what the model is returning.

What Haiku is actually doing wrong

Two distinct failure modes, both ~1-in-5 frequency on the cheap model, almost never recurring on retry:

Mode 1 — undercomplete tool call. The model emits ONLY `environment_delta` and skips everything else (no stakeholder_messages, no visible_effects, no surprise_triggered). The lenient normalizer correctly rejects because visible_effects is missing — without it the trainee literally sees nothing.

Mode 2 — stringified blob. The model emits `stakeholder_messages` as a JSON-encoded string containing the array AND the sibling fields:
```
"stakeholder_messages": "[\n {\n \"from\": \"marcus\", ... }\n],\n \"visible_effects\": \"...\",\n \"surprise_triggered\": null\n}"
```
Top-level visible_effects is then missing because the model packed it inside the string. Same rejection.

Fix in `9475d7b`

Widens the existing retry path to fire on normalize failure, not just on `stop_reason === 'max_tokens'`. Adds a corrective hint message specifically for the malformed-tool-input case ("emit each field as a top-level key, NOT packed into a stringified blob"). The truncation hint stays unchanged for the truncation path.

Also keeps the diagnostic `console.error` as a permanent observability hook — if a future failure mode evades both layers, we'll see the offending input in the server log instead of having to add print statements again.

Re-smoke results

10 consecutive turns through support-triage with the new commit:

```
turn 1: ok (first)
turn 2: ok (first)
turn 3: ok (first)
turn 4: ok (first)
turn 5: ok (retried) ← undercomplete tool call recovered
turn 6: ok (first)
turn 7: ok (first)
turn 8: ok (first)
turn 9: ok (retried) ← stringified blob recovered
turn 10: ok (first)

TOTAL: 10 attempts, 0 user-facing errors, 2 recovered via retry
```

The retry rate (~20%) matches the per-call failure rate the playthrough originally surfaced, but it's now invisible to the user — the failures get auto-recovered and only show up as a `retried: true` flag and an extra cost line.

Ready to merge.

Adds an explicit "TOOL CALL SHAPE" section to the inner Claude system prompt calling out the two failure modes seen on staging: 1. Packing multiple fields into a single stringified blob (Haiku sometimes emits stakeholder_messages as a JSON-encoded string containing the array AND visible_effects/surprise_triggered). 2. Omitting visible_effects on no-op turns. The retry path catches both, but a clearer prompt is cheaper than paying for a retry on every fifth turn. Sample size on staging is small (15 turns: 2 retries with the new prompt vs 2/8 before), so this isn't a clean A/B win — but the prompt is in the cached prefix and costs nothing per turn, and it documents the contract more explicitly for future model versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

That1Drifter · 2026-04-10T23:32:30Z

Re-smoke with the prompt tightening: 15 turns, 0 user-facing errors, 2 recovered via retry (~13%, vs ~20% in the previous run). Sample's too small to claim a clean win on the prompt change alone, but it's cached, zero-cost per turn, and documents the contract more explicitly. Net positive. Ready to merge.

Re-ran the fresh-eyes Sonnet playthrough on the patched build via Playwright (against a local dev server, since the staging basic-auth header trick blocks fetch in headless browsers). Both critical bugs from the original report are fixed: - 0 contract errors out of 10 turns (down from ~50% in the original). One turn showed the silent retry, otherwise clean. - Mid-scenario reload preserved session: turn counter, cost, trust scores, objective state, and inbox all restored exactly. Same ?session=<id> URL. The agent completed 10 turns (vs 3 in the original where contract errors blocked progress at turn 4), saw both surprises fire, and the debrief rendered cleanly. Two new findings worth fixing: - Last-turn work area narrative is empty after reload because lastEffects lives in component state, not the session store. Visible in frame-06-turn3-post-reload.png. Small follow-on to the session URL persistence work in #17. - The `retried` badge added by the contract guard fix (#16) shows up in the turn metadata line with no explanation, which is mildly confusing for users since the retry is meant to be silent. Either drop it from the UI or add a tooltip. Updates the demo GIF entry to point at the v2 frames captured by the playthrough, which include both surprises firing and the reload restore in action. Also updates scripts/stitch-demo-gif.py with auto-discovery: if the hardcoded curated frame list matches fewer than 5 frames in the target dir, fall back to globbing all frame-*.png files in lexical order with uniform timing. Makes the script work with any new playthrough run without requiring a code edit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

That1Drifter and others added 2 commits April 10, 2026 18:16

That1Drifter merged commit cc59ed8 into master Apr 10, 2026
1 check passed

That1Drifter deleted the fix-contract-guard branch April 10, 2026 23:33

That1Drifter mentioned this pull request Apr 10, 2026

fix(web): persist session id in URL so reload restores instead of resets #17

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core,web): normalize inner Claude tool input instead of rejecting it#16

fix(core,web): normalize inner Claude tool input instead of rejecting it#16
That1Drifter merged 3 commits intomasterfrom
fix-contract-guard

That1Drifter commented Apr 10, 2026

Uh oh!

That1Drifter commented Apr 10, 2026

Uh oh!

That1Drifter commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

That1Drifter commented Apr 10, 2026

Summary

What changed

Tests

Test plan

Uh oh!

That1Drifter commented Apr 10, 2026

Smoke results + follow-up commit

What Haiku is actually doing wrong

Fix in 9475d7b

Re-smoke results

Uh oh!

That1Drifter commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix in `9475d7b`