Skip to content

feat(api): tool_use, split cache breakpoints, 1h TTL, pricing fixes#14

Merged
That1Drifter merged 2 commits intomasterfrom
api-cache-toolify
Apr 10, 2026
Merged

feat(api): tool_use, split cache breakpoints, 1h TTL, pricing fixes#14
That1Drifter merged 2 commits intomasterfrom
api-cache-toolify

Conversation

@That1Drifter
Copy link
Copy Markdown
Owner

Summary

Tightens API usage and prompt caching in the inner-Claude turn loop and the debrief generator.

  • tool_use for structured outputinner-claude.ts now defines an emit_turn_response tool whose input_schema mirrors InnerClaudeTurnResponse, and forces it via tool_choice. The JSON-in-text parse path and the malformed-retry branch are gone; only the truncation retry remains. isInnerClaudeResponse stays as the runtime guard against tool_input drift.
  • Split cache breakpoints — system prompt is now two ephemeral blocks. Block 1 is the static CONTRACT (5m TTL, reused globally across every scenario and session). Block 2 is the per-session scenario context + ticket sample (1h TTL via the extended-cache-ttl-2025-04-11 beta header). Previously the contract was lumped into the scenario block, so each new session re-paid cache-write on the contract too.
  • Pricing table — re-keyed to exact dated prefixes (claude-sonnet-4-5, claude-sonnet-4-6, claude-opus-4-5, claude-opus-4-6, claude-haiku-4-5) so sonnet-4-5 and a future sonnet-4-6 don't collide on a claude-sonnet-4 prefix. startsWith lookup still resolves dated suffixes like claude-haiku-4-5-20251001.
  • Model defaults bumpedMODEL_STAKEHOLDER and DEBRIEF_MODEL defaults are now claude-sonnet-4-6. Env overrides preserved.
  • Client hoistnew Anthropic({ apiKey }) moved to a lazy module-level singleton in both inner-claude.ts and debrief.ts. The "ANTHROPIC_API_KEY not configured" error still throws at call time so import-time behavior is unchanged.
  • Debrief caching — debrief system prompt extracted to SYSTEM_PROMPT and marked cache_control: ephemeral.
  • TODO — added followups for the broken pnpm lint (next lint deprecation) and the now-misnamed TurnCallResult.rawText.

Test plan

  • pnpm typecheck — clean across 6 packages
  • pnpm test — 17 core + 7 scenarios + 10 rubric tests passing
  • Smoke a real session end-to-end on staging to confirm tool_use + 1h TTL header are accepted by the API and cache hits show in usage.cache_read_input_tokens on turn 2+
  • Confirm the debrief still renders narrative output with the new claude-sonnet-4-6 default

🤖 Generated with Claude Code

That1Drifter and others added 2 commits April 10, 2026 16:52
Inner Claude now uses tool_use to guarantee structured output, removing the
JSON-parse + malformed-retry path. Cache prefix split into a global CONTRACT
breakpoint (5m) and a per-session scenario breakpoint (1h via extended TTL),
so new sessions stop re-paying cache writes on the static contract. Pricing
table re-keyed with exact dated prefixes so sonnet-4-5/4-6 don't collide.
Anthropic client hoisted to a lazy singleton in both inner-claude and
debrief; default stakeholder/debrief models bumped to claude-sonnet-4-6;
debrief system prompt marked ephemeral. TODO updated with the broken
next-lint and rawText followups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Anthropic API rejects cache_control blocks ordered with a longer TTL
after a shorter one (system.1.cache_control.ttl error). The previous split
put CONTRACT at default 5m before scenario context at 1h. Bumping CONTRACT
to 1h is strictly better anyway — it's static globally, so the longer TTL
is paid once and read constantly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@That1Drifter
Copy link
Copy Markdown
Owner Author

Staging smoke results ✅

Deployed fcf7860 to staging and exercised the full turn loop + debrief on support-triage.

Test plan results

  • ✅ tool_use accepted by API across haiku-4-5 and sonnet-4-6 — every call returned valid structured output, retried: false on every turn
  • ✅ 1h cache header accepted — turn 1 wrote 6357-token prefix (cache_write: 6357, cache_read: 0)
  • ✅ Cache hits on turn 2+ — every subsequent turn shows cache_write: 0, cache_read: 6357-6358
  • ✅ Surprise tier auto-routes to claude-sonnet-4-6spam_reroute surprise fired and modelUsed: claude-sonnet-4-6, modelTier: stakeholder
  • ✅ Debrief renders narrative on claude-sonnet-4-6 — multi-paragraph critique with per-turn references

Bug caught and fixed mid-test (fcf7860)
The first deploy hit 400 system.1.cache_control.ttl: a ttl='1h' cache_control block must not come after a ttl='5m' cache_control block. The Anthropic API requires longer-TTL cache blocks to be ordered before shorter ones. Bumped CONTRACT to 1h too — strictly better since it's static globally.

Sample turn 4 usage (sonnet-4-6, surprise tier)

usage: { input: 820, output: 565, cache_write: 0, cache_read: 6358 }
turnCostUsd: 0.0128

Pre-existing bug spotted in server log (not from this PR — flagging as separate issue)

SyntaxError: Invalid regular expression: /(?i)(deploy|ship|launch|rollout|production)/: Invalid group
  at .next/server/app/api/debrief/route.js:39:10591
  called from app/api/turn/route.js

Looks like a payload_regex rubric rule is using Python's (?i) inline flag, which JS regex doesn't support. Worth a follow-up fix in @fieldwork/rubric to either flip to the i flag or strip (?i) prefixes when compiling rules.

@That1Drifter
Copy link
Copy Markdown
Owner Author

Correction: the regex bug I flagged in the previous comment was already fixed in d791a3d. The two log entries I saw were from restarts before this branch's deploy — the current build (fcf7860, restart 21:57:31) has had zero errors. Disregard that followup.

@That1Drifter That1Drifter merged commit 3be7b21 into master Apr 10, 2026
1 check passed
@That1Drifter That1Drifter deleted the api-cache-toolify branch April 10, 2026 22:03
That1Drifter added a commit that referenced this pull request Apr 10, 2026
A Sonnet subagent with no project context played support-triage on
staging via gstack browse and surfaced two critical bugs the existing
suite never would have caught: an Inner Claude contract guard that
rejects valid tool inputs intermittently (regression from the tool_use
migration in #14) and a page-reload session destruction. Pulling the
staging logs alongside the playthrough also surfaced a separate infra
issue: the box is being OOM-killed under modest load, six SIGKILL
events on 2026-04-10 alone.

- Promote the three critical bugs to the top of the Now section with
  source-line references and proposed fix sketches.
- Add five medium UX items to Polish (debrief structure, trust deltas,
  cost tooltip, objective badges, action log auto-expand).
- Add a tech-debt note documenting the gstack-browse-with-in-URL-creds
  fetch artifact that caused at least one false-positive bug report,
  so future smoke runs use header-based auth instead.
- Record verified false alarms (picker click works in real Chromium,
  debrief button doesn't submit a turn, the (?i) regex was already
  fixed in d791a3d, reset has a confirm guard) so future fresh-eyes
  reviews don't re-investigate them.
- Mark back-nav and lint migration as done (shipped in #15).
- Note that the demo GIF has a draft generated via the new stitching
  script.

Also adds scripts/stitch-demo-gif.py (Pillow-based curated frame
assembler) and gitignores .playwright-mcp/ to keep snapshot artifacts
out of the repo, matching the existing .gstack/ convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant