That1Drifter · That1Drifter · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -36,5 +36,4 @@ apps/web/data/
 data/sessions.json
 
 # Local-only
-CLAUDE.md
 resume-claude.bat
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,83 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## What this is
+
+fieldwork is a self-hosted FDE training simulator built on the Anthropic API. See `README.md` for the product pitch and `docs/architecture.md` for the full engine breakdown — read that file before making non-trivial changes to the turn loop.
+
+## Commands
+
+pnpm workspaces managed by Turbo. Run from repo root unless noted.
+
+```bash
+pnpm install                  # bootstrap workspace
+pnpm dev                      # turbo dev (starts apps/web Next.js dev server)
+pnpm build                    # turbo build across workspace
+pnpm test                     # vitest across packages
+pnpm lint                     # next lint (web) — other packages are no-ops
+pnpm typecheck                # tsc --noEmit across workspace
+pnpm validate-scenarios       # schema-check all scenario manifests
+pnpm format                   # prettier --write .
+```
+
+Scoped commands:
+
+```bash
+pnpm --filter @fieldwork/web dev            # web only
+pnpm --filter @fieldwork/core test          # engine tests only
+pnpm --filter @fieldwork/core test -- -t "surprise"   # single vitest by name
+pnpm --filter @fieldwork/scenarios validate # validate just scenarios
+```
+
+CLI (authoring harness, from `apps/cli`). Only `validate` is implemented; the
+other commands are registered but are TODO stubs.
+
+```bash
+pnpm --filter @fieldwork/cli dev validate packages/scenarios/support-triage/manifest.yaml
+pnpm --filter @fieldwork/cli dev init <name>            # stub
+pnpm --filter @fieldwork/cli dev dryrun <path>          # stub
+pnpm --filter @fieldwork/cli dev play <path>            # stub
+pnpm --filter @fieldwork/cli dev cost-estimate <path>   # stub
+```
+
+Deploy to the staging server: `./scripts/deploy-staging.sh` (requires `FW_HOST` env var; supports `--fast`, `--restart`, `--logs`, `--set-key`, `--set-auth`). See README for the full option list.
+
+## Architecture
+
+Monorepo layout (pnpm + Turbo):
+
+- `apps/web` — Next.js 15 App Router. API routes in `app/api/{session,turn,debrief,health}` are the only surface the UI talks to. `middleware.ts` gates every route except `/api/health` with HTTP Basic Auth when `FIELDWORK_AUTH_PASS` is set (default-off for local dev). `lib/session-store.ts` is a singleton `Map` hydrated from `apps/web/data/sessions.json` via atomic write-and-rename — single-writer only, swap for SQLite if concurrency is ever needed.
+- `apps/cli` — commander-based authoring CLI (`validate`, `init`, `dryrun`, `play`, `cost-estimate`). Entry: `src/index.ts`.
+- `packages/core` — the engine. Key files: `engine.ts` (turn loop), `state.ts` (`SimState` shape), `contract.ts` (inner-Claude JSON contract), `surprises.ts` (trigger evaluator), `tickets.ts` + `rng.ts` (deterministic mulberry32 ticket generator), `schema/scenario.schema.json` (AJV-validated manifest schema). Vitest lives under `src/__tests__`.
+- `packages/rubric` — `score.ts` (per-turn deterministic checks via manifest `rubric` rules; case-insensitive `payload_contains` and `payload_regex`, fails closed on malformed regex) and `debrief.ts` (end-of-scenario Sonnet call).
+- `packages/scenarios` — one directory per scenario with `manifest.yaml` + README. All six scenarios are authored and schema-valid: `support-triage`, `doc-qa-rag`, `pipeline-automation`, `workflow-agent`, `legacy-migration`, `incident-response`. Schema tests in `__tests__`.
+- `packages/ui` — shared React components used by `apps/web`.
+
+### The two-Claude model
+
+The outer shell (your code) owns state, scoring, and progression. Inner Claude plays the simulated customer environment and is constrained to a strict JSON output contract (see `docs/inner-claude-contract.md` and `packages/core/src/contract.ts`). **Never delegate rendering or scoring to inner Claude** — parse its JSON, validate it, and merge into `SimState` yourself.
+
+Each turn builds a three-part prompt: (1) cached system content (contract + world bible + ticket sample) marked `cache_control: ephemeral` so bytes reuse across turns, (2) dynamic user message (objective statuses, trust map, last 5 actions, current action, fired surprises), (3) model selection — Haiku 4.5 routine, Sonnet when a surprise fires or for debrief. First call uses `max_tokens: 4096`; on JSON parse error or `stop_reason === 'max_tokens'` the engine retries once with 8192 (truncation → omit bad output, malformed → feed it back for correction).
+
+Model IDs are overridable via `FIELDWORK_MODEL_ENV`, `FIELDWORK_MODEL_STAKEHOLDER`, `FIELDWORK_MODEL_DEBRIEF`.
+
+### State invariants
+
+- `stakeholderTrust` is a `[0, 1]` scalar per stakeholder, initialized to 0.5. Deltas from inner Claude are clamped to ±0.2 per turn and the result clamped to `[0, 1]` before persisting.
+- `discoveredObjectives` only grows (objectives flagged `discoverable: true` stay hidden from the trainee until surfaced). An objective that's `NEVER DISCOVERED` at debrief time triggers a specific critique.
+- `surprisesFired` is append-only — a surprise id never fires twice.
+- `turn_budget` (optional, per-manifest) is enforced at the `/api/turn` route with `403 budgetExhausted: true`.
+
+### Cost tracking
+
+Every inner-Claude response carries usage data. The engine applies a pricing table (Haiku/Sonnet/Opus × input/output/cache-write/cache-read) to compute per-turn USD and accumulates it on the session. Both per-turn and cumulative are returned in the turn API response.
+
+## Conventions
+
+- Node 20+, pnpm 9+, ESM (`"type": "module"`) across all packages. Imports between workspace packages use `workspace:*`.
+- TypeScript strict, shared base in `tsconfig.base.json`. Run `pnpm typecheck` before shipping.
+- Prettier for formatting (config in `.prettierrc.json`). Run `pnpm format` before committing anything touched.
+- Comments explain **why**, not **what** — match existing code style. Don't add docstrings that restate the identifier name.
+- Scenario IDs must match `^[a-z0-9][a-z0-9-]*$`.
+- New scenarios: create `packages/scenarios/<id>/manifest.yaml` + `README.md`, then `pnpm validate-scenarios` before opening a PR. Target <$2 API cost per session and include at least one surprise.
diff --git a/README.md b/README.md
@@ -35,6 +35,9 @@ pnpm dev
 
 Open http://localhost:3000, pick a scenario, and get to work.
 
+Want more detail on setup, prerequisites, and running the CLI/tests? See
+[docs/getting-started.md](docs/getting-started.md).
+
 ## How it works
 
 Every scenario is a YAML manifest describing a simulated company: its industry,
@@ -141,10 +144,12 @@ API key and your data never leaves your machine.
 
 ## Project status
 
-**Phase 1 playable.** The Support Ticket Classifier scenario runs end-to-end
-with live inner-Claude turns, stakeholder dialogue, discoverable objectives,
-trust tracking, per-turn objective transitions, and a sharpened debrief that
-cites specific turns and proposes alternative prompts.
+**Phase 1 playable.** All six scenarios in the catalog are authored, schema-valid,
+and loadable through the web app. Support Ticket Classifier and Internal Doc Q&A
+are the scenarios exercised end-to-end on a live API (turn loop, stakeholder
+dialogue, discoverable objectives, trust tracking, per-turn objective
+transitions, cost tracking). The other four Tier 2–3 scenarios are authored
+but haven't been played on live API yet.
 
 Working:
 
@@ -153,16 +158,22 @@ Working:
 - JSON file session persistence (`data/sessions.json`, survives server restart)
 - Stakeholder trust meter with colored bars in the briefing panel
 - Turn budget and live USD cost display
+- Deterministic per-turn objective scoring via manifest `rubric` rules
+  (`action_kind` / `payload_contains` / case-insensitive `payload_regex`)
+- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and
+  `random` triggers
 - Collapsible action log viewer
+- End-of-scenario debrief with turn-specific, alternative-prompt critiques
 - `fieldwork validate` CLI wired to the scenario schema
-- 17+ passing unit tests (scenario schema, ticket generator, surprise engine)
+- 34 passing unit tests across core, rubric, and scenarios
 
 Not yet built (see [TODO.md](TODO.md)):
 
-- The 5 other scenario manifests (currently README stubs only)
-- Per-turn deterministic rubric checks (currently LLM-driven via inner Claude)
-- Stakeholder conflict surprises / political pressure mechanics
+- Inner Claude response streaming (currently blocks the UI 5–15s per turn)
+- SQLite session persistence (to replace the single-writer JSON file store)
+- Action log summarization for long runs (prompt bloats past ~20 turns)
 - Cross-session history and analytics
+- Demo GIF in the README
 
 ## Contributing
 

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -132,7 +132,8 @@ Trigger types:
 
 - `turn_count` — fires at or after a specific turn number
 - `objective_state` — fires when a named objective reaches a target state
-- `action_pattern` — fires when the action log matches a regex
+- `action_pattern` — fires when the action log matches a regex (case-insensitive,
+  fails closed on a malformed pattern)
 - `random` — fires with a given probability (per turn)
 
 Already-fired ids are tracked in `state.surprisesFired` and never fire twice.
@@ -145,10 +146,18 @@ Unit-tested in `packages/core/src/__tests__/surprises.test.ts`.
 
 Two-tier scoring:
 
-- **Per-turn** — LLM-driven via inner Claude returning
-  `hidden_state_updates.objective_transitions`. The engine validates and
-  merges these into `state.objectives`. Deterministic pattern-based checks in
-  `@fieldwork/rubric/score` are scaffolded but not wired yet (TODO).
+- **Per-turn** — two paths run in parallel and the deterministic one wins when
+  both fire on the same objective:
+  - Inner Claude proposes objective state transitions via
+    `hidden_state_updates.objective_transitions`. The engine validates and
+    merges these into `state.objectives`.
+  - Manifest-level `rubric` rules on each objective run through
+    `@fieldwork/rubric/score`. A rule matches on `action_kind`,
+    `payload_contains` (case-insensitive substring), or `payload_regex`
+    (case-insensitive regex, fails closed on malformed patterns). The first
+    matching rule sets the objective state and overrides inner Claude's
+    judgment for the turn. Rules should only upgrade state
+    (`open → attempted → met`); the engine does not prevent downgrades.
 - **End-of-scenario** — `@fieldwork/rubric/debrief` sends the full action log,
   final objective states, stakeholder trust, and discovered/undiscovered
   splits to Sonnet. The prompt demands every critique cite a specific turn,

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -43,12 +43,21 @@ pnpm test
 
 ## What's playable today
 
-- Support Ticket Classifier scenario runs end-to-end
+- All six scenarios in the catalog (Support Ticket Classifier, Internal Doc Q&A,
+  Data Pipeline Automation, Multi-Step Workflow Agent, Legacy System Migration,
+  Production Incident Response) are authored, schema-valid, and loadable through
+  the web app. Support Ticket Classifier and Internal Doc Q&A are verified
+  end-to-end against a live API; the other four are authored but not yet exercised
+  on live API.
 - Full turn loop through inner Claude with JSON contract validation + retry
 - Prompt caching and Haiku/Sonnet model tiering
 - Stakeholder dialogue with trust meter (0-1 per stakeholder)
 - Discoverable objectives (hidden until the trainee surfaces them)
-- Per-turn objective state transitions driven by inner Claude
+- Per-turn objective state transitions: inner Claude proposes + deterministic
+  manifest `rubric` rules (`action_kind` / `payload_contains` / `payload_regex`)
+  override when they match
+- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and
+  `random` triggers
 - Turn budget + cumulative USD cost display
 - Collapsible action log viewer
 - End-of-scenario debrief with turn-specific, alternative-prompt critiques

diff --git a/docs/writing-scenarios.md b/docs/writing-scenarios.md
@@ -52,6 +52,41 @@ Each entry in `objectives`:
 See [`support-triage/manifest.yaml`](../packages/scenarios/support-triage/manifest.yaml)
 for a fully worked example covering all fields.
 
+## Objective `rubric` rules
+
+Objectives can carry a `rubric` array of deterministic rules that run every
+turn. The first matching rule sets the objective state, overriding inner
+Claude's judgment for that turn. Each rule has a `match` block and a `set`
+value (`open` | `attempted` | `met` | `failed`).
+
+`match` fields (all specified conditions must match):
+
+- `action_kind` — exact match on `action.kind`
+- `payload_contains` — case-insensitive substring match against the
+  JSON-serialized action payload
+- `payload_regex` — regex tested against the JSON-serialized payload. **Written
+  as plain regex — do NOT prefix with `(?i)` or any inline flag.** The engine
+  constructs patterns with the JavaScript `i` flag automatically, and inline
+  flags like `(?i)` are Perl-flavor and will crash the turn handler if the
+  engine didn't also fail closed. A malformed pattern now fails closed (the
+  rule skips instead of firing), but still — keep patterns plain.
+
+Rules should only upgrade state (`open → attempted → met`); the engine does
+not prevent downgrades.
+
+## Surprise triggers
+
+The `surprises` array defines mid-scenario twists. Each surprise has an `id`,
+a `trigger`, a `detail` string, and optional `visible`, `channel`, and `from`
+fields. Trigger types:
+
+- `turn_count` — `value: N` fires at or after turn N
+- `objective_state` — `objective: <id>, state: <state>` fires when the named
+  objective reaches that state
+- `action_pattern` — `pattern: <regex>` fires when the serialized action log
+  matches. Case-insensitive; no inline flags; fails closed on malformed input.
+- `random` — `probability: 0.3` fires stochastically per turn
+
 ## Quality checklist
 
 - [ ] The trainee objective is specific and measurable