diff --git a/.gitignore b/.gitignore index 3fe4c3c..8d1420b 100644 --- a/.gitignore +++ b/.gitignore @@ -36,5 +36,4 @@ apps/web/data/ data/sessions.json # Local-only -CLAUDE.md resume-claude.bat diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2179b92 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,83 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What this is + +fieldwork is a self-hosted FDE training simulator built on the Anthropic API. See `README.md` for the product pitch and `docs/architecture.md` for the full engine breakdown — read that file before making non-trivial changes to the turn loop. + +## Commands + +pnpm workspaces managed by Turbo. Run from repo root unless noted. + +```bash +pnpm install # bootstrap workspace +pnpm dev # turbo dev (starts apps/web Next.js dev server) +pnpm build # turbo build across workspace +pnpm test # vitest across packages +pnpm lint # next lint (web) — other packages are no-ops +pnpm typecheck # tsc --noEmit across workspace +pnpm validate-scenarios # schema-check all scenario manifests +pnpm format # prettier --write . +``` + +Scoped commands: + +```bash +pnpm --filter @fieldwork/web dev # web only +pnpm --filter @fieldwork/core test # engine tests only +pnpm --filter @fieldwork/core test -- -t "surprise" # single vitest by name +pnpm --filter @fieldwork/scenarios validate # validate just scenarios +``` + +CLI (authoring harness, from `apps/cli`). Only `validate` is implemented; the +other commands are registered but are TODO stubs. + +```bash +pnpm --filter @fieldwork/cli dev validate packages/scenarios/support-triage/manifest.yaml +pnpm --filter @fieldwork/cli dev init # stub +pnpm --filter @fieldwork/cli dev dryrun # stub +pnpm --filter @fieldwork/cli dev play # stub +pnpm --filter @fieldwork/cli dev cost-estimate # stub +``` + +Deploy to the staging server: `./scripts/deploy-staging.sh` (requires `FW_HOST` env var; supports `--fast`, `--restart`, `--logs`, `--set-key`, `--set-auth`). See README for the full option list. + +## Architecture + +Monorepo layout (pnpm + Turbo): + +- `apps/web` — Next.js 15 App Router. API routes in `app/api/{session,turn,debrief,health}` are the only surface the UI talks to. `middleware.ts` gates every route except `/api/health` with HTTP Basic Auth when `FIELDWORK_AUTH_PASS` is set (default-off for local dev). `lib/session-store.ts` is a singleton `Map` hydrated from `apps/web/data/sessions.json` via atomic write-and-rename — single-writer only, swap for SQLite if concurrency is ever needed. +- `apps/cli` — commander-based authoring CLI (`validate`, `init`, `dryrun`, `play`, `cost-estimate`). Entry: `src/index.ts`. +- `packages/core` — the engine. Key files: `engine.ts` (turn loop), `state.ts` (`SimState` shape), `contract.ts` (inner-Claude JSON contract), `surprises.ts` (trigger evaluator), `tickets.ts` + `rng.ts` (deterministic mulberry32 ticket generator), `schema/scenario.schema.json` (AJV-validated manifest schema). Vitest lives under `src/__tests__`. +- `packages/rubric` — `score.ts` (per-turn deterministic checks via manifest `rubric` rules; case-insensitive `payload_contains` and `payload_regex`, fails closed on malformed regex) and `debrief.ts` (end-of-scenario Sonnet call). +- `packages/scenarios` — one directory per scenario with `manifest.yaml` + README. All six scenarios are authored and schema-valid: `support-triage`, `doc-qa-rag`, `pipeline-automation`, `workflow-agent`, `legacy-migration`, `incident-response`. Schema tests in `__tests__`. +- `packages/ui` — shared React components used by `apps/web`. + +### The two-Claude model + +The outer shell (your code) owns state, scoring, and progression. Inner Claude plays the simulated customer environment and is constrained to a strict JSON output contract (see `docs/inner-claude-contract.md` and `packages/core/src/contract.ts`). **Never delegate rendering or scoring to inner Claude** — parse its JSON, validate it, and merge into `SimState` yourself. + +Each turn builds a three-part prompt: (1) cached system content (contract + world bible + ticket sample) marked `cache_control: ephemeral` so bytes reuse across turns, (2) dynamic user message (objective statuses, trust map, last 5 actions, current action, fired surprises), (3) model selection — Haiku 4.5 routine, Sonnet when a surprise fires or for debrief. First call uses `max_tokens: 4096`; on JSON parse error or `stop_reason === 'max_tokens'` the engine retries once with 8192 (truncation → omit bad output, malformed → feed it back for correction). + +Model IDs are overridable via `FIELDWORK_MODEL_ENV`, `FIELDWORK_MODEL_STAKEHOLDER`, `FIELDWORK_MODEL_DEBRIEF`. + +### State invariants + +- `stakeholderTrust` is a `[0, 1]` scalar per stakeholder, initialized to 0.5. Deltas from inner Claude are clamped to ±0.2 per turn and the result clamped to `[0, 1]` before persisting. +- `discoveredObjectives` only grows (objectives flagged `discoverable: true` stay hidden from the trainee until surfaced). An objective that's `NEVER DISCOVERED` at debrief time triggers a specific critique. +- `surprisesFired` is append-only — a surprise id never fires twice. +- `turn_budget` (optional, per-manifest) is enforced at the `/api/turn` route with `403 budgetExhausted: true`. + +### Cost tracking + +Every inner-Claude response carries usage data. The engine applies a pricing table (Haiku/Sonnet/Opus × input/output/cache-write/cache-read) to compute per-turn USD and accumulates it on the session. Both per-turn and cumulative are returned in the turn API response. + +## Conventions + +- Node 20+, pnpm 9+, ESM (`"type": "module"`) across all packages. Imports between workspace packages use `workspace:*`. +- TypeScript strict, shared base in `tsconfig.base.json`. Run `pnpm typecheck` before shipping. +- Prettier for formatting (config in `.prettierrc.json`). Run `pnpm format` before committing anything touched. +- Comments explain **why**, not **what** — match existing code style. Don't add docstrings that restate the identifier name. +- Scenario IDs must match `^[a-z0-9][a-z0-9-]*$`. +- New scenarios: create `packages/scenarios//manifest.yaml` + `README.md`, then `pnpm validate-scenarios` before opening a PR. Target <$2 API cost per session and include at least one surprise. diff --git a/README.md b/README.md index cfcaca0..b732ba3 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,9 @@ pnpm dev Open http://localhost:3000, pick a scenario, and get to work. +Want more detail on setup, prerequisites, and running the CLI/tests? See +[docs/getting-started.md](docs/getting-started.md). + ## How it works Every scenario is a YAML manifest describing a simulated company: its industry, @@ -141,10 +144,12 @@ API key and your data never leaves your machine. ## Project status -**Phase 1 playable.** The Support Ticket Classifier scenario runs end-to-end -with live inner-Claude turns, stakeholder dialogue, discoverable objectives, -trust tracking, per-turn objective transitions, and a sharpened debrief that -cites specific turns and proposes alternative prompts. +**Phase 1 playable.** All six scenarios in the catalog are authored, schema-valid, +and loadable through the web app. Support Ticket Classifier and Internal Doc Q&A +are the scenarios exercised end-to-end on a live API (turn loop, stakeholder +dialogue, discoverable objectives, trust tracking, per-turn objective +transitions, cost tracking). The other four Tier 2–3 scenarios are authored +but haven't been played on live API yet. Working: @@ -153,16 +158,22 @@ Working: - JSON file session persistence (`data/sessions.json`, survives server restart) - Stakeholder trust meter with colored bars in the briefing panel - Turn budget and live USD cost display +- Deterministic per-turn objective scoring via manifest `rubric` rules + (`action_kind` / `payload_contains` / case-insensitive `payload_regex`) +- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and + `random` triggers - Collapsible action log viewer +- End-of-scenario debrief with turn-specific, alternative-prompt critiques - `fieldwork validate` CLI wired to the scenario schema -- 17+ passing unit tests (scenario schema, ticket generator, surprise engine) +- 34 passing unit tests across core, rubric, and scenarios Not yet built (see [TODO.md](TODO.md)): -- The 5 other scenario manifests (currently README stubs only) -- Per-turn deterministic rubric checks (currently LLM-driven via inner Claude) -- Stakeholder conflict surprises / political pressure mechanics +- Inner Claude response streaming (currently blocks the UI 5–15s per turn) +- SQLite session persistence (to replace the single-writer JSON file store) +- Action log summarization for long runs (prompt bloats past ~20 turns) - Cross-session history and analytics +- Demo GIF in the README ## Contributing diff --git a/docs/architecture.md b/docs/architecture.md index 770f499..d2a74ea 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -132,7 +132,8 @@ Trigger types: - `turn_count` — fires at or after a specific turn number - `objective_state` — fires when a named objective reaches a target state -- `action_pattern` — fires when the action log matches a regex +- `action_pattern` — fires when the action log matches a regex (case-insensitive, + fails closed on a malformed pattern) - `random` — fires with a given probability (per turn) Already-fired ids are tracked in `state.surprisesFired` and never fire twice. @@ -145,10 +146,18 @@ Unit-tested in `packages/core/src/__tests__/surprises.test.ts`. Two-tier scoring: -- **Per-turn** — LLM-driven via inner Claude returning - `hidden_state_updates.objective_transitions`. The engine validates and - merges these into `state.objectives`. Deterministic pattern-based checks in - `@fieldwork/rubric/score` are scaffolded but not wired yet (TODO). +- **Per-turn** — two paths run in parallel and the deterministic one wins when + both fire on the same objective: + - Inner Claude proposes objective state transitions via + `hidden_state_updates.objective_transitions`. The engine validates and + merges these into `state.objectives`. + - Manifest-level `rubric` rules on each objective run through + `@fieldwork/rubric/score`. A rule matches on `action_kind`, + `payload_contains` (case-insensitive substring), or `payload_regex` + (case-insensitive regex, fails closed on malformed patterns). The first + matching rule sets the objective state and overrides inner Claude's + judgment for the turn. Rules should only upgrade state + (`open → attempted → met`); the engine does not prevent downgrades. - **End-of-scenario** — `@fieldwork/rubric/debrief` sends the full action log, final objective states, stakeholder trust, and discovered/undiscovered splits to Sonnet. The prompt demands every critique cite a specific turn, diff --git a/docs/getting-started.md b/docs/getting-started.md index d51e587..62354f2 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -43,12 +43,21 @@ pnpm test ## What's playable today -- Support Ticket Classifier scenario runs end-to-end +- All six scenarios in the catalog (Support Ticket Classifier, Internal Doc Q&A, + Data Pipeline Automation, Multi-Step Workflow Agent, Legacy System Migration, + Production Incident Response) are authored, schema-valid, and loadable through + the web app. Support Ticket Classifier and Internal Doc Q&A are verified + end-to-end against a live API; the other four are authored but not yet exercised + on live API. - Full turn loop through inner Claude with JSON contract validation + retry - Prompt caching and Haiku/Sonnet model tiering - Stakeholder dialogue with trust meter (0-1 per stakeholder) - Discoverable objectives (hidden until the trainee surfaces them) -- Per-turn objective state transitions driven by inner Claude +- Per-turn objective state transitions: inner Claude proposes + deterministic + manifest `rubric` rules (`action_kind` / `payload_contains` / `payload_regex`) + override when they match +- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and + `random` triggers - Turn budget + cumulative USD cost display - Collapsible action log viewer - End-of-scenario debrief with turn-specific, alternative-prompt critiques diff --git a/docs/writing-scenarios.md b/docs/writing-scenarios.md index de72d8d..748c075 100644 --- a/docs/writing-scenarios.md +++ b/docs/writing-scenarios.md @@ -52,6 +52,41 @@ Each entry in `objectives`: See [`support-triage/manifest.yaml`](../packages/scenarios/support-triage/manifest.yaml) for a fully worked example covering all fields. +## Objective `rubric` rules + +Objectives can carry a `rubric` array of deterministic rules that run every +turn. The first matching rule sets the objective state, overriding inner +Claude's judgment for that turn. Each rule has a `match` block and a `set` +value (`open` | `attempted` | `met` | `failed`). + +`match` fields (all specified conditions must match): + +- `action_kind` — exact match on `action.kind` +- `payload_contains` — case-insensitive substring match against the + JSON-serialized action payload +- `payload_regex` — regex tested against the JSON-serialized payload. **Written + as plain regex — do NOT prefix with `(?i)` or any inline flag.** The engine + constructs patterns with the JavaScript `i` flag automatically, and inline + flags like `(?i)` are Perl-flavor and will crash the turn handler if the + engine didn't also fail closed. A malformed pattern now fails closed (the + rule skips instead of firing), but still — keep patterns plain. + +Rules should only upgrade state (`open → attempted → met`); the engine does +not prevent downgrades. + +## Surprise triggers + +The `surprises` array defines mid-scenario twists. Each surprise has an `id`, +a `trigger`, a `detail` string, and optional `visible`, `channel`, and `from` +fields. Trigger types: + +- `turn_count` — `value: N` fires at or after turn N +- `objective_state` — `objective: , state: ` fires when the named + objective reaches that state +- `action_pattern` — `pattern: ` fires when the serialized action log + matches. Case-insensitive; no inline flags; fails closed on malformed input. +- `random` — `probability: 0.3` fires stochastically per turn + ## Quality checklist - [ ] The trainee objective is specific and measurable