From 01ca4a60faf6a9c3b309d585629f76831a5d2528 Mon Sep 17 00:00:00 2001 From: That1Drifter Date: Fri, 10 Apr 2026 16:33:40 -0500 Subject: [PATCH 1/2] docs: sync docs with shipped scenario catalog and rubric wiring MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README: prune "not yet built" items that shipped (5 scenario manifests, deterministic per-turn scoring, stakeholder conflict surprise) and refresh the Working list to reflect the current engine. Link getting-started.md from Quick start for discoverability. - docs/architecture.md: rewrite the scoring-pipeline section — deterministic manifest rubric rules are wired now (not scaffolded), and document the case-insensitive + fail-closed regex behavior for both rubric and surprises. - docs/getting-started.md: refresh "What's playable today" — all six scenarios are loadable, Support Triage and Doc Q&A are smoke-tested on live API, the other four are authored but not yet run. - docs/writing-scenarios.md: add an Objective rubric rules section and a Surprise triggers section with explicit guidance not to use Perl-style (?i) inline flags — the engine now supplies the i flag automatically. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 27 +++++++++++++++++++-------- docs/architecture.md | 19 ++++++++++++++----- docs/getting-started.md | 13 +++++++++++-- docs/writing-scenarios.md | 35 +++++++++++++++++++++++++++++++++++ 4 files changed, 79 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index cfcaca0..b732ba3 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,9 @@ pnpm dev Open http://localhost:3000, pick a scenario, and get to work. +Want more detail on setup, prerequisites, and running the CLI/tests? See +[docs/getting-started.md](docs/getting-started.md). + ## How it works Every scenario is a YAML manifest describing a simulated company: its industry, @@ -141,10 +144,12 @@ API key and your data never leaves your machine. ## Project status -**Phase 1 playable.** The Support Ticket Classifier scenario runs end-to-end -with live inner-Claude turns, stakeholder dialogue, discoverable objectives, -trust tracking, per-turn objective transitions, and a sharpened debrief that -cites specific turns and proposes alternative prompts. +**Phase 1 playable.** All six scenarios in the catalog are authored, schema-valid, +and loadable through the web app. Support Ticket Classifier and Internal Doc Q&A +are the scenarios exercised end-to-end on a live API (turn loop, stakeholder +dialogue, discoverable objectives, trust tracking, per-turn objective +transitions, cost tracking). The other four Tier 2–3 scenarios are authored +but haven't been played on live API yet. Working: @@ -153,16 +158,22 @@ Working: - JSON file session persistence (`data/sessions.json`, survives server restart) - Stakeholder trust meter with colored bars in the briefing panel - Turn budget and live USD cost display +- Deterministic per-turn objective scoring via manifest `rubric` rules + (`action_kind` / `payload_contains` / case-insensitive `payload_regex`) +- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and + `random` triggers - Collapsible action log viewer +- End-of-scenario debrief with turn-specific, alternative-prompt critiques - `fieldwork validate` CLI wired to the scenario schema -- 17+ passing unit tests (scenario schema, ticket generator, surprise engine) +- 34 passing unit tests across core, rubric, and scenarios Not yet built (see [TODO.md](TODO.md)): -- The 5 other scenario manifests (currently README stubs only) -- Per-turn deterministic rubric checks (currently LLM-driven via inner Claude) -- Stakeholder conflict surprises / political pressure mechanics +- Inner Claude response streaming (currently blocks the UI 5–15s per turn) +- SQLite session persistence (to replace the single-writer JSON file store) +- Action log summarization for long runs (prompt bloats past ~20 turns) - Cross-session history and analytics +- Demo GIF in the README ## Contributing diff --git a/docs/architecture.md b/docs/architecture.md index 770f499..d2a74ea 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -132,7 +132,8 @@ Trigger types: - `turn_count` — fires at or after a specific turn number - `objective_state` — fires when a named objective reaches a target state -- `action_pattern` — fires when the action log matches a regex +- `action_pattern` — fires when the action log matches a regex (case-insensitive, + fails closed on a malformed pattern) - `random` — fires with a given probability (per turn) Already-fired ids are tracked in `state.surprisesFired` and never fire twice. @@ -145,10 +146,18 @@ Unit-tested in `packages/core/src/__tests__/surprises.test.ts`. Two-tier scoring: -- **Per-turn** — LLM-driven via inner Claude returning - `hidden_state_updates.objective_transitions`. The engine validates and - merges these into `state.objectives`. Deterministic pattern-based checks in - `@fieldwork/rubric/score` are scaffolded but not wired yet (TODO). +- **Per-turn** — two paths run in parallel and the deterministic one wins when + both fire on the same objective: + - Inner Claude proposes objective state transitions via + `hidden_state_updates.objective_transitions`. The engine validates and + merges these into `state.objectives`. + - Manifest-level `rubric` rules on each objective run through + `@fieldwork/rubric/score`. A rule matches on `action_kind`, + `payload_contains` (case-insensitive substring), or `payload_regex` + (case-insensitive regex, fails closed on malformed patterns). The first + matching rule sets the objective state and overrides inner Claude's + judgment for the turn. Rules should only upgrade state + (`open → attempted → met`); the engine does not prevent downgrades. - **End-of-scenario** — `@fieldwork/rubric/debrief` sends the full action log, final objective states, stakeholder trust, and discovered/undiscovered splits to Sonnet. The prompt demands every critique cite a specific turn, diff --git a/docs/getting-started.md b/docs/getting-started.md index d51e587..62354f2 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -43,12 +43,21 @@ pnpm test ## What's playable today -- Support Ticket Classifier scenario runs end-to-end +- All six scenarios in the catalog (Support Ticket Classifier, Internal Doc Q&A, + Data Pipeline Automation, Multi-Step Workflow Agent, Legacy System Migration, + Production Incident Response) are authored, schema-valid, and loadable through + the web app. Support Ticket Classifier and Internal Doc Q&A are verified + end-to-end against a live API; the other four are authored but not yet exercised + on live API. - Full turn loop through inner Claude with JSON contract validation + retry - Prompt caching and Haiku/Sonnet model tiering - Stakeholder dialogue with trust meter (0-1 per stakeholder) - Discoverable objectives (hidden until the trainee surfaces them) -- Per-turn objective state transitions driven by inner Claude +- Per-turn objective state transitions: inner Claude proposes + deterministic + manifest `rubric` rules (`action_kind` / `payload_contains` / `payload_regex`) + override when they match +- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and + `random` triggers - Turn budget + cumulative USD cost display - Collapsible action log viewer - End-of-scenario debrief with turn-specific, alternative-prompt critiques diff --git a/docs/writing-scenarios.md b/docs/writing-scenarios.md index de72d8d..748c075 100644 --- a/docs/writing-scenarios.md +++ b/docs/writing-scenarios.md @@ -52,6 +52,41 @@ Each entry in `objectives`: See [`support-triage/manifest.yaml`](../packages/scenarios/support-triage/manifest.yaml) for a fully worked example covering all fields. +## Objective `rubric` rules + +Objectives can carry a `rubric` array of deterministic rules that run every +turn. The first matching rule sets the objective state, overriding inner +Claude's judgment for that turn. Each rule has a `match` block and a `set` +value (`open` | `attempted` | `met` | `failed`). + +`match` fields (all specified conditions must match): + +- `action_kind` — exact match on `action.kind` +- `payload_contains` — case-insensitive substring match against the + JSON-serialized action payload +- `payload_regex` — regex tested against the JSON-serialized payload. **Written + as plain regex — do NOT prefix with `(?i)` or any inline flag.** The engine + constructs patterns with the JavaScript `i` flag automatically, and inline + flags like `(?i)` are Perl-flavor and will crash the turn handler if the + engine didn't also fail closed. A malformed pattern now fails closed (the + rule skips instead of firing), but still — keep patterns plain. + +Rules should only upgrade state (`open → attempted → met`); the engine does +not prevent downgrades. + +## Surprise triggers + +The `surprises` array defines mid-scenario twists. Each surprise has an `id`, +a `trigger`, a `detail` string, and optional `visible`, `channel`, and `from` +fields. Trigger types: + +- `turn_count` — `value: N` fires at or after turn N +- `objective_state` — `objective: , state: ` fires when the named + objective reaches that state +- `action_pattern` — `pattern: ` fires when the serialized action log + matches. Case-insensitive; no inline flags; fails closed on malformed input. +- `random` — `probability: 0.3` fires stochastically per turn + ## Quality checklist - [ ] The trainee objective is specific and measurable From 0fb2d6991deab2fe7551d0975ac96a2c163cbaae Mon Sep 17 00:00:00 2001 From: That1Drifter Date: Fri, 10 Apr 2026 16:36:32 -0500 Subject: [PATCH 2/2] docs: commit CLAUDE.md with project guidance Previously gitignored as "local-only" but the content is pure project guidance (monorepo layout, commands, two-Claude model, conventions) with no secrets or personal paths. Committing makes it discoverable to new contributors and Claude Code picks it up the same way. Synced with this PR's other fixes while I was at it: rubric score.ts is wired (not scaffolded), all six scenarios are authored (not just support-triage), and the CLI section flags dryrun/play/cost-estimate/init as TODO stubs so readers don't waste time on them. Co-Authored-By: Claude Opus 4.6 (1M context) --- .gitignore | 1 - CLAUDE.md | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+), 1 deletion(-) create mode 100644 CLAUDE.md diff --git a/.gitignore b/.gitignore index 3fe4c3c..8d1420b 100644 --- a/.gitignore +++ b/.gitignore @@ -36,5 +36,4 @@ apps/web/data/ data/sessions.json # Local-only -CLAUDE.md resume-claude.bat diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2179b92 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,83 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What this is + +fieldwork is a self-hosted FDE training simulator built on the Anthropic API. See `README.md` for the product pitch and `docs/architecture.md` for the full engine breakdown — read that file before making non-trivial changes to the turn loop. + +## Commands + +pnpm workspaces managed by Turbo. Run from repo root unless noted. + +```bash +pnpm install # bootstrap workspace +pnpm dev # turbo dev (starts apps/web Next.js dev server) +pnpm build # turbo build across workspace +pnpm test # vitest across packages +pnpm lint # next lint (web) — other packages are no-ops +pnpm typecheck # tsc --noEmit across workspace +pnpm validate-scenarios # schema-check all scenario manifests +pnpm format # prettier --write . +``` + +Scoped commands: + +```bash +pnpm --filter @fieldwork/web dev # web only +pnpm --filter @fieldwork/core test # engine tests only +pnpm --filter @fieldwork/core test -- -t "surprise" # single vitest by name +pnpm --filter @fieldwork/scenarios validate # validate just scenarios +``` + +CLI (authoring harness, from `apps/cli`). Only `validate` is implemented; the +other commands are registered but are TODO stubs. + +```bash +pnpm --filter @fieldwork/cli dev validate packages/scenarios/support-triage/manifest.yaml +pnpm --filter @fieldwork/cli dev init # stub +pnpm --filter @fieldwork/cli dev dryrun # stub +pnpm --filter @fieldwork/cli dev play # stub +pnpm --filter @fieldwork/cli dev cost-estimate # stub +``` + +Deploy to the staging server: `./scripts/deploy-staging.sh` (requires `FW_HOST` env var; supports `--fast`, `--restart`, `--logs`, `--set-key`, `--set-auth`). See README for the full option list. + +## Architecture + +Monorepo layout (pnpm + Turbo): + +- `apps/web` — Next.js 15 App Router. API routes in `app/api/{session,turn,debrief,health}` are the only surface the UI talks to. `middleware.ts` gates every route except `/api/health` with HTTP Basic Auth when `FIELDWORK_AUTH_PASS` is set (default-off for local dev). `lib/session-store.ts` is a singleton `Map` hydrated from `apps/web/data/sessions.json` via atomic write-and-rename — single-writer only, swap for SQLite if concurrency is ever needed. +- `apps/cli` — commander-based authoring CLI (`validate`, `init`, `dryrun`, `play`, `cost-estimate`). Entry: `src/index.ts`. +- `packages/core` — the engine. Key files: `engine.ts` (turn loop), `state.ts` (`SimState` shape), `contract.ts` (inner-Claude JSON contract), `surprises.ts` (trigger evaluator), `tickets.ts` + `rng.ts` (deterministic mulberry32 ticket generator), `schema/scenario.schema.json` (AJV-validated manifest schema). Vitest lives under `src/__tests__`. +- `packages/rubric` — `score.ts` (per-turn deterministic checks via manifest `rubric` rules; case-insensitive `payload_contains` and `payload_regex`, fails closed on malformed regex) and `debrief.ts` (end-of-scenario Sonnet call). +- `packages/scenarios` — one directory per scenario with `manifest.yaml` + README. All six scenarios are authored and schema-valid: `support-triage`, `doc-qa-rag`, `pipeline-automation`, `workflow-agent`, `legacy-migration`, `incident-response`. Schema tests in `__tests__`. +- `packages/ui` — shared React components used by `apps/web`. + +### The two-Claude model + +The outer shell (your code) owns state, scoring, and progression. Inner Claude plays the simulated customer environment and is constrained to a strict JSON output contract (see `docs/inner-claude-contract.md` and `packages/core/src/contract.ts`). **Never delegate rendering or scoring to inner Claude** — parse its JSON, validate it, and merge into `SimState` yourself. + +Each turn builds a three-part prompt: (1) cached system content (contract + world bible + ticket sample) marked `cache_control: ephemeral` so bytes reuse across turns, (2) dynamic user message (objective statuses, trust map, last 5 actions, current action, fired surprises), (3) model selection — Haiku 4.5 routine, Sonnet when a surprise fires or for debrief. First call uses `max_tokens: 4096`; on JSON parse error or `stop_reason === 'max_tokens'` the engine retries once with 8192 (truncation → omit bad output, malformed → feed it back for correction). + +Model IDs are overridable via `FIELDWORK_MODEL_ENV`, `FIELDWORK_MODEL_STAKEHOLDER`, `FIELDWORK_MODEL_DEBRIEF`. + +### State invariants + +- `stakeholderTrust` is a `[0, 1]` scalar per stakeholder, initialized to 0.5. Deltas from inner Claude are clamped to ±0.2 per turn and the result clamped to `[0, 1]` before persisting. +- `discoveredObjectives` only grows (objectives flagged `discoverable: true` stay hidden from the trainee until surfaced). An objective that's `NEVER DISCOVERED` at debrief time triggers a specific critique. +- `surprisesFired` is append-only — a surprise id never fires twice. +- `turn_budget` (optional, per-manifest) is enforced at the `/api/turn` route with `403 budgetExhausted: true`. + +### Cost tracking + +Every inner-Claude response carries usage data. The engine applies a pricing table (Haiku/Sonnet/Opus × input/output/cache-write/cache-read) to compute per-turn USD and accumulates it on the session. Both per-turn and cumulative are returned in the turn API response. + +## Conventions + +- Node 20+, pnpm 9+, ESM (`"type": "module"`) across all packages. Imports between workspace packages use `workspace:*`. +- TypeScript strict, shared base in `tsconfig.base.json`. Run `pnpm typecheck` before shipping. +- Prettier for formatting (config in `.prettierrc.json`). Run `pnpm format` before committing anything touched. +- Comments explain **why**, not **what** — match existing code style. Don't add docstrings that restate the identifier name. +- Scenario IDs must match `^[a-z0-9][a-z0-9-]*$`. +- New scenarios: create `packages/scenarios//manifest.yaml` + `README.md`, then `pnpm validate-scenarios` before opening a PR. Target <$2 API cost per session and include at least one surprise.