Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,4 @@ apps/web/data/
data/sessions.json

# Local-only
CLAUDE.md
resume-claude.bat
83 changes: 83 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this is

fieldwork is a self-hosted FDE training simulator built on the Anthropic API. See `README.md` for the product pitch and `docs/architecture.md` for the full engine breakdown — read that file before making non-trivial changes to the turn loop.

## Commands

pnpm workspaces managed by Turbo. Run from repo root unless noted.

```bash
pnpm install # bootstrap workspace
pnpm dev # turbo dev (starts apps/web Next.js dev server)
pnpm build # turbo build across workspace
pnpm test # vitest across packages
pnpm lint # next lint (web) — other packages are no-ops
pnpm typecheck # tsc --noEmit across workspace
pnpm validate-scenarios # schema-check all scenario manifests
pnpm format # prettier --write .
```

Scoped commands:

```bash
pnpm --filter @fieldwork/web dev # web only
pnpm --filter @fieldwork/core test # engine tests only
pnpm --filter @fieldwork/core test -- -t "surprise" # single vitest by name
pnpm --filter @fieldwork/scenarios validate # validate just scenarios
```

CLI (authoring harness, from `apps/cli`). Only `validate` is implemented; the
other commands are registered but are TODO stubs.

```bash
pnpm --filter @fieldwork/cli dev validate packages/scenarios/support-triage/manifest.yaml
pnpm --filter @fieldwork/cli dev init <name> # stub
pnpm --filter @fieldwork/cli dev dryrun <path> # stub
pnpm --filter @fieldwork/cli dev play <path> # stub
pnpm --filter @fieldwork/cli dev cost-estimate <path> # stub
```

Deploy to the staging server: `./scripts/deploy-staging.sh` (requires `FW_HOST` env var; supports `--fast`, `--restart`, `--logs`, `--set-key`, `--set-auth`). See README for the full option list.

## Architecture

Monorepo layout (pnpm + Turbo):

- `apps/web` — Next.js 15 App Router. API routes in `app/api/{session,turn,debrief,health}` are the only surface the UI talks to. `middleware.ts` gates every route except `/api/health` with HTTP Basic Auth when `FIELDWORK_AUTH_PASS` is set (default-off for local dev). `lib/session-store.ts` is a singleton `Map` hydrated from `apps/web/data/sessions.json` via atomic write-and-rename — single-writer only, swap for SQLite if concurrency is ever needed.
- `apps/cli` — commander-based authoring CLI (`validate`, `init`, `dryrun`, `play`, `cost-estimate`). Entry: `src/index.ts`.
- `packages/core` — the engine. Key files: `engine.ts` (turn loop), `state.ts` (`SimState` shape), `contract.ts` (inner-Claude JSON contract), `surprises.ts` (trigger evaluator), `tickets.ts` + `rng.ts` (deterministic mulberry32 ticket generator), `schema/scenario.schema.json` (AJV-validated manifest schema). Vitest lives under `src/__tests__`.
- `packages/rubric` — `score.ts` (per-turn deterministic checks via manifest `rubric` rules; case-insensitive `payload_contains` and `payload_regex`, fails closed on malformed regex) and `debrief.ts` (end-of-scenario Sonnet call).
- `packages/scenarios` — one directory per scenario with `manifest.yaml` + README. All six scenarios are authored and schema-valid: `support-triage`, `doc-qa-rag`, `pipeline-automation`, `workflow-agent`, `legacy-migration`, `incident-response`. Schema tests in `__tests__`.
- `packages/ui` — shared React components used by `apps/web`.

### The two-Claude model

The outer shell (your code) owns state, scoring, and progression. Inner Claude plays the simulated customer environment and is constrained to a strict JSON output contract (see `docs/inner-claude-contract.md` and `packages/core/src/contract.ts`). **Never delegate rendering or scoring to inner Claude** — parse its JSON, validate it, and merge into `SimState` yourself.

Each turn builds a three-part prompt: (1) cached system content (contract + world bible + ticket sample) marked `cache_control: ephemeral` so bytes reuse across turns, (2) dynamic user message (objective statuses, trust map, last 5 actions, current action, fired surprises), (3) model selection — Haiku 4.5 routine, Sonnet when a surprise fires or for debrief. First call uses `max_tokens: 4096`; on JSON parse error or `stop_reason === 'max_tokens'` the engine retries once with 8192 (truncation → omit bad output, malformed → feed it back for correction).

Model IDs are overridable via `FIELDWORK_MODEL_ENV`, `FIELDWORK_MODEL_STAKEHOLDER`, `FIELDWORK_MODEL_DEBRIEF`.

### State invariants

- `stakeholderTrust` is a `[0, 1]` scalar per stakeholder, initialized to 0.5. Deltas from inner Claude are clamped to ±0.2 per turn and the result clamped to `[0, 1]` before persisting.
- `discoveredObjectives` only grows (objectives flagged `discoverable: true` stay hidden from the trainee until surfaced). An objective that's `NEVER DISCOVERED` at debrief time triggers a specific critique.
- `surprisesFired` is append-only — a surprise id never fires twice.
- `turn_budget` (optional, per-manifest) is enforced at the `/api/turn` route with `403 budgetExhausted: true`.

### Cost tracking

Every inner-Claude response carries usage data. The engine applies a pricing table (Haiku/Sonnet/Opus × input/output/cache-write/cache-read) to compute per-turn USD and accumulates it on the session. Both per-turn and cumulative are returned in the turn API response.

## Conventions

- Node 20+, pnpm 9+, ESM (`"type": "module"`) across all packages. Imports between workspace packages use `workspace:*`.
- TypeScript strict, shared base in `tsconfig.base.json`. Run `pnpm typecheck` before shipping.
- Prettier for formatting (config in `.prettierrc.json`). Run `pnpm format` before committing anything touched.
- Comments explain **why**, not **what** — match existing code style. Don't add docstrings that restate the identifier name.
- Scenario IDs must match `^[a-z0-9][a-z0-9-]*$`.
- New scenarios: create `packages/scenarios/<id>/manifest.yaml` + `README.md`, then `pnpm validate-scenarios` before opening a PR. Target <$2 API cost per session and include at least one surprise.
27 changes: 19 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ pnpm dev

Open http://localhost:3000, pick a scenario, and get to work.

Want more detail on setup, prerequisites, and running the CLI/tests? See
[docs/getting-started.md](docs/getting-started.md).

## How it works

Every scenario is a YAML manifest describing a simulated company: its industry,
Expand Down Expand Up @@ -141,10 +144,12 @@ API key and your data never leaves your machine.

## Project status

**Phase 1 playable.** The Support Ticket Classifier scenario runs end-to-end
with live inner-Claude turns, stakeholder dialogue, discoverable objectives,
trust tracking, per-turn objective transitions, and a sharpened debrief that
cites specific turns and proposes alternative prompts.
**Phase 1 playable.** All six scenarios in the catalog are authored, schema-valid,
and loadable through the web app. Support Ticket Classifier and Internal Doc Q&A
are the scenarios exercised end-to-end on a live API (turn loop, stakeholder
dialogue, discoverable objectives, trust tracking, per-turn objective
transitions, cost tracking). The other four Tier 2–3 scenarios are authored
but haven't been played on live API yet.

Working:

Expand All @@ -153,16 +158,22 @@ Working:
- JSON file session persistence (`data/sessions.json`, survives server restart)
- Stakeholder trust meter with colored bars in the briefing panel
- Turn budget and live USD cost display
- Deterministic per-turn objective scoring via manifest `rubric` rules
(`action_kind` / `payload_contains` / case-insensitive `payload_regex`)
- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and
`random` triggers
- Collapsible action log viewer
- End-of-scenario debrief with turn-specific, alternative-prompt critiques
- `fieldwork validate` CLI wired to the scenario schema
- 17+ passing unit tests (scenario schema, ticket generator, surprise engine)
- 34 passing unit tests across core, rubric, and scenarios

Not yet built (see [TODO.md](TODO.md)):

- The 5 other scenario manifests (currently README stubs only)
- Per-turn deterministic rubric checks (currently LLM-driven via inner Claude)
- Stakeholder conflict surprises / political pressure mechanics
- Inner Claude response streaming (currently blocks the UI 5–15s per turn)
- SQLite session persistence (to replace the single-writer JSON file store)
- Action log summarization for long runs (prompt bloats past ~20 turns)
- Cross-session history and analytics
- Demo GIF in the README

## Contributing

Expand Down
19 changes: 14 additions & 5 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,8 @@ Trigger types:

- `turn_count` — fires at or after a specific turn number
- `objective_state` — fires when a named objective reaches a target state
- `action_pattern` — fires when the action log matches a regex
- `action_pattern` — fires when the action log matches a regex (case-insensitive,
fails closed on a malformed pattern)
- `random` — fires with a given probability (per turn)

Already-fired ids are tracked in `state.surprisesFired` and never fire twice.
Expand All @@ -145,10 +146,18 @@ Unit-tested in `packages/core/src/__tests__/surprises.test.ts`.

Two-tier scoring:

- **Per-turn** — LLM-driven via inner Claude returning
`hidden_state_updates.objective_transitions`. The engine validates and
merges these into `state.objectives`. Deterministic pattern-based checks in
`@fieldwork/rubric/score` are scaffolded but not wired yet (TODO).
- **Per-turn** — two paths run in parallel and the deterministic one wins when
both fire on the same objective:
- Inner Claude proposes objective state transitions via
`hidden_state_updates.objective_transitions`. The engine validates and
merges these into `state.objectives`.
- Manifest-level `rubric` rules on each objective run through
`@fieldwork/rubric/score`. A rule matches on `action_kind`,
`payload_contains` (case-insensitive substring), or `payload_regex`
(case-insensitive regex, fails closed on malformed patterns). The first
matching rule sets the objective state and overrides inner Claude's
judgment for the turn. Rules should only upgrade state
(`open → attempted → met`); the engine does not prevent downgrades.
- **End-of-scenario** — `@fieldwork/rubric/debrief` sends the full action log,
final objective states, stakeholder trust, and discovered/undiscovered
splits to Sonnet. The prompt demands every critique cite a specific turn,
Expand Down
13 changes: 11 additions & 2 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,21 @@ pnpm test

## What's playable today

- Support Ticket Classifier scenario runs end-to-end
- All six scenarios in the catalog (Support Ticket Classifier, Internal Doc Q&A,
Data Pipeline Automation, Multi-Step Workflow Agent, Legacy System Migration,
Production Incident Response) are authored, schema-valid, and loadable through
the web app. Support Ticket Classifier and Internal Doc Q&A are verified
end-to-end against a live API; the other four are authored but not yet exercised
on live API.
- Full turn loop through inner Claude with JSON contract validation + retry
- Prompt caching and Haiku/Sonnet model tiering
- Stakeholder dialogue with trust meter (0-1 per stakeholder)
- Discoverable objectives (hidden until the trainee surfaces them)
- Per-turn objective state transitions driven by inner Claude
- Per-turn objective state transitions: inner Claude proposes + deterministic
manifest `rubric` rules (`action_kind` / `payload_contains` / `payload_regex`)
override when they match
- Surprise engine with `turn_count`, `objective_state`, `action_pattern`, and
`random` triggers
- Turn budget + cumulative USD cost display
- Collapsible action log viewer
- End-of-scenario debrief with turn-specific, alternative-prompt critiques
Expand Down
35 changes: 35 additions & 0 deletions docs/writing-scenarios.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,41 @@ Each entry in `objectives`:
See [`support-triage/manifest.yaml`](../packages/scenarios/support-triage/manifest.yaml)
for a fully worked example covering all fields.

## Objective `rubric` rules

Objectives can carry a `rubric` array of deterministic rules that run every
turn. The first matching rule sets the objective state, overriding inner
Claude's judgment for that turn. Each rule has a `match` block and a `set`
value (`open` | `attempted` | `met` | `failed`).

`match` fields (all specified conditions must match):

- `action_kind` — exact match on `action.kind`
- `payload_contains` — case-insensitive substring match against the
JSON-serialized action payload
- `payload_regex` — regex tested against the JSON-serialized payload. **Written
as plain regex — do NOT prefix with `(?i)` or any inline flag.** The engine
constructs patterns with the JavaScript `i` flag automatically, and inline
flags like `(?i)` are Perl-flavor and will crash the turn handler if the
engine didn't also fail closed. A malformed pattern now fails closed (the
rule skips instead of firing), but still — keep patterns plain.

Rules should only upgrade state (`open → attempted → met`); the engine does
not prevent downgrades.

## Surprise triggers

The `surprises` array defines mid-scenario twists. Each surprise has an `id`,
a `trigger`, a `detail` string, and optional `visible`, `channel`, and `from`
fields. Trigger types:

- `turn_count` — `value: N` fires at or after turn N
- `objective_state` — `objective: <id>, state: <state>` fires when the named
objective reaches that state
- `action_pattern` — `pattern: <regex>` fires when the serialized action log
matches. Case-insensitive; no inline flags; fails closed on malformed input.
- `random` — `probability: 0.3` fires stochastically per turn

## Quality checklist

- [ ] The trainee objective is specific and measurable
Expand Down