From b67214569aeafa186ba8588a467de00a9fa5b18f Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 12:40:29 -0400
Subject: [PATCH 01/12] docs(planning): register v1.5 'Build with Codex'; Phase
 0 env + R1/R2 research notes

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .planning/ROADMAP.md                          |  14 +-
 .planning/STATE.md                            |  25 ++-
 .../2026-06-12-claude-json-envelope.md        |  71 ++++++
 .../2026-06-12-codex-headless-facts.md        | 207 ++++++++++++++++++
 .planning/v1.5-DESIGN.md                      | 119 ++++++++++
 5 files changed, 425 insertions(+), 11 deletions(-)
 create mode 100644 .planning/research/2026-06-12-claude-json-envelope.md
 create mode 100644 .planning/research/2026-06-12-codex-headless-facts.md
 create mode 100644 .planning/v1.5-DESIGN.md

diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md
index 5c4b660..941dd2c 100644
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -10,7 +10,19 @@ Co-Evolution is a tooling repo for structured iterative refinement between AI ag
 - [x] **v1.1 Polish & Ergonomics** (shipped 2026-04-17) — v1.0 code review fixes (WR-01/02/03) + runtime ergonomics (REVISE auto-loop, visible live mode, branch/worktree management). 4 phases, 6 requirements closed. PR [#2](https://github.com/alanshurafa/co-evolution/pull/2) · See [`milestones/v1.1-ROADMAP.md`](milestones/v1.1-ROADMAP.md) · [`milestones/v1.1-SUMMARY.md`](milestones/v1.1-SUMMARY.md) · [`milestones/v1.1-REQUIREMENTS.md`](milestones/v1.1-REQUIREMENTS.md)
 - [x] **v1.3 Reliability, Measurement & Cross-Platform** (shipped 2026-06-11) — stranded-fix landing, macOS/bash-3.2+5.2 portability with 3-OS CI, silent-failure hardening, and the bounce measurement stack (state.json, deterministic scorer + marker-fate ledger, blind judge, human report). Headline: 17.6% deletion-convergence measured; Fable-5 judge 7/7 improved. 9 phases. See [`milestones/v1.3-SUMMARY.md`](milestones/v1.3-SUMMARY.md) · audit at `docs/audits/2026-06-10-v13-audit.md`
 
-## Active Milestone: v1.4 Distribution — npm + MCP (2026-06-11)
+## Active Milestone: v1.5 Build with Codex — model ladder + orchestrated execution (2026-06-12)
+
+**Goal:** Adopt the Codex-execution / Fable-orchestration split (per @cjzafir's pattern) in the dev-review runner: fix 3 latent env-export bugs, add per-seat model/effort config, a `--preset codex-build` shortcut, detached background execution with harness-exit-wakeup, a status-reader script, token capture to measure the 50% cost claim, and a `/codex-build` orchestration skill for both the runner and plugin transports. Design basis: `.planning/v1.5-DESIGN.md` (approved 2026-06-12).
+
+- [ ] **Phase 0: Environment + research** (2026-06-12, in progress) — codex symlink + smoke; plugin install; R1 pin `claude -p --output-format json` envelope; R2 pin codex end-of-run token line; register milestone in .planning/; research notes filed.
+- [ ] **Phase 1: Seat plumbing + env-export correctness** — `lib/co-evolution.sh` effort knobs + `invoke_codex_schema` move (B2); `dev-review.sh` `export CODEX_MODEL` (B1) + `export WORKDIR` (B3) + `--verifier/--claude-model` flags + per-seat env via `apply_seat_env`. Gate: `tests/run-all.sh` green; byte-parity with knobs off.
+- [ ] **Phase 2: Claude-verifier hardening + `--preset codex-build`** — fenced-JSON verdict fallback; preset expansion (fable/high → codex/xhigh → fable/max, bounces=2, revise-loop=1); banner; `tests/preset-expansion-simulation.sh`.
+- [ ] **Phase 3: Runner observability + status reader** — `state.json` additions (`current_phase`, `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id`); new `dev-review-status.sh` (~120 lines, exit codes 0/2/3/4/5); `tests/status-reader-simulation.sh`.
+- [ ] **Phase 4: Token capture** — `CO_EVOLVE_TOKEN_CAPTURE=1` (default off); `invoke_claude` gated JSON mode; codex stderr harvest; `collect_token_usage` → `state.json.tokens`; `tests/token-capture-simulation.sh`.
+- [ ] **Phase 5: `/codex-build` skill + docs** — new `skills/codex-build/SKILL.md` (preflight → plan → kick → wake/gate loop, both runner and plugin transports); CLAUDE.md Default Rule update; routing doc updates.
+- [ ] **Phase 6: Dogfood + evidence** — 2–3 real `/codex-build` tasks (ACCEPT / REVISE→ACCEPT / ESCALATE); token evidence note; MCP parity (`vendor.sh` + `npm test`); memory update.
+
+## Previous Milestone: v1.4 Distribution — npm + MCP (2026-06-11)
 
 **Goal:** Make the bounce protocol invocable without `git clone`: a Node/TS
 MCP server (`@alanshurafa/co-evolution-mcp`, one `co_evolve` tool) published
diff --git a/.planning/STATE.md b/.planning/STATE.md
index c429f60..8a914df 100644
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@@ -1,17 +1,17 @@
 ---
 gsd_state_version: 1.0
-milestone: v1.4
-milestone_name: Distribution — npm + MCP
+milestone: v1.5
+milestone_name: Build with Codex — model ladder + orchestrated execution
 status: executing
-stopped_at: v1.4 Phases 0-4 EXECUTED (2026-06-11). mcp/ package built + 4/4 hermetic smoke tests + stdio handshake verified; CI gains 3-OS mcp job; publish-mcp.yml ready. Remaining = Phase 5 (HUMAN: verify npm scope @alanshurafa, add NPM_TOKEN secret, git tag -> auto-publish, Claude Desktop round-trip) + Phase 6 post-ship registry/awesome-list.
-last_updated: "2026-06-10T23:45:00.000Z"
-last_activity: 2026-06-11 -- v1.4 milestone registered; v1.3 archived
+stopped_at: v1.5 Phase 0 IN PROGRESS (2026-06-12). Codex symlink created, smoke pass, claude -p envelope pinned, plugin installed. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched.
+last_updated: "2026-06-12T12:28:00.000Z"
+last_activity: 2026-06-12 -- v1.5 milestone registered; Phase 0 executing
 progress:
-  total_phases: 8
-  completed_phases: 8
-  total_plans: 17
-  completed_plans: 18
-  percent: 100
+  total_phases: 7
+  completed_phases: 0
+  total_plans: 0
+  completed_plans: 0
+  percent: 0
 ---
 
 # Project State
@@ -30,6 +30,11 @@ Milestone: v1.4 Distribution — npm + MCP
 Phase: 0-4 complete; Phase 5 (publish) blocked on human items
 Status: npm scope verification + NPM_TOKEN secret + git tag are Alan's; everything else built and CI-gated. v1.2 SC-4 gate still open (VERIFY-SC4.md).
 Last activity: 2026-06-10 -- Phase 0 merges + LF policy + audit report
+
+Milestone: v1.5 Build with Codex — model ladder + orchestrated execution
+Phase: 0 (env + research) in progress
+Status: codex symlink created + smoke pass; claude -p envelope pinned; codex plugin installed (openai-codex marketplace, v1.0.4); Phase 0 research notes filed. Branch: feat/v1.5-codex-build.
+Last activity: 2026-06-12 -- Phase 0 start; design file committed at .planning/v1.5-DESIGN.md
 Working directories: `~/co-evolution-v13/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC
 
 macOS baseline before v1.3 fixes: scorer-verification 11/14; code-proposer sim 1/16; pr-emitter sim 4/12; template-proposer sim 1/8; revise-loop sim aborts. Root causes: bash 3.2 (mapfile, source <(…)), BSD sed GNU-isms. Target after Phase 0.5: all green on macOS.
diff --git a/.planning/research/2026-06-12-claude-json-envelope.md b/.planning/research/2026-06-12-claude-json-envelope.md
new file mode 100644
index 0000000..766fa02
--- /dev/null
+++ b/.planning/research/2026-06-12-claude-json-envelope.md
@@ -0,0 +1,71 @@
+# R1: claude -p JSON Envelope — Field Reference
+
+**Date:** 2026-06-12  
+**Phase:** v1.5 Phase 0  
+**Purpose:** Pin the exact JSON output structure of `claude -p --output-format json` for Phase 4 token-capture parsing.
+
+## Command run
+
+```
+claude -p --output-format json --model claude-haiku-4-5-20251001 "Reply with exactly: PING"
+```
+
+## Verbatim output (not-logged-in state; all usage fields present and zero)
+
+```json
+{"type":"result","subtype":"success","is_error":true,"api_error_status":null,"duration_ms":1103,"duration_api_ms":0,"num_turns":1,"result":"Not logged in · Please run /login","stop_reason":"stop_sequence","session_id":"4642f382-c299-4d72-bcaf-3e7bca396c7d","total_cost_usd":0,"usage":{"input_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":0,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"inference_geo":"","iterations":[],"speed":"standard"},"modelUsage":{},"permission_denials":[],"terminal_reason":"completed","fast_mode_state":"off","uuid":"def45115-84e3-4e92-a601-20e0dae05dda"}
+```
+
+Exit code: 1 (error because not logged in; envelope still emitted on stdout)
+
+## Top-level fields
+
+| Field | Type | Notes |
+|---|---|---|
+| `type` | string | Always `"result"` |
+| `subtype` | string | `"success"` even on error |
+| `is_error` | bool | `true` when auth fails or API error |
+| `api_error_status` | null/int | HTTP status code on API errors; null here |
+| `duration_ms` | int | Wall time in ms |
+| `duration_api_ms` | int | API time in ms |
+| `num_turns` | int | Number of conversation turns |
+| `result` | string | The model's text output (or error message) |
+| `stop_reason` | string | e.g. `"stop_sequence"`, `"end_turn"` |
+| `session_id` | string | UUID |
+| `total_cost_usd` | float | Total cost in USD (0 when not logged in) |
+| `usage` | object | Token usage breakdown — see below |
+| `modelUsage` | object | Per-model usage breakdown (empty when not logged in) |
+| `permission_denials` | array | Tool permission denial events |
+| `terminal_reason` | string | `"completed"` |
+| `fast_mode_state` | string | `"off"` |
+| `uuid` | string | Run UUID |
+
+## usage subfields
+
+| Field | Type | Notes |
+|---|---|---|
+| `input_tokens` | int | Prompt input tokens |
+| `cache_creation_input_tokens` | int | Cache write tokens |
+| `cache_read_input_tokens` | int | Cache hit tokens |
+| `output_tokens` | int | Response tokens |
+| `server_tool_use.web_search_requests` | int | Web search count |
+| `server_tool_use.web_fetch_requests` | int | Web fetch count |
+| `service_tier` | string | `"standard"` or `"priority"` |
+| `cache_creation.ephemeral_1h_input_tokens` | int | 1-hour ephemeral cache write tokens |
+| `cache_creation.ephemeral_5m_input_tokens` | int | 5-min ephemeral cache write tokens |
+| `inference_geo` | string | Inference geography code |
+| `iterations` | array | Per-iteration usage (for multi-turn) |
+| `speed` | string | `"standard"` |
+
+## Notes for Phase 4
+
+- All token fields live under `.usage`. Phase 4's `invoke_claude` gated JSON mode should extract: `.usage.input_tokens`, `.usage.output_tokens`, `.usage.cache_creation_input_tokens`, `.usage.cache_read_input_tokens`.
+- `.total_cost_usd` is a direct top-level field, not nested.
+- The envelope is always emitted to **stdout** even when `is_error=true`.
+- The exit code is 1 on auth error; Phase 4 must handle non-zero exit with usable envelope (capture stdout regardless of exit code using `|| true`).
+- **Limitation:** This capture is from a not-logged-in shell. When logged in, `modelUsage` will be populated and `iterations` may have per-turn breakdown. Field names are stable across auth state.
+
+## Auth status at capture time
+
+`claude whoami` returns: `Not logged in · Please run /login`  
+(The Mac's interactive Claude Code sessions authenticate through the Electron app, not this shell. The sub-agent shell does not carry the session token.)
diff --git a/.planning/research/2026-06-12-codex-headless-facts.md b/.planning/research/2026-06-12-codex-headless-facts.md
new file mode 100644
index 0000000..62ab02a
--- /dev/null
+++ b/.planning/research/2026-06-12-codex-headless-facts.md
@@ -0,0 +1,207 @@
+# Codex Headless Research: Smoke, Token Line, Plugin Route
+
+**Date:** 2026-06-12  
+**Phase:** v1.5 Phase 0 (T1, R2, T4)  
+**Purpose:** Pin environment facts needed by Phases 1–5.
+
+---
+
+## T1 — Codex CLI symlink + smoke
+
+### Binary location
+
+```
+/Applications/Codex.app/Contents/Resources/codex
+-rwxr-xr-x@ 1 shurafa  staff  213429648 Jun 11 16:11
+```
+
+### Symlink created
+
+```
+ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex
+```
+
+Result: `lrwxr-xr-x@ 1 shurafa  staff  48 Jun 12 12:28 /Users/shurafa/.local/bin/codex -> /Applications/Codex.app/Contents/Resources/codex`
+
+### Version
+
+```
+codex-cli 0.140.0-alpha.2
+```
+
+### Config (~/.codex/config.toml relevant fields)
+
+```toml
+model = "gpt-5.5"
+model_reasoning_effort = "xhigh"
+service_tier = "priority"
+```
+
+### Headless smoke command
+
+```bash
+mkdir -p "$TMPDIR/codex-smoke" && cd "$TMPDIR/codex-smoke" && \
+  ~/.local/bin/codex exec --full-auto --skip-git-repo-check \
+  "Reply with exactly: SMOKE-OK" > out.txt 2> err.txt; echo "exit=$?"
+```
+
+**Exit code: 0**
+
+### stdout (out.txt — verbatim)
+
+```
+SMOKE-OK
+```
+
+### stderr (err.txt — verbatim)
+
+```
+warning: `--full-auto` is deprecated; use `--sandbox workspace-write` instead.
+Reading additional input from stdin...
+OpenAI Codex v0.140.0-alpha.2
+--------
+workdir: /private/var/folders/20/vl4p6fsd2dsbf5r68611c6nc0000gn/T/codex-smoke
+model: gpt-5.5
+provider: openai
+approval: never
+sandbox: workspace-write [workdir, /tmp, $TMPDIR]
+reasoning effort: xhigh
+session id: 019ebcaa-5766-7c60-beb8-824b921d69a1
+--------
+user
+Reply with exactly: SMOKE-OK
+codex
+SMOKE-OK
+tokens used
+11,681
+```
+
+**Conclusion:** Auth works headless. gpt-5.5 + xhigh accepted. ChatGPT-plan tier confirmed (`service_tier = "priority"` in config).
+
+---
+
+## R2 — Codex end-of-run token usage line (verbatim + stream)
+
+**Stream:** stderr  
+**Exact lines (verbatim, from smoke run above):**
+
+```
+tokens used
+11,681
+```
+
+Two lines: the label `tokens used` on one line, then the formatted integer with comma separator on the next line.
+
+**Format for Phase 4 grep/parse:**  
+The token count follows the literal line `tokens used` (no colon) as the immediately subsequent line. Parse with:
+```bash
+grep -A1 "^tokens used$" err.txt | tail -1
+```
+or equivalently with `awk '/^tokens used$/{getline; print}'`.  
+The number uses comma formatting (e.g., `11,681`); strip commas before arithmetic: `tr -d ','`.
+
+**Note on deprecation:** `--full-auto` is deprecated in favor of `--sandbox workspace-write`. Both flags produce the same token output format. The runner may want to migrate to `--sandbox workspace-write` in Phase 1.
+
+---
+
+## T4 — Plugin install route
+
+### `claude plugin --help` subcommands (relevant)
+
+```
+claude plugin install|i [options] <plugin>
+  Install a plugin from available marketplaces
+  Options:
+    --config <key=value>  Set a userConfig option (repeatable)
+    -s, --scope <scope>   user, project, or local (default: "user")
+
+claude plugin marketplace [options] [command]
+  Commands:
+    add [options] <source>      Add a marketplace from a URL, path, or GitHub repo
+    list [options]              List all configured marketplaces
+    remove|rm [options] <name>  Remove a configured marketplace
+    update [options] [name]     Update marketplace(s)
+```
+
+### Non-interactive install route (confirmed working)
+
+Step 1 — Add the marketplace:
+```bash
+claude plugin marketplace add openai/codex-plugin-cc
+```
+Output:
+```
+Adding marketplace…SSH not configured, cloning via HTTPS: https://github.com/openai/codex-plugin-cc.git
+Refreshing marketplace cache (timeout: 120s)…
+Cloning repository (timeout: 120s): https://github.com/openai/codex-plugin-cc.git
+Clone complete, validating marketplace…
+Cleaning up old marketplace cache…
+✔ Successfully added marketplace: openai-codex (declared in user settings)
+```
+
+Step 2 — Install the plugin:
+```bash
+claude plugin install codex@openai-codex
+```
+Output:
+```
+Installing plugin "codex@openai-codex"...✔ Successfully installed plugin: codex@openai-codex (scope: user)
+```
+
+### Verification (`claude plugin list`)
+
+```
+Installed plugins:
+
+  ❯ codex@openai-codex
+    Version: 1.0.4
+    Scope: user
+    Status: ✔ enabled
+```
+
+### Plugin details (component inventory)
+
+```
+codex 1.0.4
+  Use Codex from Claude Code to review code or delegate tasks.
+  Source: codex@openai-codex
+
+Component inventory
+  Skills (10)  adversarial-review, cancel, codex-cli-runtime, codex-result-handling,
+               gpt-5-4-prompting, rescue, result, review, setup, status
+  Agents (1)   codex-rescue
+  Hooks (3)    SessionStart, SessionEnd, Stop  (harness-only — no model context cost)
+  MCP servers (0)
+  LSP servers (0)
+
+Projected token cost
+  Always-on:   ~306 tok   added to every session
+```
+
+**Key skills for v1.5:** `/codex:rescue` (delegate tasks to Codex), `/codex:status` (check job status), `/codex:review` (adversarial review), `/codex:cancel` (cancel running job).
+
+### Conclusion
+
+A fully **non-interactive** install route exists and works. No `/plugin` UI required. The plan's note about "if it requires the interactive /plugin UI" is moot — `claude plugin marketplace add` + `claude plugin install` suffice. The install persists at user scope.
+
+### Interactive equivalent (for reference only, not needed)
+
+Would be: open Claude Code, type `/plugin marketplace add openai/codex-plugin-cc`, then `/plugin install codex`. The CLI route above is strictly equivalent.
+
+---
+
+## Summary
+
+| Item | Fact |
+|---|---|
+| codex binary | `/Applications/Codex.app/Contents/Resources/codex` (213 MB, Jun 11) |
+| codex version | 0.140.0-alpha.2 |
+| symlink | `~/.local/bin/codex` → binary (created 2026-06-12) |
+| smoke exit | 0 |
+| smoke stdout | `SMOKE-OK` |
+| token line stream | **stderr** |
+| token line format | two lines: `tokens used` then `11,681` (comma-formatted integer) |
+| `--full-auto` | deprecated; equivalent `--sandbox workspace-write` |
+| plugin install | non-interactive, fully CLI-driven |
+| plugin name | `codex@openai-codex` v1.0.4, user scope, enabled |
+| plugin skills | rescue, status, review, cancel, adversarial-review (+ 5 more) |
diff --git a/.planning/v1.5-DESIGN.md b/.planning/v1.5-DESIGN.md
new file mode 100644
index 0000000..9d661b9
--- /dev/null
+++ b/.planning/v1.5-DESIGN.md
@@ -0,0 +1,119 @@
+# v1.5 "Build with Codex" — model ladder + Fable-orchestrated execution
+<!-- Approved 2026-06-12; copied from ~/.claude/plans/evaluate-this-workflow-as-rippling-adleman.md -->
+
+## Context: the evaluation
+
+**The post** ([@cjzafir, 2026-06-11](https://x.com/cjzafir/status/2065104422762684745), ~98k views): cut weekly Claude Code limits ~50% by installing OpenAI's official [codex-plugin-cc](https://github.com/openai/codex-plugin-cc) (20.7k stars, Apache 2.0) and splitting roles — **Fable 5 high plans, Codex 5.5 xhigh executes on the ChatGPT plan (no API), Fable 5 max reviews and gates**. "Fable never writes the code, Codex never decides the design." Mechanically the plugin's `codex-rescue` agent shells out to `codex exec --full-auto` — **the exact invocation our dev-review runner already uses**.
+
+**Verdict: adopt.** dev-review already implements ~80% of this (compose → bounce → execute → verify, Codex executor, review-verdict.json gate, revise loop, worktree isolation). What's genuinely missing: per-seat model/effort selection, a blessed preset, detached execution with gate-based check-ins (the user's "Fable orchestrates, checks in periodically" vision), and token instrumentation to test the 50% claim.
+
+**Where we improve on the post:**
+1. **Bounce the brief** — Codex critiques the plan with `[CONTESTED]`/`[CLARIFY]` markers before execution (post has no plan hardening).
+2. **Schema-bound verdict gate with bounded convergence** — APPROVED/REVISE + confidence + issues[], max 2 orchestrator revise rounds (mirrors the 2-pass marker-expiry rule). The post's "cycle repeats until it passes" has no termination guarantee.
+3. **Detached execution, not babysitting** — the post keeps a Fable session supervising inline, burning cache reads every turn (against the token policy). We kick the runner via harness background Bash, **end the turn**, and get woken on exit — Fable pays only for plan + review gates.
+4. **Worktree/branch isolation** (`--worktree auto` exists) — post runs Codex loose in the repo.
+5. **Evidence, not vibes** — per-phase token capture into state.json so runs measure whether the 50% claim holds here.
+
+**Environment discoveries (this Mac):**
+- Codex.app installed, `~/.codex/config.toml` already `model = "gpt-5.5"`, `model_reasoning_effort = "xhigh"`, fresh auth.json; CLI binary at `/Applications/Codex.app/Contents/Resources/codex` (not on PATH — symlink needed).
+- claude CLI 2.1.162; `--effort` proven headless in-repo (`evals/judge-bounce.sh:139` uses `--effort high` with `claude-fable-5`).
+- `~/Project/co-evolution` clone current at 215dc83, clean.
+- **3 latent bugs found** (must fix; they break per-seat config under the `timeout bash -c` process boundary): **B1** `--model` sets `CODEX_MODEL` without `export` (dev-review.sh:976) so it never reaches `invoke_codex`; **B2** `invoke_codex_schema` defined in dev-review.sh but called in a child that sources only lib → exit 127 when verifier=codex; **B3** `WORKDIR` never exported → agents fall back to launch cwd.
+
+**User decisions:** full v1.5 scope · both transports (own runner core + plugin path) · skill named `/codex-build` · Mac codex setup yes.
+
+## Ground rules
+
+- Implement in **`~/Project/co-evolution`** (GitHub-first; the SMB mount `/Volumes/Project/co-evolution` is read-only/sync-only). Branch `feat/v1.5-codex-build`; push when sessions end.
+- Register milestone **v1.5** per `.planning/` conventions (STATE.md, ROADMAP.md, `phases/NN-*/`). v1.4 Phase 5 (npm publish) stays human-blocked in parallel — untouched.
+- All new knobs default empty/off → **byte-parity** with today's argv and artifacts (existing sims must stay green).
+- lib/co-evolution.sh changes must stay additive (it's vendored into mcp/; re-run `vendor.sh` + mcp tests to prove parity).
+
+## Phases
+
+### Phase 0 — Environment + research spikes
+- `ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex`; `codex --version`; headless smoke `codex exec --full-auto` in a scratch dir (confirms ChatGPT-plan auth + gpt-5.5/xhigh accepted).
+- Install the plugin in Claude Code: `/plugin marketplace add openai/codex-plugin-cc` → install `codex` → smoke `/codex:rescue` + `/codex:status`.
+- R1: pin `claude -p --output-format json` envelope fields (usage, total_cost_usd). R2: pin codex end-of-run token line format on stderr. R3: confirm harness background-exit notification with a ~5-min stub run.
+- Output: `.planning/research/` notes + v1.5 milestone registration.
+
+### Phase 1 — Seat plumbing + env-export correctness (runner)
+- `lib/co-evolution.sh`: new empty-default knobs `CLAUDE_EFFORT` (→ `--effort` in `invoke_claude`) and `CODEX_REASONING_EFFORT` (→ `-c model_reasoning_effort=` in `invoke_codex`); **move `invoke_codex_schema` into lib** (fixes B2).
+- `dev-review/codex/dev-review.sh`: `export CODEX_MODEL` (B1) + `export WORKDIR` (B3); new flags `--verifier opus|claude|codex`, `--claude-model` (+ `fable` → `claude-fable-5` alias); per-seat env `COMPOSER_/EXECUTOR_/VERIFIER_MODEL|EFFORT` via `apply_seat_env` called at the 5 invocation sites (compose, bounce both turns, execute, verify); `select_verifier` honors `VERIFIER_OVERRIDE`.
+- Gate: `tests/run-all.sh` green; default invocation argv byte-identical.
+
+### Phase 2 — Claude-verifier hardening + `--preset codex-build`
+- Verify phase: brace-block extraction fallback for prose-wrapped verdicts; persist normalized verdict back to the contract path `verdict.json` (no-op for codex `--output-schema`).
+- `apply_preset codex-build`: composer=claude(fable, high) · executor=codex(model unpinned — user's codex config rules, effort xhigh) · verifier=claude(fable, max) · `--verify` on · bounces=2 · revise-loop=1. Banner prints seats; optional `seat_models` state field; RUNNER-CONTRACT.md → v1.1 (additive rows).
+- New `tests/preset-expansion-simulation.sh` (PATH-stub claude/codex, assert seat argv incl. `--effort`/`-c model_reasoning_effort=xhigh`, preset expansion, B1/B2/B3 regressions, fenced-JSON verdict parse).
+
+### Phase 3 — Runner observability + status reader
+- state.json additions (single-writer, additive): `current_phase {name, started_at}` written at phase **start** (null at EOF), `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id` via new `--parent-run` flag (lineage across revise re-kicks; re-kicks always get fresh run dirs — no `--resume` in v1.5).
+- New `dev-review/codex/dev-review-status.sh` (~120 lines, read-only): run summary, phase progress, heartbeat = stderr-log mtime/size (BSD/GNU `stat` feature-detect), marker counts, verdict, diffstat, dead-runner detection. Exit codes: 0 done / 2 partial / 5 running / 4 presumed-dead / 3 no-run; `--json` mode. This is the **progress-file protocol** any fresh Fable session, Haiku watcher, or human reads.
+- New `tests/status-reader-simulation.sh` (synthetic run dirs) + lifecycle sim.
+
+### Phase 4 — Token capture (`CO_EVOLVE_TOKEN_CAPTURE=1`, default off)
+- `invoke_claude`: gated `--output-format json` mode — `.result` → output file (contract preserved), usage → `outputs/usage-<phase>.json` sidecar.
+- Codex: harvest the token line from already-captured per-phase stderr logs (format pinned in R2).
+- `collect_token_usage` at EOF → `state.json.tokens {phases, totals}`; scorer copies totals into report details (informational). New `tests/token-capture-simulation.sh`; parity guard when off.
+
+### Phase 5 — `/codex-build` skill + docs (both transports)
+- **New `skills/codex-build/SKILL.md`** (~250 lines) — the orchestration loop:
+  1. PREFLIGHT: codex present, no other active run (`dev-review-status.sh --list`), clean tree → `--branch auto`, dirty → `--worktree auto` (dirty tree otherwise makes verify silently skip).
+  2. PLAN in-session (Fable = composer; plan file in `$TMPDIR`, **never** inside the workdir, content embedded inline per repo convention) + optional ≤2 synchronous bounce passes; resolve markers.
+  3. KICK: one Bash call, `run_in_background: true`: `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --skip-plan --plan $PLAN --preset codex-build [--parent-run <id>] -- "<task>"` → report run id → **END TURN** (no polling, no sleep; harness wakes the session on exit).
+  4. WAKE → REVIEW GATE: status script → read verdict.json + diffstat first, targeted diff hunks only → **ACCEPT** (report branch/diffstat/tokens, suggest merge/PR) / **REVISE** (append "Reviewer Corrections (Round N)" to a copy of the plan, re-kick with `--parent-run`; **max 2 rounds**) / **ESCALATE** (missing verdict, exit 1, dead runner, rounds exhausted, scope_creep_detected — never auto-merge).
+  5. Budget rules + troubleshooting (incl. `pkill -f 'codex exec'` for orphans; optional one-shot scheduled check for very long runs).
+- **Plugin transport section** (second supported path, interactive flavor): when mid-session or for ad-hoc rescue work, delegate via `codex-rescue` / `/codex:rescue --background`, check `/codex:status`, then Fable reviews using the same review template + verdict JSON contract and the same ≤2-round cap. Documented protocol sharing our templates/schema; not used by evals/CI (no `--output-schema` contract through the plugin).
+- Docs: `CLAUDE.md` Default Rule gains the third option ("build with codex" — Fable plans/reviews, Codex executes); `dev-review/codex/instructions.md` routing rows (runner preset + plugin alternative); `skills/dev-review/SKILL.md` preset mention, `$CODEX_MODEL`/`-c` args gap fix, live-mode limitation note; sync-token grep test pinning the preset table across files.
+
+### Phase 6 — Dogfood + evidence
+- 2–3 real tasks through `/codex-build` on the Mac (now codex-capable); exercise ACCEPT, REVISE→ACCEPT, and ESCALATE paths for real; tune defaults.
+- **Evidence note** on the 50% claim: runner-side tokens (state.json) + Fable session turn counts vs. an interactive `/dev-review` baseline on the same task. Honest scope: harness-side session spend visible only as turn counts.
+- Re-run `mcp/scripts/vendor.sh` + `(cd mcp && npm test)` (lib changed); update auto-memory with the outcome.
+
+## Critical files
+
+| File | Change |
+|---|---|
+| `dev-review/codex/dev-review.sh` | B1/B3 exports, seat flags/env, preset, `--parent-run`, phase-start state writes, EOF token collect |
+| `lib/co-evolution.sh` | effort knobs, `invoke_codex_schema` relocation (B2), `begin_state_phase`, gated JSON mode, `collect_token_usage` |
+| `dev-review/codex/dev-review-status.sh` | **new** status reader |
+| `skills/codex-build/SKILL.md` | **new** orchestration skill (both transports) |
+| `skills/dev-review/SKILL.md`, `CLAUDE.md`, `dev-review/codex/instructions.md` | docs/routing |
+| `evals/RUNNER-CONTRACT.md` | v1.1 additive rows (`seat_models`, `current_phase`, `runner_pid`, `tokens`, lineage) |
+| `tests/preset-expansion-simulation.sh`, `tests/status-reader-simulation.sh`, `tests/token-capture-simulation.sh` | **new** hermetic sims |
+| `.planning/` (STATE.md, ROADMAP.md, `phases/`, `research/`) | v1.5 registration + R1–R3 notes |
+
+## Verification
+
+1. **Hermetic (Mac, no real agents):** `bash -n` on touched scripts; `bash tests/run-all.sh` — all existing + 3 new sims green; byte-parity assertions with knobs off.
+2. **Claude-only live dry run:** scratch repo, all three seats = claude with `VERIFIER_MODEL=fable VERIFIER_EFFORT=max` — proves `--effort max` headless and the claude-verdict path end-to-end (fallback: `high`).
+3. **Codex smoke (Mac, post-symlink):** headless `codex exec` + one full `--preset codex-build` run on a toy repo, incl. one forced REVISE round.
+4. **Orchestrated loop:** kick via background Bash, end turn, confirm wake-on-exit + status script output; plugin-transport smoke via `/codex:rescue`.
+5. **MCP parity:** `vendor.sh` + `cd mcp && npm test`.
+6. **Dogfood matrix (Phase 6):** ACCEPT / REVISE→ACCEPT / ESCALATE each exercised once; token evidence note written.
+
+## Execution strategy — sub-agent deployment
+
+This session (Fable) acts as orchestrator only: it briefs one executor sub-agent per phase, reviews the diff + test output at each gate, and never writes the code itself. Each agent gets the plan file path, its phase section, and the repo conventions; context stays small in the main session.
+
+| Phase | Agent model | Rationale |
+|---|---|---|
+| 0 — env setup, R1–R3 spikes, v1.5 `.planning` registration | **Sonnet** | Mechanical: symlink, smoke commands, record outputs, templated planning docs. Plugin install (`/plugin`) may need the main session or user if no CLI route. |
+| 1 — seat plumbing + B1/B2/B3 fixes | **Opus** | Careful surgery in a 1471-line bash script with byte-parity invariants. |
+| 2 — verifier hardening + preset + sim | **Opus** | Verdict-path correctness + new hermetic simulation. |
+| 3 — observability + status reader + sims | **Opus** | state.json single-writer weaving is delicate. |
+| 4 — token capture | **Opus** | Gated invoke_claude changes against the vendored lib. |
+| 5 — `/codex-build` skill + docs | **Opus** | Spec is fully written in this plan; converting it to SKILL.md prose + doc routing is execution, not design. |
+| 6 — dogfood + evidence | **Fable gates, Opus drafts** | Real runs go through the runner itself; Fable (main session) reviews verdicts/evidence; an Opus agent drafts the evidence note. |
+
+Phases run **sequentially** (1→2 and 3→4 edit the same files; parallelism would conflict). Gate protocol between phases: executor reports diff summary + test results → Fable spot-checks the diff → commit on the `feat/v1.5-codex-build` branch → next phase. All work in `~/Project/co-evolution`, never the SMB mount.
+
+## Risks
+
+- `--effort max` on claude CLI unverified (in-repo proof exists only for `high`) → dry-run step 2 resolves; one-constant fallback.
+- `xhigh` × model mismatch in codex → surfaced by existing stderr/error-payload gates; `EXECUTOR_EFFORT` override.
+- Seat env is process-global → every call site re-applies; sim pins all 5 sites.
+- Plugin drift (OpenAI-owned formats) → plugin path is prose-level protocol only, pinned to plugin major version in docs; runner path carries CI.
+- Missed wake notification (machine sleep) → status script's dead-runner detection + optional one-shot scheduled check.

From fb6862ab3d8b84c5f576d2904853d7eb5e270ffd Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 12:40:29 -0400
Subject: [PATCH 02/12] feat: per-seat model/effort seats + B1/B2/B3 env-export
 fixes (v1.5 Phase 1)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review.sh | 138 +++++++++++++++++++++++++--------
 lib/co-evolution.sh            |  72 ++++++++++++++++-
 2 files changed, 175 insertions(+), 35 deletions(-)

diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 2d96675..7d7595f 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -27,6 +27,9 @@ LIVE_MODE="${LIVE_MODE:-false}"
 # exclusive — both non-empty after parsing = die. Default empty = Phase 3 byte-parity.
 BRANCH_SPEC="${DEV_REVIEW_BRANCH:-}"
 WORKTREE_SPEC="${DEV_REVIEW_WORKTREE:-}"
+# v1.5: explicit verifier seat override. Empty = derive from executor via the
+# existing select_verifier logic (byte-parity default). Env default + --verifier flag.
+VERIFIER_OVERRIDE="${DEV_REVIEW_VERIFIER:-}"
 BRANCH_CREATED=""
 WORKTREE_PATH=""
 WORKDIR="$(pwd)"
@@ -75,6 +78,8 @@ Options:
   --skip-plan              Skip compose+bounce and execute an existing plan
   --plan FILE              Existing plan file to use with --skip-plan
   --model MODEL            Override Codex model
+  --verifier AGENT         Force the verify-phase agent (codex|opus|claude); default derives from --executor
+  --claude-model MODEL     Override Claude model (alias: fable -> claude-fable-5; else passthrough)
   --workdir DIR            Working directory (default: current directory)
   --timeout SECONDS        Per-phase timeout in seconds (default: 1800)
   --revise-loop N          Auto-retry on REVISE verdict up to N extra passes (default: 0 = disabled)
@@ -110,6 +115,17 @@ normalize_agent() {
   esac
 }
 
+# v1.5: resolve a friendly Claude model alias to its CLI model id. Only `fable`
+# is aliased today (→ claude-fable-5); anything else passes through verbatim so
+# explicit ids (claude-opus-4-6, claude-opus-4-8[1m], …) and an empty value are
+# untouched. Used by the --claude-model flag and the per-seat env layer.
+resolve_claude_model_alias() {
+  case "$1" in
+    fable) echo "claude-fable-5" ;;
+    *)     echo "$1" ;;
+  esac
+}
+
 invoke_agent() {
   local agent="$1"
   shift
@@ -169,6 +185,12 @@ require_agent_cli() {
 }
 
 select_verifier() {
+  # v1.5: an explicit --verifier / DEV_REVIEW_VERIFIER override wins; otherwise
+  # derive the opposite-of-executor default (byte-parity when unset).
+  if [[ -n "${VERIFIER_OVERRIDE:-}" ]]; then
+    echo "$VERIFIER_OVERRIDE"
+    return 0
+  fi
   if [[ "$EXECUTOR" == "codex" ]]; then
     echo "opus"
   else
@@ -214,6 +236,14 @@ ensure_codex_compatible_workdir() {
     needs_codex="true"
   fi
 
+  # v1.5: with --verifier codex (or executor=opus default) the verify phase runs
+  # codex too, so its workdir must satisfy the same WSL constraint. Only matters
+  # when verification is requested. Byte-parity holds: default executor=codex →
+  # verifier=opus → this branch is a no-op.
+  if [[ "$VERIFY" == "true" && "$(select_verifier)" == "codex" ]]; then
+    needs_codex="true"
+  fi
+
   if [[ "$needs_codex" != "true" ]]; then
     return 0
   fi
@@ -226,38 +256,12 @@ ensure_codex_compatible_workdir() {
   fi
 }
 
-invoke_codex_schema() {
-  local prompt_file="$1"
-  local output_file="$2"
-  local stderr_file="$3"
-  local schema_file="$4"
-  local workdir="${WORKDIR:-$PWD}"
-  local -a cmd
-  local windows_workdir=""
-  local windows_output=""
-  local windows_schema=""
-
-  if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then
-    windows_workdir=$(wslpath -w "$workdir")
-    windows_output=$(wslpath -w "$output_file")
-    windows_schema=$(wslpath -w "$schema_file")
-    cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir")
-  else
-    cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir")
-  fi
-
-  if [[ -n "${CODEX_MODEL:-}" ]]; then
-    cmd+=(-c "model=${CODEX_MODEL}")
-  fi
-
-  if [[ -n "$windows_schema" ]]; then
-    cmd+=(--output-schema "$windows_schema" -o "$windows_output")
-  else
-    cmd+=(--output-schema "$schema_file" -o "$output_file")
-  fi
-
-  "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true
-}
+# v1.5 (fixes B2): invoke_codex_schema now lives in lib/co-evolution.sh (sourced
+# at the top of this script). It was previously defined here, but the verify
+# phase dispatches it inside a `bash -c 'source lib/co-evolution.sh; invoke_...'`
+# child that can only see lib-defined functions — so the local copy was invisible
+# there (exit 127 when a timeout binary exists and verifier=codex). Moved to lib
+# so both the timeout child and the no-timeout fallback resolve the same function.
 
 write_text_file() {
   local output_path="$1"
@@ -565,6 +569,8 @@ Output ONLY the plan document. No preamble."
   # abort_on_timeout will fire from main flow if the dispatcher reports 124.
   # RTUX-01: Launch a live-tail window before the agent call (no-op unless --live).
   maybe_launch_live_window "compose" "$compose_stderr_file"
+  # v1.5: layer the composer seat's model/effort (no-op when no per-seat env set).
+  apply_seat_env composer "$COMPOSER"
   invoke_agent_with_timeout "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$(phase_is_writable compose)"
   abort_on_timeout "compose" "$phase_start"
   ensure_valid_plan_output "compose phase" "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$compose_retry_stderr_file" "" "compose" || return $?
@@ -655,6 +661,14 @@ run_bounce_phase() {
     # so state.json records which pass hit the timeout.
     # RTUX-01: Per-pass live-tail window (bounce-01, bounce-02, ...) when --live.
     maybe_launch_live_window "bounce-${pass_padded}" "$stderr_file"
+    # v1.5: composer turns get the composer seat's model/effort; reviewer (the
+    # bounce counterparty) turns use globals only (the `bounce` seat is a no-op
+    # arm in apply_seat_env). No-op overall when no per-seat env is set.
+    if [[ "$role" == "composer" ]]; then
+      apply_seat_env composer "$current_agent"
+    else
+      apply_seat_env bounce "$current_agent"
+    fi
     invoke_agent_with_timeout "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$(phase_is_writable bounce)"
     abort_on_timeout "bounce-${pass_padded}" "$bounce_pass_start"
     ensure_valid_plan_output "bounce pass ${pass}" "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$retry_stderr_file" "$PLAN_PATH" "bounce" || return $?
@@ -751,6 +765,10 @@ run_execute_phase() {
   # RNPT-05: timeout-wrapped. abort_on_timeout uses _execute_phase_start from main flow.
   # RTUX-01: Live-tail window for execute phase (no-op unless --live).
   maybe_launch_live_window "execute" "$execute_stderr_file"
+  # v1.5: layer the executor seat's model/effort. The in-phase empty-output retry
+  # below inherits these exported values (no second apply needed). No-op when no
+  # per-seat env is set.
+  apply_seat_env executor "$EXECUTOR"
   invoke_agent_with_timeout "$EXECUTOR" "$execute_prompt_file" "$execute_output_file" "$execute_stderr_file" "$(phase_is_writable execute)"
   abort_on_timeout "execute" "$phase_start"
 
@@ -857,6 +875,10 @@ run_verify_phase() {
   fi
 
   verifier=$(select_verifier)
+  # v1.5: layer the verifier seat's model/effort, AFTER the verifier agent is
+  # resolved (the seat env keys off the agent type). Covers both the codex-schema
+  # and opus branches below. No-op when no per-seat env is set.
+  apply_seat_env verifier "$verifier"
 
   plan_content=$(cat "$PLAN_PATH")
   diff_content=$(cat "$diff_file")
@@ -973,7 +995,23 @@ while [[ $# -gt 0 ]]; do
       ;;
     --model)
       [[ $# -gt 1 ]] || die "--model requires a value"
-      CODEX_MODEL="$2"
+      # v1.5 (fixes B1): export so the `bash -c` child inside
+      # invoke_agent_with_timeout inherits it (mirrors the --timeout arm). Without
+      # export, CODEX_MODEL never reached invoke_codex in the dispatched child.
+      export CODEX_MODEL="$2"
+      shift 2
+      ;;
+    --verifier)
+      [[ $# -gt 1 ]] || die "--verifier requires a value"
+      # v1.5: normalize so `claude` maps to the opus seat name the rest of the
+      # script uses; reject anything normalize_agent doesn't accept.
+      VERIFIER_OVERRIDE=$(normalize_agent "$2") || die "Unsupported verifier: $2"
+      shift 2
+      ;;
+    --claude-model)
+      [[ $# -gt 1 ]] || die "--claude-model requires a value"
+      # v1.5: resolve friendly alias (fable -> claude-fable-5) else passthrough.
+      CLAUDE_MODEL=$(resolve_claude_model_alias "$2")
       shift 2
       ;;
     --workdir)
@@ -1147,6 +1185,12 @@ if [[ -n "$PLAN_SOURCE" ]]; then
 fi
 
 WORKDIR="$(cd "$WORKDIR" && pwd)"
+# v1.5 (fixes B3): export WORKDIR so the `bash -c` child inside
+# invoke_agent_with_timeout (and the verify-codex `bash -c` block) inherits it;
+# otherwise ${WORKDIR:-$PWD} in invoke_claude/invoke_codex falls back to the
+# child's launch cwd. The export attribute persists across the later worktree-mode
+# reassignment of WORKDIR (~line 1300), so the worktree path is exported too.
+export WORKDIR
 
 if [[ "$SKIP_PLAN" == "true" && -z "$PLAN_SOURCE" ]]; then
   die "--skip-plan requires --plan FILE"
@@ -1192,6 +1236,34 @@ ensure_codex_compatible_workdir
 
 require_selected_agent_clis
 
+# v1.5 per-seat env layer. Snapshot the post-parse base model/effort values (set
+# by globals / --model / --claude-model / CLAUDE_EFFORT / CODEX_REASONING_EFFORT),
+# then apply_seat_env layers an optional per-seat override before each invocation.
+# Byte-parity: with no COMPOSER_/EXECUTOR_/VERIFIER_ envs set, every seat resolves
+# to its base, so exported CLAUDE_MODEL/CLAUDE_EFFORT/CODEX_MODEL/CODEX_REASONING_EFFORT
+# match what the unlayered globals already were — argv is unchanged.
+CLAUDE_MODEL_BASE="$CLAUDE_MODEL"; CODEX_MODEL_BASE="${CODEX_MODEL:-}"
+CLAUDE_EFFORT_BASE="${CLAUDE_EFFORT:-}"; CODEX_EFFORT_BASE="${CODEX_REASONING_EFFORT:-}"
+
+# Export is load-bearing: invoke_agent_with_timeout crosses a `timeout bash -c`
+# process boundary; only exported values survive it.
+apply_seat_env() {
+  local seat="$1" agent="$2" model="" effort=""
+  case "$seat" in
+    composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;;
+    executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;;
+    verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;;
+    *) : ;;  # bounce counterparty: globals only
+  esac
+  if [[ "$agent" == "codex" ]]; then
+    export CODEX_MODEL="${model:-$CODEX_MODEL_BASE}"
+    export CODEX_REASONING_EFFORT="${effort:-$CODEX_EFFORT_BASE}"
+  else
+    export CLAUDE_MODEL="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")"
+    export CLAUDE_EFFORT="${effort:-$CLAUDE_EFFORT_BASE}"
+  fi
+}
+
 # Phase 8.1 WR-04: honor --run-dir when set; otherwise preserve v1.2 default (byte-parity).
 if [[ -n "$RUN_DIR_OVERRIDE" ]]; then
   RUN_DIR="$RUN_DIR_OVERRIDE"
diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh
index 65aa7e0..ef0aeed 100644
--- a/lib/co-evolution.sh
+++ b/lib/co-evolution.sh
@@ -58,6 +58,15 @@ LIVE_MODE_WARNING_LOGGED=false
 # appends -c model= when it is set.)
 : "${CLAUDE_MODEL:=claude-opus-4-6}"
 
+# v1.5: Per-seat reasoning-effort knobs. Both default empty = OFF, so invoke_*
+# omit the effort flag entirely and argv stays byte-identical to today (parity
+# invariant). When non-empty, invoke_claude appends `--effort "$CLAUDE_EFFORT"`
+# and invoke_codex appends `-c model_reasoning_effort=...`. `--effort high` is
+# proven headless in -p mode at evals/judge-bounce.sh:58 (built into JUDGE_ARGS,
+# consumed by the `"$JUDGE_CMD" -p ...` call at evals/judge-bounce.sh:139).
+: "${CLAUDE_EFFORT:=}"
+: "${CODEX_REASONING_EFFORT:=}"
+
 # RNPT-02: Authoritative list of phases that require write access to the workdir.
 # Phase code MUST NOT pass a hard-coded "true"/"false" to invoke_agent; it must
 # call `phase_is_writable "<phase-name>"` instead. To add a new writable phase
@@ -408,11 +417,19 @@ invoke_claude() {
     tool_flags=(--disallowedTools "Edit,Write,Bash,Glob,Grep,WebSearch,WebFetch")
   fi
 
+  # v1.5: model + optional reasoning effort. CLAUDE_EFFORT defaults empty (OFF)
+  # so model_flags is exactly `--model "$CLAUDE_MODEL"` and argv is byte-identical
+  # to the prior single-flag form. The --effort element is appended only when set.
+  local -a model_flags=(--model "${CLAUDE_MODEL}")
+  if [[ -n "${CLAUDE_EFFORT:-}" ]]; then
+    model_flags+=(--effort "${CLAUDE_EFFORT}")
+  fi
+
   if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1; then
     # Under WSL, reuse the Windows Claude session because WSL and Windows keep separate auth state.
-    cmd=(cmd.exe /c claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}")
+    cmd=(cmd.exe /c claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}")
   else
-    cmd=(claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}")
+    cmd=(claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}")
   fi
 
   "${cmd[@]}" < "$prompt_file" > "$output_file" 2>"$stderr_file" || true
@@ -439,6 +456,12 @@ invoke_codex() {
     cmd+=(-c "model=${CODEX_MODEL}")
   fi
 
+  # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append, so argv
+  # is byte-identical to today. Appended only when CODEX_REASONING_EFFORT is set.
+  if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then
+    cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}")
+  fi
+
   if [[ -n "${windows_output:-}" ]]; then
     cmd+=(-o "$windows_output")
   else
@@ -448,6 +471,51 @@ invoke_codex() {
   "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true
 }
 
+# v1.5 (fixes B2): relocated verbatim from dev-review/codex/dev-review.sh so the
+# verify-phase `bash -c 'source lib/co-evolution.sh; invoke_codex_schema ...'`
+# child can actually find it (it was defined only in dev-review.sh → exit 127
+# whenever a timeout binary exists and verifier=codex). Behavior is identical to
+# the prior local copy (--output-schema / -o / --skip-git-repo-check / WSL path
+# handling), plus the same conditional model + reasoning-effort appends that
+# invoke_codex grew above — both default-off so argv is unchanged when unset.
+invoke_codex_schema() {
+  local prompt_file="$1"
+  local output_file="$2"
+  local stderr_file="$3"
+  local schema_file="$4"
+  local workdir="${WORKDIR:-$PWD}"
+  local -a cmd
+  local windows_workdir=""
+  local windows_output=""
+  local windows_schema=""
+
+  if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then
+    windows_workdir=$(wslpath -w "$workdir")
+    windows_output=$(wslpath -w "$output_file")
+    windows_schema=$(wslpath -w "$schema_file")
+    cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir")
+  else
+    cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir")
+  fi
+
+  if [[ -n "${CODEX_MODEL:-}" ]]; then
+    cmd+=(-c "model=${CODEX_MODEL}")
+  fi
+
+  # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append.
+  if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then
+    cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}")
+  fi
+
+  if [[ -n "$windows_schema" ]]; then
+    cmd+=(--output-schema "$windows_schema" -o "$windows_output")
+  else
+    cmd+=(--output-schema "$schema_file" -o "$output_file")
+  fi
+
+  "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true
+}
+
 file_contains_auth_failure() {
   local file_path="$1"
 

From ffe765f23e7a684b1176a8dff279946975f6f5e1 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 12:53:02 -0400
Subject: [PATCH 03/12] feat: claude-verifier verdict hardening + --preset
 codex-build (v1.5 Phase 2)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review.sh       | 104 ++++++-
 evals/RUNNER-CONTRACT.md             |   3 +-
 tests/preset-expansion-simulation.sh | 402 +++++++++++++++++++++++++++
 3 files changed, 504 insertions(+), 5 deletions(-)
 create mode 100644 tests/preset-expansion-simulation.sh

diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 7d7595f..0bf0ff7 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -30,6 +30,10 @@ WORKTREE_SPEC="${DEV_REVIEW_WORKTREE:-}"
 # v1.5: explicit verifier seat override. Empty = derive from executor via the
 # existing select_verifier logic (byte-parity default). Env default + --verifier flag.
 VERIFIER_OVERRIDE="${DEV_REVIEW_VERIFIER:-}"
+# v1.5: named seat preset (e.g. codex-build). Empty = no preset (byte-parity
+# default). Applied in-parser so AFTER-preset flags win (last-wins) and model/
+# effort values fill-if-empty so pre-set env vars win. See apply_preset().
+PRESET=""
 BRANCH_CREATED=""
 WORKTREE_PATH=""
 WORKDIR="$(pwd)"
@@ -80,6 +84,10 @@ Options:
   --model MODEL            Override Codex model
   --verifier AGENT         Force the verify-phase agent (codex|opus|claude); default derives from --executor
   --claude-model MODEL     Override Claude model (alias: fable -> claude-fable-5; else passthrough)
+  --preset NAME            Expand a named seat preset (available: codex-build).
+                           codex-build = Fable plans (high) + Codex executes (xhigh) + Fable reviews (max),
+                           --verify on, bounces 2, revise-loop 1. Precedence: flags placed AFTER --preset
+                           override it (last-wins); pre-set model/effort env vars win over the preset.
   --workdir DIR            Working directory (default: current directory)
   --timeout SECONDS        Per-phase timeout in seconds (default: 1800)
   --revise-loop N          Auto-retry on REVISE verdict up to N extra passes (default: 0 = disabled)
@@ -126,6 +134,26 @@ resolve_claude_model_alias() {
   esac
 }
 
+# v1.5: expand a named seat preset into the underlying seat/model/effort knobs.
+# Called from the --preset parser arm. Two precedence rules, both deliberate:
+#   - seat/structural knobs (COMPOSER, EXECUTOR, VERIFIER_OVERRIDE, VERIFY,
+#     BOUNCES, REVISE_LOOP_MAX) are set hard, so flags placed AFTER --preset
+#     override them via last-wins parsing.
+#   - model/effort knobs use fill-if-empty (`: "${VAR:=…}"`) so a value already
+#     present in the environment (e.g. COMPOSER_EFFORT=low) wins over the preset.
+apply_preset() {
+  case "$1" in
+    codex-build)   # "build with codex": Fable plans/reviews, Codex executes.
+      COMPOSER="opus"; EXECUTOR="codex"; VERIFIER_OVERRIDE="opus"
+      VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1
+      : "${COMPOSER_MODEL:=fable}";  : "${COMPOSER_EFFORT:=high}"
+      : "${VERIFIER_MODEL:=fable}";  : "${VERIFIER_EFFORT:=max}"
+      : "${EXECUTOR_EFFORT:=xhigh}"   # codex model stays the CLI's configured default — deliberately unpinned
+      ;;
+    *) die "Unknown preset: $1 (available: codex-build)" ;;
+  esac
+}
+
 invoke_agent() {
   local agent="$1"
   shift
@@ -927,16 +955,31 @@ run_verify_phase() {
     return 2
   fi
 
-  verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file") || {
-    log "WARNING: verifier output was unusable: ${verdict_data}. Review manually."
-    return 2
-  }
+  if ! verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file"); then
+    # v1.5: claude verifiers (no --output-schema) sometimes wrap the verdict in
+    # prose, which normalize_json_artifact rejects. Try a brace-block extraction
+    # (same idiom as evals/judge-bounce.sh:149) before giving up. codex verdicts
+    # come from --output-schema and never hit this branch (they normalize clean).
+    sed -n '/^{/,/^}/p' "$verdict_file" > "$normalized_verdict_file"
+    if [[ ! -s "$normalized_verdict_file" ]]; then
+      log "WARNING: verifier output was unusable: ${verdict_data}. Review manually."
+      return 2
+    fi
+  fi
 
   verdict_data=$(validate_review_verdict "$normalized_verdict_file") || {
     log "WARNING: verifier output was unusable: ${verdict_data}. Review manually."
     return 2
   }
 
+  # v1.5: persist the normalized verdict back to the contract path so downstream
+  # consumers (evals/score-run.sh jq-parses verdict.json raw; mapping unparseable
+  # -> FAIL) read clean JSON. Byte-no-op for codex --output-schema verdicts, which
+  # are already clean. The dot-prefixed .verdict-normalized.json copy stays for the
+  # revise-loop read; cleanup_runtime_artifacts sweeps it but verdict.json (plain
+  # path, no leading dot) survives as the contract artifact.
+  cp "$normalized_verdict_file" "$verdict_file"
+
   eval "$verdict_data"
 
   review_status="$VERDICT"
@@ -961,6 +1004,15 @@ cleanup_runtime_artifacts() {
 
 while [[ $# -gt 0 ]]; do
   case "$1" in
+    --preset)
+      # v1.5: expand a named seat preset. Applied in-parser so flags placed
+      # AFTER --preset override it (last-wins) and pre-set model/effort env vars
+      # win over the preset's fill-if-empty defaults. See apply_preset().
+      [[ $# -gt 1 ]] || die "--preset requires a value"
+      PRESET="$2"
+      apply_preset "$2"
+      shift 2
+      ;;
     --composer)
       [[ $# -gt 1 ]] || die "--composer requires a value"
       COMPOSER=$(normalize_agent "$2") || die "Unsupported composer: $2"
@@ -1264,6 +1316,29 @@ apply_seat_env() {
   fi
 }
 
+# v1.5: resolve a seat's "agent:model@effort" descriptor using the SAME model/
+# effort precedence apply_seat_env uses, but without mutating the exported env.
+# Feeds the startup banner and the optional state.json seat_models fields.
+# model defaults to "(default)" when unpinned (codex's executor seat under the
+# codex-build preset); effort defaults to "(default)" when no effort is set.
+resolve_seat_model_string() {
+  local seat="$1" agent="$2" model="" effort="" model_str="" effort_str=""
+  case "$seat" in
+    composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;;
+    executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;;
+    verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;;
+    *) : ;;
+  esac
+  if [[ "$agent" == "codex" ]]; then
+    model_str="${model:-$CODEX_MODEL_BASE}"
+    effort_str="${effort:-$CODEX_EFFORT_BASE}"
+  else
+    model_str="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")"
+    effort_str="${effort:-$CLAUDE_EFFORT_BASE}"
+  fi
+  printf '%s:%s@%s' "$agent" "${model_str:-(default)}" "${effort_str:-(default)}"
+}
+
 # Phase 8.1 WR-04: honor --run-dir when set; otherwise preserve v1.2 default (byte-parity).
 if [[ -n "$RUN_DIR_OVERRIDE" ]]; then
   RUN_DIR="$RUN_DIR_OVERRIDE"
@@ -1286,12 +1361,30 @@ EXECUTE_DELTA_JSON="${RUN_DIR}/.execute-delta.json"
 RUN_ID="dev-review-${TIMESTAMP}"
 init_state_json "$STATE_JSON" "$RUN_ID" "$TASK" "$COMPOSER" "$EXECUTOR" "$REVIEWER"
 
+# v1.5: resolve the verifier seat and per-seat model/effort descriptors for the
+# banner + observability fields. _VERIFIER_SEAT reflects --verifier / executor
+# derivation (select_verifier); the seat strings reflect whatever the seats
+# resolve to under the per-seat env layer (apply_seat_env precedence).
+_VERIFIER_SEAT=$(select_verifier)
+_SEAT_COMPOSER=$(resolve_seat_model_string composer "$COMPOSER")
+_SEAT_EXECUTOR=$(resolve_seat_model_string executor "$EXECUTOR")
+_SEAT_VERIFIER=$(resolve_seat_model_string verifier "$_VERIFIER_SEAT")
+
+# v1.5: optional seat_models observability — records what each seat resolves to.
+# Unconditional + additive (no existing sim asserts an exact state.json key set,
+# so this does not break shape); cheap, and reflects the actual resolved seats.
+write_state_field "$STATE_JSON" ".seat_models.composer" "string" "$_SEAT_COMPOSER"
+write_state_field "$STATE_JSON" ".seat_models.executor" "string" "$_SEAT_EXECUTOR"
+write_state_field "$STATE_JSON" ".seat_models.verifier" "string" "$_SEAT_VERIFIER"
+
 log "============================================"
 log " DEV-REVIEW SESSION"
 log "============================================"
 log " Task:      $TASK"
+log " Preset:    ${PRESET:-<none>}"
 log " Composer:  $COMPOSER"
 log " Executor:  $EXECUTOR"
+log " Verifier:  $_VERIFIER_SEAT"
 log " Bounces:   $BOUNCES"
 log " Verify:    $VERIFY"
 log " Workdir:   $WORKDIR"
@@ -1300,6 +1393,9 @@ log " Timeout:   ${PHASE_TIMEOUT}s per phase"
 log " Live mode: $LIVE_MODE"
 log " Branch:    ${BRANCH_SPEC:-<empty>}"
 log " Worktree:  ${WORKTREE_SPEC:-<empty>}"
+log " Composer model: $_SEAT_COMPOSER"
+log " Executor model: $_SEAT_EXECUTOR"
+log " Verifier model: $_SEAT_VERIFIER"
 log "============================================"
 log ""
 
diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md
index f8fe194..650f476 100644
--- a/evals/RUNNER-CONTRACT.md
+++ b/evals/RUNNER-CONTRACT.md
@@ -1,6 +1,6 @@
 ---
 title: Runner ↔ scorer contract
-version: 1.0
+version: 1.1
 status: locked
 owners:
   - dev-review/codex/dev-review.sh (runner)
@@ -43,6 +43,7 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`.
 | `verify_verdict` | string \| null | nullable | runner (post-verify) | `score-run.sh:498` | Verify-phase verdict — `APPROVED` / `REVISE` / null |
 | `history` | array\<`{phase,status,detail,timestamp}`\> | yes | runner (per phase) | `score-run.sh:338` | Canonical bounce-trace array. Convergence needs ≥1 entry with `phase` matching `^bounce-[0-9]+$` when bounces were expected. |
 | `mode` | string | no | runner (optional tag) | — | Fake-runner sets `"fake-runner"`; real runner may set `"real"` or omit. Runner-owned metadata — NOT written by the shared `init_state_json` library. |
+| `seat_models` | object `{composer, executor, verifier}` (each string `agent:model@effort`) | no | `dev-review.sh` (post-init) | — | v1.5 observability: what each seat resolved to under the per-seat env layer (e.g. `"claude:claude-fable-5@high"`, `"codex:(default)@xhigh"`). Runner-owned metadata — NOT written by the shared `init_state_json` library; scorer ignores it. |
 
 ### Legacy alias: `phases` → `history`
 
diff --git a/tests/preset-expansion-simulation.sh b/tests/preset-expansion-simulation.sh
new file mode 100644
index 0000000..6cf10bc
--- /dev/null
+++ b/tests/preset-expansion-simulation.sh
@@ -0,0 +1,402 @@
+#!/usr/bin/env bash
+# tests/preset-expansion-simulation.sh
+# Hermetic gate for v1.5 Phase 2: --preset codex-build + claude-verifier
+# verdict hardening (dev-review/codex/dev-review.sh).
+#
+# The codex-build preset expands to the post's ladder — Fable plans (high),
+# Codex executes (xhigh), Fable reviews (max) — behind one flag. This gate
+# pins:
+#   a. seat argv under the preset (composer claude --model claude-fable-5
+#      --effort high; executor codex -c model_reasoning_effort=xhigh and NO
+#      -c model=; verifier claude --effort max).
+#   b. claude-verifier verdict hardening: a verdict wrapped in markdown
+#      fences/prose lands in verdict.json jq-parseable (brace-block fallback).
+#   c. last-wins parsing: --verifier codex after --preset wins.
+#   d. fill-if-empty: a pre-set COMPOSER_EFFORT env var beats the preset.
+#   e. unknown preset dies with the right message.
+#   f. --help mentions --preset.
+#   g. parity guard: a default run emits no --effort and no
+#      model_reasoning_effort anywhere.
+#
+# Pattern: PATH-injected claude + codex stubs append their full argv (one line
+# per invocation) to phase-agnostic logs; assertions grep the logs. Both stubs
+# drain stdin via a read-loop (NOT `cat`) to avoid SIGPIPE against the runner's
+# stdin producer under strict-mode bash (same discipline as the scorer Tier-4
+# stubs in evals/tests/scorer-verification.sh).
+
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh"
+
+command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; }
+
+TEST_DIR="$(mktemp -d -t preset-expand-XXXXXX)"
+trap 'rm -rf "$TEST_DIR"' EXIT
+
+TOTAL=0
+FAILURES=0
+pass() { printf "PASS: %s\n" "$1"; }
+fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); }
+
+# --- hermetic git env (mirrors tests/run-all.sh so the scratch repo is
+# independent of the host's identity / autocrlf / hooks) ---------------------
+export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig"
+export GIT_CONFIG_SYSTEM=/dev/null
+cat > "$TEST_DIR/gitconfig" <<'GITCFG'
+[user]
+	name = preset-sim
+	email = preset-sim@invalid.local
+[init]
+	defaultBranch = master
+[core]
+	autocrlf = false
+[commit]
+	gpgsign = false
+GITCFG
+
+# --- stub CLIs --------------------------------------------------------------
+mkdir -p "$TEST_DIR/bin"
+
+# claude stub: append full argv to $CLAUDE_ARGV_LOG (one line per invocation),
+# drain stdin via read-loop, then emit a phase-appropriate body. The verify
+# phase is detected by the review-prompt heading on stdin; for it the stub
+# emits a verdict whose JSON object lives at column 0 wrapped in prose + fences
+# (exercises the brace-block hardening fallback). All other phases emit a plan.
+cat > "$TEST_DIR/bin/claude" <<'STUB'
+#!/usr/bin/env bash
+for arg in "$@"; do
+  case "$arg" in
+    --version|-v|version) echo "claude 1.0.0 (preset-stub)"; exit 0 ;;
+  esac
+done
+[[ -n "${CLAUDE_ARGV_LOG:-}" ]] && printf '%s\n' "$*" >> "$CLAUDE_ARGV_LOG"
+# Drain stdin via read-loop (no SIGPIPE); capture it so we can detect the phase.
+stdin_capture=""
+while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done
+if [[ "$stdin_capture" == *"Verification Instructions"* ]]; then
+  # Verify phase: prose-wrapped, fenced verdict. The JSON object's braces sit at
+  # column 0 so the runner's `sed -n '/^{/,/^}/p'` brace-block fallback recovers
+  # clean JSON after normalize_json_artifact rejects the mixed fence+prose.
+  cat <<'VERDICT'
+Here is my assessment of the diff against the plan.
+
+```json
+{
+  "verdict": "APPROVED",
+  "confidence": 90,
+  "summary": "Stub verifier approves the hermetic change.",
+  "issues": []
+}
+```
+
+That concludes the review.
+VERDICT
+  exit 0
+fi
+# Compose / bounce phase: emit a minimal valid plan that clears the plan_quality
+# gate (validate_plan_artifact wants >= 60 words, >= 5 non-empty lines, >= 2
+# structural lines) so PLAN_EXIT stays 0 and the execute + verify phases run.
+cat <<'PLAN'
+# Preset Expansion Stub Plan
+
+## Approach
+
+This is a hermetic stub plan emitted by the preset-expansion simulation claude
+stub. It exists solely to satisfy the plan heading regex, the minimum word
+count, and the structural-line check so the compose phase validates and the run
+proceeds into execute and verify. It is not a real plan and the executor stub
+performs only a single harmless append to a tracked file so the run produces a
+diff for the verify phase to score. No external commands run and no network is
+touched anywhere in this hermetic path.
+
+## Files to Change
+
+- `touched.txt` — the executor stub appends a line so the run produces a diff.
+
+## Risks
+
+- None identified. Stubbed CLI produces a fixed plan with no side effects.
+PLAN
+STUB
+chmod +x "$TEST_DIR/bin/claude"
+
+# codex stub: append full argv to $CODEX_ARGV_LOG, parse -o FILE and -C DIR,
+# drain stdin, then (a) emit the same stub plan to -o FILE (compose/bounce) and
+# (b) make a real file change inside -C DIR so the execute phase yields a
+# non-empty diff (so verify actually runs). The execute phase is the one that
+# carries -C to the workdir; for it we append a line to a tracked file.
+cat > "$TEST_DIR/bin/codex" <<'STUB'
+#!/usr/bin/env bash
+for arg in "$@"; do
+  case "$arg" in
+    --version|-v|version) echo "codex 0.117.0 (preset-stub)"; exit 0 ;;
+  esac
+done
+[[ -n "${CODEX_ARGV_LOG:-}" ]] && printf '%s\n' "$*" >> "$CODEX_ARGV_LOG"
+output_file=""; workdir=""; prev=""
+for arg in "$@"; do
+  case "$prev" in
+    -o) output_file="$arg" ;;
+    -C) workdir="$arg" ;;
+  esac
+  prev="$arg"
+done
+# Capture stdin (no SIGPIPE) so we can detect the execute phase by its prompt.
+stdin_capture=""
+while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done
+# Only the EXECUTE phase may mutate the workdir — bounce passes (also codex
+# under the preset, since composer=opus → reviewer=codex) run BEFORE the runner
+# snapshots INITIAL_GIT_STATUS, so a bounce-time edit would make execute look
+# like a no-op ("no changes detected"). Gate on the execute-prompt heading.
+if [[ "$stdin_capture" == *"Execution Instructions (Codex)"* \
+      && -n "$workdir" && -f "$workdir/touched.txt" ]]; then
+  printf 'codex-stub-change\n' >> "$workdir/touched.txt"
+fi
+plan_content=$(cat <<'PLAN'
+# Preset Expansion Stub Plan
+
+## Approach
+
+This is a hermetic stub plan emitted by the preset-expansion simulation codex
+stub. It satisfies the plan heading regex, the minimum word count, and the
+structural-line check so the compose phase validates and the run proceeds into
+execute and verify. It is not a real plan and the only side effect is a single
+harmless append to a tracked file so the run produces a diff for the verify
+phase to score. No external commands run and no network is touched anywhere.
+
+## Files to Change
+
+- `touched.txt` — appended to by the executor stub.
+
+## Risks
+
+- None identified. Stubbed codex produces a fixed plan with no side effects.
+PLAN
+)
+if [[ -n "$output_file" ]]; then
+  printf '%s\n' "$plan_content" > "$output_file"
+else
+  printf '%s\n' "$plan_content"
+fi
+exit 0
+STUB
+chmod +x "$TEST_DIR/bin/codex"
+
+# --- helpers ----------------------------------------------------------------
+
+# Make a fresh scratch git repo with one tracked file. Echoes the repo path.
+make_scratch_repo() {
+  local repo
+  repo="$TEST_DIR/repo-$1"
+  rm -rf "$repo"
+  mkdir -p "$repo"
+  git -C "$repo" init -q
+  printf 'baseline\n' > "$repo/touched.txt"
+  git -C "$repo" add -A
+  git -C "$repo" commit -qm "baseline"
+  printf '%s' "$repo"
+}
+
+# A pre-existing plan fixture so scenarios that don't need to test compose can
+# use --skip-plan and run faster. Scenario (a) deliberately does NOT use it (it
+# must exercise compose to capture the composer argv).
+PLAN_FIXTURE="$TEST_DIR/plan-fixture.md"
+cat > "$PLAN_FIXTURE" <<'PLAN'
+# Skip-Plan Fixture
+
+## Approach
+
+A pre-baked plan so the --skip-plan scenarios bypass the compose phase entirely
+and drive only the execute and verify phases. It satisfies the plan heading
+regex, the minimum word count, and the structural-line check. The executor stub
+appends one harmless line to a tracked file so the run produces a diff for the
+verify phase to score, and nothing else happens in this hermetic path.
+
+## Files to Change
+
+- `touched.txt` — the executor stub appends a line.
+
+## Risks
+
+- None identified.
+PLAN
+
+# ===========================================================================
+# Scenario (a): --preset codex-build → seat argv (exercises compose).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo a)
+claude_log="$TEST_DIR/a_claude.log"; codex_log="$TEST_DIR/a_codex.log"
+: > "$claude_log"; : > "$codex_log"
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --preset codex-build --workdir "$repo" --timeout 60 -- "preset seat argv probe"
+) >"$TEST_DIR/a.out" 2>&1 || true
+
+a_ok=true
+# composer = claude with the fable model alias resolved + high effort.
+if ! grep -Eq -- '--model claude-fable-5' "$claude_log"; then a_ok=false; fi
+if ! grep -Eq -- '--effort high' "$claude_log"; then a_ok=false; fi
+# executor = codex with xhigh effort and NO pinned model.
+if ! grep -Eq -- '-c model_reasoning_effort=xhigh' "$codex_log"; then a_ok=false; fi
+if grep -Eq -- '-c model=' "$codex_log"; then a_ok=false; fi
+# verifier = claude with max effort.
+if ! grep -Eq -- '--effort max' "$claude_log"; then a_ok=false; fi
+if [[ "$a_ok" == true ]]; then
+  pass "preset codex-build: composer fable/high, executor codex/xhigh (no model=), verifier claude/max"
+else
+  fail "preset codex-build seat argv mismatch (claude_log + codex_log below)"
+  { echo "--- claude argv ---"; cat "$claude_log"; echo "--- codex argv ---"; cat "$codex_log"; } >&2
+fi
+
+# ===========================================================================
+# Scenario (b): claude-verifier verdict hardening — fenced/prose verdict ends
+# up jq-parseable at verdict.json. Reuses the run dir from scenario (a) (same
+# preset run already produced a claude verdict through the hardening path).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+# Find the run dir the scenario-(a) run created (newest under runs/).
+a_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+b_ok=false
+if [[ -n "$a_run_dir" && -f "$a_run_dir/verdict.json" ]]; then
+  if jq -e '.verdict == "APPROVED" and (.confidence | type) == "number" and has("summary") and has("issues")' \
+       "$a_run_dir/verdict.json" >/dev/null 2>&1; then
+    b_ok=true
+  fi
+fi
+if [[ "$b_ok" == true ]]; then
+  pass "verdict hardening: prose/fence-wrapped verdict.json is jq-parseable with expected fields"
+else
+  fail "verdict.json not jq-parseable after hardening (run dir: ${a_run_dir:-<none>})"
+  [[ -n "${a_run_dir:-}" && -f "$a_run_dir/verdict.json" ]] && { echo "--- verdict.json ---"; cat "$a_run_dir/verdict.json"; } >&2
+fi
+# Clean up the runs/ artifacts this gate created so it leaves no side effects.
+[[ -n "${a_run_dir:-}" ]] && rm -rf "$a_run_dir"
+
+# ===========================================================================
+# Scenario (c): --preset codex-build --verifier codex → verifier is codex
+# (last-wins parsing; flags after --preset override it).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo c)
+codex_log="$TEST_DIR/c_codex.log"; claude_log="$TEST_DIR/c_claude.log"
+: > "$codex_log"; : > "$claude_log"
+run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --preset codex-build --verifier codex --skip-plan --plan "$PLAN_FIXTURE" \
+    --workdir "$repo" --timeout 60 -- "preset verifier override probe"
+) >"$TEST_DIR/c.out" 2>&1 || true
+c_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+c_ok=false
+if [[ -n "$c_run_dir" && "$c_run_dir" != "$run_dir_before" ]]; then
+  # state.json seat_models.verifier must report codex; banner Verifier line too.
+  if jq -e '.seat_models.verifier | startswith("codex:")' "$c_run_dir/state.json" >/dev/null 2>&1 \
+     && grep -Eq '^ Verifier:  codex' "$c_run_dir/run.log"; then
+    c_ok=true
+  fi
+fi
+if [[ "$c_ok" == true ]]; then
+  pass "preset codex-build --verifier codex: verifier seat is codex (last-wins)"
+else
+  fail "verifier override did not win over preset (run dir: ${c_run_dir:-<none>})"
+  [[ -n "${c_run_dir:-}" ]] && grep -E '^ Verifier:' "$c_run_dir/run.log" >&2 || true
+fi
+[[ -n "${c_run_dir:-}" && "$c_run_dir" != "$run_dir_before" ]] && rm -rf "$c_run_dir"
+
+# ===========================================================================
+# Scenario (d): COMPOSER_EFFORT=low pre-set + preset → composer gets
+# --effort low (fill-if-empty: env beats the preset's high default).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo d)
+claude_log="$TEST_DIR/d_claude.log"; codex_log="$TEST_DIR/d_codex.log"
+: > "$claude_log"; : > "$codex_log"
+run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export COMPOSER_EFFORT="low"
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --preset codex-build --workdir "$repo" --timeout 60 -- "preset composer effort env probe"
+) >"$TEST_DIR/d.out" 2>&1 || true
+d_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+if grep -Eq -- '--effort low' "$claude_log" && ! grep -Eq -- '--effort high' "$claude_log"; then
+  pass "preset + COMPOSER_EFFORT=low: composer gets --effort low (fill-if-empty env wins)"
+else
+  fail "COMPOSER_EFFORT env did not beat preset high (claude argv below)"
+  cat "$claude_log" >&2
+fi
+[[ -n "${d_run_dir:-}" && "$d_run_dir" != "$run_dir_before" ]] && rm -rf "$d_run_dir"
+
+# ===========================================================================
+# Scenario (e): --preset bogus dies with the unknown-preset message.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+e_out=$(bash "$RUNNER" --preset bogus -- "x" 2>&1 || true)
+if printf '%s' "$e_out" | grep -Eq 'Unknown preset: bogus'; then
+  pass "unknown preset dies with the unknown-preset message"
+else
+  fail "unknown preset did not produce the expected die message (got: $e_out)"
+fi
+
+# ===========================================================================
+# Scenario (f): --help mentions --preset.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+help_out=$(bash "$RUNNER" --help 2>/dev/null || true)
+if printf '%s' "$help_out" | grep -Eq -- '--preset'; then
+  pass "--help mentions --preset"
+else
+  fail "--help does not mention --preset"
+fi
+
+# ===========================================================================
+# Scenario (g): parity guard — a default run (no preset, no new envs) emits no
+# --effort (claude) and no model_reasoning_effort (codex) anywhere.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo g)
+claude_log="$TEST_DIR/g_claude.log"; codex_log="$TEST_DIR/g_codex.log"
+: > "$claude_log"; : > "$codex_log"
+run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  # Default flavor: codex composes + executes, opus verifies. --verify so a
+  # claude verdict path runs too. No preset, no effort knobs.
+  bash "$RUNNER" --verify --bounces 0 --skip-plan --plan "$PLAN_FIXTURE" \
+    --workdir "$repo" --timeout 60 -- "parity guard probe"
+) >"$TEST_DIR/g.out" 2>&1 || true
+g_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+g_ok=true
+if grep -Eq -- '--effort ' "$claude_log"; then g_ok=false; fi
+if grep -Eq -- 'model_reasoning_effort=' "$codex_log"; then g_ok=false; fi
+if [[ "$g_ok" == true ]]; then
+  pass "parity guard: default run emits no --effort and no model_reasoning_effort"
+else
+  fail "parity guard: an effort flag leaked into a default run (logs below)"
+  { echo "--- claude argv ---"; cat "$claude_log"; echo "--- codex argv ---"; cat "$codex_log"; } >&2
+fi
+[[ -n "${g_run_dir:-}" && "$g_run_dir" != "$run_dir_before" ]] && rm -rf "$g_run_dir"
+
+# --- summary ----------------------------------------------------------------
+passed=$((TOTAL - FAILURES))
+if (( FAILURES == 0 )); then
+  echo "$passed/$TOTAL scenarios passed"
+  exit 0
+else
+  echo "$passed/$TOTAL scenarios passed ($FAILURES failed)" >&2
+  exit 1
+fi

From 13c2beea4f2f7cec5a6d6dbcf4d08b97b2752833 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 13:07:02 -0400
Subject: [PATCH 04/12] =?UTF-8?q?feat:=20runner=20observability=20?=
 =?UTF-8?q?=E2=80=94=20current=5Fphase,=20runner=5Fpid,=20shas,=20lineage?=
 =?UTF-8?q?=20+=20status=20reader=20(v1.5=20Phase=203)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review-status.sh       | 362 ++++++++++++++++++++
 dev-review/codex/dev-review.sh              |  67 ++++
 evals/RUNNER-CONTRACT.md                    |   7 +-
 lib/co-evolution.sh                         |  29 ++
 tests/observability-lifecycle-simulation.sh | 292 ++++++++++++++++
 tests/status-reader-simulation.sh           | 203 +++++++++++
 6 files changed, 959 insertions(+), 1 deletion(-)
 create mode 100755 dev-review/codex/dev-review-status.sh
 create mode 100644 tests/observability-lifecycle-simulation.sh
 create mode 100644 tests/status-reader-simulation.sh

diff --git a/dev-review/codex/dev-review-status.sh b/dev-review/codex/dev-review-status.sh
new file mode 100755
index 0000000..540584b
--- /dev/null
+++ b/dev-review/codex/dev-review-status.sh
@@ -0,0 +1,362 @@
+#!/usr/bin/env bash
+# dev-review/codex/dev-review-status.sh
+# v1.5 Phase 3 — read-only status reader for dev-review runner runs.
+#
+# A Claude Code session kicks the runner via a background Bash task and ENDS
+# ITS TURN. On wake (or via a cheap watcher, or a human) it must reconstruct
+# run state purely from disk. This script does exactly that — it writes NOTHING
+# and never touches the runner. It reports run summary, phase progress, a
+# heartbeat derived from the in-flight phase's stderr-log mtime/size, marker
+# counts, verdict, diffstat, and a liveness assessment with an actionable line.
+#
+# Usage:
+#   dev-review-status.sh [--json] [--list] [RUN_ID|RUN_DIR]
+#     (no positional)  → latest runs/dev-review-* by mtime
+#     --list           → one summary line per non-terminal run + 3 newest terminal
+#     RUN_ID           → resolved against <repo>/runs/RUN_ID
+#     RUN_DIR          → an absolute/relative dir path, used as-is
+#
+# Exit codes (liveness assessment):
+#   0  terminal-completed   (status=completed)
+#   2  terminal-partial     (status=partial/failed)
+#   5  still-running        (non-terminal + pid alive OR heartbeat fresh <120s)
+#   4  presumed-dead        (non-terminal + pid gone + heartbeat stale)
+#   3  run not found / no state.json / bad argument
+#
+# Reader, not runner: requires jq and dies (exit 3) without it.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# Repo root: dev-review/codex/ -> repo (two levels up).
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+RUNS_DIR="${CO_EVOLVE_RUNS_DIR:-$REPO_ROOT/runs}"
+
+HEARTBEAT_FRESH_SECS=120
+
+die() { printf 'dev-review-status: %s\n' "$1" >&2; exit 3; }
+
+command -v jq >/dev/null 2>&1 || die "jq is required (this is a reader, not the runner)"
+
+# --- portable file mtime (epoch) --------------------------------------------
+# Feature-detect GNU stat (supports --version) vs BSD stat, mirroring the
+# GNU-find detection convention in lib/co-evolution.sh (list_available_lab_modes).
+if stat --version >/dev/null 2>&1; then
+  _file_mtime() { stat -c %Y "$1" 2>/dev/null; }   # GNU
+else
+  _file_mtime() { stat -f %m "$1" 2>/dev/null; }   # BSD/macOS
+fi
+
+now_epoch() { date +%s; }
+
+# --- argument parsing --------------------------------------------------------
+JSON=false
+LIST=false
+ARG=""
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --json) JSON=true; shift ;;
+    --list) LIST=true; shift ;;
+    -h|--help)
+      sed -n '5,30p' "${BASH_SOURCE[0]}" | sed 's/^# \{0,1\}//'
+      exit 0
+      ;;
+    --*) die "unknown flag: $1" ;;
+    *)   ARG="$1"; shift ;;
+  esac
+done
+
+# --- run-dir resolution ------------------------------------------------------
+# A value containing a slash (or that resolves to an existing dir) is a path;
+# otherwise treat it as a RUN_ID under $RUNS_DIR.
+resolve_run_dir() {
+  local arg="$1"
+  if [[ -z "$arg" ]]; then
+    ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null | head -1
+    return 0
+  fi
+  if [[ -d "$arg" ]]; then
+    printf '%s' "$arg"
+    return 0
+  fi
+  if [[ "$arg" == */* ]]; then
+    # Looks like a path but does not exist.
+    printf '%s' "$arg"
+    return 0
+  fi
+  printf '%s' "$RUNS_DIR/$arg"
+}
+
+# --- map a phase name to its stderr-log path (heartbeat source) --------------
+# Mirrors the exact filenames the runner writes:
+#   compose      -> compose-stderr.log
+#   bounce-NN    -> pass-N-stderr.log   (N un-padded; runner names the file with
+#                                        the raw pass int, the phase with NN)
+#   execute[-N]  -> execute-stderr.log  (reused across revise passes)
+#   verify[-N]   -> review-stderr.log   (reused across revise passes)
+phase_stderr_file() {
+  local run_dir="$1" phase="$2"
+  case "$phase" in
+    compose)      printf '%s' "$run_dir/compose-stderr.log" ;;
+    bounce-*)
+      local nn="${phase#bounce-}"
+      nn=$((10#$nn))   # strip leading zero: bounce-01 -> 1
+      printf '%s' "$run_dir/pass-${nn}-stderr.log"
+      ;;
+    execute|execute-*) printf '%s' "$run_dir/execute-stderr.log" ;;
+    verify|verify-*)   printf '%s' "$run_dir/review-stderr.log" ;;
+    *) printf '' ;;
+  esac
+}
+
+# --- one-line summary for --list ---------------------------------------------
+summarize_one() {
+  local run_dir="$1"
+  local state="$run_dir/state.json"
+  local id status phase
+  id=$(jq -r '.run_id // "?"' "$state" 2>/dev/null)
+  status=$(jq -r '.status // "null"' "$state" 2>/dev/null)
+  phase=$(jq -r '.current_phase.name // "-"' "$state" 2>/dev/null)
+  printf '%-32s  status=%-9s  phase=%s\n' "$id" "$status" "$phase"
+}
+
+if [[ "$LIST" == true ]]; then
+  shopt -s nullglob
+  dirs=("$RUNS_DIR"/dev-review-*)
+  shopt -u nullglob
+  if [[ ${#dirs[@]} -eq 0 ]]; then
+    echo "(no dev-review runs under $RUNS_DIR)"
+    exit 3
+  fi
+  # newest first. bash-3.2 portable (macOS /bin/bash has no mapfile); read the
+  # ls -t output line by line into the array.
+  sorted=()
+  while IFS= read -r _d; do
+    [[ -n "$_d" ]] && sorted+=("$_d")
+  done < <(ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null)
+  echo "ACTIVE (non-terminal: pending/null):"
+  active_found=false
+  for d in "${sorted[@]}"; do
+    [[ -f "$d/state.json" ]] || continue
+    st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null)
+    if [[ "$st" == "pending" || "$st" == "null" ]]; then
+      summarize_one "$d"; active_found=true
+    fi
+  done
+  [[ "$active_found" == false ]] && echo "  (none)"
+  echo
+  echo "RECENT (3 newest terminal):"
+  shown=0
+  for d in "${sorted[@]}"; do
+    [[ -f "$d/state.json" ]] || continue
+    st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null)
+    if [[ "$st" != "pending" && "$st" != "null" ]]; then
+      summarize_one "$d"; shown=$((shown + 1))
+      [[ "$shown" -ge 3 ]] && break
+    fi
+  done
+  [[ "$shown" -eq 0 ]] && echo "  (none)"
+  exit 0
+fi
+
+RUN_DIR=$(resolve_run_dir "$ARG")
+[[ -n "$RUN_DIR" && -d "$RUN_DIR" ]] || die "run not found: ${ARG:-<latest>} (looked under $RUNS_DIR)"
+STATE="$RUN_DIR/state.json"
+[[ -f "$STATE" ]] || die "no state.json in $RUN_DIR"
+jq -e . "$STATE" >/dev/null 2>&1 || die "state.json is not valid JSON: $STATE"
+
+# --- read fields -------------------------------------------------------------
+RUN_ID=$(jq -r '.run_id // "?"' "$STATE")
+STATUS=$(jq -r '.status // "null"' "$STATE")
+TASK=$(jq -r '.task // ""' "$STATE" | head -c 100)
+RUNNER_PID=$(jq -r '.runner_pid // empty' "$STATE")
+CUR_PHASE=$(jq -r '.current_phase.name // empty' "$STATE")
+CUR_STARTED=$(jq -r '.current_phase.started_at // empty' "$STATE")
+PARENT_RUN=$(jq -r '.orchestration.parent_run_id // empty' "$STATE")
+VERDICT=$(jq -r '.verify_verdict // empty' "$STATE")
+PRE_SHA=$(jq -r '.pre_execute_sha // empty' "$STATE")
+POST_SHA=$(jq -r '.post_execute_sha // empty' "$STATE")
+M_CONTESTED=$(jq -r '.marker_counts.contested // 0' "$STATE")
+M_CLARIFY=$(jq -r '.marker_counts.clarify // 0' "$STATE")
+
+# --- runner liveness via kill -0 ---------------------------------------------
+# yes  = pid present and alive; no = pid present and gone; unknown = no pid.
+RUNNER_ALIVE="unknown"
+if [[ -n "$RUNNER_PID" ]]; then
+  if kill -0 "$RUNNER_PID" 2>/dev/null; then
+    RUNNER_ALIVE="yes"
+  else
+    RUNNER_ALIVE="no"
+  fi
+fi
+
+# --- heartbeat from the in-flight phase's stderr log -------------------------
+HB_AGE=""        # seconds since last write (empty if no current phase / no file)
+HB_LINES=""      # line count of the stderr file
+HB_FILE=""
+if [[ -n "$CUR_PHASE" ]]; then
+  HB_FILE=$(phase_stderr_file "$RUN_DIR" "$CUR_PHASE")
+  if [[ -n "$HB_FILE" && -f "$HB_FILE" ]]; then
+    mt=$(_file_mtime "$HB_FILE")
+    if [[ -n "$mt" ]]; then
+      HB_AGE=$(( $(now_epoch) - mt ))
+    fi
+    HB_LINES=$(wc -l < "$HB_FILE" | tr -d ' ')
+  fi
+fi
+
+# --- liveness assessment + exit code -----------------------------------------
+# Terminal states short-circuit. Non-terminal: alive pid OR fresh heartbeat
+# => still-running; pid gone + stale/absent heartbeat => presumed-dead.
+ASSESS=""
+EXIT_CODE=0
+case "$STATUS" in
+  completed)
+    ASSESS="DONE: run completed cleanly."
+    EXIT_CODE=0
+    ;;
+  partial|failed)
+    ASSESS="PARTIAL: run reached a terminal non-completed status ($STATUS) — review verdict/diffstat before acting."
+    EXIT_CODE=2
+    ;;
+  *)
+    # Non-terminal (pending / null / unexpected). Decide running vs dead.
+    hb_fresh=false
+    if [[ -n "$HB_AGE" && "$HB_AGE" -lt "$HEARTBEAT_FRESH_SECS" ]]; then
+      hb_fresh=true
+    fi
+    if [[ "$RUNNER_ALIVE" == "yes" || "$hb_fresh" == true ]]; then
+      ASSESS="RUNNING: runner alive ($RUNNER_ALIVE) / heartbeat ${HB_AGE:-?}s — in phase ${CUR_PHASE:-?}. Leave it; re-check on wake."
+      EXIT_CODE=5
+    else
+      ASSESS="DEAD: runner gone mid-phase (${CUR_PHASE:-?}) — escalate or re-kick with --skip-plan --plan ${RUN_DIR}/plan.md"
+      EXIT_CODE=4
+    fi
+    ;;
+esac
+
+# --- diffstat tail + execute stderr tail (paths) -----------------------------
+DIFFSTAT_FILE="$RUN_DIR/execute-diffstat.txt"
+EXEC_STDERR="$RUN_DIR/execute-stderr.log"
+VERDICT_JSON="$RUN_DIR/verdict.json"
+
+# ============================================================================
+# JSON mode — one machine-readable object for the orchestrating skill.
+# ============================================================================
+if [[ "$JSON" == true ]]; then
+  diffstat_tail=""
+  [[ -f "$DIFFSTAT_FILE" ]] && diffstat_tail=$(tail -5 "$DIFFSTAT_FILE")
+  exec_tail=""
+  [[ -f "$EXEC_STDERR" ]] && exec_tail=$(tail -5 "$EXEC_STDERR")
+  verdict_present=false
+  [[ -f "$VERDICT_JSON" ]] && verdict_present=true
+
+  jq -n \
+    --arg run_id "$RUN_ID" \
+    --arg run_dir "$RUN_DIR" \
+    --arg status "$STATUS" \
+    --arg task "$TASK" \
+    --arg runner_pid "${RUNNER_PID:-}" \
+    --arg runner_alive "$RUNNER_ALIVE" \
+    --arg current_phase "${CUR_PHASE:-}" \
+    --arg current_started "${CUR_STARTED:-}" \
+    --arg heartbeat_age "${HB_AGE:-}" \
+    --arg heartbeat_lines "${HB_LINES:-}" \
+    --arg heartbeat_file "${HB_FILE:-}" \
+    --arg parent_run "${PARENT_RUN:-}" \
+    --arg verdict "${VERDICT:-}" \
+    --arg verdict_json "$([[ -f "$VERDICT_JSON" ]] && printf '%s' "$VERDICT_JSON")" \
+    --argjson verdict_present "$verdict_present" \
+    --arg pre_sha "${PRE_SHA:-}" \
+    --arg post_sha "${POST_SHA:-}" \
+    --argjson contested "$M_CONTESTED" \
+    --argjson clarify "$M_CLARIFY" \
+    --arg diffstat_tail "$diffstat_tail" \
+    --arg execute_stderr_tail "$exec_tail" \
+    --arg assess "$ASSESS" \
+    --argjson exit_code "$EXIT_CODE" \
+    --slurpfile phases <(jq '.phases // []' "$STATE") \
+    '{
+      run_id: $run_id,
+      run_dir: $run_dir,
+      status: $status,
+      task: $task,
+      runner_pid: (if $runner_pid == "" then null else ($runner_pid|tonumber) end),
+      runner_alive: $runner_alive,
+      current_phase: (if $current_phase == "" then null else {name: $current_phase, started_at: $current_started} end),
+      heartbeat: {
+        age_secs: (if $heartbeat_age == "" then null else ($heartbeat_age|tonumber) end),
+        lines: (if $heartbeat_lines == "" then null else ($heartbeat_lines|tonumber) end),
+        file: (if $heartbeat_file == "" then null else $heartbeat_file end)
+      },
+      orchestration: {parent_run_id: (if $parent_run == "" then null else $parent_run end)},
+      verdict: (if $verdict == "" then null else $verdict end),
+      verdict_json: (if $verdict_json == "" then null else $verdict_json end),
+      verdict_present: $verdict_present,
+      pre_execute_sha: (if $pre_sha == "" then null else $pre_sha end),
+      post_execute_sha: (if $post_sha == "" then null else $post_sha end),
+      marker_counts: {contested: $contested, clarify: $clarify},
+      phases: $phases[0],
+      diffstat_tail: $diffstat_tail,
+      execute_stderr_tail: $execute_stderr_tail,
+      assess: $assess,
+      exit_code: $exit_code
+    }'
+  exit "$EXIT_CODE"
+fi
+
+# ============================================================================
+# Human mode.
+# ============================================================================
+printf '== dev-review run: %s ==\n' "$RUN_ID"
+printf 'run dir:       %s\n' "$RUN_DIR"
+printf 'status:        %s\n' "$STATUS"
+printf 'runner pid:    %s (alive: %s)\n' "${RUNNER_PID:-<none>}" "$RUNNER_ALIVE"
+[[ -n "$PARENT_RUN" ]] && printf 'parent run:    %s\n' "$PARENT_RUN"
+printf 'task:          %s\n' "$TASK"
+
+if [[ -n "$CUR_PHASE" ]]; then
+  printf 'current phase: %s (started %s)\n' "$CUR_PHASE" "${CUR_STARTED:-?}"
+  if [[ -n "$HB_AGE" ]]; then
+    printf 'heartbeat:     %ss ago, %s lines (%s)\n' "$HB_AGE" "${HB_LINES:-0}" "$HB_FILE"
+  else
+    printf 'heartbeat:     (no stderr activity yet for %s)\n' "$CUR_PHASE"
+  fi
+else
+  printf 'current phase: <none in flight>\n'
+fi
+
+# Completed phases table.
+phase_rows=$(jq -r '.phases // [] | .[] | "  \(.name)\t\(.status)\texit=\(.exit_code)\t\(.completed_at // "")"' "$STATE" 2>/dev/null)
+if [[ -n "$phase_rows" ]]; then
+  printf 'phases:\n%s\n' "$phase_rows"
+else
+  printf 'phases:        (none recorded yet)\n'
+fi
+
+printf 'markers:       contested=%s clarify=%s\n' "$M_CONTESTED" "$M_CLARIFY"
+
+if [[ -n "$PRE_SHA" || -n "$POST_SHA" ]]; then
+  printf 'execute SHAs:  pre=%s post=%s\n' "${PRE_SHA:-<none>}" "${POST_SHA:-<none>}"
+fi
+
+if [[ -n "$VERDICT" ]]; then
+  printf 'verdict:       %s' "$VERDICT"
+  [[ -f "$VERDICT_JSON" ]] && printf '  (%s)' "$VERDICT_JSON"
+  printf '\n'
+elif [[ -f "$VERDICT_JSON" ]]; then
+  printf 'verdict:       (verdict.json present: %s)\n' "$VERDICT_JSON"
+fi
+
+if [[ -f "$DIFFSTAT_FILE" ]]; then
+  printf 'diffstat (tail):\n'
+  tail -5 "$DIFFSTAT_FILE" | sed 's/^/  /'
+fi
+
+if [[ -f "$EXEC_STDERR" ]]; then
+  printf 'execute stderr (last 5 lines):\n'
+  tail -5 "$EXEC_STDERR" | sed 's/^/  /'
+fi
+
+printf 'ASSESS: %s\n' "$ASSESS"
+exit "$EXIT_CODE"
diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 0bf0ff7..583d8a9 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -67,6 +67,9 @@ FLAVOR_OVERRIDE=""
 # (byte-parity invariant SC-5 / D-06). Harness-side passes an absolute path rooted under
 # the eval fixture's .co-evolution/runs/ subtree.
 RUN_DIR_OVERRIDE=""
+# v1.5 Phase 3: lineage token for orchestrated re-kicks. Empty = standalone run
+# (byte-parity: .orchestration is omitted). Validated as a safe fs-ish token.
+PARENT_RUN_ID=""
 
 usage() {
   cat <<'EOF'
@@ -94,6 +97,7 @@ Options:
   --live                   Launch visible Windows terminal tailing each phase's stderr (Windows-only; warns + falls back on other OS)
   --branch auto|NAME       Create a feature branch off HEAD before execute (auto = dev-review/auto-<timestamp>-<slug>); mutually exclusive with --worktree
   --worktree auto|PATH     Create a git worktree for isolation before execute (auto = sibling dir); mutually exclusive with --branch
+  --parent-run RUN_ID      Lineage tag: record the orchestrator's parent run id in state.orchestration.parent_run_id (re-kicks always get a fresh run dir; no behavior change)
   --lab MODE               Route to lab/<MODE>/entry.sh (opt-in beta channel; see lab/README.md)
   --target FILE            PEL-only: file to mutate (used with --lab pel-proposer; must be repo-relative forward-slash path, e.g. lib/co-evolution.sh — NOT absolute or WSL/Windows-style)
   --tier TIER              PEL-only: override tier auto-detect (template|policy|code)
@@ -597,6 +601,8 @@ Output ONLY the plan document. No preamble."
   # abort_on_timeout will fire from main flow if the dispatcher reports 124.
   # RTUX-01: Launch a live-tail window before the agent call (no-op unless --live).
   maybe_launch_live_window "compose" "$compose_stderr_file"
+  # v1.5 Phase 3: mark the compose phase as STARTING (status reader heartbeat).
+  begin_state_phase "$STATE_JSON" "compose"
   # v1.5: layer the composer seat's model/effort (no-op when no per-seat env set).
   apply_seat_env composer "$COMPOSER"
   invoke_agent_with_timeout "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$(phase_is_writable compose)"
@@ -689,6 +695,9 @@ run_bounce_phase() {
     # so state.json records which pass hit the timeout.
     # RTUX-01: Per-pass live-tail window (bounce-01, bounce-02, ...) when --live.
     maybe_launch_live_window "bounce-${pass_padded}" "$stderr_file"
+    # v1.5 Phase 3: mark this bounce pass as STARTING (matches the bounce-NN
+    # name write_state_phase records on completion).
+    begin_state_phase "$STATE_JSON" "bounce-${pass_padded}"
     # v1.5: composer turns get the composer seat's model/effort; reviewer (the
     # bounce counterparty) turns use globals only (the `bounce` seat is a no-op
     # arm in apply_seat_env). No-op overall when no per-seat env is set.
@@ -775,6 +784,17 @@ run_execute_phase() {
     PRE_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true)
   fi
 
+  # v1.5 Phase 3: record the workdir HEAD just before the executor runs so a
+  # reviewer can pin the exact pre-change commit. Non-git / detached / failed
+  # rev-parse yields an empty PRE_EXECUTE_SHA → write null (never die).
+  if [[ -n "${STATE_JSON:-}" ]]; then
+    if [[ -n "$PRE_EXECUTE_SHA" ]]; then
+      write_state_field "$STATE_JSON" ".pre_execute_sha" "string" "$PRE_EXECUTE_SHA"
+    else
+      write_state_field "$STATE_JSON" ".pre_execute_sha" "null"
+    fi
+  fi
+
   # RNPT-03: Capture pre-execute baseline hash of every workdir file.
   # Delta computed post-execute and written into state.json.
   if [[ -n "${STATE_JSON:-}" && -n "${BASELINE_HASHES_JSON:-}" ]]; then
@@ -823,6 +843,17 @@ run_execute_phase() {
     POST_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true)
     status_output=$(git -C "$WORKDIR" status --short)
 
+    # v1.5 Phase 3: record the workdir HEAD just after change detection so a
+    # reviewer can see whether the executor committed (pre != post) or left the
+    # tree dirty (pre == post). Empty rev-parse → null (never die).
+    if [[ -n "${STATE_JSON:-}" ]]; then
+      if [[ -n "$POST_EXECUTE_SHA" ]]; then
+        write_state_field "$STATE_JSON" ".post_execute_sha" "string" "$POST_EXECUTE_SHA"
+      else
+        write_state_field "$STATE_JSON" ".post_execute_sha" "null"
+      fi
+    fi
+
     # RNPT-03: Post-execute delta (runs regardless of "no changes" branch below
     # so Phase 8 scorer sees an empty delta rather than a missing field).
     if [[ -n "${STATE_JSON:-}" && -n "${CURRENT_HASHES_JSON:-}" && -n "${EXECUTE_DELTA_JSON:-}" ]]; then
@@ -1110,6 +1141,19 @@ while [[ $# -gt 0 ]]; do
       WORKTREE_SPEC="$2"
       shift 2
       ;;
+    --parent-run)
+      # v1.5 Phase 3: orchestration lineage. Validate as a safe filesystem-ish
+      # token (alnum, _, -, .) so a malicious id can't traverse paths or inject
+      # shell-meta when echoed downstream (mirrors validate_lab_mode's posture,
+      # plus '.' since run ids carry dotted timestamps). Lineage only — no
+      # behavior change beyond the state.orchestration.parent_run_id write.
+      [[ $# -gt 1 ]] || die "--parent-run requires a value"
+      if ! [[ "$2" =~ ^[A-Za-z0-9._-]+$ ]] || [[ "${#2}" -gt 128 ]]; then
+        die "--parent-run must be a safe token [A-Za-z0-9._-]{1,128} (got: $2)"
+      fi
+      PARENT_RUN_ID="$2"
+      shift 2
+      ;;
     --lab)
       # Phase 3 LAB-01: opt-in routing to lab/<MODE>/entry.sh.
       # Arm sits BEFORE the `--` argv-terminator so args after `--` remain
@@ -1361,6 +1405,17 @@ EXECUTE_DELTA_JSON="${RUN_DIR}/.execute-delta.json"
 RUN_ID="dev-review-${TIMESTAMP}"
 init_state_json "$STATE_JSON" "$RUN_ID" "$TASK" "$COMPOSER" "$EXECUTOR" "$REVIEWER"
 
+# v1.5 Phase 3: observability — record the runner's PID once so a status reader
+# can probe liveness via `kill -0`. Written as a number (jq accepts $$ verbatim).
+write_state_field "$STATE_JSON" ".runner_pid" "number" "$$"
+
+# v1.5 Phase 3: lineage — when this run was re-kicked by an orchestrator under a
+# parent run, record the parent's id (validated at parse time). Lineage only;
+# no other behavior. Omitted (left absent) when --parent-run was not passed.
+if [[ -n "$PARENT_RUN_ID" ]]; then
+  write_state_field "$STATE_JSON" ".orchestration.parent_run_id" "string" "$PARENT_RUN_ID"
+fi
+
 # v1.5: resolve the verifier seat and per-seat model/effort descriptors for the
 # banner + observability fields. _VERIFIER_SEAT reflects --verifier / executor
 # derivation (select_verifier); the seat strings reflect whatever the seats
@@ -1440,6 +1495,9 @@ if [[ "$PLAN_ONLY" == "true" ]]; then
   # RNPT-04: record completion even on plan-only exit so Phase 8 eval sees a
   # terminated run rather than one that looks crashed mid-phase.
   write_state_field "$STATE_JSON" ".completed_at" "string" "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+  # v1.5 Phase 3: plan-only runs also reach a clean terminal — clear the in-flight
+  # phase so a status reader does not see a stale compose/bounce as "still running".
+  write_state_field "$STATE_JSON" ".current_phase" "null"
   cleanup_runtime_artifacts
   if [[ "${PLAN_EXIT:-0}" -eq 2 ]]; then
     exit 2
@@ -1531,6 +1589,9 @@ _run_revise_loop() {
 
     _execute_phase_start=$(date -u +%Y-%m-%dT%H:%M:%SZ)
     EXECUTE_EXIT=0
+    # v1.5 Phase 3: mark execute (or execute-N on a revise pass) as STARTING,
+    # using the same name write_state_phase records on completion below.
+    begin_state_phase "$STATE_JSON" "$exec_name"
     run_execute_phase "$_execute_phase_start" || EXECUTE_EXIT=$?
     _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ)
     _phase_status=$([[ "$EXECUTE_EXIT" -eq 0 ]] && echo "ok" || echo "error")
@@ -1545,6 +1606,8 @@ _run_revise_loop() {
     VERIFY_EXIT=0
     # Clear prior verdict globals so a hung/empty verify pass can't leak them.
     unset VERDICT CONFIDENCE SUMMARY
+    # v1.5 Phase 3: mark verify (or verify-N on a revise pass) as STARTING.
+    begin_state_phase "$STATE_JSON" "$verify_name"
     run_verify_phase "$_verify_phase_start" || VERIFY_EXIT=$?
     _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ)
     _phase_status=$([[ "$VERIFY_EXIT" -eq 0 ]] && echo "ok" || echo "error")
@@ -1595,6 +1658,10 @@ write_state_field "$STATE_JSON" ".status"       "string" "$_run_status"
 write_state_field "$STATE_JSON" ".completed_at" "string" "$_run_end_ts"
 # Phase 8.1 WR-02 supporting: .updated_at feeds Cost dimension (score-run.sh:261).
 write_state_field "$STATE_JSON" ".updated_at"   "string" "$_run_end_ts"
+# v1.5 Phase 3: the run reached EOF — no phase is in flight. The status reader
+# treats a non-null .current_phase on a non-terminal run as "still in <phase>";
+# clearing it here is the in-band "phases done" signal.
+write_state_field "$STATE_JSON" ".current_phase" "null"
 
 # Phase 8.1 / D-01: mirror phases[] into history[] (canonical contract name).
 # Field-rename transition posture — phases[] stays as legacy alias for one
diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md
index 650f476..e262780 100644
--- a/evals/RUNNER-CONTRACT.md
+++ b/evals/RUNNER-CONTRACT.md
@@ -6,7 +6,7 @@ owners:
   - dev-review/codex/dev-review.sh (runner)
   - evals/score-run.sh (scorer)
   - evals/tests/fake-runner.sh (reference implementation)
-updated: 2026-04-19
+updated: 2026-06-12
 ---
 
 # Runner ↔ scorer contract
@@ -44,6 +44,11 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`.
 | `history` | array\<`{phase,status,detail,timestamp}`\> | yes | runner (per phase) | `score-run.sh:338` | Canonical bounce-trace array. Convergence needs ≥1 entry with `phase` matching `^bounce-[0-9]+$` when bounces were expected. |
 | `mode` | string | no | runner (optional tag) | — | Fake-runner sets `"fake-runner"`; real runner may set `"real"` or omit. Runner-owned metadata — NOT written by the shared `init_state_json` library. |
 | `seat_models` | object `{composer, executor, verifier}` (each string `agent:model@effort`) | no | `dev-review.sh` (post-init) | — | v1.5 observability: what each seat resolved to under the per-seat env layer (e.g. `"claude:claude-fable-5@high"`, `"codex:(default)@xhigh"`). Runner-owned metadata — NOT written by the shared `init_state_json` library; scorer ignores it. |
+| `current_phase` | object `{name, started_at}` \| null | no | `dev-review.sh` (`begin_state_phase` at each phase start; null at EOF) | — | v1.5 additions — observability. The phase the runner is currently in; `name` matches the `phases[]`/`history[]` phase name (`compose`, `bounce-NN`, `execute[-N]`, `verify[-N]`). Set at phase **start**, cleared to null at EOF (incl. plan-only). A non-null value on a non-terminal run means the runner is in that phase or died there. Read by `dev-review-status.sh`; scorer ignores it. |
+| `runner_pid` | number | no | `dev-review.sh` (post-init, once) | — | v1.5 additions — observability. The runner process's PID (`$$`). A status reader probes liveness via `kill -0`. Runner-owned; scorer ignores it. |
+| `pre_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, before executor) | — | v1.5 additions — observability. Workdir `git rev-parse HEAD` captured just before the executor runs; null when non-git/detached. Scorer ignores it. |
+| `post_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, after change detection) | — | v1.5 additions — observability. Workdir HEAD just after change detection; `pre != post` means the executor committed. Null when non-git/detached. Scorer ignores it. |
+| `orchestration` | object `{parent_run_id}` | no | `dev-review.sh` (post-init, only when `--parent-run` passed) | — | v1.5 additions — lineage. Records the orchestrator's parent run id for re-kicks (re-kicks always get a fresh run dir; no `--resume`). Omitted entirely for standalone runs. Runner-owned; scorer ignores it. |
 
 ### Legacy alias: `phases` → `history`
 
diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh
index ef0aeed..cc83a77 100644
--- a/lib/co-evolution.sh
+++ b/lib/co-evolution.sh
@@ -1277,6 +1277,35 @@ write_state_phase() {
   fi
 }
 
+# v1.5: record the phase now STARTING (write_state_phase records completions).
+# Sets .current_phase = {name, started_at} and bumps .updated_at so a status
+# reader knows which phase the runner is in mid-run. The EOF block clears it
+# back to null. Single-writer discipline: called from the runner main flow only
+# — there is NO concurrent heartbeat loop, which would race write_state_field's
+# read-modify-write. Degrades silently (return 0) when jq is absent, matching
+# write_state_phase / write_state_field.
+begin_state_phase() {
+  local state_path="${1:?state path required}"
+  local phase_name="${2:?phase name required}"
+
+  if command -v jq >/dev/null 2>&1; then
+    local tmp now
+    now=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+    tmp=$(mktemp)
+    # FIX-WR-02 idiom: clean up $tmp on jq failure so we don't leak files.
+    if jq --arg name "$phase_name" --arg now "$now" \
+       '.current_phase = {name: $name, started_at: $now} | .updated_at = $now' \
+       "$state_path" > "$tmp"; then
+      mv "$tmp" "$state_path"
+    else
+      rm -f "$tmp"
+      log "WARNING: jq failed in begin_state_phase ($phase_name) — state.json unchanged"
+    fi
+  else
+    log "WARNING: jq unavailable — begin_state_phase skipping ($phase_name)"
+  fi
+}
+
 # RNPT-04: Generic jq-path field setter. Supports string|number|bool|null|rawfile.
 # Usage: write_state_field state.json '.verify_verdict' string APPROVED
 #        write_state_field state.json '.execute_delta' rawfile path/to/delta.json
diff --git a/tests/observability-lifecycle-simulation.sh b/tests/observability-lifecycle-simulation.sh
new file mode 100644
index 0000000..7544674
--- /dev/null
+++ b/tests/observability-lifecycle-simulation.sh
@@ -0,0 +1,292 @@
+#!/usr/bin/env bash
+# tests/observability-lifecycle-simulation.sh
+# Hermetic gate for v1.5 Phase 3: runner observability weave-in
+# (dev-review/codex/dev-review.sh). Runs the REAL runner end-to-end with
+# PATH-stubbed claude/codex (no agents, no network) and asserts the additive
+# state.json fields the status reader depends on:
+#
+#   - .current_phase == null at a clean terminal (EOF clears the in-flight phase)
+#   - .runner_pid present (number)
+#   - .pre_execute_sha / .post_execute_sha present and 40-hex
+#   - --parent-run abc-123 lands in .orchestration.parent_run_id
+#   - a mid-run state.json capture shows .current_phase.name == "execute"
+#
+# Cheap shape: --skip-plan --plan <fixture> --bounces 0 --verify --verifier codex
+# (codex executor + codex verifier via --output-schema, so a single codex stub
+# drives both phases without needing a working claude-verdict path).
+#
+# House final-line convention: "N/N scenarios passed".
+
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh"
+
+command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; }
+
+TEST_DIR="$(mktemp -d -t obs-lifecycle-XXXXXX)"
+trap 'rm -rf "$TEST_DIR"' EXIT
+
+TOTAL=0
+PASSED=0
+pass() { printf 'PASS: %s\n' "$1"; PASSED=$((PASSED + 1)); }
+fail() { printf 'FAIL: %s\n' "$1" >&2; }
+
+# --- hermetic git env (mirrors tests/run-all.sh) ----------------------------
+export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig"
+export GIT_CONFIG_SYSTEM=/dev/null
+cat > "$TEST_DIR/gitconfig" <<'GITCFG'
+[user]
+	name = obs-sim
+	email = obs-sim@invalid.local
+[init]
+	defaultBranch = master
+[core]
+	autocrlf = false
+[commit]
+	gpgsign = false
+GITCFG
+
+# --- stubs ------------------------------------------------------------------
+# codex stub handles BOTH phases the default-seat skip-plan run exercises:
+#   - EXECUTE: prompt heading "Execution Instructions (Codex)" + a -C workdir →
+#     append a line to the tracked file so the run produces a diff. Honors
+#     OBS_EXECUTE_SLEEP (seconds) so the mid-run scenario has a polling window.
+#   - VERIFY: argv carries --output-schema + -o FILE → write a valid verdict.
+mkdir -p "$TEST_DIR/bin"
+cat > "$TEST_DIR/bin/codex" <<'STUB'
+#!/usr/bin/env bash
+for arg in "$@"; do
+  case "$arg" in --version|-v|version) echo "codex 0.117.0 (obs-stub)"; exit 0 ;; esac
+done
+output_file=""; workdir=""; has_schema=false; prev=""
+for arg in "$@"; do
+  case "$prev" in
+    -o) output_file="$arg" ;;
+    -C) workdir="$arg" ;;
+  esac
+  [[ "$arg" == "--output-schema" ]] && has_schema=true
+  prev="$arg"
+done
+stdin_capture=""
+while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done
+
+if [[ "$has_schema" == true ]]; then
+  # VERIFY phase — emit a schema-shaped verdict to the -o file.
+  printf '%s\n' '{"verdict":"APPROVED","confidence":90,"summary":"obs-stub approves the change.","issues":[],"scope_creep_detected":false,"iteration_notes":""}' > "$output_file"
+  exit 0
+fi
+
+if [[ "$stdin_capture" == *"Execution Instructions (Codex)"* && -n "$workdir" && -f "$workdir/touched.txt" ]]; then
+  # EXECUTE phase — optional dwell so a poller can observe current_phase=execute.
+  if [[ -n "${OBS_EXECUTE_SLEEP:-}" ]]; then
+    sleep "$OBS_EXECUTE_SLEEP"
+  fi
+  printf 'obs-stub-change\n' >> "$workdir/touched.txt"
+fi
+# Non-schema codex calls emit to -o (or stdout) so the runner sees non-empty output.
+plan='obs stub output line one\nobs stub output line two\nobs stub output line three'
+if [[ -n "$output_file" ]]; then
+  printf '%b\n' "$plan" > "$output_file"
+else
+  printf '%b\n' "$plan"
+fi
+exit 0
+STUB
+chmod +x "$TEST_DIR/bin/codex"
+# claude stub (unused by default seats, present for safety): emit a plan.
+cat > "$TEST_DIR/bin/claude" <<'STUB'
+#!/usr/bin/env bash
+for arg in "$@"; do
+  case "$arg" in --version|-v|version) echo "claude 1.0.0 (obs-stub)"; exit 0 ;; esac
+done
+while IFS= read -r _line; do :; done
+printf 'obs claude stub output\n'
+STUB
+chmod +x "$TEST_DIR/bin/claude"
+
+# --- fixtures ---------------------------------------------------------------
+make_scratch_repo() {
+  local repo="$TEST_DIR/repo-$1"
+  rm -rf "$repo"; mkdir -p "$repo"
+  git -C "$repo" init -q
+  printf 'baseline\n' > "$repo/touched.txt"
+  git -C "$repo" add -A
+  git -C "$repo" commit -qm baseline
+  printf '%s' "$repo"
+}
+
+PLAN_FIXTURE="$TEST_DIR/plan-fixture.md"
+cat > "$PLAN_FIXTURE" <<'PLAN'
+# Observability Lifecycle Fixture
+
+## Approach
+
+A pre-baked plan so the --skip-plan run bypasses compose and bounce and drives
+only the execute and verify phases. It satisfies the plan heading regex, the
+minimum word count, and the structural-line check. The executor stub appends one
+harmless line to a tracked file so the run produces a diff for verify to score.
+
+## Files to Change
+
+- `touched.txt` — the executor stub appends a line.
+
+## Risks
+
+- None identified.
+PLAN
+
+latest_run_dir() { ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1; }
+
+# ===========================================================================
+# Scenario 1: full lifecycle — current_phase=null at EOF, runner_pid present,
+# pre/post SHA 40-hex.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo s1)
+before=$(latest_run_dir || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \
+    --workdir "$repo" --timeout 60 -- "obs lifecycle probe"
+) > "$TEST_DIR/s1.out" 2>&1 || true
+run1=$(latest_run_dir || true)
+s1_state="$run1/state.json"
+s1_ok=true
+if [[ -z "$run1" || "$run1" == "$before" || ! -f "$s1_state" ]]; then
+  s1_ok=false
+else
+  jq -e '.current_phase == null' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo "  current_phase not null" >&2; }
+  jq -e '(.runner_pid | type) == "number"' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo "  runner_pid missing/non-number" >&2; }
+  jq -e '(.pre_execute_sha | test("^[0-9a-f]{40}$"))' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo "  pre_execute_sha not 40-hex" >&2; }
+  jq -e '(.post_execute_sha | test("^[0-9a-f]{40}$"))' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo "  post_execute_sha not 40-hex" >&2; }
+  jq -e '.verify_verdict == "APPROVED"' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo "  verdict not APPROVED" >&2; }
+fi
+if [[ "$s1_ok" == true ]]; then
+  pass "S1: lifecycle — current_phase=null, runner_pid number, pre/post SHA 40-hex, verdict APPROVED"
+else
+  fail "S1: state below"
+  [[ -f "$s1_state" ]] && cat "$s1_state" >&2
+  cat "$TEST_DIR/s1.out" >&2
+fi
+[[ -n "$run1" && "$run1" != "$before" ]] && rm -rf "$run1"
+
+# ===========================================================================
+# Scenario 2: --parent-run lands in .orchestration.parent_run_id.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo s2)
+before=$(latest_run_dir || true)
+(
+  export PATH="$TEST_DIR/bin:$PATH"
+  bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \
+    --parent-run "abc-123" --workdir "$repo" --timeout 60 -- "obs parent-run probe"
+) > "$TEST_DIR/s2.out" 2>&1 || true
+run2=$(latest_run_dir || true)
+if [[ -n "$run2" && "$run2" != "$before" && -f "$run2/state.json" ]] \
+   && jq -e '.orchestration.parent_run_id == "abc-123"' "$run2/state.json" >/dev/null 2>&1; then
+  pass "S2: --parent-run abc-123 recorded in .orchestration.parent_run_id"
+else
+  fail "S2: parent_run_id missing (run=$run2)"
+  [[ -f "$run2/state.json" ]] && cat "$run2/state.json" >&2
+fi
+[[ -n "$run2" && "$run2" != "$before" ]] && rm -rf "$run2"
+
+# ===========================================================================
+# Scenario 3: --parent-run rejects a garbage token (path traversal / shell meta).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+rc3=0
+( export PATH="$TEST_DIR/bin:$PATH"
+  bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --parent-run '../evil;rm' \
+    --workdir "$TEST_DIR" --timeout 60 -- "x" ) > "$TEST_DIR/s3.out" 2>&1 || rc3=$?
+if [[ "$rc3" -ne 0 ]] && grep -qi 'parent-run' "$TEST_DIR/s3.out"; then
+  pass "S3: --parent-run rejects an unsafe token (die)"
+else
+  fail "S3: rc=$rc3 out=$(cat "$TEST_DIR/s3.out")"
+fi
+
+# ===========================================================================
+# Scenario 4: mid-run capture — while execute dwells, state.json shows
+# current_phase.name == "execute" and runner_pid is alive.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo s4)
+before=$(latest_run_dir || true)
+(
+  export PATH="$TEST_DIR/bin:$PATH"
+  export OBS_EXECUTE_SLEEP=4
+  bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \
+    --workdir "$repo" --timeout 60 -- "obs mid-run probe"
+) > "$TEST_DIR/s4.out" 2>&1 &
+runner_bg=$!
+
+# Poll for a run dir whose state.json reports current_phase.name == execute.
+captured=false
+captured_pid=""
+for _ in $(seq 1 60); do
+  rd=$(latest_run_dir || true)
+  if [[ -n "$rd" && "$rd" != "$before" && -f "$rd/state.json" ]]; then
+    name=$(jq -r '.current_phase.name // empty' "$rd/state.json" 2>/dev/null || true)
+    if [[ "$name" == "execute" ]]; then
+      captured=true
+      captured_pid=$(jq -r '.runner_pid // empty' "$rd/state.json" 2>/dev/null || true)
+      break
+    fi
+  fi
+  sleep 0.2
+done
+wait "$runner_bg" 2>/dev/null || true
+
+s4_ok=true
+[[ "$captured" == true ]] || { s4_ok=false; echo "  never observed current_phase=execute" >&2; }
+# runner_pid captured mid-run must be a positive integer.
+[[ "$captured_pid" =~ ^[0-9]+$ ]] || { s4_ok=false; echo "  captured runner_pid not numeric: $captured_pid" >&2; }
+# And the run must still terminate cleanly with current_phase cleared.
+run4=$(latest_run_dir || true)
+if [[ -n "$run4" && -f "$run4/state.json" ]]; then
+  jq -e '.current_phase == null and .status == "completed"' "$run4/state.json" >/dev/null 2>&1 \
+    || { s4_ok=false; echo "  post-run state not clean terminal" >&2; }
+fi
+if [[ "$s4_ok" == true ]]; then
+  pass "S4: mid-run capture observed current_phase=execute; run then terminated clean"
+else
+  fail "S4: capture failed"
+  [[ -f "$run4/state.json" ]] && cat "$run4/state.json" >&2
+  cat "$TEST_DIR/s4.out" >&2
+fi
+[[ -n "$run4" && "$run4" != "$before" ]] && rm -rf "$run4"
+
+# ===========================================================================
+# Scenario 5: parity — a default skip-plan run with NO new flags still produces
+# a completed state (no-new-flags regression; existing bounce-state behavior
+# is covered by tests/bounce-state-simulation.sh).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo s5)
+before=$(latest_run_dir || true)
+(
+  export PATH="$TEST_DIR/bin:$PATH"
+  bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \
+    --workdir "$repo" --timeout 60 -- "obs parity probe"
+) > "$TEST_DIR/s5.out" 2>&1 || true
+run5=$(latest_run_dir || true)
+if [[ -n "$run5" && "$run5" != "$before" && -f "$run5/state.json" ]] \
+   && jq -e '.status == "completed" and .current_phase == null' "$run5/state.json" >/dev/null 2>&1; then
+  pass "S5: no-new-flags run reaches completed with current_phase cleared"
+else
+  fail "S5: parity run not clean (run=$run5)"
+  [[ -f "$run5/state.json" ]] && cat "$run5/state.json" >&2
+fi
+[[ -n "$run5" && "$run5" != "$before" ]] && rm -rf "$run5"
+
+# ---------------------------------------------------------------------------
+printf '%d/%d scenarios passed' "$PASSED" "$TOTAL"
+if (( PASSED != TOTAL )); then
+  printf ' (%d failed)\n' "$((TOTAL - PASSED))"
+  exit 1
+fi
+printf '\n'
diff --git a/tests/status-reader-simulation.sh b/tests/status-reader-simulation.sh
new file mode 100644
index 0000000..817e831
--- /dev/null
+++ b/tests/status-reader-simulation.sh
@@ -0,0 +1,203 @@
+#!/usr/bin/env bash
+# tests/status-reader-simulation.sh
+# Hermetic gate for v1.5 Phase 3: dev-review/codex/dev-review-status.sh.
+#
+# Fabricates synthetic run dirs (state.json + stderr heartbeat logs, contract-
+# shaped per evals/tests/fake-runner.sh) under a throwaway CO_EVOLVE_RUNS_DIR
+# and asserts the status reader's liveness assessment, exit codes, --json keys,
+# and --list output. No runner, no agents, no network.
+#
+# Exit-code contract under test (dev-review-status.sh):
+#   0 terminal-completed · 2 terminal-partial · 5 still-running ·
+#   4 presumed-dead · 3 not found / no state.json
+#
+# House final-line convention: "N/N scenarios passed".
+
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+STATUS="$REPO_ROOT/dev-review/codex/dev-review-status.sh"
+
+command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; }
+
+TEST_DIR="$(mktemp -d -t status-reader-sim-XXXXXX)"
+trap 'rm -rf "$TEST_DIR"' EXIT
+RUNS="$TEST_DIR/runs"
+mkdir -p "$RUNS"
+export CO_EVOLVE_RUNS_DIR="$RUNS"
+
+TOTAL=0
+PASSED=0
+pass() { printf 'PASS: %s\n' "$1"; PASSED=$((PASSED + 1)); }
+fail() { printf 'FAIL: %s\n' "$1" >&2; }
+
+# Run the reader; capture stdout + the exit code it returns.
+run_status() {
+  STATUS_OUT=""
+  STATUS_RC=0
+  STATUS_OUT=$(bash "$STATUS" "$@" 2>&1) || STATUS_RC=$?
+}
+
+# Fabricate a run dir. Args: <id> <status-json-literal> <current_phase-name-or-empty>
+# Writes a contract-shaped state.json with runner_pid + (optional) current_phase.
+# Extra knobs via env: FAB_PID, FAB_VERDICT.
+fabricate() {
+  local id="$1" status="$2" phase="$3"
+  local dir="$RUNS/$id"
+  mkdir -p "$dir/outputs"
+  local pid="${FAB_PID:-$$}"
+  local cur="null"
+  if [[ -n "$phase" ]]; then
+    cur=$(jq -n --arg n "$phase" '{name: $n, started_at: "2026-06-12T00:00:00Z"}')
+  fi
+  jq -n \
+    --arg id "$id" \
+    --arg status "$status" \
+    --argjson pid "$pid" \
+    --argjson cur "$cur" \
+    --arg verdict "${FAB_VERDICT:-}" \
+    '{
+      run_id: $id,
+      task: "synthetic status-reader probe task that is fairly long to test truncation behavior in the reader output line",
+      composer: "opus", executor: "codex", reviewer: "opus",
+      phases: [
+        {name: "compose", started_at: "2026-06-12T00:00:00Z", completed_at: "2026-06-12T00:00:01Z", status: "ok", exit_code: 0}
+      ],
+      marker_counts: {contested: 1, clarify: 2},
+      verify_verdict: (if $verdict == "" then null else $verdict end),
+      runner_pid: $pid,
+      current_phase: (if $status == "null" then $cur else $cur end),
+      started_at: "2026-06-12T00:00:00Z",
+      completed_at: null,
+      status: (if $status == "null" then null else $status end),
+      updated_at: "2026-06-12T00:00:00Z"
+    }' > "$dir/state.json"
+  printf '%s' "$dir"
+}
+
+# ---------------------------------------------------------------------------
+# Scenario 1: still-running — alive pid ($$) + fresh execute stderr -> exit 5
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+d1=$(FAB_PID="$$" fabricate "dev-review-0001-running" null "execute")
+# Fresh heartbeat: execute-stderr.log just touched (mtime = now).
+printf 'executor working...\nstill going...\n' > "$d1/execute-stderr.log"
+run_status "dev-review-0001-running"
+if [[ "$STATUS_RC" -eq 5 ]] && grep -q 'RUNNING' <<<"$STATUS_OUT"; then
+  pass "S1: still-running (alive pid + fresh heartbeat) -> exit 5"
+else
+  fail "S1: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 2: presumed-dead — bogus pid + stale stderr + non-terminal -> exit 4
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+d2=$(FAB_PID=999999 fabricate "dev-review-0002-dead" null "execute")
+printf 'executor started\n' > "$d2/execute-stderr.log"
+# Make the heartbeat stale (>120s old).
+touch -t 202001010000 "$d2/execute-stderr.log" 2>/dev/null || true
+run_status "dev-review-0002-dead"
+if [[ "$STATUS_RC" -eq 4 ]] && grep -q 'DEAD' <<<"$STATUS_OUT"; then
+  pass "S2: presumed-dead (pid gone + stale heartbeat) -> exit 4"
+else
+  fail "S2: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 3: completed -> exit 0
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+d3=$(FAB_VERDICT="APPROVED" fabricate "dev-review-0003-done" completed "")
+jq -n '{verdict:"APPROVED",confidence:90,summary:"ok",issues:[]}' > "$d3/verdict.json"
+run_status "dev-review-0003-done"
+if [[ "$STATUS_RC" -eq 0 ]] && grep -q 'DONE' <<<"$STATUS_OUT"; then
+  pass "S3: completed -> exit 0"
+else
+  fail "S3: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 4: partial -> exit 2
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+d4=$(fabricate "dev-review-0004-partial" partial "")
+run_status "dev-review-0004-partial"
+if [[ "$STATUS_RC" -eq 2 ]] && grep -q 'PARTIAL' <<<"$STATUS_OUT"; then
+  pass "S4: partial -> exit 2"
+else
+  fail "S4: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 5: missing run -> exit 3
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+run_status "dev-review-9999-nope"
+if [[ "$STATUS_RC" -eq 3 ]]; then
+  pass "S5: missing run -> exit 3"
+else
+  fail "S5: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 6: --json parses with jq and has the expected keys
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+json_rc=0
+json_out=$(bash "$STATUS" --json "dev-review-0001-running" 2>/dev/null) || json_rc=$?
+j_ok=true
+if ! jq -e . <<<"$json_out" >/dev/null 2>&1; then j_ok=false; fi
+for key in run_id status runner_pid runner_alive current_phase heartbeat marker_counts assess exit_code phases; do
+  jq -e "has(\"$key\")" <<<"$json_out" >/dev/null 2>&1 || { j_ok=false; echo "  missing key: $key" >&2; }
+done
+# current_phase.name and heartbeat.age_secs must be populated for the running run.
+if [[ "$j_ok" == true ]]; then
+  jq -e '.current_phase.name == "execute"' <<<"$json_out" >/dev/null 2>&1 || j_ok=false
+  jq -e '.heartbeat.age_secs != null' <<<"$json_out" >/dev/null 2>&1 || j_ok=false
+  jq -e '.exit_code == 5' <<<"$json_out" >/dev/null 2>&1 || j_ok=false
+fi
+# --json still returns the assessment exit code.
+[[ "$json_rc" -eq 5 ]] || j_ok=false
+if [[ "$j_ok" == true ]]; then
+  pass "S6: --json is valid JSON with expected keys + populated current_phase/heartbeat"
+else
+  fail "S6: json_rc=$json_rc out=<<$json_out>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 7: --list shows the active (non-terminal) run + recent terminal ones
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+list_out=$(bash "$STATUS" --list 2>&1) || true
+if grep -q 'ACTIVE' <<<"$list_out" \
+   && grep -q 'dev-review-0001-running' <<<"$list_out" \
+   && grep -q 'dev-review-0003-done' <<<"$list_out"; then
+  pass "S7: --list shows the active run and recent terminal runs"
+else
+  fail "S7: out=<<$list_out>>"
+fi
+
+# ---------------------------------------------------------------------------
+# Scenario 8: no positional arg -> latest run by mtime (the dead one we touched
+# oldest is NOT latest; ensure default resolution returns *some* run cleanly).
+# ---------------------------------------------------------------------------
+TOTAL=$((TOTAL + 1))
+# Touch a brand-new completed run so it is unambiguously newest.
+d8=$(fabricate "dev-review-0008-latest" completed "")
+touch "$d8/state.json"
+run_status   # no arg
+if [[ "$STATUS_RC" -eq 0 ]] && grep -q 'dev-review-0008-latest' <<<"$STATUS_OUT"; then
+  pass "S8: no-arg resolves the newest run by mtime"
+else
+  fail "S8: rc=$STATUS_RC out=<<$STATUS_OUT>>"
+fi
+
+# ---------------------------------------------------------------------------
+printf '%d/%d scenarios passed' "$PASSED" "$TOTAL"
+if (( PASSED != TOTAL )); then
+  printf ' (%d failed)\n' "$((TOTAL - PASSED))"
+  exit 1
+fi
+printf '\n'

From fb965ad70bd6e7abf515b5fdd168cdc5fcb070d3 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 13:19:49 -0400
Subject: [PATCH 05/12] feat: gated per-phase token capture into state.json
 (v1.5 Phase 4)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review.sh    |  23 ++
 evals/RUNNER-CONTRACT.md          |   1 +
 evals/score-run.sh                |  12 +-
 lib/co-evolution.sh               | 184 ++++++++++++-
 tests/token-capture-simulation.sh | 413 ++++++++++++++++++++++++++++++
 5 files changed, 630 insertions(+), 3 deletions(-)
 create mode 100755 tests/token-capture-simulation.sh

diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 583d8a9..30fd33e 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -1033,6 +1033,22 @@ cleanup_runtime_artifacts() {
   find "$RUN_DIR" -maxdepth 1 -type f -name '.*' -delete 2>/dev/null || true
 }
 
+# v1.5 Phase 4: token-usage aggregation, gated on CO_EVOLVE_TOKEN_CAPTURE=1.
+# When the flag is unset/off this is a no-op and state.json never grows a
+# `tokens` key (byte-parity). Called just BEFORE cleanup_runtime_artifacts in
+# both terminal paths (plan-only early exit and normal EOF). collect_token_usage
+# reads the *.usage.json sidecars + per-phase stderr logs and writes one
+# state.json.tokens block; we then remove the transient sidecars/envelopes since
+# state.json is the durable record (some sidecars — execute-output.md.usage.json,
+# verdict.json.usage.json — have no leading dot, so cleanup_runtime_artifacts
+# would not sweep them).
+maybe_collect_token_usage() {
+  [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] || return 0
+  collect_token_usage "$RUN_DIR" "$STATE_JSON"
+  find "$RUN_DIR" -maxdepth 1 -type f \
+    \( -name '*.usage.json' -o -name '*.envelope.json' \) -delete 2>/dev/null || true
+}
+
 while [[ $# -gt 0 ]]; do
   case "$1" in
     --preset)
@@ -1498,6 +1514,9 @@ if [[ "$PLAN_ONLY" == "true" ]]; then
   # v1.5 Phase 3: plan-only runs also reach a clean terminal — clear the in-flight
   # phase so a status reader does not see a stale compose/bounce as "still running".
   write_state_field "$STATE_JSON" ".current_phase" "null"
+  # v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1)
+  # BEFORE cleanup sweeps the dot-prefixed sidecars.
+  maybe_collect_token_usage
   cleanup_runtime_artifacts
   if [[ "${PLAN_EXIT:-0}" -eq 2 ]]; then
     exit 2
@@ -1679,6 +1698,10 @@ if command -v jq >/dev/null 2>&1; then
 fi
 unset _run_status _run_end_ts
 
+# v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1)
+# BEFORE cleanup sweeps the dot-prefixed sidecars.
+maybe_collect_token_usage
+
 cleanup_runtime_artifacts
 
 log ""
diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md
index e262780..a331f5f 100644
--- a/evals/RUNNER-CONTRACT.md
+++ b/evals/RUNNER-CONTRACT.md
@@ -49,6 +49,7 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`.
 | `pre_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, before executor) | — | v1.5 additions — observability. Workdir `git rev-parse HEAD` captured just before the executor runs; null when non-git/detached. Scorer ignores it. |
 | `post_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, after change detection) | — | v1.5 additions — observability. Workdir HEAD just after change detection; `pre != post` means the executor committed. Null when non-git/detached. Scorer ignores it. |
 | `orchestration` | object `{parent_run_id}` | no | `dev-review.sh` (post-init, only when `--parent-run` passed) | — | v1.5 additions — lineage. Records the orchestrator's parent run id for re-kicks (re-kicks always get a fresh run dir; no `--resume`). Omitted entirely for standalone runs. Runner-owned; scorer ignores it. |
+| `tokens` | object `{phases:{<phase>:{…}}, totals:{claude_input, claude_output, claude_cache_read, claude_cost_usd, codex_total_tokens}}` | no | `dev-review.sh` (`collect_token_usage` at EOF, **only when `CO_EVOLVE_TOKEN_CAPTURE=1`**) | `score-run.sh` — `.tokens.totals` copied into `details.cost.tokens`, informational only | v1.5 addition — token economics. Per-phase usage: claude phases carry `{input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, total_cost_usd, source:"claude-json"}` from the gated `--output-format json` envelope; codex phases carry `{source:"codex-stderr", total_tokens}` harvested from the per-phase stderr token line. **Omitted entirely unless capture is enabled** (default off → byte-parity with prior state.json). Scorer surfaces totals for evidence; never affects any PASS/FAIL dimension. |
 
 ### Legacy alias: `phases` → `history`
 
diff --git a/evals/score-run.sh b/evals/score-run.sh
index 843f603..b9a6380 100644
--- a/evals/score-run.sh
+++ b/evals/score-run.sh
@@ -644,6 +644,14 @@ composite=$(jq -n --argjson scores "$scores_json" '
 # Details (sparse for Task 1; Task 2 expands per-dimension metadata)
 # ---------------------------------------------------------------------------
 
+# v1.5 Phase 4: token economics (informational only). When the runner ran with
+# CO_EVOLVE_TOKEN_CAPTURE=1, state.json carries a .tokens.totals block; surface
+# it under details.cost.tokens so the report can speak to the ~50% Claude-token
+# savings claim. The `// empty` guard means: when there is no tokens block (every
+# existing fixture, and any run with capture off), the key is simply absent and
+# details stays byte-identical — this MUST NOT touch any PASS/FAIL dimension.
+tokens_totals_json=$(jq -c '.tokens.totals // empty' "$state_json_path" 2>/dev/null)
+
 details_json=$(jq -n \
   --arg status "$status" \
   --argjson had_exception "$had_exception" \
@@ -651,9 +659,11 @@ details_json=$(jq -n \
   --argjson max_wall "$max_wall" \
   --arg composer "$composer" \
   --arg reviewer "$reviewer" \
+  --argjson tokens "${tokens_totals_json:-null}" \
   '{
     robustness: {status: $status, had_exception: $had_exception},
-    cost: {wall_clock_seconds: $wall_secs, max: $max_wall},
+    cost: ({wall_clock_seconds: $wall_secs, max: $max_wall}
+           + (if $tokens == null then {} else {tokens: $tokens} end)),
     cross_ai_diversity: {composer: $composer, reviewer: $reviewer}
   }')
 
diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh
index cc83a77..8478cf7 100644
--- a/lib/co-evolution.sh
+++ b/lib/co-evolution.sh
@@ -425,11 +425,61 @@ invoke_claude() {
     model_flags+=(--effort "${CLAUDE_EFFORT}")
   fi
 
+  # v1.5 Phase 4: per-phase token capture, default OFF. Gated behind
+  # CO_EVOLVE_TOKEN_CAPTURE=1 AND jq present so the default path stays
+  # byte-identical (argv + artifacts). When ON we swap `--output-format text`
+  # for `--output-format json`: the JSON envelope (R1) carries `.usage` token
+  # counts, and `.result` holds the same text the text path would have emitted.
+  # PRTP-03 reminder: this is `--output-format json`, NOT `--output-schema`
+  # (the latter hangs claude -p on Windows). The gate keeps it default-off so
+  # the Windows precedent is respected unless an operator opts in.
+  local capture_json=false
+  if [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] && command -v jq >/dev/null 2>&1; then
+    capture_json=true
+  fi
+
+  local output_format=text
+  [[ "$capture_json" == "true" ]] && output_format=json
+
   if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1; then
     # Under WSL, reuse the Windows Claude session because WSL and Windows keep separate auth state.
-    cmd=(cmd.exe /c claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}")
+    cmd=(cmd.exe /c claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}")
   else
-    cmd=(claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}")
+    cmd=(claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}")
+  fi
+
+  if [[ "$capture_json" == "true" ]]; then
+    local envelope_file="${output_file}.envelope.json"
+    local usage_file="${output_file}.usage.json"
+    # The envelope is emitted on stdout even on failure (R1: is_error=true still
+    # carries a body, exit nonzero) — `|| true` matches the text path and we
+    # parse regardless. Capture stdout to the envelope, NOT the output file.
+    "${cmd[@]}" < "$prompt_file" > "$envelope_file" 2>"$stderr_file" || true
+    # Extract `.result` into the output file so the external contract is
+    # preserved: callers (validate_agent_artifact, file_contains_auth_failure)
+    # only ever read $output_file. On auth failure `.result` is the "Not logged
+    # in" banner, so the existing auth gate keeps working. If `.result` is empty
+    # or the envelope is unparseable, fall back to copying the raw envelope so
+    # validate_agent_artifact sees a non-empty file rather than a silent blank.
+    if jq -e 'has("result") and (.result | type == "string") and (.result | length > 0)' \
+         "$envelope_file" >/dev/null 2>&1; then
+      jq -r '.result // empty' "$envelope_file" > "$output_file"
+    else
+      cp "$envelope_file" "$output_file"
+    fi
+    # Usage sidecar (R1 fields). `// null` guards a malformed/partial envelope so
+    # the file is always valid JSON for collect_token_usage to slurp. Never
+    # fatal — token capture must not break a run.
+    jq '{
+          input_tokens: (.usage.input_tokens // null),
+          output_tokens: (.usage.output_tokens // null),
+          cache_read_input_tokens: (.usage.cache_read_input_tokens // null),
+          cache_creation_input_tokens: (.usage.cache_creation_input_tokens // null),
+          total_cost_usd: (.total_cost_usd // null),
+          source: "claude-json"
+        }' "$envelope_file" > "$usage_file" 2>/dev/null \
+      || printf '{"input_tokens":null,"output_tokens":null,"cache_read_input_tokens":null,"cache_creation_input_tokens":null,"total_cost_usd":null,"source":"claude-json"}\n' > "$usage_file"
+    return 0
   fi
 
   "${cmd[@]}" < "$prompt_file" > "$output_file" 2>"$stderr_file" || true
@@ -1306,6 +1356,136 @@ begin_state_phase() {
   fi
 }
 
+# v1.5 Phase 4: aggregate per-phase token usage into state.json. Called ONLY
+# when CO_EVOLVE_TOKEN_CAPTURE=1 (the runner gates the call), so when the flag
+# is off there is no `tokens` key at all — state.json stays byte-identical.
+#
+# Two evidence sources, both already on disk by the time this runs:
+#   1. Claude usage sidecars (`*.usage.json`, written by invoke_claude's gated
+#      JSON path). Mapped to a phase via the sidecar's base output filename:
+#        .compose-output.md.usage.json   -> compose
+#        .bounce-output-N.md.usage.json  -> bounce-NN (zero-padded)
+#        execute-output.md.usage.json    -> execute
+#        verdict.json.usage.json         -> verify
+#   2. Codex token line (R2) harvested read-only from per-phase stderr logs:
+#        compose-stderr.log   -> compose
+#        pass-N-stderr.log     -> bounce-NN
+#        execute-stderr.log    -> execute
+#        review-stderr.log     -> verify
+#      Format: the literal line `tokens used` immediately followed by a
+#      comma-formatted integer. awk grabs the next line and strips commas.
+#
+# Writes ONE `tokens` block: {phases:{<phase>:{…}}, totals:{…}} and bumps
+# .updated_at. Degrades silently when jq is unavailable. The caller removes the
+# transient sidecars afterward (state.json is the durable record).
+collect_token_usage() {
+  local run_dir="${1:?run_dir required}"
+  local state_json="${2:?state_json required}"
+
+  command -v jq >/dev/null 2>&1 || {
+    log "WARNING: jq unavailable — collect_token_usage skipping"
+    return 0
+  }
+  [[ -f "$state_json" ]] || return 0
+
+  # Build the phases object incrementally as a JSON string.
+  local phases_json='{}'
+  local tmp
+
+  # --- 1. Claude usage sidecars -------------------------------------------
+  # Two globs: the leading-dot one matches compose/bounce sidecars whose base
+  # output file is itself dot-prefixed (.compose-output.md.usage.json,
+  # .bounce-output-N.md.usage.json) — a plain *.usage.json glob skips dotfiles.
+  local sidecar base phase pass
+  for sidecar in "$run_dir"/*.usage.json "$run_dir"/.*.usage.json; do
+    [[ -e "$sidecar" ]] || continue          # no-glob guard
+    base=$(basename "$sidecar")
+    base="${base%.usage.json}"               # strip sidecar suffix -> output filename
+    case "$base" in
+      .compose-output.md) phase="compose" ;;
+      execute-output.md)  phase="execute" ;;
+      verdict.json)       phase="verify" ;;
+      .bounce-output-*.md)
+        pass="${base#.bounce-output-}"
+        pass="${pass%.md}"
+        if [[ "$pass" =~ ^[0-9]+$ ]]; then
+          printf -v phase 'bounce-%02d' "$pass"
+        else
+          phase="bounce-${pass}"
+        fi
+        ;;
+      *) phase="$base" ;;                     # unknown shape: key by raw base
+    esac
+    # Merge this sidecar's object under .<phase>. Tolerate a malformed sidecar.
+    if tmp=$(jq --arg p "$phase" --slurpfile s "$sidecar" \
+               '. + {($p): $s[0]}' <<<"$phases_json" 2>/dev/null); then
+      phases_json="$tmp"
+    fi
+  done
+
+  # --- 2. Codex stderr token lines ----------------------------------------
+  local stderr_log log_base codex_phase tokens_val
+  for stderr_log in "$run_dir"/compose-stderr.log "$run_dir"/execute-stderr.log \
+                    "$run_dir"/review-stderr.log "$run_dir"/pass-*-stderr.log; do
+    [[ -f "$stderr_log" ]] || continue
+    log_base=$(basename "$stderr_log")
+    case "$log_base" in
+      compose-stderr.log) codex_phase="compose" ;;
+      execute-stderr.log) codex_phase="execute" ;;
+      review-stderr.log)  codex_phase="verify" ;;
+      pass-*-stderr.log)
+        pass="${log_base#pass-}"
+        pass="${pass%-stderr.log}"
+        if [[ "$pass" =~ ^[0-9]+$ ]]; then
+          printf -v codex_phase 'bounce-%02d' "$pass"
+        else
+          codex_phase="bounce-${pass}"
+        fi
+        ;;
+      *) continue ;;
+    esac
+    # Skip the retry-stderr variants (pass-N-stderr-retry.log won't match the
+    # globs above; the primary logs are authoritative for the phase total).
+    tokens_val=$(awk '/^tokens used$/{getline; gsub(",",""); print; exit}' "$stderr_log" 2>/dev/null)
+    [[ "$tokens_val" =~ ^[0-9]+$ ]] || continue
+    # A claude sidecar already claimed this phase only if claude ran it; codex
+    # phases never have a sidecar, so this is collision-free in practice. If both
+    # somehow exist, the codex entry wins last (rare/diagnostic).
+    if tmp=$(jq --arg p "$codex_phase" --argjson t "$tokens_val" \
+               '. + {($p): {source: "codex-stderr", total_tokens: $t}}' \
+               <<<"$phases_json" 2>/dev/null); then
+      phases_json="$tmp"
+    fi
+  done
+
+  # --- 3. Totals + single write into state.json ----------------------------
+  # Totals sum across phases: claude fields from claude-json sidecars, codex
+  # tokens from codex-stderr entries. Nulls are treated as 0 in the sums.
+  local now
+  now=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+  tmp=$(mktemp)
+  if jq --argjson phases "$phases_json" --arg now "$now" '
+        ($phases | to_entries) as $entries |
+        def num($x): ($x // 0);
+        {
+          phases: $phases,
+          totals: {
+            claude_input:      ([$entries[] | select(.value.source=="claude-json") | num(.value.input_tokens)] | add // 0),
+            claude_output:     ([$entries[] | select(.value.source=="claude-json") | num(.value.output_tokens)] | add // 0),
+            claude_cache_read: ([$entries[] | select(.value.source=="claude-json") | num(.value.cache_read_input_tokens)] | add // 0),
+            claude_cost_usd:   ([$entries[] | select(.value.source=="claude-json") | num(.value.total_cost_usd)] | add // 0),
+            codex_total_tokens:([$entries[] | select(.value.source=="codex-stderr") | num(.value.total_tokens)] | add // 0)
+          }
+        } as $tokens |
+        .tokens = $tokens | .updated_at = $now
+      ' "$state_json" > "$tmp" 2>/dev/null; then
+    mv "$tmp" "$state_json"
+  else
+    rm -f "$tmp"
+    log "WARNING: jq failed in collect_token_usage — state.json unchanged (no tokens block)"
+  fi
+}
+
 # RNPT-04: Generic jq-path field setter. Supports string|number|bool|null|rawfile.
 # Usage: write_state_field state.json '.verify_verdict' string APPROVED
 #        write_state_field state.json '.execute_delta' rawfile path/to/delta.json
diff --git a/tests/token-capture-simulation.sh b/tests/token-capture-simulation.sh
new file mode 100755
index 0000000..ecfb1a1
--- /dev/null
+++ b/tests/token-capture-simulation.sh
@@ -0,0 +1,413 @@
+#!/usr/bin/env bash
+# tests/token-capture-simulation.sh
+# Hermetic gate for v1.5 Phase 4: CO_EVOLVE_TOKEN_CAPTURE=1 token capture in
+# lib/co-evolution.sh (invoke_claude gated JSON mode + collect_token_usage) and
+# the runner's EOF aggregation into state.json.
+#
+# The capture is gated behind CO_EVOLVE_TOKEN_CAPTURE=1 AND jq present; default
+# OFF must be byte-identical to today (argv, artifacts, no state.json `tokens`
+# key). This gate pins:
+#   a. flag ON, mixed run (claude compose+verify + codex execute) → state.json
+#      .tokens.phases has both kinds of entries, totals are correct, the *.usage
+#      and *.envelope sidecars are cleaned up, and the claude output files hold
+#      the extracted `.result` text (NOT the raw envelope).
+#   b. flag ON, malformed claude envelope → the output file falls back to the
+#      envelope copy (non-empty), the run does not die, and that phase's tokens
+#      entry carries nulls.
+#   c. flag OFF (default) → claude argv contains `--output-format text` (not
+#      json), NO *.envelope.json / *.usage.json sidecars survive, and state.json
+#      has NO `tokens` key (the byte-parity guard).
+#   d. flag OFF → `--help` is unaffected (still exits 0, still emits usage) and
+#      no token machinery leaks into the help path.
+#
+# Pattern: PATH-injected claude + codex stubs (same discipline as
+# tests/preset-expansion-simulation.sh) — drain stdin via a read-loop (NOT
+# `cat`) to avoid SIGPIPE under strict-mode bash. The claude stub branches on
+# whether `--output-format json` is in its argv: when present it emits a canned
+# R1-shaped envelope (real token numbers) on stdout; otherwise the plain text
+# body. Phase (compose/bounce vs verify) is detected from the prompt on stdin.
+
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh"
+
+command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; }
+
+TEST_DIR="$(mktemp -d -t token-capture-XXXXXX)"
+trap 'rm -rf "$TEST_DIR"' EXIT
+
+TOTAL=0
+FAILURES=0
+pass() { printf "PASS: %s\n" "$1"; }
+fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); }
+
+# --- hermetic git env (mirrors tests/run-all.sh) ----------------------------
+export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig"
+export GIT_CONFIG_SYSTEM=/dev/null
+cat > "$TEST_DIR/gitconfig" <<'GITCFG'
+[user]
+	name = token-capture-sim
+	email = token-capture-sim@invalid.local
+[init]
+	defaultBranch = master
+[core]
+	autocrlf = false
+[commit]
+	gpgsign = false
+GITCFG
+
+# --- canned R1 envelope numbers (the assertions key off these) --------------
+# These are the per-claude-invocation token counts the stub reports. The runner
+# calls claude once for compose and once for verify in the scenario-(a) run
+# (bounces 0, so no bounce passes), so claude totals = 2x each field.
+CLAUDE_IN=1111
+CLAUDE_OUT=222
+CLAUDE_CACHE_READ=33333
+CLAUDE_CACHE_CREATE=44
+CODEX_TOKENS=12345
+
+# --- stub CLIs --------------------------------------------------------------
+mkdir -p "$TEST_DIR/bin"
+
+# claude stub. If `--output-format json` is in argv → emit a canned R1 envelope
+# on stdout whose `.result` is a phase-appropriate body (plan or verdict).
+# Otherwise → emit the body as plain text (the default text path). The
+# MALFORMED_ENVELOPE env var (scenario b) makes the json path emit a body that
+# is NOT valid JSON, exercising invoke_claude's envelope-copy fallback.
+cat > "$TEST_DIR/bin/claude" <<STUB
+#!/usr/bin/env bash
+for arg in "\$@"; do
+  case "\$arg" in
+    --version|-v|version) echo "claude 1.0.0 (token-capture-stub)"; exit 0 ;;
+  esac
+done
+[[ -n "\${CLAUDE_ARGV_LOG:-}" ]] && printf '%s\n' "\$*" >> "\$CLAUDE_ARGV_LOG"
+json_mode=false
+for arg in "\$@"; do
+  if [[ "\$arg" == "json" ]]; then json_mode=true; fi
+done
+# Drain stdin via read-loop (no SIGPIPE); capture to detect the phase.
+stdin_capture=""
+while IFS= read -r _line; do stdin_capture+="\$_line"\$'\n'; done
+
+# Phase-appropriate text body.
+if [[ "\$stdin_capture" == *"Verification Instructions"* ]]; then
+  body=\$(cat <<'VERDICT'
+{
+  "verdict": "APPROVED",
+  "confidence": 88,
+  "summary": "Token-capture stub verifier approves the hermetic change.",
+  "issues": []
+}
+VERDICT
+)
+else
+  body=\$(cat <<'PLAN'
+# Token Capture Stub Plan
+
+## Approach
+
+This is a hermetic stub plan emitted by the token-capture simulation claude
+stub. It satisfies the plan heading regex, the minimum word count, and the
+structural-line check so the compose phase validates and the run proceeds into
+execute and verify. It is not a real plan and the only side effect is a single
+harmless append to a tracked file so the run produces a diff for the verify
+phase to score. No external commands run and no network is touched anywhere.
+
+## Files to Change
+
+- \`touched.txt\` — the executor stub appends a line so the run yields a diff.
+
+## Risks
+
+- None identified. Stubbed CLI produces a fixed plan with no side effects.
+PLAN
+)
+fi
+
+if [[ "\$json_mode" == true ]]; then
+  if [[ "\${MALFORMED_ENVELOPE:-}" == "1" ]]; then
+    # NOT a valid JSON envelope — invoke_claude must copy the raw stdout to the
+    # output file (non-empty fallback) and write an all-null usage sidecar
+    # (jq can't extract .usage from non-JSON). We emit the plain plan body so the
+    # copied fallback still clears the compose plan-quality gate; the point under
+    # test is the fallback + null-usage path, not the plan validator.
+    printf '%s\n' "\$body"
+    exit 1
+  fi
+  # Canned R1-shaped envelope on stdout. .result carries the text body. jq -n
+  # builds it so .result is correctly JSON-escaped regardless of body content.
+  jq -n --arg result "\$body" \
+        --argjson it ${CLAUDE_IN} --argjson ot ${CLAUDE_OUT} \
+        --argjson cr ${CLAUDE_CACHE_READ} --argjson cc ${CLAUDE_CACHE_CREATE} '
+    {
+      type: "result", subtype: "success", is_error: false,
+      duration_ms: 100, num_turns: 1, result: \$result,
+      total_cost_usd: 0.05,
+      usage: {
+        input_tokens: \$it, output_tokens: \$ot,
+        cache_read_input_tokens: \$cr, cache_creation_input_tokens: \$cc
+      }
+    }'
+  exit 0
+fi
+# Plain text path (flag OFF): just the body.
+printf '%s\n' "\$body"
+STUB
+chmod +x "$TEST_DIR/bin/claude"
+
+# codex stub: append argv, parse -o/-C, drain stdin, emit plan to -o (or make a
+# real change in -C for the execute phase), and ALWAYS write the R2 token line
+# to stderr so the harvest path has something to find.
+cat > "$TEST_DIR/bin/codex" <<STUB
+#!/usr/bin/env bash
+for arg in "\$@"; do
+  case "\$arg" in
+    --version|-v|version) echo "codex 0.140.0 (token-capture-stub)"; exit 0 ;;
+  esac
+done
+[[ -n "\${CODEX_ARGV_LOG:-}" ]] && printf '%s\n' "\$*" >> "\$CODEX_ARGV_LOG"
+output_file=""; workdir=""; prev=""
+for arg in "\$@"; do
+  case "\$prev" in
+    -o) output_file="\$arg" ;;
+    -C) workdir="\$arg" ;;
+  esac
+  prev="\$arg"
+done
+stdin_capture=""
+while IFS= read -r _line; do stdin_capture+="\$_line"\$'\n'; done
+if [[ "\$stdin_capture" == *"Execution Instructions (Codex)"* \
+      && -n "\$workdir" && -f "\$workdir/touched.txt" ]]; then
+  printf 'codex-stub-change\n' >> "\$workdir/touched.txt"
+fi
+# R2 token line on stderr (always — codex prints it at end of every run).
+printf 'OpenAI Codex stub\ntokens used\n%s\n' "${CODEX_TOKENS}" >&2
+plan_content=\$(cat <<'PLAN'
+# Token Capture Stub Plan
+
+## Approach
+
+This is a hermetic stub plan emitted by the token-capture simulation codex
+stub. It satisfies the plan heading regex, the minimum word count, and the
+structural-line check so the compose phase validates and the run proceeds into
+execute and verify. The only side effect is a single harmless append to a
+tracked file so the run produces a diff. No external commands run and no
+network is touched anywhere in this hermetic path.
+
+## Files to Change
+
+- \`touched.txt\` — appended to by the executor stub.
+
+## Risks
+
+- None identified. Stubbed codex produces a fixed plan with no side effects.
+PLAN
+)
+if [[ -n "\$output_file" ]]; then
+  printf '%s\n' "\$plan_content" > "\$output_file"
+else
+  printf '%s\n' "\$plan_content"
+fi
+exit 0
+STUB
+chmod +x "$TEST_DIR/bin/codex"
+
+# --- helpers ----------------------------------------------------------------
+make_scratch_repo() {
+  local repo
+  repo="$TEST_DIR/repo-$1"
+  rm -rf "$repo"
+  mkdir -p "$repo"
+  git -C "$repo" init -q
+  printf 'baseline\n' > "$repo/touched.txt"
+  git -C "$repo" add -A
+  git -C "$repo" commit -qm "baseline"
+  printf '%s' "$repo"
+}
+
+latest_run_dir() { ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1; }
+
+# ===========================================================================
+# Scenario (a): flag ON, mixed run — claude composes + verifies, codex executes.
+# Asserts: both kinds of phase entries, correct totals, sidecars cleaned, claude
+# output files contain the extracted .result text (not the raw envelope).
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo a)
+claude_log="$TEST_DIR/a_claude.log"; codex_log="$TEST_DIR/a_codex.log"
+: > "$claude_log"; : > "$codex_log"
+run_dir_before=$(latest_run_dir || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT MALFORMED_ENVELOPE
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  export CO_EVOLVE_TOKEN_CAPTURE=1
+  # composer=claude, executor=codex, verifier=claude (opus alias). bounces 0 so
+  # exactly two claude calls (compose + verify) and one codex call (execute).
+  bash "$RUNNER" --composer opus --executor codex --verifier opus \
+    --verify --bounces 0 --workdir "$repo" --timeout 60 -- "token capture mixed run"
+) >"$TEST_DIR/a.out" 2>&1 || true
+a_run_dir=$(latest_run_dir)
+a_ok=true
+a_msg=""
+if [[ -z "$a_run_dir" || "$a_run_dir" == "$run_dir_before" ]]; then
+  a_ok=false; a_msg="no new run dir"
+else
+  state="$a_run_dir/state.json"
+  # tokens block present, both phase kinds.
+  if ! jq -e '.tokens.phases.compose.source == "claude-json"' "$state" >/dev/null 2>&1; then
+    a_ok=false; a_msg="compose not claude-json"
+  fi
+  if ! jq -e '.tokens.phases.verify.source == "claude-json"' "$state" >/dev/null 2>&1; then
+    a_ok=false; a_msg="verify not claude-json"
+  fi
+  if ! jq -e '.tokens.phases.execute.source == "codex-stderr"' "$state" >/dev/null 2>&1; then
+    a_ok=false; a_msg="execute not codex-stderr"
+  fi
+  # totals: claude (compose+verify) = 2x each; codex = one execute line.
+  exp_in=$((CLAUDE_IN * 2)); exp_out=$((CLAUDE_OUT * 2)); exp_cr=$((CLAUDE_CACHE_READ * 2))
+  if ! jq -e --argjson v "$exp_in"  '.tokens.totals.claude_input == $v'      "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_input total"; fi
+  if ! jq -e --argjson v "$exp_out" '.tokens.totals.claude_output == $v'     "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_output total"; fi
+  if ! jq -e --argjson v "$exp_cr"  '.tokens.totals.claude_cache_read == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_cache_read total"; fi
+  if ! jq -e --argjson v "$CODEX_TOKENS" '.tokens.totals.codex_total_tokens == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="codex total"; fi
+  # sidecars cleaned up (neither *.usage.json nor *.envelope.json survive).
+  if ls "$a_run_dir"/*.usage.json "$a_run_dir"/.*.usage.json >/dev/null 2>&1; then a_ok=false; a_msg="usage sidecars survived"; fi
+  if ls "$a_run_dir"/*.envelope.json "$a_run_dir"/.*.envelope.json >/dev/null 2>&1; then a_ok=false; a_msg="envelope sidecars survived"; fi
+  # claude output files hold the extracted .result text — verdict.json must be
+  # the verdict object (not the raw envelope, which would carry a "usage" key).
+  if ! jq -e '.verdict == "APPROVED" and (has("usage") | not)' "$a_run_dir/verdict.json" >/dev/null 2>&1; then
+    a_ok=false; a_msg="verdict.json is not the extracted .result"
+  fi
+fi
+if [[ "$a_ok" == true ]]; then
+  pass "flag ON mixed run: claude+codex phases captured, totals correct, sidecars cleaned, .result extracted"
+else
+  fail "flag ON mixed run failed ($a_msg) — run dir: ${a_run_dir:-<none>}"
+  [[ -n "${a_run_dir:-}" ]] && jq -c '.tokens // "<no tokens key>"' "$a_run_dir/state.json" >&2 2>/dev/null || true
+fi
+[[ -n "${a_run_dir:-}" && "$a_run_dir" != "$run_dir_before" ]] && rm -rf "$a_run_dir"
+
+# ===========================================================================
+# Scenario (b): flag ON, malformed claude envelope. The compose output file
+# falls back to the envelope copy (non-empty → run does not die mid-compose),
+# and the compose tokens entry carries nulls.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo b)
+run_dir_before=$(latest_run_dir || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CO_EVOLVE_TOKEN_CAPTURE=1
+  export MALFORMED_ENVELOPE=1
+  # Only the compose phase is claude here; --bounces 0 and no --verify so the
+  # malformed compose envelope is the one under test. plan-only keeps it short:
+  # the run reaches the plan-only terminal (still calls maybe_collect_token_usage).
+  bash "$RUNNER" --composer opus --plan-only --bounces 0 \
+    --workdir "$repo" --timeout 60 -- "token capture malformed envelope"
+) >"$TEST_DIR/b.out" 2>&1 || true
+b_run_dir=$(latest_run_dir)
+b_ok=true
+b_msg=""
+if [[ -z "$b_run_dir" || "$b_run_dir" == "$run_dir_before" ]]; then
+  b_ok=false; b_msg="no new run dir"
+else
+  state="$b_run_dir/state.json"
+  # The run produced a plan.md from the envelope-copy fallback (non-empty).
+  if [[ ! -s "$b_run_dir/plan.md" ]]; then b_ok=false; b_msg="plan.md empty (fallback did not copy envelope)"; fi
+  # tokens block exists and the compose entry has null token fields.
+  if ! jq -e '.tokens.phases.compose.source == "claude-json"' "$state" >/dev/null 2>&1; then
+    b_ok=false; b_msg="compose entry missing"
+  fi
+  if ! jq -e '.tokens.phases.compose.input_tokens == null' "$state" >/dev/null 2>&1; then
+    b_ok=false; b_msg="compose input_tokens not null on malformed envelope"
+  fi
+  # totals sum nulls as 0 → claude_input 0; state.json valid.
+  if ! jq -e '.tokens.totals.claude_input == 0' "$state" >/dev/null 2>&1; then
+    b_ok=false; b_msg="malformed totals not zeroed"
+  fi
+  if ! jq empty "$state" >/dev/null 2>&1; then b_ok=false; b_msg="state.json invalid JSON"; fi
+fi
+if [[ "$b_ok" == true ]]; then
+  pass "flag ON malformed envelope: output falls back to envelope copy, run survives, tokens nulled"
+else
+  fail "flag ON malformed envelope failed ($b_msg) — run dir: ${b_run_dir:-<none>}"
+fi
+[[ -n "${b_run_dir:-}" && "$b_run_dir" != "$run_dir_before" ]] && rm -rf "$b_run_dir"
+
+# ===========================================================================
+# Scenario (c): flag OFF (default) — byte-parity guard. claude argv carries
+# `--output-format text` (not json), NO envelope/usage sidecars survive, and
+# state.json has NO `tokens` key.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo c)
+claude_log="$TEST_DIR/c_claude.log"; codex_log="$TEST_DIR/c_codex.log"
+: > "$claude_log"; : > "$codex_log"
+run_dir_before=$(latest_run_dir || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT MALFORMED_ENVELOPE
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  unset CO_EVOLVE_TOKEN_CAPTURE
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --composer opus --executor codex --verifier opus \
+    --verify --bounces 0 --workdir "$repo" --timeout 60 -- "token capture parity off"
+) >"$TEST_DIR/c.out" 2>&1 || true
+c_run_dir=$(latest_run_dir)
+c_ok=true
+c_msg=""
+# claude argv: text mode, never json.
+if ! grep -Eq -- '--output-format text' "$claude_log"; then c_ok=false; c_msg="claude not in text mode"; fi
+if grep -Eq -- '--output-format json' "$claude_log"; then c_ok=false; c_msg="claude leaked json mode with flag off"; fi
+if [[ -z "$c_run_dir" || "$c_run_dir" == "$run_dir_before" ]]; then
+  c_ok=false; c_msg="no new run dir"
+else
+  # No sidecars anywhere (they were never written).
+  if ls "$c_run_dir"/*.usage.json "$c_run_dir"/.*.usage.json >/dev/null 2>&1; then c_ok=false; c_msg="usage sidecar exists with flag off"; fi
+  if ls "$c_run_dir"/*.envelope.json "$c_run_dir"/.*.envelope.json >/dev/null 2>&1; then c_ok=false; c_msg="envelope sidecar exists with flag off"; fi
+  # state.json has NO tokens key.
+  if jq -e 'has("tokens")' "$c_run_dir/state.json" >/dev/null 2>&1; then c_ok=false; c_msg="state.json has tokens key with flag off"; fi
+fi
+if [[ "$c_ok" == true ]]; then
+  pass "flag OFF parity: claude argv is --output-format text, no sidecars, no state.json tokens key"
+else
+  fail "flag OFF parity guard failed ($c_msg) — run dir: ${c_run_dir:-<none>}"
+  [[ -s "$claude_log" ]] && { echo "--- claude argv ---"; cat "$claude_log"; } >&2
+fi
+[[ -n "${c_run_dir:-}" && "$c_run_dir" != "$run_dir_before" ]] && rm -rf "$c_run_dir"
+
+# ===========================================================================
+# Scenario (d): --help is unaffected by the capture machinery (exits 0, emits
+# usage) regardless of the flag, and never references the token internals.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+d_ok=true
+d_msg=""
+# Help with flag explicitly ON must still behave exactly as the normal help.
+help_on=$(CO_EVOLVE_TOKEN_CAPTURE=1 bash "$RUNNER" --help 2>/dev/null; echo "EXIT=$?")
+help_off=$(bash "$RUNNER" --help 2>/dev/null; echo "EXIT=$?")
+if [[ "$help_on" != *"EXIT=0"* ]]; then d_ok=false; d_msg="--help exit nonzero with flag on"; fi
+if ! printf '%s' "$help_off" | grep -Eq -- '--composer|Usage|usage'; then d_ok=false; d_msg="--help did not emit usage"; fi
+# The flag must not change help output (no token sidecars, no env-dependent text).
+if [[ "${help_on%EXIT=*}" != "${help_off%EXIT=*}" ]]; then d_ok=false; d_msg="--help output differs by flag"; fi
+if [[ "$d_ok" == true ]]; then
+  pass "flag ON/OFF: --help unaffected (exit 0, usage emitted, byte-identical across the flag)"
+else
+  fail "--help affected by capture flag ($d_msg)"
+fi
+
+# --- summary ----------------------------------------------------------------
+passed=$((TOTAL - FAILURES))
+if (( FAILURES == 0 )); then
+  echo "$passed/$TOTAL scenarios passed"
+  exit 0
+else
+  echo "$passed/$TOTAL scenarios passed ($FAILURES failed)" >&2
+  exit 1
+fi

From 30a9a00b5b39999596a41404d258010459a78a6e Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 13:35:45 -0400
Subject: [PATCH 06/12] feat: /codex-build orchestration skill + routing docs,
 both transports (v1.5 Phase 5)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 AGENTS.md                        |   1 +
 CLAUDE.md                        |  25 +-
 README.md                        |   1 +
 dev-review/codex/instructions.md |   3 +
 skills/codex-build/SKILL.md      | 378 +++++++++++++++++++++++++++++++
 skills/dev-review/SKILL.md       |  43 +++-
 tests/docs-sync-simulation.sh    | 132 +++++++++++
 7 files changed, 577 insertions(+), 6 deletions(-)
 create mode 100644 skills/codex-build/SKILL.md
 create mode 100644 tests/docs-sync-simulation.sh

diff --git a/AGENTS.md b/AGENTS.md
index ee53f61..05d790b 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -17,6 +17,7 @@ Full spec: [`BOUNCE-PROTOCOL.md`](BOUNCE-PROTOCOL.md). Reference implementation:
   in the general-purpose co-evolution workflow for questions, drafts, plans,
   specs, arguments, and markdown refinement
 - If you are invoked via `/dev-review` (Claude Code) or `bash dev-review/codex/dev-review.sh` (Codex), you are inside the bounce pipeline — your role (reviewer / composer) and pass number are passed to you in the prompt template
+- If you are invoked via `/codex-build`, you are orchestrating a detached Codex build: the session plans and reviews, Codex executes in the background under `--preset codex-build`, and the session is woken at gates (see `skills/codex-build/`)
 - If you are invoked via `bash agent-bouncer/agent-bouncer.sh <doc>`, you are bouncing a single markdown document — same protocol applies
 - If you are exploring the repo directly without an explicit role, treat the protocol section above as orientation, then read the GSD-managed sections below for project meta
 
diff --git a/CLAUDE.md b/CLAUDE.md
index 8cdfe98..20d3b98 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -5,9 +5,17 @@ agents using structured `[CONTESTED]` / `[CLARIFY]` markers until convergence.
 
 ## Default Rule
 
-Use general co-evolution by default. Reach for `dev-review` only when the user
-specifically wants repo files changed, a bug fixed, a feature implemented, or a
-code diff verified against a plan.
+Three routes, lowest-ceremony first:
+
+- **co-evolution** (default) — bounce protocol for questions, drafts, plans,
+  specs, arguments, and markdown refinement.
+- **dev-review** — interactive code pipeline (compose → bounce → execute →
+  verify in one session) when the user wants repo files changed, a bug fixed, a
+  feature implemented, or a code diff verified against a plan.
+- **codex-build** — "build with codex": the session plans and reviews, Codex
+  executes detached, with check-ins only at gates. Reach for this when the user
+  wants Codex to grind on a build in the background while the session stays free
+  (the model-ladder flavor of dev-review).
 
 ## Components
 
@@ -54,6 +62,17 @@ Code-focused compose-bounce-execute-verify pipeline integrated with Claude Code.
 Key flags: `--skip-plan` executes a pre-existing plan, `--plan-only` stops after
 bounce, and `--live` opens visible Windows terminals for Codex passes.
 
+### Codex-Build Skill (`skills/codex-build/`)
+
+Detached orchestration skill: the session (typically Fable) plans and reviews,
+Codex executes in the background, and the session is woken at gates instead of
+babysitting. Kicks `dev-review.sh --preset codex-build` via a background task,
+ends the turn, then runs a schema-bound ACCEPT / REVISE / ESCALATE gate on wake.
+
+```text
+/codex-build Have codex implement the retry wrapper while I review
+```
+
 ### Codex Runtime (`dev-review/codex/`)
 
 Standalone Bash runtime for the code-focused compose-bounce-execute-verify flow
diff --git a/README.md b/README.md
index e9ffb06..fcc58bd 100644
--- a/README.md
+++ b/README.md
@@ -173,6 +173,7 @@ fallthrough.
 | Existing markdown document needs refinement | `co-evolve-bouncer.sh --bounce-only` |
 | Small, low-risk repo edit in 1-2 files | Direct execution |
 | Multi-file code change, medium/high risk, or plan/verify workflow | `dev-review/codex/dev-review.sh` |
+| Codex builds in the background while the session plans and reviews | `skills/codex-build/` (`--preset codex-build`, detached) |
 | General co-evolution inside Claude Code | `skills/co-evolution/` |
 | Code pipeline inside Claude Code | `skills/dev-review/` |
 | External MCP client (Claude Desktop, Cursor, Continue) | `npm i -g @alanshurafa/co-evolution-mcp` |
diff --git a/dev-review/codex/instructions.md b/dev-review/codex/instructions.md
index f30eb6b..92fc2dc 100644
--- a/dev-review/codex/instructions.md
+++ b/dev-review/codex/instructions.md
@@ -18,6 +18,8 @@ Pick the lowest-ceremony path that still protects correctness. **Start with `co-
 | Existing approved plan should be executed as-is | `bash dev-review/codex/dev-review.sh --skip-plan --plan <file>` | Reuses the approved plan |
 | User wants a reviewed plan before any code changes | `bash co-evolve-bouncer.sh --vanilla "task"` or `dev-review.sh --plan-only` | Plan artifact only |
 | User wants a final review against the plan and diff | `bash dev-review/codex/dev-review.sh --verify <task>` | Runs verifier after execution |
+| Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan <file> --branch auto -- "<task>"` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: Fable plans/reviews, Codex executes; session is not babysat |
+| Ad-hoc interactive delegation to Codex mid-session | OpenAI plugin `/codex:rescue --background` (see `skills/codex-build/` Transport B) | Quick delegation; review with the same verdict schema by hand, not via CI |
 
 ## Distinguish Coding From Everything Else
 
@@ -101,5 +103,6 @@ bash dev-review/codex/dev-review.sh --skip-plan --plan .planning/phases/04-docs-
 - Legacy document bouncer: `agent-bouncer/agent-bouncer.sh`
 - Default Claude Code skill surface: `skills/co-evolution/`
 - Code-focused Claude Code skill surface: `skills/dev-review/`
+- Detached Codex orchestration skill surface: `skills/codex-build/` (kicks `--preset codex-build` in the background)
 - Shared library: `lib/co-evolution.sh`
 - Co-evolve templates: `templates/co-evolve/`
diff --git a/skills/codex-build/SKILL.md b/skills/codex-build/SKILL.md
new file mode 100644
index 0000000..f1e731f
--- /dev/null
+++ b/skills/codex-build/SKILL.md
@@ -0,0 +1,378 @@
+---
+name: codex-build
+description: >
+  Orchestrate Codex to BUILD code in the background while this Claude Code
+  session (typically Fable) plans and reviews — never babysitting. The session
+  composes the implementation plan, kicks the dev-review runner detached via a
+  background Bash task with --preset codex-build, ENDS ITS TURN, and is woken on
+  exit to run a schema-bound review gate (ACCEPT / REVISE / ESCALATE, max 2
+  rounds). Use when the user wants to "build with codex", "have codex build",
+  "have codex implement", "orchestrate codex", "kick off codex in the
+  background", "let codex grind", or "codex builds while I review". For an
+  interactive single-session compose-bounce-execute loop with no detach, use
+  /dev-review. For non-code document refinement, use /co-evolution.
+allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent
+---
+
+# /codex-build - Detached Codex Execution with Gate-Based Review
+
+This is the **orchestration protocol** behind the "Fable plans, Codex executes,
+Fable reviews" model ladder. The session is the composer and the reviewer; Codex
+is the executor. The defining rule: **the session is NEVER kept busy while Codex
+grinds.** You kick the runner as a background task, end your turn, and the
+harness wakes you when it exits.
+
+Use `/dev-review` instead when you want the classic interactive
+compose-bounce-execute loop in one session (you watch every pass). Use
+`/co-evolution` for questions, drafts, plans, and document refinement with no
+code execution.
+
+The protocol has five steps. Steps 1-3 happen in the FIRST turn (preflight,
+plan, kick-and-stop). Steps 4-5 happen on WAKE, after the background task exits.
+
+---
+
+## Step 1: PREFLIGHT (one turn, before planning)
+
+Run these checks. Each gate either passes, degrades with a stated fallback, or
+dies with a fix hint.
+
+### Codex present
+
+```bash
+command -v codex
+```
+
+If missing, die with the install hint:
+
+```text
+codex CLI not found. On this Mac with Codex.app installed:
+  ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex
+Otherwise install the CLI:
+  npm i -g @openai/codex
+Then re-run /codex-build.
+```
+
+### Claude CLI auth probe (decides the verifier seat)
+
+The runner's `claude` verifier seat runs in a headless shell that does NOT carry
+the interactive app's session token. Probe it:
+
+```bash
+claude -p --output-format text --model claude-haiku-4-5-20251001 "ping" 2>&1 | head -2
+```
+
+- If the output contains `Not logged in` (or `/login`): the runner's claude
+  verifier seat cannot run. **DEGRADE** — kick with `--verifier codex` so the
+  verify phase uses Codex's schema-bound review instead. Tell the user plainly:
+  the in-session review gate (Step 4, which YOU run) is then the only Claude-side
+  review of the diff, and to run `claude /login` in a terminal to restore the
+  full ladder (Fable verifier seat inside the runner).
+- Otherwise: the full ladder is available. Kick with the preset's default claude
+  verifier (Fable, max effort) — no `--verifier` override needed.
+
+### No other active run
+
+```bash
+bash dev-review/codex/dev-review-status.sh --list
+```
+
+If the `ACTIVE` section shows a non-terminal run (`status=pending` or
+`status=null`), do NOT kick a second one — one orchestrated run per workdir.
+Either wait for it (check its status on wake) or escalate to the user.
+
+### Isolation — clean vs dirty tree
+
+```bash
+git -C "$(pwd)" status --short
+```
+
+- **Clean tree** → kick with `--branch auto` (the runner cuts
+  `dev-review/auto-<ts>-<slug>` off HEAD before execute).
+- **Dirty tree** → kick with `--worktree auto` (REQUIRED). A dirty tree
+  otherwise makes the runner's verify phase silently skip: it returns early with
+  `verification skipped - workdir had pre-existing uncommitted changes` because
+  it cannot isolate this run's diff. `--worktree auto` gives Codex a clean
+  sibling checkout so verify runs against an isolatable diff.
+
+### Local clone, never an SMB mount
+
+The workdir must be a local git clone. Never run from a network/SMB mount
+(`/Volumes/...`, `\\server\...`) — it causes line-ending churn and the runner's
+diff isolation is unreliable. If the cwd is a mount, stop and ask the user to
+point at a local clone.
+
+---
+
+## Step 2: PLAN (in-session — the session is the composer)
+
+This is the cheap, interactive part. You explore and write the plan; you do NOT
+hand planning to Codex.
+
+1. Explore the codebase with Read / Grep / Glob (or an Explore Agent for
+   fan-out). Understand the task, the files involved, and the conventions.
+2. Write the implementation plan to a temp file **outside the workdir**:
+
+   ```bash
+   PLAN_FILE="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S).md"
+   ```
+
+   **NEVER** write the plan inside the workdir. An untracked plan file there
+   makes the runner's verify phase skip (`run left untracked files that cannot
+   be diffed automatically`). **NEVER** pass the plan path inside a prompt to
+   Codex — per repo convention the runner embeds plan content inline; the
+   orchestrator is the sole owner of the canonical plan file.
+
+3. Shape the plan the way `--skip-plan --plan FILE` expects. The runner's plan
+   validator wants a real plan artifact: a top-level heading, >= ~60 words, >= 5
+   non-empty lines, and >= 2 structural lines (headings / bullets). Write the
+   plan with these sections:
+
+   ```text
+   # <Task title>
+
+   ## Approach
+   <what will be built and the key technical decisions>
+
+   ## Files to Change
+   - `path/to/file` — <what changes and why>
+
+   ## Steps
+   1. <ordered implementation steps>
+
+   ## Risks / Out of scope
+   - <risks, and anything explicitly NOT in scope>
+   ```
+
+   A thin or malformed plan makes the runner's plan-quality gate fail before
+   execute.
+
+### Optional plan hardening (≤ 2 synchronous bounce passes)
+
+For higher-stakes work, harden the plan against Codex BEFORE kicking, using the
+bounce protocol markers. This is the only synchronous Codex use in this skill —
+keep it short (a plan bounce is seconds-to-low-minutes, well under the 5-minute
+synchronous ceiling):
+
+- Send the plan to Codex asking it to critique with `[CONTESTED]` (disagreement
+  + counter-argument) and `[CLARIFY]` (ambiguity + two interpretations) markers.
+- Resolve EVERY `[CONTESTED]` / `[CLARIFY]` before kicking. Apply the 2-pass
+  expiry rule: any marker still open after 2 passes — decide it yourself or ask
+  the user. Do not kick with open markers.
+
+If you skip hardening, that is fine for routine tasks — go straight to Step 3.
+
+---
+
+## Step 3: KICK + END TURN (the load-bearing step)
+
+Kick the runner as a **background** Bash task, report the run id, and **end your
+turn**. This is what makes the model ladder pay off — the session pays only for
+plan + review gates, not for watching Codex grind.
+
+Run this with the Bash tool, **`run_in_background: true`**:
+
+```bash
+CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
+  --preset codex-build --skip-plan --plan "$PLAN_FILE" \
+  [--verifier codex] \
+  [--branch auto | --worktree auto] \
+  [--parent-run <previous-run-id>] \
+  [--timeout 1800] \
+  -- "<task description>"
+```
+
+Fill the brackets from Step 1:
+- `--verifier codex` ONLY when the auth probe degraded (else omit — the preset's
+  Fable/max claude verifier is the default).
+- `--branch auto` for a clean tree; `--worktree auto` for a dirty tree.
+- `--parent-run <id>` only on a REVISE re-kick (Step 4), to tag lineage.
+- `--timeout` defaults to 1800s; raise it for large tasks.
+
+The preset expands to: composer = Fable (high), executor = Codex (xhigh, model
+left to the CLI's config), verifier = Fable (max), `--verify` on, bounces 2,
+revise-loop 1. `CO_EVOLVE_TOKEN_CAPTURE=1` records per-phase tokens into
+`state.json` so you can report spend at the gate.
+
+After kicking:
+1. Capture the run id. It is `dev-review-<timestamp>` and the run dir is
+   `runs/dev-review-<timestamp>/`. Read it from the runner's early stdout, or
+   resolve the newest run with `ls -dt runs/dev-review-* | head -1`.
+2. Report to the user: the run id, and how to check it manually —
+   `bash dev-review/codex/dev-review-status.sh <run-id>` (human) or `--json`
+   (machine).
+3. **END THE TURN.**
+
+### Hard rules for Step 3
+
+- Do **NOT** poll. Do **NOT** sleep. Do **NOT** loop waiting on status.
+- Do **NOT** run Codex synchronously for anything expected to take longer than
+  ~5 minutes — that defeats the entire purpose (the session burns cache reads
+  every turn while idle, the exact anti-pattern this skill exists to avoid).
+- The harness re-invokes this session when the background task exits. **That
+  notification IS the wake signal.** You do not build the wake mechanism.
+- Optional safety net for very long runs (reference, do not build inline): a
+  one-shot scheduled reminder that re-checks status after the expected duration,
+  in case a machine-sleep drops the wake notification.
+
+---
+
+## Step 4: WAKE → REVIEW GATE (1-2 turns, after the background task exits)
+
+When the background task exits, you are re-invoked. Reconstruct state from disk —
+read the status JSON FIRST, then read narrowly. Do not re-read the whole tree.
+
+```bash
+bash dev-review/codex/dev-review-status.sh --json <run-id>
+```
+
+This emits one object with: `status`, `verdict` (APPROVED / REVISE / null),
+`verdict_json` (path to `verdict.json`), `verdict_present`, `diffstat_tail`,
+`current_phase`, `marker_counts`, `assess`, and `exit_code` (the status reader's
+liveness code: 0 done / 2 partial / 4 presumed-dead / 5 running / 3 no-run).
+
+Then, BEFORE reading any source files:
+1. Read `verdict.json` (the path is in `.verdict_json`) — the schema-bound
+   verdict: `verdict`, `confidence`, `summary`, `issues[]`,
+   `scope_creep_detected`.
+2. Read the diffstat (`.diffstat_tail`, or
+   `runs/<run-id>/execute-diffstat.txt`).
+
+Only THEN read targeted diff hunks and the specific files named in `issues[]`.
+Never re-read the whole tree to form an opinion.
+
+Decide exactly ONE of three outcomes.
+
+### ACCEPT
+
+Conditions: verdict is `APPROVED` AND your own spot-check of the named diff
+hunks agrees. Report and finish:
+
+- Branch or worktree path (where the code landed).
+- Diffstat (files changed, insertions/deletions).
+- Verdict confidence and one-line summary.
+- Token totals:
+
+  ```bash
+  jq '.tokens.totals' runs/<run-id>/state.json
+  ```
+
+- Your own turn count for this orchestration (plan + kick + this gate).
+- Suggested next step: merge / PR commands, e.g.
+  `git checkout master && git merge <branch>` or `gh pr create`.
+
+Then you are done. **Never auto-merge** — surface the commands for the user.
+
+### REVISE (max 2 orchestrator rounds total)
+
+Conditions: verdict is `REVISE`, or your spot-check finds a real, fixable
+problem, AND you have not already used 2 revise rounds.
+
+1. Copy the plan file to a new path (keep the original; lineage is by file):
+
+   ```bash
+   REVISE_PLAN="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S)-rN.md"
+   cp "$PLAN_FILE" "$REVISE_PLAN"
+   ```
+
+2. Append a corrections section with **file-specific, actionable** fixes drawn
+   from `issues[]` and your spot-check:
+
+   ```text
+   ## Reviewer Corrections (Round N)
+   - `path/to/file`: <concrete fix — what to change and to what>
+   ```
+
+3. Re-kick Step 3 with the new plan and `--parent-run <this-run-id>` (tags the
+   re-kick's lineage). Then END THE TURN again.
+
+Never re-kick on identical feedback twice — if the same issue survives a round,
+that is an ESCALATE, not another REVISE.
+
+### ESCALATE
+
+Conditions (any): verdict missing (`verdict_present: false` /
+`verdict: null`) · runner exited non-zero (status `failed`) · status reader exit
+4 (presumed-dead runner) · 2 revise rounds exhausted · `scope_creep_detected:
+true`.
+
+Write a concise human summary — do NOT auto-merge, do NOT silently retry:
+- What was attempted (task, rounds used).
+- Artifact paths (run dir, `verdict.json`, diffstat, the plan file).
+- Open issues (from `issues[]` and/or the dead-runner assessment).
+- Recommended manual step (inspect the worktree, re-kick with `--skip-plan`,
+  fix by hand, etc.).
+
+---
+
+## Step 5: BUDGET + TROUBLESHOOTING
+
+### Budget rules
+
+- **Max 2 orchestrator revise rounds.** Round 3 is an ESCALATE.
+- The runner's internal `--revise-loop 1` (set by the preset) adds at most one
+  cheap automatic retry per kick, INSIDE the runner — that is separate from and
+  cheaper than your orchestrator rounds. You still cap your own re-kicks at 2.
+- **One orchestrated run per workdir.** The Step 1 `--list` check enforces this.
+
+### Troubleshooting
+
+| Symptom | Cause | Action |
+|---|---|---|
+| status reader exits 4 (presumed-dead) | runner process gone mid-phase | `pkill -f 'codex exec'` to clear orphans, then re-kick `--skip-plan --plan "$PLAN_FILE"` (add `--parent-run <dead-run-id>`) |
+| Machine slept during the run | wake notification may have been dropped | Run the status script manually; do not assume the run failed |
+| `tokens` block missing from state.json | `CO_EVOLVE_TOKEN_CAPTURE=1` was not set, or `jq` is missing | Re-kick with the flag set; install `jq` for token capture |
+| Verify silently skipped | dirty tree kicked without `--worktree auto`, or run left untracked files | Re-kick on a clean tree with `--branch auto`, or `--worktree auto` |
+| `claude` verifier seat errors mid-run | headless shell not logged in | Re-kick with `--verifier codex`; run `claude /login` to restore the Fable seat |
+
+---
+
+## Transport B: the OpenAI plugin (secondary, interactive flavor)
+
+When you are already mid-session on small ad-hoc work and want to delegate a
+quick build/fix to Codex without the full runner, the official OpenAI plugin is
+a supported alternative.
+
+Install (one-time, non-interactive):
+
+```bash
+claude plugin marketplace add openai/codex-plugin-cc
+claude plugin install codex@openai-codex
+```
+
+It provides `/codex:rescue --background` (delegate a task to Codex detached),
+`/codex:status` (check the job), `/codex:result` (fetch the result), and
+`/codex:cancel`.
+
+Protocol when using the plugin:
+1. Delegate execution to `codex-rescue` (or `/codex:rescue --background`).
+2. Apply the **same review contract** in-session: read the diff and fill the
+   review-verdict JSON yourself (`skills/dev-review/schemas/review-verdict.json`
+   — `verdict`, `confidence`, `summary`, `issues[]`, `scope_creep_detected`).
+3. Same `≤ 2`-round revise cap. Same **never auto-merge** rule.
+
+State plainly to the user: the plugin path shares our templates/schema **by
+convention** but is **NOT** exercised by evals/CI — there is no `--output-schema`
+contract enforced through it (you fill the verdict by hand), and it is pinned to
+the plugin's major version (`v1.x`). Format drift in the plugin's output is the
+plugin's risk, not ours. The runner path (Transport A above) is the one that
+carries CI.
+
+---
+
+## Notes
+
+- **Why detached, not babysat:** the post that inspired this (Fable plans, Codex
+  executes, Fable reviews) keeps a session supervising inline, burning cache
+  reads every turn. This skill kicks the runner via background Bash, ends the
+  turn, and gets woken on exit — the session pays only for plan + review gates.
+- **The preset is the single source of truth for the seats.** `codex-build`
+  expands to Fable/high · Codex/xhigh · Fable/max · verify on · bounces 2 ·
+  revise-loop 1, inside `dev-review/codex/dev-review.sh` (`apply_preset`). The
+  preset triple is pinned by `tests/docs-sync-simulation.sh` so this doc and the
+  runner cannot drift.
+- **Live-mode caveat:** `--live` codex windows ignore model/effort overrides
+  (documented v1 limitation in /dev-review). This skill does not use `--live`.
+- **Plan ownership:** the orchestrator owns the canonical plan file; Codex only
+  ever receives plan content inline (embedded by the runner) and writes to a
+  separate output. Never let Codex see the canonical plan path.
diff --git a/skills/dev-review/SKILL.md b/skills/dev-review/SKILL.md
index 36dd9ad..719fb39 100644
--- a/skills/dev-review/SKILL.md
+++ b/skills/dev-review/SKILL.md
@@ -26,6 +26,14 @@ Use `/co-evolution` for general questions, ideas, arguments, prompt refinement,
 or document bounces. Use `/dev-review` only when the intended deliverable is a
 repo change or code-focused verification workflow.
 
+For the **detached** flavor — the session plans and reviews while Codex executes
+in the background and the session is woken at gates instead of babysitting — use
+`/codex-build` (`skills/codex-build/`). It kicks the standalone runner with
+`--preset codex-build` (Fable plans at `high`, Codex executes at `xhigh`, Fable
+reviews at `max`; `--verify` on, bounces `2`, revise-loop `1`) and applies a
+schema-bound ACCEPT / REVISE / ESCALATE gate. `/dev-review` is the interactive
+single-session loop; `/codex-build` is the same pipeline run detached.
+
 When `--live` is enabled on Windows, each Codex pass runs in a visible PowerShell
 window so the user can watch progress in real time. The handoff contract stays
 file-based: prompts are written to disk, Codex writes the final message to disk,
@@ -170,6 +178,28 @@ the warning from Step 2 is enough; still render the banner as `Live: no`.
 ### Initialize the plan document:
 Create `/tmp/dev-review-plan-{timestamp}.md` - this is the living document that bounces.
 
+### Build the shared Codex argument block: `CODEX_ARGS`
+
+`--model` and the per-seat effort knob are parsed but must actually reach
+`codex exec`. Build a `CODEX_ARGS` array ONCE here and splice it into every
+headless `codex exec` invocation (compose, bounce, execute, verify). Codex takes
+model and reasoning-effort as `-c key=value` overrides, not as named flags:
+
+```bash
+CODEX_ARGS=()
+[ -n "$CODEX_MODEL" ] && CODEX_ARGS+=( -c "model=$CODEX_MODEL" )
+[ -n "${CODEX_EFFORT:-}" ] && CODEX_ARGS+=( -c "model_reasoning_effort=$CODEX_EFFORT" )
+```
+
+`$CODEX_MODEL` comes from `--model`; `$CODEX_EFFORT` (if your invocation sets it)
+maps to the executor's reasoning effort, mirroring the standalone runner's
+`-c model_reasoning_effort=` wiring. When both are empty, `CODEX_ARGS` is empty
+and the codex commands are byte-identical to today's — Codex falls back to its
+own `~/.codex/config.toml` defaults.
+
+Splice it into each headless codex command with `"${CODEX_ARGS[@]}"`, e.g.
+`codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o <out>`.
+
 ### Shared helper: `launch_codex_live`
 
 Use one helper for every live Codex pass: compose, bounce, execute, and verify.
@@ -329,7 +359,7 @@ launch_codex_live "Codex - Compose" "exec" \
 If `$LIVE_ACTIVE=false`, keep the existing headless behavior:
 
 ```bash
-printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md
+printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md
 ```
 
 **Important:** Always pipe plan content through stdin or embed it in the prompt file.
@@ -421,7 +451,7 @@ cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timest
 If `$LIVE_ACTIVE=false`, keep the existing headless behavior:
 
 ```bash
-printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md
+printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md
 # Orchestrator overwrites canonical plan with Codex's output
 cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timestamp}.md
 ```
@@ -582,7 +612,7 @@ launch_codex_live "Codex - Execute" "exec" \
 If `$LIVE_ACTIVE=false`, keep the existing headless behavior:
 
 ```bash
-codex exec --full-auto -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1
+codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1
 ```
 
 After execution, verify changes:
@@ -631,6 +661,8 @@ Pass 1: `codex review --uncommitted` (if available, skip if not)
 - If `$LIVE_ACTIVE=false`, keep the current headless `codex review` behavior.
 
 Pass 2: `codex exec --output-schema` with the review prompt
+- Headless: splice `"${CODEX_ARGS[@]}"` in too, e.g.
+  `codex exec --full-auto "${CODEX_ARGS[@]}" --output-schema <schema> -C "$(pwd)" -o /tmp/dev-review-verdict.json < /tmp/dev-review-review-prompt.md`
 - If `$LIVE_ACTIVE=true`, use `launch_codex_live` with title `Codex - Verify`,
   mode `exec-schema`, prompt `/tmp/dev-review-review-prompt.md`, output
   `/tmp/dev-review-verdict.json`, and schema path
@@ -745,6 +777,11 @@ Dev-review is integrated into GSD workflows:
 - **`--live` is launch-mode only**: It does not change the artifact contract.
   Prompts still come from temp files, and the final Codex message still lands in
   the same output file before Claude Code continues.
+- **`--live` ignores model/effort overrides** (documented v1 limitation): the
+  visible-window launcher (`launch_codex_live`) does not splice `CODEX_ARGS`, so
+  `--model` / the reasoning-effort knob have no effect on live Codex passes.
+  Those overrides apply only on the headless path. Use headless mode when a
+  pinned Codex model or effort matters.
 - **`--live` is Windows-only in the first release**: On non-Windows platforms,
   warn once and continue headless.
 - **Live windows auto-close after a short read delay**: The wrapper waits 5 seconds
diff --git a/tests/docs-sync-simulation.sh b/tests/docs-sync-simulation.sh
new file mode 100644
index 0000000..03d4b1e
--- /dev/null
+++ b/tests/docs-sync-simulation.sh
@@ -0,0 +1,132 @@
+#!/usr/bin/env bash
+# tests/docs-sync-simulation.sh
+# Hermetic sync-token gate for v1.5 Phase 5: pin the codex-build preset triple
+# so the runner code and the user-facing docs cannot drift.
+#
+# The `codex-build` preset is defined in ONE place (dev-review/codex/dev-review.sh
+# apply_preset) but DESCRIBED in two skill docs (skills/codex-build/SKILL.md and
+# skills/dev-review/SKILL.md). If someone edits the seats in one and forgets the
+# others, the docs lie. This gate greps the load-bearing seat tokens out of each
+# file and asserts they all agree on the same triple:
+#
+#   composer  = Fable, effort high
+#   executor  = Codex, effort xhigh (model unpinned — CLI config rules)
+#   verifier  = Fable, effort max
+#   bounces   = 2
+#
+# Pattern: pure-grep assertions over the real files (no agents, no stubs). Each
+# scenario is a (file, required-token) pair. This is a structural invariant test,
+# the same flavour as the frozen-surface grep checks in classifier-simulation.sh.
+
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh"
+CODEX_BUILD_SKILL="$REPO_ROOT/skills/codex-build/SKILL.md"
+DEV_REVIEW_SKILL="$REPO_ROOT/skills/dev-review/SKILL.md"
+
+TOTAL=0
+FAILURES=0
+pass() { printf "PASS: %s\n" "$1"; }
+fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); }
+
+# assert_grep <label> <file> <ERE-pattern>
+# One scenario: the pattern must match somewhere in the file.
+assert_grep() {
+  local label="$1" file="$2" pat="$3"
+  TOTAL=$((TOTAL + 1))
+  if [[ ! -f "$file" ]]; then
+    fail "$label — file missing: $file"
+    return
+  fi
+  if grep -Eiq -- "$pat" "$file"; then
+    pass "$label"
+  else
+    fail "$label — pattern not found in $(basename "$file"): /$pat/"
+  fi
+}
+
+# assert_not_grep <label> <file> <ERE-pattern>
+# One scenario: the pattern must NOT match (used for the "executor model unpinned"
+# invariant — no `-c model=` is pinned for the executor seat in the preset).
+assert_not_grep() {
+  local label="$1" file="$2" pat="$3"
+  TOTAL=$((TOTAL + 1))
+  if [[ ! -f "$file" ]]; then
+    fail "$label — file missing: $file"
+    return
+  fi
+  if grep -Eq -- "$pat" "$file"; then
+    fail "$label — pattern unexpectedly found in $(basename "$file"): /$pat/"
+  else
+    pass "$label"
+  fi
+}
+
+# ===========================================================================
+# Group 1: the runner's apply_preset() is the single source of truth.
+# These pin the literal knob assignments so a doc edit can be diffed against
+# the code, and a code edit that changes the triple breaks this gate too.
+# ===========================================================================
+assert_grep "runner: composer Fable seat (COMPOSER_MODEL:=fable)" \
+  "$RUNNER" 'COMPOSER_MODEL:=fable'
+assert_grep "runner: composer effort high (COMPOSER_EFFORT:=high)" \
+  "$RUNNER" 'COMPOSER_EFFORT:=high'
+assert_grep "runner: verifier Fable seat (VERIFIER_MODEL:=fable)" \
+  "$RUNNER" 'VERIFIER_MODEL:=fable'
+assert_grep "runner: verifier effort max (VERIFIER_EFFORT:=max)" \
+  "$RUNNER" 'VERIFIER_EFFORT:=max'
+assert_grep "runner: executor effort xhigh (EXECUTOR_EFFORT:=xhigh)" \
+  "$RUNNER" 'EXECUTOR_EFFORT:=xhigh'
+assert_grep "runner: bounces pinned to 2 (BOUNCES=2)" \
+  "$RUNNER" 'BOUNCES=2'
+# Executor model is deliberately unpinned: apply_preset must NOT set EXECUTOR_MODEL
+# (the CLI's ~/.codex/config.toml model wins). Pin that invariant.
+TOTAL=$((TOTAL + 1))
+if sed -n '/^apply_preset()/,/^}/p' "$RUNNER" | grep -Eq 'EXECUTOR_MODEL'; then
+  fail "runner: executor model unexpectedly pinned in apply_preset (must stay unpinned)"
+else
+  pass "runner: executor model unpinned in apply_preset (CLI config rules)"
+fi
+
+# ===========================================================================
+# Group 2: skills/codex-build/SKILL.md must describe the SAME triple.
+# Tokens kept loose enough to survive prose rewording but tight enough to catch
+# a wrong seat (e.g. opus instead of Fable, high instead of xhigh on executor).
+# ===========================================================================
+assert_grep "codex-build skill: composer Fable at high" \
+  "$CODEX_BUILD_SKILL" 'composer = Fable \(high\)'
+assert_grep "codex-build skill: executor Codex at xhigh" \
+  "$CODEX_BUILD_SKILL" 'executor = Codex \(xhigh'
+assert_grep "codex-build skill: verifier Fable at max" \
+  "$CODEX_BUILD_SKILL" 'verifier = Fable \(max\)'
+assert_grep "codex-build skill: bounces 2" \
+  "$CODEX_BUILD_SKILL" 'bounces 2'
+assert_grep "codex-build skill: executor model unpinned (CLI config note)" \
+  "$CODEX_BUILD_SKILL" "left to the CLI's config"
+
+# ===========================================================================
+# Group 3: skills/dev-review/SKILL.md cross-references the same preset triple.
+# ===========================================================================
+assert_grep "dev-review skill: Fable plans at high" \
+  "$DEV_REVIEW_SKILL" 'Fable plans at .?high'
+assert_grep "dev-review skill: Codex executes at xhigh" \
+  "$DEV_REVIEW_SKILL" 'Codex executes at .?xhigh'
+assert_grep "dev-review skill: reviews at max" \
+  "$DEV_REVIEW_SKILL" 'reviews at .?max'
+assert_grep "dev-review skill: bounces 2" \
+  "$DEV_REVIEW_SKILL" 'bounces .?2'
+assert_grep "dev-review skill: names --preset codex-build" \
+  "$DEV_REVIEW_SKILL" '--preset codex-build'
+
+# --- summary ----------------------------------------------------------------
+passed=$((TOTAL - FAILURES))
+if (( FAILURES == 0 )); then
+  echo "$passed/$TOTAL scenarios passed"
+  exit 0
+else
+  echo "$passed/$TOTAL scenarios passed ($FAILURES failed)" >&2
+  exit 1
+fi

From c9b40f6e8893ec475433f8df96f44a7680a6a5f1 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 14:09:50 -0400
Subject: [PATCH 07/12] =?UTF-8?q?fix:=20cross-agent=20seat=20leak=20+=20st?=
 =?UTF-8?q?rict-mode=20verdict=20schema=20=E2=80=94=20codex=20verifier=20s?=
 =?UTF-8?q?eat=20now=20real=20(v1.5=20Phase=206)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Found by live dogfood: (1) --preset codex-build --verifier codex leaked
VERIFIER_MODEL=fable into CODEX_MODEL (ChatGPT plan rejects with HTTP 400);
apply_seat_env now drops a cross-agent-kind model+effort pair as a unit.
(2) review-verdict.json lacked additionalProperties:false on issues.items
and complete required lists (OpenAI strict structured output) — the codex
--output-schema verify path could never have produced a verdict. Proven
E2E: real run exits 0 with verdict APPROVED@96, verify tokens captured.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review.sh                | 34 ++++++++++++
 runners/codex-ps/schemas/review-verdict.json  |  5 +-
 schemas/review-verdict.json                   |  5 +-
 skills/dev-review/schemas/review-verdict.json |  5 +-
 tests/preset-expansion-simulation.sh          | 53 +++++++++++++++++++
 5 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 30fd33e..4fccdf2 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -1367,6 +1367,28 @@ apply_seat_env() {
     verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;;
     *) : ;;  # bounce counterparty: globals only
   esac
+  # v1.5 cross-agent leak guard. A seat's model+effort is ONE override pair
+  # configured for a specific agent kind (e.g. the codex-build preset fills
+  # VERIFIER_MODEL=fable / VERIFIER_EFFORT=max for its DEFAULT claude verifier).
+  # When --verifier codex flips that same seat to codex, the pair is now wrong:
+  # exporting CODEX_MODEL=fable makes codex/ChatGPT reject the run with
+  #   HTTP 400: The 'fable' model is not supported when using Codex with a
+  #   ChatGPT account.
+  # and CODEX_REASONING_EFFORT=max is off codex's scale (it ends at xhigh).
+  # The two halves are coupled: a model picked for the wrong kind means the
+  # effort was too — so drop the WHOLE pair (not just the model) and fall back
+  # to that agent's base values (CODEX_MODEL_BASE/CODEX_EFFORT_BASE = the user's
+  # codex config defaults, which is exactly right for the degrade path). Symmetric
+  # on the claude arm so a codex-shaped override never reaches a claude seat.
+  if [[ "$agent" == "codex" ]]; then
+    case "$model" in
+      fable|claude-*) model=""; effort="" ;;
+    esac
+  else
+    case "$model" in
+      gpt-*|codex*) model=""; effort="" ;;
+    esac
+  fi
   if [[ "$agent" == "codex" ]]; then
     export CODEX_MODEL="${model:-$CODEX_MODEL_BASE}"
     export CODEX_REASONING_EFFORT="${effort:-$CODEX_EFFORT_BASE}"
@@ -1389,6 +1411,18 @@ resolve_seat_model_string() {
     verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;;
     *) : ;;
   esac
+  # v1.5: mirror apply_seat_env's cross-agent leak guard (drop the wrong-kind
+  # model+effort pair as a unit) so the banner / state.json seat_models report
+  # what actually runs — e.g. codex:(default)@(default), not codex:fable@max.
+  if [[ "$agent" == "codex" ]]; then
+    case "$model" in
+      fable|claude-*) model=""; effort="" ;;
+    esac
+  else
+    case "$model" in
+      gpt-*|codex*) model=""; effort="" ;;
+    esac
+  fi
   if [[ "$agent" == "codex" ]]; then
     model_str="${model:-$CODEX_MODEL_BASE}"
     effort_str="${effort:-$CODEX_EFFORT_BASE}"
diff --git a/runners/codex-ps/schemas/review-verdict.json b/runners/codex-ps/schemas/review-verdict.json
index 3d4835a..3e4193e 100644
--- a/runners/codex-ps/schemas/review-verdict.json
+++ b/runners/codex-ps/schemas/review-verdict.json
@@ -3,7 +3,7 @@
   "title": "ReviewVerdict",
   "description": "Structured code review verdict for the dev-review loop",
   "type": "object",
-  "required": ["verdict", "confidence", "summary", "issues"],
+  "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"],
   "properties": {
     "verdict": {
       "type": "string",
@@ -24,7 +24,8 @@
       "type": "array",
       "items": {
         "type": "object",
-        "required": ["severity", "description"],
+        "additionalProperties": false,
+        "required": ["severity", "file", "line_range", "description", "suggestion"],
         "properties": {
           "severity": {
             "type": "string",
diff --git a/schemas/review-verdict.json b/schemas/review-verdict.json
index 3d4835a..3e4193e 100644
--- a/schemas/review-verdict.json
+++ b/schemas/review-verdict.json
@@ -3,7 +3,7 @@
   "title": "ReviewVerdict",
   "description": "Structured code review verdict for the dev-review loop",
   "type": "object",
-  "required": ["verdict", "confidence", "summary", "issues"],
+  "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"],
   "properties": {
     "verdict": {
       "type": "string",
@@ -24,7 +24,8 @@
       "type": "array",
       "items": {
         "type": "object",
-        "required": ["severity", "description"],
+        "additionalProperties": false,
+        "required": ["severity", "file", "line_range", "description", "suggestion"],
         "properties": {
           "severity": {
             "type": "string",
diff --git a/skills/dev-review/schemas/review-verdict.json b/skills/dev-review/schemas/review-verdict.json
index 3d4835a..3e4193e 100644
--- a/skills/dev-review/schemas/review-verdict.json
+++ b/skills/dev-review/schemas/review-verdict.json
@@ -3,7 +3,7 @@
   "title": "ReviewVerdict",
   "description": "Structured code review verdict for the dev-review loop",
   "type": "object",
-  "required": ["verdict", "confidence", "summary", "issues"],
+  "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"],
   "properties": {
     "verdict": {
       "type": "string",
@@ -24,7 +24,8 @@
       "type": "array",
       "items": {
         "type": "object",
-        "required": ["severity", "description"],
+        "additionalProperties": false,
+        "required": ["severity", "file", "line_range", "description", "suggestion"],
         "properties": {
           "severity": {
             "type": "string",
diff --git a/tests/preset-expansion-simulation.sh b/tests/preset-expansion-simulation.sh
index 6cf10bc..4b6ccbf 100644
--- a/tests/preset-expansion-simulation.sh
+++ b/tests/preset-expansion-simulation.sh
@@ -17,6 +17,9 @@
 #   f. --help mentions --preset.
 #   g. parity guard: a default run emits no --effort and no
 #      model_reasoning_effort anywhere.
+#   h. cross-agent leak guard: --verifier codex over the preset must NOT leak
+#      the claude verifier's fable@max into the codex verify argv (live HTTP 400
+#      on a ChatGPT account); seat_models.verifier falls back to default.
 #
 # Pattern: PATH-injected claude + codex stubs append their full argv (one line
 # per invocation) to phase-agnostic logs; assertions grep the logs. Both stubs
@@ -391,6 +394,56 @@ else
 fi
 [[ -n "${g_run_dir:-}" && "$g_run_dir" != "$run_dir_before" ]] && rm -rf "$g_run_dir"
 
+# ===========================================================================
+# Scenario (h): cross-agent leak guard — --preset codex-build --verifier codex.
+# The preset fills VERIFIER_MODEL=fable / VERIFIER_EFFORT=max for its DEFAULT
+# claude verifier; --verifier codex flips that seat to codex. Without the guard
+# in apply_seat_env, the codex verify argv carried `-c model=fable` (→ live
+# HTTP 400: "The 'fable' model is not supported when using Codex with a ChatGPT
+# account.") and `-c model_reasoning_effort=max` (off codex's xhigh scale).
+# Assert the codex verify argv carries NEITHER, and state.json seat_models
+# .verifier reports the fallback (codex:(default)@(default)), not codex:fable@max.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo h)
+codex_log="$TEST_DIR/h_codex.log"; claude_log="$TEST_DIR/h_claude.log"
+: > "$codex_log"; : > "$claude_log"
+run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --preset codex-build --verifier codex --skip-plan --plan "$PLAN_FIXTURE" \
+    --workdir "$repo" --timeout 60 -- "preset codex-verifier leak guard probe"
+) >"$TEST_DIR/h.out" 2>&1 || true
+h_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+# The verify invocation is the codex argv line carrying --output-schema.
+h_verify_argv=$(grep -- '--output-schema' "$codex_log" || true)
+h_ok=true
+# No leaked fable model and no off-scale max effort in the codex verify argv.
+if printf '%s' "$h_verify_argv" | grep -Eq -- '-c model=fable'; then h_ok=false; fi
+if printf '%s' "$h_verify_argv" | grep -Eq -- 'model_reasoning_effort=max'; then h_ok=false; fi
+# Sanity: the codex verify invocation actually happened (so the asserts are real).
+if [[ -z "$h_verify_argv" ]]; then h_ok=false; fi
+# state.json must report the fallback seat, not the leaked fable@max pair.
+if [[ -n "$h_run_dir" && "$h_run_dir" != "$run_dir_before" ]]; then
+  if ! jq -e '.seat_models.verifier == "codex:(default)@(default)"' "$h_run_dir/state.json" >/dev/null 2>&1; then
+    h_ok=false
+  fi
+else
+  h_ok=false
+fi
+if [[ "$h_ok" == true ]]; then
+  pass "preset codex-build --verifier codex: no fable/max leak into codex verify argv; seat falls back to default"
+else
+  fail "cross-agent leak guard failed (run dir: ${h_run_dir:-<none>})"
+  { echo "--- codex verify argv ---"; printf '%s\n' "$h_verify_argv"
+    [[ -n "${h_run_dir:-}" && -f "$h_run_dir/state.json" ]] && { echo "--- seat_models ---"; jq -c '.seat_models' "$h_run_dir/state.json"; }
+  } >&2
+fi
+[[ -n "${h_run_dir:-}" && "$h_run_dir" != "$run_dir_before" ]] && rm -rf "$h_run_dir"
+
 # --- summary ----------------------------------------------------------------
 passed=$((TOTAL - FAILURES))
 if (( FAILURES == 0 )); then

From 2c4d229dde8d28255e70dcb016b248d54a092baa Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Fri, 12 Jun 2026 14:09:50 -0400
Subject: [PATCH 08/12] =?UTF-8?q?docs(planning):=20v1.5=20Phase=206=20?=
 =?UTF-8?q?=E2=80=94=20mcp=20parity=20green,=20dogfood=20ACCEPT+ESCALATE?=
 =?UTF-8?q?=20live,=20token=20evidence=20note?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .planning/STATE.md                            |  24 +-
 .../research/2026-06-12-token-evidence.md     | 294 ++++++++++++++++++
 2 files changed, 309 insertions(+), 9 deletions(-)
 create mode 100644 .planning/research/2026-06-12-token-evidence.md

diff --git a/.planning/STATE.md b/.planning/STATE.md
index 8a914df..512dc4b 100644
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@@ -3,15 +3,15 @@ gsd_state_version: 1.0
 milestone: v1.5
 milestone_name: Build with Codex — model ladder + orchestrated execution
 status: executing
-stopped_at: v1.5 Phase 0 IN PROGRESS (2026-06-12). Codex symlink created, smoke pass, claude -p envelope pinned, plugin installed. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched.
-last_updated: "2026-06-12T12:28:00.000Z"
-last_activity: 2026-06-12 -- v1.5 milestone registered; Phase 0 executing
+stopped_at: v1.5 Phases 0-5 EXECUTED (b672145..30a9a00). Phase 6 PARTIAL (2026-06-12) — MCP vendor parity green; codex-verifier degrade path now FULLY GREEN end-to-end: model-leak fix + review-verdict.json schema 400 fix both landed, first real /codex-build ACCEPT produced (subtract-helper task, exit 0, APPROVED conf 96, verify tokens captured). Remaining Phase 6: claude /login (human, for full-ladder + non-zero claude_* tokens), REVISE→ACCEPT row, interactive baseline. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched.
+last_updated: "2026-06-12T17:43:00.000Z"
+last_activity: 2026-06-12 -- v1.5 Phase 6: review-verdict.json schema 400 fixed; codex-verifier degrade path E2E green (first ACCEPT)
 progress:
   total_phases: 7
-  completed_phases: 0
+  completed_phases: 6
   total_plans: 0
   completed_plans: 0
-  percent: 0
+  percent: 86
 ---
 
 # Project State
@@ -32,10 +32,16 @@ Status: npm scope verification + NPM_TOKEN secret + git tag are Alan's; everythi
 Last activity: 2026-06-10 -- Phase 0 merges + LF policy + audit report
 
 Milestone: v1.5 Build with Codex — model ladder + orchestrated execution
-Phase: 0 (env + research) in progress
-Status: codex symlink created + smoke pass; claude -p envelope pinned; codex plugin installed (openai-codex marketplace, v1.0.4); Phase 0 research notes filed. Branch: feat/v1.5-codex-build.
-Last activity: 2026-06-12 -- Phase 0 start; design file committed at .planning/v1.5-DESIGN.md
-Working directories: `~/co-evolution-v13/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC
+Phase: 0-5 EXECUTED; Phase 6 (dogfood + evidence) PARTIAL
+Status: Phases 0-5 shipped on feat/v1.5-codex-build (b672145 Phase 0 env + research · fb6862a Phase 1 seats + B1/B2/B3 fixes · ffe765f Phase 2 verifier hardening + preset · 13c2bee Phase 3 observability + status reader · fb965ad Phase 4 token capture · 30a9a00 Phase 5 /codex-build skill + docs). Phase 6 partial — see below.
+Phase 6 progress (2026-06-12):
+  - MCP vendor parity GREEN: `bash mcp/scripts/vendor.sh` clean; `(cd mcp && npm test)` 4/4 pass. `mcp/vendor/` is gitignored (generated-at-publish via `npm run build:vendor`, NOT checked in) — Phase 1/3/4 lib changes were additive and broke nothing.
+  - First real `/codex-build` dogfood (slugify task, scratch repo under $TMPDIR, `--verifier codex` degrade since headless claude is logged out): execute phase SUCCEEDED (slugify landed, all 4 scratch tests pass), but verify phase ERRORED → runner exit 2, `verdict_present: false`, ESCALATE. codex_total_tokens=21497 (token capture works); claude_* totals=0 (ladder not exercised on the degrade path). One re-kick (`--parent-run`, lineage recorded) hit the same error. NO ACCEPT data point yet.
+  - Real bug found AND FIXED (2026-06-12): the documented `--verifier codex` degrade leaked the preset's `VERIFIER_MODEL=fable` into the codex seat (`apply_seat_env`, dev-review.sh:1370-1372) → codex on a ChatGPT account returned HTTP 400 "The 'fable' model is not supported". Fix = cross-agent leak guard in `apply_seat_env` + `resolve_seat_model_string` (drop a wrong-kind model+effort pair as a unit; codex seat falls back to `codex:(default)@(default)`). Sim scenario (h) added (preset-expansion-simulation.sh now 8/8; run-all 25/25 green). Re-run proof: fable 400 GONE, execute SUCCEEDED, scratch run-tests ALL PASS, codex_total_tokens=17237, wall 54s — BUT verdict still null: the verify seat now hit a SEPARATE, pre-existing schema 400 (`invalid_json_schema`: nested `issues.items` missing `additionalProperties:false` in skills/dev-review/schemas/review-verdict.json). Seat fix proven; degrade path was blocked one layer deeper by the schema bug. Detail in .planning/research/2026-06-12-token-evidence.md.
+  - Schema 400 FIXED (2026-06-12): OpenAI strict structured-output requires `additionalProperties:false` + a `required` list covering EVERY property on EVERY object node. `issues.items` was missing both; top-level `required` omitted `scope_creep_detected`/`iteration_notes`. Tightened all THREE canonical copies identically (schemas/, runners/codex-ps/schemas/, skills/dev-review/schemas/) so the drift guard stays green; shell `validate_review_verdict` is unaffected (stays loose, independent of this file). DEGRADE-PATH E2E NOW GREEN: real /codex-build (subtract-helper task, scratch repo under $TMPDIR, `--verifier codex`, --branch auto) → exit 0, verify OK, **verdict APPROVED conf 96**, verdict.json complete (all 6 strict fields), tokens execute=30606 verify=14485 codex_total=45091, wall 59s, scratch run-tests ALL 4 PASS. First full ACCEPT-path evidence row (degrade path). run-all 25/25 green. First ACCEPT data point logged in .planning/research/2026-06-12-token-evidence.md.
+  - Remaining Phase 6 (HUMAN + follow-up): `claude /login` on this Mac (unblocks full ladder + non-zero claude_* tokens; degrade-path claude_* are 0 by design); REVISE→ACCEPT row still owed (ESCALATE + ACCEPT now evidenced); an interactive `/dev-review` baseline for the 50%-claim denominator.
+Last activity: 2026-06-12 -- Phase 6: schema 400 fixed (3 copies); degrade-path E2E green, first ACCEPT (.planning/research/2026-06-12-token-evidence.md)
+Working directories: `~/Project/co-evolution/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC
 
 macOS baseline before v1.3 fixes: scorer-verification 11/14; code-proposer sim 1/16; pr-emitter sim 4/12; template-proposer sim 1/8; revise-loop sim aborts. Root causes: bash 3.2 (mapfile, source <(…)), BSD sed GNU-isms. Target after Phase 0.5: all green on macOS.
 
diff --git a/.planning/research/2026-06-12-token-evidence.md b/.planning/research/2026-06-12-token-evidence.md
new file mode 100644
index 0000000..34cd05a
--- /dev/null
+++ b/.planning/research/2026-06-12-token-evidence.md
@@ -0,0 +1,294 @@
+# Token Evidence: the "50% Claude-limit reduction" claim
+
+**Date:** 2026-06-12
+**Phase:** v1.5 Phase 6 (dogfood + evidence)
+**Purpose:** Establish the measurement method for the post's "~50% weekly
+Claude-limit reduction" claim, record the first real `/codex-build` data point,
+and enumerate what is still needed before a verdict on the claim can be drawn.
+
+---
+
+## Measurement method
+
+The claim is that splitting roles (Claude plans/reviews, Codex executes) cuts
+weekly Claude usage roughly in half versus Claude doing the whole loop. We
+measure two distinct spend surfaces, because they live in different places:
+
+1. **Runner-side agent tokens** — captured per phase into `state.json.tokens`
+   when `CO_EVOLVE_TOKEN_CAPTURE=1` (Phase 4). Claude seats land under
+   `.tokens.totals.claude_*` (from the `claude -p --output-format json` envelope,
+   field map in `2026-06-12-claude-json-envelope.md`); Codex seats land under
+   `.tokens.totals.codex_total_tokens` (harvested from the `tokens used` stderr
+   line, format pinned in `2026-06-12-codex-headless-facts.md` R2). This is the
+   part the runner can see and attribute by seat.
+
+2. **Orchestrator-session spend** — the Fable/Claude Code session that plans and
+   runs the review gate. The harness does **not** expose a token meter to the
+   session, so this is visible only as **turn counts**: how many model turns the
+   orchestrator burned (plan + kick + each wake/review gate). Under `/codex-build`
+   the session ends its turn at KICK and is woken on exit, so the orchestrator
+   pays for plan + review gates only — not for watching Codex grind.
+
+**The comparison** is `/codex-build` (Codex executes, Claude only plans+reviews)
+vs. an **interactive `/dev-review` baseline** on the *same task* (Claude in more
+seats, session supervises inline every pass). The 50% question is whether the
+Claude-attributable spend — runner-side `claude_*` totals **plus** orchestrator
+turn count — drops by roughly half. Codex tokens are "free" against the
+ChatGPT-plan quota and so are reported but excluded from the Claude-limit math.
+
+---
+
+## First real data point (T2 dogfood, 2026-06-12)
+
+**Task:** add a `slugify` function to `utils.sh` (lowercase; spaces/underscores →
+hyphens; strip other punctuation) + two test cases in `run-tests.sh`. Honest
+scratch repo under `$TMPDIR`, clean tree, `--branch auto`, `--workdir` on the
+scratch repo, `--run-dir` outside the main repo. Auth-degraded path
+(`--verifier codex`) because the headless `claude` shell on this Mac is
+**not logged in** (expected; the in-app session token does not reach sub-shells).
+
+Exact kick (seats from `--preset codex-build`):
+
+```bash
+CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
+  --preset codex-build --skip-plan --plan "$PLAN" \
+  --verifier codex --branch auto --workdir "$SCRATCH" \
+  --run-dir "$TMPDIR/codex-build-dogfood-run" --timeout 900 \
+  -- "Add a slugify function to utils.sh and two test cases to run-tests.sh"
+```
+
+### Result: execute SUCCEEDED, verify ERRORED (no verdict) — NOT an ACCEPT
+
+| Signal | Value |
+|---|---|
+| Runner exit code | 2 (partial) |
+| Wall clock | 65 s (run 1), 61 s (run 2 re-kick) |
+| Status reader `status` / `verdict` / `verdict_present` | `partial` / `null` / `false` |
+| Execute phase | `ok`, exit 0 — slugify landed |
+| Verify phase | `error`, exit 2 — codex verifier returned an error payload |
+| Diffstat | `utils.sh +7`, `run-tests.sh +2`, 2 files, 9 insertions |
+| Scratch tests after execute | **ALL 4 PASS** (`bash run-tests.sh` → `ALL TESTS PASSED`) |
+
+The executor's work was **correct** — the implemented `slugify` passes both new
+assertions and the two pre-existing ones. The run did not reach ACCEPT only
+because the **verify seat could not produce a verdict**.
+
+### Tokens block (verbatim, run 1 — `jq '.tokens' state.json`)
+
+```json
+{
+  "phases": {
+    "execute": {
+      "source": "codex-stderr",
+      "total_tokens": 21497
+    }
+  },
+  "totals": {
+    "claude_input": 0,
+    "claude_output": 0,
+    "claude_cache_read": 0,
+    "claude_cost_usd": 0,
+    "codex_total_tokens": 21497
+  }
+}
+```
+
+Run 2 (re-kick, `--parent-run dev-review-20260612-133912`, lineage recorded)
+re-ran execute and reported `codex_total_tokens: 34935`; verify failed the same
+way. Both runs: `claude_*` totals are **0** — by design under the codex-verifier
+degrade, the only Claude work was orchestrator-side (this session), and the
+headless claude seats were never invoked.
+
+### Seat models actually used (verbatim)
+
+```json
+{
+  "composer": "opus:claude-fable-5@high",
+  "executor": "codex:(default)@xhigh",
+  "verifier": "codex:fable@max"
+}
+```
+
+---
+
+## Finding: the `--verifier codex` degrade is broken on a ChatGPT-plan Codex
+
+Root cause, fully diagnosed (this is a real runner bug, not a plan or env fault):
+
+- The `codex-build` preset hard-fills `VERIFIER_MODEL=fable` (correct for the
+  **default** Claude verifier seat — Fable reviews).
+- When the documented degrade `--verifier codex` flips the verifier *agent* to
+  codex, `apply_seat_env verifier codex` exports `CODEX_MODEL=fable`
+  (`dev-review.sh:1370-1372`: `export CODEX_MODEL="${model:-...}"` where
+  `model` = `VERIFIER_MODEL` = `fable`). The verdict is the seat model string
+  `codex:fable@max`.
+- Codex on a **ChatGPT account** rejects an unknown model with HTTP 400:
+
+  ```
+  ERROR: {"type":"error","status":400,"error":{"type":"invalid_request_error",
+  "message":"The 'fable' model is not supported when using Codex with a ChatGPT account."}}
+  ```
+
+So the skill's own documented auth-degrade (kick with `--verifier codex` when the
+headless claude seat is logged out — the *exact* situation on this Mac) currently
+**cannot produce a verdict**: the Claude model name leaks into the Codex seat.
+`VERIFIER_MODEL=` (empty) does not fix it — the preset's `:=fable` refills an
+empty value. A real fix is a runner change (clear/override `VERIFIER_MODEL` when
+the verifier resolves to codex, or pass a codex-valid model), out of scope for
+this evidence pass per Phase 6 ground rules. Per the skill's gate logic a missing
+verdict is an **ESCALATE**, never an auto-merge — the gate behaved correctly.
+
+> **Update 2026-06-12 (v1.5 fix):** degrade-path verifier model leak FIXED in
+> `apply_seat_env` / `resolve_seat_model_string` (cross-agent leak guard drops a
+> wrong-kind model+effort pair as a unit; codex seat falls back to
+> `codex:(default)@(default)` = gpt-5.5/xhigh). Re-run proof: the `fable` HTTP 400
+> is gone, execute SUCCEEDED, scratch `run-tests.sh` ALL PASS,
+> `codex_total_tokens: 17237`, wall 54 s — but verdict is still `null` because the
+> verify seat now hits a SEPARATE, pre-existing schema 400 (`invalid_json_schema`:
+> `additionalProperties` required `false` on `issues.items` in
+> `skills/dev-review/schemas/review-verdict.json`). Seat fix proven; degrade path
+> still blocked one layer deeper by the schema bug (tracked separately).
+
+> **Update 2026-06-12 (v1.5 schema fix — FULL ACCEPT path now green):** the
+> schema 400 above is FIXED. OpenAI strict structured-output requires
+> `additionalProperties: false` and a `required` list covering every property on
+> every object node; `issues.items` was missing both, and the top-level `required`
+> omitted `scope_creep_detected` / `iteration_notes`. All three canonical copies
+> (`schemas/`, `runners/codex-ps/schemas/`, `skills/dev-review/schemas/`) were
+> tightened identically (drift guard stays green). The shell `validate_review_verdict`
+> gate is unaffected — it stays loose and is independent of this file. With both
+> the seat leak and the schema 400 fixed, the degrade path (`--verifier codex`)
+> reaches a **real verdict end-to-end** for the first time. See the data point
+> immediately below.
+
+---
+
+## First COMPLETE-RUN data point — codex-verifier ACCEPT (2026-06-12)
+
+**This is the first full ACCEPT-path evidence row.** Same harness as T2 (honest
+scratch repo under `$TMPDIR`, plan outside it, clean tree, `--branch auto`,
+`--workdir` on the scratch repo, `--run-dir` outside the main repo), auth-degraded
+`--verifier codex` seat (headless `claude` still logged out on this Mac). The only
+difference from T2 is the schema fix — so this isolates the schema 400 as the last
+blocker on the degrade path.
+
+**Task:** add a `subtract(a,b)` helper to `utils.sh` (echo `a-b`) + two assertions
+in `run-tests.sh`.
+
+Exact kick (seats from `--preset codex-build`):
+
+```bash
+CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
+  --preset codex-build --skip-plan --plan "$PLAN" \
+  --verifier codex --branch auto --workdir "$SCRATCH" \
+  --run-dir "$TMPDIR/codex-build-schema-proof" --timeout 900 \
+  -- "Add a subtract(a,b) helper to utils.sh that echoes a-b, and add two assertions for it to run-tests.sh"
+```
+
+### Result: ACCEPT — execute OK, verify OK, real verdict produced
+
+| Signal | Value |
+|---|---|
+| Runner exit code | **0** |
+| Wall clock | **59 s** |
+| Status reader `status` / `verdict` | `completed` / `APPROVED` |
+| Execute phase | `ok`, exit 0 — `subtract()` landed |
+| Verify phase | `ok`, exit 0 — codex `--output-schema` returned a valid verdict (no schema 400) |
+| Diffstat | `utils.sh +4`, `run-tests.sh +2`, 2 files, 6 insertions |
+| Scratch tests after execute | **ALL 4 PASS** (`bash run-tests.sh` → `ALL TESTS PASSED`) |
+
+### Verdict (verbatim — `verdict.json`)
+
+```json
+{"verdict":"APPROVED","confidence":96,"summary":"Implementation matches the plan: `utils.sh` adds a `subtract()` helper in the same style as `add()`, `run-tests.sh` includes the two requested assertions, and `bash run-tests.sh` passes all checks. No logic, style, security, or scope issues found for the requested change.","issues":[],"scope_creep_detected":false,"iteration_notes":"No changes needed."}
+```
+
+The verdict carries all six strict-mode fields (including `scope_creep_detected`
+and `iteration_notes`), proving the tightened schema round-trips through Codex.
+
+### Tokens block (verbatim — `jq '.tokens' state.json`)
+
+```json
+{
+  "phases": {
+    "execute": {
+      "source": "codex-stderr",
+      "total_tokens": 30606
+    },
+    "verify": {
+      "source": "codex-stderr",
+      "total_tokens": 14485
+    }
+  },
+  "totals": {
+    "claude_input": 0,
+    "claude_output": 0,
+    "claude_cache_read": 0,
+    "claude_cost_usd": 0,
+    "codex_total_tokens": 45091
+  }
+}
+```
+
+Note this is the first row with a populated **`verify` phase** token entry
+(14,485) — T2's verify never produced one because it errored before emitting a
+`tokens used` line. `claude_*` totals remain **0** by design under the
+codex-verifier degrade (no headless Claude seat was invoked); the Claude-limit
+numerator still requires `claude /login` (see below). Codex tokens
+(45,091) are excluded from the Claude-reduction math.
+
+### Real-task matrix progress
+
+T2 evidenced the **ESCALATE** row (missing verdict). This run evidences the
+**ACCEPT** row for real on the degrade path. **REVISE→ACCEPT** and the
+full-ladder (Fable-verifier, non-zero `claude_*`) ACCEPT remain — both still
+gated on `claude /login`.
+
+---
+
+## What is still needed for a 50%-claim verdict
+
+1. **`claude /login` on this Mac (human step).** The headless `claude` shell is
+   not logged in, so the full ladder — and any non-zero `claude_*` token data —
+   is unreachable here. Until then every runner-side `claude_*` total reads 0 and
+   the Claude-limit math has no numerator. This is the single biggest gap.
+2. **A real ACCEPT data point.** ~~T2 reached execute-OK but verify-error.~~
+   **DONE on the degrade path (2026-06-12, schema fix):** the subtract-helper run
+   reached a clean `APPROVED` verdict (`verdict.json` present, exit 0) — see the
+   COMPLETE-RUN data point above. Still owed: a **full-ladder** ACCEPT with
+   non-zero `claude_*` tokens (the degrade path's `claude_*` totals are 0 by
+   design; that gap is item 1, `claude /login`).
+3. **The real-task matrix (Phase 6 design):** ACCEPT, REVISE→ACCEPT, and
+   ESCALATE each exercised once on real tasks. T2 produced an **ESCALATE** path
+   for real (missing verdict); the subtract run produced an **ACCEPT** for real
+   (degrade path). **REVISE→ACCEPT** remains.
+4. **An interactive `/dev-review` baseline on the same task** — the denominator
+   for the comparison. Without it there is no "half of what?" to measure against.
+5. **Orchestrator turn-count instrumentation.** Harness-side session spend is
+   visible only as turn counts (see Honest scope). For this orchestration the
+   count is: plan + 1 kick + 1 re-kick + this gate ≈ 4 orchestrator turns;
+   a clean ACCEPT would be plan + kick + 1 gate ≈ 3.
+
+## Honest scope
+
+- **Runner-side tokens are attributable; orchestrator-side spend is not metered.**
+  The harness gives the session no token meter — only turn counts. Any 50% figure
+  is therefore "runner-side Claude tokens (exact) + orchestrator turns (proxy)",
+  not a single dollar number. State this whenever the claim is quoted.
+- **Codex tokens do not count against the Claude limit.** They are reported
+  (`codex_total_tokens`) for completeness and to size the ChatGPT-plan quota
+  draw, but excluded from the Claude-reduction numerator.
+- **This data point is auth-degraded.** It measures the codex-only-verify path,
+  which is the fallback, not the blessed Fable-verifier ladder. Treat the
+  `claude_* = 0` totals as "ladder not exercised here," not as "Claude spend was
+  zero for this kind of task."
+
+## Artifacts
+
+| Item | Path |
+|---|---|
+| Run 1 dir (ACCEPT attempt) | `$TMPDIR/codex-build-dogfood-run/` |
+| Run 2 dir (re-kick, parent-linked) | `$TMPDIR/codex-build-dogfood-run-r2/` |
+| Plan file | `$TMPDIR/codex-build-plan-<ts>.md` |
+| Scratch repo | `$TMPDIR/codex-build-dogfood-<ts>/` |
+| Verifier error log | `<run-dir>/review-stderr.log` (the HTTP-400 fable rejection) |

From 733972a70a51acc49081dbe13985262d7818b889 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Sat, 13 Jun 2026 13:57:27 -0400
Subject: [PATCH 09/12] fix: status reader --json works under native Windows jq

Native jq.exe cannot open Git Bash process-substitution paths
(/proc/<pid>/fd/N), so dev-review-status.sh --json failed only on
Windows. Replace --slurpfile <(jq ...) with --argjson "$(jq -c ...)".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review-status.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev-review/codex/dev-review-status.sh b/dev-review/codex/dev-review-status.sh
index 540584b..b457e35 100755
--- a/dev-review/codex/dev-review-status.sh
+++ b/dev-review/codex/dev-review-status.sh
@@ -275,7 +275,7 @@ if [[ "$JSON" == true ]]; then
     --arg execute_stderr_tail "$exec_tail" \
     --arg assess "$ASSESS" \
     --argjson exit_code "$EXIT_CODE" \
-    --slurpfile phases <(jq '.phases // []' "$STATE") \
+    --argjson phases "$(jq -c '.phases // []' "$STATE")" \
     '{
       run_id: $run_id,
       run_dir: $run_dir,
@@ -296,7 +296,7 @@ if [[ "$JSON" == true ]]; then
       pre_execute_sha: (if $pre_sha == "" then null else $pre_sha end),
       post_execute_sha: (if $post_sha == "" then null else $post_sha end),
       marker_counts: {contested: $contested, clarify: $clarify},
-      phases: $phases[0],
+      phases: $phases,
       diffstat_tail: $diffstat_tail,
       execute_stderr_tail: $execute_stderr_tail,
       assess: $assess,

From e3d7fdc39ed205cad5a7825e5079cf2e4c9c1e8e Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Sat, 13 Jun 2026 13:57:47 -0400
Subject: [PATCH 10/12] feat: codex-build defaults to Opus via best alias, not
 Fable

Fable access was lost, so the codex-build preset pointed its two Claude
seats (plan + review) at an unreachable model. Default them through a
best/opus alias (-> claude-opus-4-8) so a future model bump is one line
in resolve_claude_model_alias. Extend the cross-agent leak guard to drop
best|opus from codex seats, document a session-follow opt-in, and move
the docs-sync + preset-expansion drift guards in lockstep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 CLAUDE.md                            |  2 +-
 dev-review/codex/dev-review.sh       | 36 ++++++++-------
 dev-review/codex/instructions.md     |  2 +-
 skills/codex-build/SKILL.md          | 52 ++++++++++++++-------
 skills/dev-review/SKILL.md           |  2 +-
 tests/docs-sync-simulation.sh        | 24 +++++-----
 tests/preset-expansion-simulation.sh | 67 ++++++++++++++++++++--------
 7 files changed, 119 insertions(+), 66 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 20d3b98..ce09d4e 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -64,7 +64,7 @@ bounce, and `--live` opens visible Windows terminals for Codex passes.
 
 ### Codex-Build Skill (`skills/codex-build/`)
 
-Detached orchestration skill: the session (typically Fable) plans and reviews,
+Detached orchestration skill: the session (typically Opus) plans and reviews,
 Codex executes in the background, and the session is woken at gates instead of
 babysitting. Kicks `dev-review.sh --preset codex-build` via a background task,
 ends the turn, then runs a schema-bound ACCEPT / REVISE / ESCALATE gate on wake.
diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 4fccdf2..37bd405 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -86,9 +86,9 @@ Options:
   --plan FILE              Existing plan file to use with --skip-plan
   --model MODEL            Override Codex model
   --verifier AGENT         Force the verify-phase agent (codex|opus|claude); default derives from --executor
-  --claude-model MODEL     Override Claude model (alias: fable -> claude-fable-5; else passthrough)
+  --claude-model MODEL     Override Claude model (aliases: best/opus -> claude-opus-4-8, fable -> claude-fable-5; else passthrough)
   --preset NAME            Expand a named seat preset (available: codex-build).
-                           codex-build = Fable plans (high) + Codex executes (xhigh) + Fable reviews (max),
+                           codex-build = best (currently Opus) plans (high) + Codex executes (xhigh) + best reviews (max),
                            --verify on, bounces 2, revise-loop 1. Precedence: flags placed AFTER --preset
                            override it (last-wins); pre-set model/effort env vars win over the preset.
   --workdir DIR            Working directory (default: current directory)
@@ -127,13 +127,17 @@ normalize_agent() {
   esac
 }
 
-# v1.5: resolve a friendly Claude model alias to its CLI model id. Only `fable`
-# is aliased today (→ claude-fable-5); anything else passes through verbatim so
-# explicit ids (claude-opus-4-6, claude-opus-4-8[1m], …) and an empty value are
-# untouched. Used by the --claude-model flag and the per-seat env layer.
+# v1.5: resolve a friendly Claude model alias to its CLI model id. `best` and
+# `opus` resolve to the current Opus line (claude-opus-4-8); `fable` is retained
+# for back-compat only (currently unreachable). Anything else passes through
+# verbatim so explicit ids (claude-opus-4-6, claude-opus-4-8[1m], …) and an empty
+# value are untouched. Used by the --claude-model flag and the per-seat env layer.
+# Bumping the Claude default to a new model = edit the `best` arm here, one place.
 resolve_claude_model_alias() {
   case "$1" in
-    fable) echo "claude-fable-5" ;;
+    best)  echo "claude-opus-4-8" ;;   # strongest supported Claude default for the preset
+    opus)  echo "claude-opus-4-8" ;;   # current Opus-line model
+    fable) echo "claude-fable-5" ;;    # retained for back-compat; currently unreachable, do not default to it
     *)     echo "$1" ;;
   esac
 }
@@ -147,11 +151,11 @@ resolve_claude_model_alias() {
 #     present in the environment (e.g. COMPOSER_EFFORT=low) wins over the preset.
 apply_preset() {
   case "$1" in
-    codex-build)   # "build with codex": Fable plans/reviews, Codex executes.
+    codex-build)   # "build with codex": best Claude seats plan/review, Codex executes.
       COMPOSER="opus"; EXECUTOR="codex"; VERIFIER_OVERRIDE="opus"
       VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1
-      : "${COMPOSER_MODEL:=fable}";  : "${COMPOSER_EFFORT:=high}"
-      : "${VERIFIER_MODEL:=fable}";  : "${VERIFIER_EFFORT:=max}"
+      : "${COMPOSER_MODEL:=best}";  : "${COMPOSER_EFFORT:=high}"
+      : "${VERIFIER_MODEL:=best}";  : "${VERIFIER_EFFORT:=max}"
       : "${EXECUTOR_EFFORT:=xhigh}"   # codex model stays the CLI's configured default — deliberately unpinned
       ;;
     *) die "Unknown preset: $1 (available: codex-build)" ;;
@@ -1369,11 +1373,11 @@ apply_seat_env() {
   esac
   # v1.5 cross-agent leak guard. A seat's model+effort is ONE override pair
   # configured for a specific agent kind (e.g. the codex-build preset fills
-  # VERIFIER_MODEL=fable / VERIFIER_EFFORT=max for its DEFAULT claude verifier).
+  # VERIFIER_MODEL=best / VERIFIER_EFFORT=max for its DEFAULT claude verifier).
   # When --verifier codex flips that same seat to codex, the pair is now wrong:
-  # exporting CODEX_MODEL=fable makes codex/ChatGPT reject the run with
-  #   HTTP 400: The 'fable' model is not supported when using Codex with a
-  #   ChatGPT account.
+  # exporting CODEX_MODEL=best (a Claude alias) makes codex/ChatGPT reject the run
+  # with HTTP 400: e.g. "The 'fable' model is not supported when using Codex with
+  #   a ChatGPT account" — any Claude alias/id is off codex's menu.
   # and CODEX_REASONING_EFFORT=max is off codex's scale (it ends at xhigh).
   # The two halves are coupled: a model picked for the wrong kind means the
   # effort was too — so drop the WHOLE pair (not just the model) and fall back
@@ -1382,7 +1386,7 @@ apply_seat_env() {
   # on the claude arm so a codex-shaped override never reaches a claude seat.
   if [[ "$agent" == "codex" ]]; then
     case "$model" in
-      fable|claude-*) model=""; effort="" ;;
+      fable|best|opus|claude-*) model=""; effort="" ;;
     esac
   else
     case "$model" in
@@ -1416,7 +1420,7 @@ resolve_seat_model_string() {
   # what actually runs — e.g. codex:(default)@(default), not codex:fable@max.
   if [[ "$agent" == "codex" ]]; then
     case "$model" in
-      fable|claude-*) model=""; effort="" ;;
+      fable|best|opus|claude-*) model=""; effort="" ;;
     esac
   else
     case "$model" in
diff --git a/dev-review/codex/instructions.md b/dev-review/codex/instructions.md
index 92fc2dc..7861c51 100644
--- a/dev-review/codex/instructions.md
+++ b/dev-review/codex/instructions.md
@@ -18,7 +18,7 @@ Pick the lowest-ceremony path that still protects correctness. **Start with `co-
 | Existing approved plan should be executed as-is | `bash dev-review/codex/dev-review.sh --skip-plan --plan <file>` | Reuses the approved plan |
 | User wants a reviewed plan before any code changes | `bash co-evolve-bouncer.sh --vanilla "task"` or `dev-review.sh --plan-only` | Plan artifact only |
 | User wants a final review against the plan and diff | `bash dev-review/codex/dev-review.sh --verify <task>` | Runs verifier after execution |
-| Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan <file> --branch auto -- "<task>"` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: Fable plans/reviews, Codex executes; session is not babysat |
+| Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan <file> --branch auto -- "<task>"` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: best Claude (Opus) plans/reviews, Codex executes; session is not babysat |
 | Ad-hoc interactive delegation to Codex mid-session | OpenAI plugin `/codex:rescue --background` (see `skills/codex-build/` Transport B) | Quick delegation; review with the same verdict schema by hand, not via CI |
 
 ## Distinguish Coding From Everything Else
diff --git a/skills/codex-build/SKILL.md b/skills/codex-build/SKILL.md
index f1e731f..921150d 100644
--- a/skills/codex-build/SKILL.md
+++ b/skills/codex-build/SKILL.md
@@ -2,7 +2,7 @@
 name: codex-build
 description: >
   Orchestrate Codex to BUILD code in the background while this Claude Code
-  session (typically Fable) plans and reviews — never babysitting. The session
+  session (typically Opus) plans and reviews — never babysitting. The session
   composes the implementation plan, kicks the dev-review runner detached via a
   background Bash task with --preset codex-build, ENDS ITS TURN, and is woken on
   exit to run a schema-bound review gate (ACCEPT / REVISE / ESCALATE, max 2
@@ -16,8 +16,9 @@ allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent
 
 # /codex-build - Detached Codex Execution with Gate-Based Review
 
-This is the **orchestration protocol** behind the "Fable plans, Codex executes,
-Fable reviews" model ladder. The session is the composer and the reviewer; Codex
+This is the **orchestration protocol** behind the "best Claude plans, Codex
+executes, best Claude reviews" model ladder. The session is the composer and the
+reviewer; Codex
 is the executor. The defining rule: **the session is NEVER kept busy while Codex
 grinds.** You kick the runner as a background task, end your turn, and the
 harness wakes you when it exits.
@@ -67,9 +68,9 @@ claude -p --output-format text --model claude-haiku-4-5-20251001 "ping" 2>&1 | h
   verify phase uses Codex's schema-bound review instead. Tell the user plainly:
   the in-session review gate (Step 4, which YOU run) is then the only Claude-side
   review of the diff, and to run `claude /login` in a terminal to restore the
-  full ladder (Fable verifier seat inside the runner).
+  full ladder (the claude verifier seat inside the runner).
 - Otherwise: the full ladder is available. Kick with the preset's default claude
-  verifier (Fable, max effort) — no `--verifier` override needed.
+  verifier (best/Opus, max effort) — no `--verifier` override needed.
 
 ### No other active run
 
@@ -184,15 +185,32 @@ CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
 
 Fill the brackets from Step 1:
 - `--verifier codex` ONLY when the auth probe degraded (else omit — the preset's
-  Fable/max claude verifier is the default).
+  best/max claude verifier is the default).
 - `--branch auto` for a clean tree; `--worktree auto` for a dirty tree.
 - `--parent-run <id>` only on a REVISE re-kick (Step 4), to tag lineage.
 - `--timeout` defaults to 1800s; raise it for large tasks.
 
-The preset expands to: composer = Fable (high), executor = Codex (xhigh, model
-left to the CLI's config), verifier = Fable (max), `--verify` on, bounces 2,
-revise-loop 1. `CO_EVOLVE_TOKEN_CAPTURE=1` records per-phase tokens into
-`state.json` so you can report spend at the gate.
+The preset expands to: composer = Opus (high), executor = Codex (xhigh, model
+left to the CLI's config), verifier = Opus (max), `--verify` on, bounces 2,
+revise-loop 1. The two Claude seats default through the `best` alias (currently
+`claude-opus-4-8`), so a future model bump is a one-line edit in the runner's
+`resolve_claude_model_alias`. `CO_EVOLVE_TOKEN_CAPTURE=1` records per-phase tokens
+into `state.json` so you can report spend at the gate.
+
+**Optional — make the seats follow THIS session's model.** By default the plan and
+review seats run `best` (a strong model) regardless of what you're driving the
+session with. To instead pin them to a specific model — e.g. match the session
+you're in — prepend the seat env vars to the kick (fill-if-empty means they win
+over the preset):
+
+```bash
+COMPOSER_MODEL=claude-opus-4-8 VERIFIER_MODEL=claude-opus-4-8 \
+CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
+  --preset codex-build --skip-plan --plan "$PLAN_FILE" -- "<task>"
+```
+
+Only the Claude seats follow this; Codex always executes. A weak model here means
+weak planning/review — keep these on a strong model.
 
 After kicking:
 1. Capture the run id. It is `dev-review-<timestamp>` and the run dir is
@@ -323,7 +341,7 @@ Write a concise human summary — do NOT auto-merge, do NOT silently retry:
 | Machine slept during the run | wake notification may have been dropped | Run the status script manually; do not assume the run failed |
 | `tokens` block missing from state.json | `CO_EVOLVE_TOKEN_CAPTURE=1` was not set, or `jq` is missing | Re-kick with the flag set; install `jq` for token capture |
 | Verify silently skipped | dirty tree kicked without `--worktree auto`, or run left untracked files | Re-kick on a clean tree with `--branch auto`, or `--worktree auto` |
-| `claude` verifier seat errors mid-run | headless shell not logged in | Re-kick with `--verifier codex`; run `claude /login` to restore the Fable seat |
+| `claude` verifier seat errors mid-run | headless shell not logged in | Re-kick with `--verifier codex`; run `claude /login` to restore the claude verifier seat |
 
 ---
 
@@ -362,13 +380,15 @@ carries CI.
 
 ## Notes
 
-- **Why detached, not babysat:** the post that inspired this (Fable plans, Codex
-  executes, Fable reviews) keeps a session supervising inline, burning cache
-  reads every turn. This skill kicks the runner via background Bash, ends the
-  turn, and gets woken on exit — the session pays only for plan + review gates.
+- **Why detached, not babysat:** the post that inspired this (strong model plans,
+  Codex executes, strong model reviews) keeps a session supervising inline,
+  burning cache reads every turn. This skill kicks the runner via background Bash,
+  ends the turn, and gets woken on exit — the session pays only for plan + review
+  gates.
 - **The preset is the single source of truth for the seats.** `codex-build`
-  expands to Fable/high · Codex/xhigh · Fable/max · verify on · bounces 2 ·
+  expands to best/high · Codex/xhigh · best/max · verify on · bounces 2 ·
   revise-loop 1, inside `dev-review/codex/dev-review.sh` (`apply_preset`). The
+  `best` alias resolves to the current Opus line. The
   preset triple is pinned by `tests/docs-sync-simulation.sh` so this doc and the
   runner cannot drift.
 - **Live-mode caveat:** `--live` codex windows ignore model/effort overrides
diff --git a/skills/dev-review/SKILL.md b/skills/dev-review/SKILL.md
index 719fb39..e6bda2f 100644
--- a/skills/dev-review/SKILL.md
+++ b/skills/dev-review/SKILL.md
@@ -29,7 +29,7 @@ repo change or code-focused verification workflow.
 For the **detached** flavor — the session plans and reviews while Codex executes
 in the background and the session is woken at gates instead of babysitting — use
 `/codex-build` (`skills/codex-build/`). It kicks the standalone runner with
-`--preset codex-build` (Fable plans at `high`, Codex executes at `xhigh`, Fable
+`--preset codex-build` (Opus plans at `high`, Codex executes at `xhigh`, Opus
 reviews at `max`; `--verify` on, bounces `2`, revise-loop `1`) and applies a
 schema-bound ACCEPT / REVISE / ESCALATE gate. `/dev-review` is the interactive
 single-session loop; `/codex-build` is the same pipeline run detached.
diff --git a/tests/docs-sync-simulation.sh b/tests/docs-sync-simulation.sh
index 03d4b1e..d3ce11b 100644
--- a/tests/docs-sync-simulation.sh
+++ b/tests/docs-sync-simulation.sh
@@ -9,9 +9,9 @@
 # others, the docs lie. This gate greps the load-bearing seat tokens out of each
 # file and asserts they all agree on the same triple:
 #
-#   composer  = Fable, effort high
+#   composer  = best (currently Opus), effort high
 #   executor  = Codex, effort xhigh (model unpinned — CLI config rules)
-#   verifier  = Fable, effort max
+#   verifier  = best (currently Opus), effort max
 #   bounces   = 2
 #
 # Pattern: pure-grep assertions over the real files (no agents, no stubs). Each
@@ -70,12 +70,12 @@ assert_not_grep() {
 # These pin the literal knob assignments so a doc edit can be diffed against
 # the code, and a code edit that changes the triple breaks this gate too.
 # ===========================================================================
-assert_grep "runner: composer Fable seat (COMPOSER_MODEL:=fable)" \
-  "$RUNNER" 'COMPOSER_MODEL:=fable'
+assert_grep "runner: composer best seat (COMPOSER_MODEL:=best)" \
+  "$RUNNER" 'COMPOSER_MODEL:=best'
 assert_grep "runner: composer effort high (COMPOSER_EFFORT:=high)" \
   "$RUNNER" 'COMPOSER_EFFORT:=high'
-assert_grep "runner: verifier Fable seat (VERIFIER_MODEL:=fable)" \
-  "$RUNNER" 'VERIFIER_MODEL:=fable'
+assert_grep "runner: verifier best seat (VERIFIER_MODEL:=best)" \
+  "$RUNNER" 'VERIFIER_MODEL:=best'
 assert_grep "runner: verifier effort max (VERIFIER_EFFORT:=max)" \
   "$RUNNER" 'VERIFIER_EFFORT:=max'
 assert_grep "runner: executor effort xhigh (EXECUTOR_EFFORT:=xhigh)" \
@@ -96,12 +96,12 @@ fi
 # Tokens kept loose enough to survive prose rewording but tight enough to catch
 # a wrong seat (e.g. opus instead of Fable, high instead of xhigh on executor).
 # ===========================================================================
-assert_grep "codex-build skill: composer Fable at high" \
-  "$CODEX_BUILD_SKILL" 'composer = Fable \(high\)'
+assert_grep "codex-build skill: composer Opus at high" \
+  "$CODEX_BUILD_SKILL" 'composer = Opus \(high\)'
 assert_grep "codex-build skill: executor Codex at xhigh" \
   "$CODEX_BUILD_SKILL" 'executor = Codex \(xhigh'
-assert_grep "codex-build skill: verifier Fable at max" \
-  "$CODEX_BUILD_SKILL" 'verifier = Fable \(max\)'
+assert_grep "codex-build skill: verifier Opus at max" \
+  "$CODEX_BUILD_SKILL" 'verifier = Opus \(max\)'
 assert_grep "codex-build skill: bounces 2" \
   "$CODEX_BUILD_SKILL" 'bounces 2'
 assert_grep "codex-build skill: executor model unpinned (CLI config note)" \
@@ -110,8 +110,8 @@ assert_grep "codex-build skill: executor model unpinned (CLI config note)" \
 # ===========================================================================
 # Group 3: skills/dev-review/SKILL.md cross-references the same preset triple.
 # ===========================================================================
-assert_grep "dev-review skill: Fable plans at high" \
-  "$DEV_REVIEW_SKILL" 'Fable plans at .?high'
+assert_grep "dev-review skill: Opus plans at high" \
+  "$DEV_REVIEW_SKILL" 'Opus plans at .?high'
 assert_grep "dev-review skill: Codex executes at xhigh" \
   "$DEV_REVIEW_SKILL" 'Codex executes at .?xhigh'
 assert_grep "dev-review skill: reviews at max" \
diff --git a/tests/preset-expansion-simulation.sh b/tests/preset-expansion-simulation.sh
index 4b6ccbf..00b5068 100644
--- a/tests/preset-expansion-simulation.sh
+++ b/tests/preset-expansion-simulation.sh
@@ -3,12 +3,13 @@
 # Hermetic gate for v1.5 Phase 2: --preset codex-build + claude-verifier
 # verdict hardening (dev-review/codex/dev-review.sh).
 #
-# The codex-build preset expands to the post's ladder — Fable plans (high),
-# Codex executes (xhigh), Fable reviews (max) — behind one flag. This gate
-# pins:
-#   a. seat argv under the preset (composer claude --model claude-fable-5
-#      --effort high; executor codex -c model_reasoning_effort=xhigh and NO
-#      -c model=; verifier claude --effort max).
+# The codex-build preset expands to the post's ladder — best (currently Opus)
+# plans (high), Codex executes (xhigh), best reviews (max) — behind one flag.
+# This gate pins:
+#   a. seat argv under the preset (composer claude --model claude-opus-4-8
+#      [the `best` alias resolved] --effort high; executor codex
+#      -c model_reasoning_effort=xhigh and NO -c model=; verifier claude
+#      --effort max).
 #   b. claude-verifier verdict hardening: a verdict wrapped in markdown
 #      fences/prose lands in verdict.json jq-parseable (brace-block fallback).
 #   c. last-wins parsing: --verifier codex after --preset wins.
@@ -18,8 +19,10 @@
 #   g. parity guard: a default run emits no --effort and no
 #      model_reasoning_effort anywhere.
 #   h. cross-agent leak guard: --verifier codex over the preset must NOT leak
-#      the claude verifier's fable@max into the codex verify argv (live HTTP 400
+#      the claude verifier's best@max into the codex verify argv (live HTTP 400
 #      on a ChatGPT account); seat_models.verifier falls back to default.
+#   i. alias resolution: resolve_claude_model_alias maps best/opus ->
+#      claude-opus-4-8, keeps fable -> claude-fable-5, passes ids through.
 #
 # Pattern: PATH-injected claude + codex stubs append their full argv (one line
 # per invocation) to phase-agnostic logs; assertions grep the logs. Both stubs
@@ -242,8 +245,8 @@ claude_log="$TEST_DIR/a_claude.log"; codex_log="$TEST_DIR/a_codex.log"
 ) >"$TEST_DIR/a.out" 2>&1 || true
 
 a_ok=true
-# composer = claude with the fable model alias resolved + high effort.
-if ! grep -Eq -- '--model claude-fable-5' "$claude_log"; then a_ok=false; fi
+# composer = claude with the `best` model alias resolved (-> claude-opus-4-8) + high effort.
+if ! grep -Eq -- '--model claude-opus-4-8' "$claude_log"; then a_ok=false; fi
 if ! grep -Eq -- '--effort high' "$claude_log"; then a_ok=false; fi
 # executor = codex with xhigh effort and NO pinned model.
 if ! grep -Eq -- '-c model_reasoning_effort=xhigh' "$codex_log"; then a_ok=false; fi
@@ -251,7 +254,7 @@ if grep -Eq -- '-c model=' "$codex_log"; then a_ok=false; fi
 # verifier = claude with max effort.
 if ! grep -Eq -- '--effort max' "$claude_log"; then a_ok=false; fi
 if [[ "$a_ok" == true ]]; then
-  pass "preset codex-build: composer fable/high, executor codex/xhigh (no model=), verifier claude/max"
+  pass "preset codex-build: composer best->opus/high, executor codex/xhigh (no model=), verifier claude/max"
 else
   fail "preset codex-build seat argv mismatch (claude_log + codex_log below)"
   { echo "--- claude argv ---"; cat "$claude_log"; echo "--- codex argv ---"; cat "$codex_log"; } >&2
@@ -396,13 +399,13 @@ fi
 
 # ===========================================================================
 # Scenario (h): cross-agent leak guard — --preset codex-build --verifier codex.
-# The preset fills VERIFIER_MODEL=fable / VERIFIER_EFFORT=max for its DEFAULT
+# The preset fills VERIFIER_MODEL=best / VERIFIER_EFFORT=max for its DEFAULT
 # claude verifier; --verifier codex flips that seat to codex. Without the guard
-# in apply_seat_env, the codex verify argv carried `-c model=fable` (→ live
-# HTTP 400: "The 'fable' model is not supported when using Codex with a ChatGPT
-# account.") and `-c model_reasoning_effort=max` (off codex's xhigh scale).
-# Assert the codex verify argv carries NEITHER, and state.json seat_models
-# .verifier reports the fallback (codex:(default)@(default)), not codex:fable@max.
+# in apply_seat_env, the codex verify argv would carry `-c model=best` (a Claude
+# alias codex/ChatGPT rejects with HTTP 400) and `-c model_reasoning_effort=max`
+# (off codex's xhigh scale). Assert the codex verify argv carries NEITHER the
+# `best` alias NOR its resolved id (claude-opus-4-8), and state.json seat_models
+# .verifier reports the fallback (codex:(default)@(default)), not codex:best@max.
 # ===========================================================================
 TOTAL=$((TOTAL + 1))
 repo=$(make_scratch_repo h)
@@ -421,8 +424,10 @@ h_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
 # The verify invocation is the codex argv line carrying --output-schema.
 h_verify_argv=$(grep -- '--output-schema' "$codex_log" || true)
 h_ok=true
-# No leaked fable model and no off-scale max effort in the codex verify argv.
-if printf '%s' "$h_verify_argv" | grep -Eq -- '-c model=fable'; then h_ok=false; fi
+# No leaked claude model (alias `best` or its resolved id) and no off-scale max
+# effort in the codex verify argv.
+if printf '%s' "$h_verify_argv" | grep -Eq -- '-c model=best'; then h_ok=false; fi
+if printf '%s' "$h_verify_argv" | grep -Eq -- '-c model=claude-opus-4-8'; then h_ok=false; fi
 if printf '%s' "$h_verify_argv" | grep -Eq -- 'model_reasoning_effort=max'; then h_ok=false; fi
 # Sanity: the codex verify invocation actually happened (so the asserts are real).
 if [[ -z "$h_verify_argv" ]]; then h_ok=false; fi
@@ -435,7 +440,7 @@ else
   h_ok=false
 fi
 if [[ "$h_ok" == true ]]; then
-  pass "preset codex-build --verifier codex: no fable/max leak into codex verify argv; seat falls back to default"
+  pass "preset codex-build --verifier codex: no best/opus or max leak into codex verify argv; seat falls back to default"
 else
   fail "cross-agent leak guard failed (run dir: ${h_run_dir:-<none>})"
   { echo "--- codex verify argv ---"; printf '%s\n' "$h_verify_argv"
@@ -444,6 +449,30 @@ else
 fi
 [[ -n "${h_run_dir:-}" && "$h_run_dir" != "$run_dir_before" ]] && rm -rf "$h_run_dir"
 
+# ===========================================================================
+# Scenario (i): alias resolution. Extract the real resolve_claude_model_alias
+# from the runner (sed range + source — bash 3.2 silently sources nothing from
+# a process substitution, so use a temp file; same discipline as
+# tests/revise-loop-simulation.sh). Assert the `best`/`opus` aliases resolve to
+# the current Opus id, `fable` stays mapped for back-compat, and an explicit id
+# passes through untouched.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+sed -n '/^resolve_claude_model_alias() {/,/^}$/p' "$RUNNER" > "$TEST_DIR/_alias.sh"
+# shellcheck disable=SC1090,SC1091
+source "$TEST_DIR/_alias.sh"
+i_ok=true
+if ! declare -F resolve_claude_model_alias >/dev/null; then i_ok=false; fi
+[[ "$(resolve_claude_model_alias best)"  == "claude-opus-4-8" ]] || i_ok=false
+[[ "$(resolve_claude_model_alias opus)"  == "claude-opus-4-8" ]] || i_ok=false
+[[ "$(resolve_claude_model_alias fable)" == "claude-fable-5"  ]] || i_ok=false
+[[ "$(resolve_claude_model_alias claude-opus-4-8[1m])" == "claude-opus-4-8[1m]" ]] || i_ok=false
+if [[ "$i_ok" == true ]]; then
+  pass "resolve_claude_model_alias: best/opus -> claude-opus-4-8, fable kept, ids passthrough"
+else
+  fail "alias resolution mismatch (best=$(resolve_claude_model_alias best 2>/dev/null), opus=$(resolve_claude_model_alias opus 2>/dev/null), fable=$(resolve_claude_model_alias fable 2>/dev/null))"
+fi
+
 # --- summary ----------------------------------------------------------------
 passed=$((TOTAL - FAILURES))
 if (( FAILURES == 0 )); then

From 7934bcdb2b775531035430564e49354c21428ecb Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Sat, 13 Jun 2026 14:32:00 -0400
Subject: [PATCH 11/12] =?UTF-8?q?feat:=20claude-build=20preset=20=E2=80=94?=
 =?UTF-8?q?=20Codex=20orchestrates,=20Claude=20executes=20(#6)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mirror of codex-build: a Codex-hosted orchestrator that plans and reviews
while Claude executes the build. Adds the claude-build preset arm (Codex
plans/reviews at xhigh, model unpinned; Claude executes at best/high),
help + unknown-preset text, the Codex-facing protocol doc
dev-review/codex/claude-build.md (synchronous gate; HARD-requires claude
auth since Claude is the executor; detached+polling documented as future
design only), routing + AGENTS contract, and drift-guard scenarios
pinning both preset triples. Built via codex-build (Codex executed,
reviewed in-session).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 AGENTS.md                            |   1 +
 dev-review/codex/claude-build.md     | 187 +++++++++++++++++++++++++++
 dev-review/codex/dev-review.sh       |  12 +-
 dev-review/codex/instructions.md     |   1 +
 tests/docs-sync-simulation.sh        | 124 ++++++++++++++----
 tests/preset-expansion-simulation.sh |  90 ++++++++++++-
 6 files changed, 381 insertions(+), 34 deletions(-)
 create mode 100644 dev-review/codex/claude-build.md

diff --git a/AGENTS.md b/AGENTS.md
index 05d790b..ffe05de 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -18,6 +18,7 @@ Full spec: [`BOUNCE-PROTOCOL.md`](BOUNCE-PROTOCOL.md). Reference implementation:
   specs, arguments, and markdown refinement
 - If you are invoked via `/dev-review` (Claude Code) or `bash dev-review/codex/dev-review.sh` (Codex), you are inside the bounce pipeline — your role (reviewer / composer) and pass number are passed to you in the prompt template
 - If you are invoked via `/codex-build`, you are orchestrating a detached Codex build: the session plans and reviews, Codex executes in the background under `--preset codex-build`, and the session is woken at gates (see `skills/codex-build/`)
+- If you are orchestrating via `--preset claude-build`, Codex plans and reviews while Claude executes the build synchronously; this path hard-requires `claude` CLI auth because Claude is the executor (see `dev-review/codex/claude-build.md`)
 - If you are invoked via `bash agent-bouncer/agent-bouncer.sh <doc>`, you are bouncing a single markdown document — same protocol applies
 - If you are exploring the repo directly without an explicit role, treat the protocol section above as orientation, then read the GSD-managed sections below for project meta
 
diff --git a/dev-review/codex/claude-build.md b/dev-review/codex/claude-build.md
new file mode 100644
index 0000000..5013f34
--- /dev/null
+++ b/dev-review/codex/claude-build.md
@@ -0,0 +1,187 @@
+# claude-build - Codex-Orchestrated Claude Execution
+
+This is the Codex-facing orchestration protocol for `--preset claude-build`.
+Codex is the orchestrator: it plans, kicks the runner, waits synchronously, and
+reviews the result. Claude is the executor: it gets the writable execute phase.
+
+Use this when the user wants Codex to supervise a build but wants Claude to write
+the code. This is intentionally the mirror of `codex-build`, with one important
+difference: Codex has no harness wake-on-exit. The Phase 1 mode is therefore
+blocking and synchronous.
+
+The preset expands to: composer = Codex (xhigh, model left to the CLI's config),
+executor = Opus (best/high), verifier = Codex (xhigh, model left to the CLI's
+config), `--verify` on, bounces 2, revise-loop 1.
+
+## Step 1: PREFLIGHT
+
+Run these checks before writing the plan or kicking the runner.
+
+### Claude CLI present and authenticated
+
+Claude is the executor in this preset, so missing or logged-out Claude is a
+HARD STOP. Do not degrade the executor to Codex; that is just `codex-build`.
+
+```bash
+command -v claude
+claude -p --model claude-haiku-4-5-20251001 "ping"
+```
+
+If either command fails, stop and tell the user to install or authenticate the
+`claude` CLI, then re-run. A logged-out Claude executor cannot write code.
+
+### Codex CLI present
+
+Codex is still the composer and verifier.
+
+```bash
+command -v codex
+```
+
+If missing, stop and tell the user to install or expose the Codex CLI before
+running `claude-build`.
+
+### No other active dev-review run
+
+```bash
+bash dev-review/codex/dev-review-status.sh --list
+```
+
+If the `ACTIVE` section shows a non-terminal run (`status=pending` or
+`status=null`), do not kick a second orchestrated run in the same workdir.
+Escalate or wait for the existing run to reach a terminal status.
+
+### Tree state and isolation
+
+```bash
+git -C "$(pwd)" status --short
+```
+
+Record whether the tree is clean. This protocol always kicks with
+`--worktree auto` so Claude writes in an isolated sibling checkout. That matters
+because this feature edits the same runner it invokes, and because dirty
+workdirs make diff-based verification ambiguous.
+
+Never run from a network or SMB mount. Use a local git clone.
+
+## Step 2: PLAN
+
+Codex writes the implementation plan before invoking the runner.
+
+1. Inspect the repo with `rg`, shell reads, and narrow file opens.
+2. Write the plan to `$TMPDIR`, never inside the workdir:
+
+   ```bash
+   PLAN_FILE="${TMPDIR:-/tmp}/claude-build-plan-$(date +%Y%m%d-%H%M%S).md"
+   ```
+
+3. Shape the file for `--skip-plan --plan FILE`: a top-level heading, at least
+   about 60 words, at least 5 non-empty lines, and at least 2 structural lines.
+   Include:
+
+   ```text
+   # <Task title>
+
+   ## Approach
+   <what will be built and the key technical decisions>
+
+   ## Files to Change
+   - `path/to/file` - <what changes and why>
+
+   ## Steps
+   1. <ordered implementation steps>
+
+   ## Risks / Out of scope
+   - <risks, assumptions, and explicit exclusions>
+   ```
+
+Optional hardening: ask Claude for a synchronous plan critique using the bounce
+markers `[CONTESTED]` and `[CLARIFY]`. Resolve every marker before kicking. After
+2 passes, resolve remaining markers yourself or ask the user; do not run with
+open markers.
+
+## Step 3: RUN (BLOCKING)
+
+Kick the runner synchronously and wait for it to exit:
+
+```bash
+CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \
+  --preset claude-build --skip-plan --plan "$PLAN_FILE" \
+  --worktree auto \
+  -- "<task>"
+```
+
+For a revision re-kick, add `--parent-run <previous-run-id>` and use the revised
+plan file. The parent id is lineage only; every re-kick gets a fresh run dir.
+
+Do not background this command in Phase 1. Codex cannot wake itself on runner
+exit, so detached execution would only create an orphaned process without a
+reliable gate.
+
+## Step 4: GATE
+
+After the blocking runner exits, read status first:
+
+```bash
+bash dev-review/codex/dev-review-status.sh --json <run-id>
+```
+
+The status reader contract provides `status`, `verdict`, `verdict_json`,
+`verdict_present`, `diffstat_tail`, `current_phase`, `marker_counts`, `assess`,
+and its own terminal exit code. The verdict file follows
+`skills/dev-review/schemas/review-verdict.json`.
+
+Then inspect narrowly:
+
+1. Read `verdict.json` from `.verdict_json`.
+2. Read the diffstat and only the changed files needed to spot-check the verdict.
+3. Decide one gate outcome.
+
+### ACCEPT
+
+Accept only when `verdict` is `APPROVED` and Codex's own spot-check finds no
+material issue. Never auto-merge.
+
+### REVISE
+
+Revise when the verdict is `REVISE` or the spot-check finds a fixable issue, and
+the orchestrator has not exhausted its revise budget. Write a revised plan in
+`$TMPDIR`, preserve the original scope, and re-run Step 3 with
+`--parent-run <current-run-id>`.
+
+### ESCALATE
+
+Escalate on any missing verdict, non-zero runner exit, status-reader failure,
+scope creep, repeated identical feedback, or exhausted revise rounds. Report the
+run dir, plan path, verdict path if present, diffstat, and the concrete blocker.
+
+## Step 5: BUDGET
+
+Codex gets at most 2 orchestrator revise rounds. The runner's internal
+`--revise-loop 1` can add one cheap execute retry inside each run, but that does
+not increase the orchestrator budget. Round 3 is an escalation.
+
+Never auto-merge. The gate decides whether to accept, revise, or escalate; a
+human still controls merge/push policy.
+
+## Future: detached mode (design only)
+
+A future `claude-build.sh` wrapper could background the runner with `nohup`,
+print a run id, and let a polling driver watch:
+
+```bash
+nohup bash dev-review/codex/dev-review.sh \
+  --preset claude-build --skip-plan --plan "$PLAN_FILE" \
+  --worktree auto -- "<task>" &
+
+while true; do
+  sleep 30
+  bash dev-review/codex/dev-review-status.sh --json <run-id>
+  # stop on terminal status-reader exits: 0, 2, or 4
+done
+```
+
+That mode is not implemented in this milestone. It adds orphan handling,
+stale-pid detection, and no-auto-wake failure modes, and it only pays off for
+very long Claude builds. Until the wrapper and polling contract exist, use the
+synchronous Phase 1 protocol above.
diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 37bd405..8242e66 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -87,8 +87,9 @@ Options:
   --model MODEL            Override Codex model
   --verifier AGENT         Force the verify-phase agent (codex|opus|claude); default derives from --executor
   --claude-model MODEL     Override Claude model (aliases: best/opus -> claude-opus-4-8, fable -> claude-fable-5; else passthrough)
-  --preset NAME            Expand a named seat preset (available: codex-build).
+  --preset NAME            Expand a named seat preset (available: codex-build, claude-build).
                            codex-build = best (currently Opus) plans (high) + Codex executes (xhigh) + best reviews (max),
+                           claude-build = Codex plans (xhigh) + Claude executes (best/high) + Codex reviews (xhigh),
                            --verify on, bounces 2, revise-loop 1. Precedence: flags placed AFTER --preset
                            override it (last-wins); pre-set model/effort env vars win over the preset.
   --workdir DIR            Working directory (default: current directory)
@@ -158,7 +159,14 @@ apply_preset() {
       : "${VERIFIER_MODEL:=best}";  : "${VERIFIER_EFFORT:=max}"
       : "${EXECUTOR_EFFORT:=xhigh}"   # codex model stays the CLI's configured default — deliberately unpinned
       ;;
-    *) die "Unknown preset: $1 (available: codex-build)" ;;
+    claude-build)  # "build with claude": Codex plans/reviews, Claude executes.
+      COMPOSER="codex"; EXECUTOR="opus"; VERIFIER_OVERRIDE="codex"
+      VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1
+      : "${EXECUTOR_MODEL:=best}";  : "${EXECUTOR_EFFORT:=high}"   # Claude (Opus) writes the code
+      : "${COMPOSER_EFFORT:=xhigh}"   # codex plans; model left to the CLI's config (unpinned)
+      : "${VERIFIER_EFFORT:=xhigh}"   # codex reviews; model unpinned
+      ;;
+    *) die "Unknown preset: $1 (available: codex-build, claude-build)" ;;
   esac
 }
 
diff --git a/dev-review/codex/instructions.md b/dev-review/codex/instructions.md
index 7861c51..d08c8be 100644
--- a/dev-review/codex/instructions.md
+++ b/dev-review/codex/instructions.md
@@ -19,6 +19,7 @@ Pick the lowest-ceremony path that still protects correctness. **Start with `co-
 | User wants a reviewed plan before any code changes | `bash co-evolve-bouncer.sh --vanilla "task"` or `dev-review.sh --plan-only` | Plan artifact only |
 | User wants a final review against the plan and diff | `bash dev-review/codex/dev-review.sh --verify <task>` | Runs verifier after execution |
 | Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan <file> --branch auto -- "<task>"` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: best Claude (Opus) plans/reviews, Codex executes; session is not babysat |
+| Orchestrated build on Claude — Codex plans/reviews, Claude executes (synchronous) | `bash dev-review/codex/dev-review.sh --preset claude-build --skip-plan --plan <file> --worktree auto -- "<task>"` (see `dev-review/codex/claude-build.md`) | Mirror ladder: Codex plans/reviews, Claude executes; hard-requires claude auth |
 | Ad-hoc interactive delegation to Codex mid-session | OpenAI plugin `/codex:rescue --background` (see `skills/codex-build/` Transport B) | Quick delegation; review with the same verdict schema by hand, not via CI |
 
 ## Distinguish Coding From Everything Else
diff --git a/tests/docs-sync-simulation.sh b/tests/docs-sync-simulation.sh
index d3ce11b..53e8d2c 100644
--- a/tests/docs-sync-simulation.sh
+++ b/tests/docs-sync-simulation.sh
@@ -1,19 +1,26 @@
 #!/usr/bin/env bash
 # tests/docs-sync-simulation.sh
-# Hermetic sync-token gate for v1.5 Phase 5: pin the codex-build preset triple
-# so the runner code and the user-facing docs cannot drift.
+# Hermetic sync-token gate for v1.5 Phase 5/Feature 6: pin the build preset
+# triples so the runner code and the user-facing docs cannot drift.
 #
 # The `codex-build` preset is defined in ONE place (dev-review/codex/dev-review.sh
 # apply_preset) but DESCRIBED in two skill docs (skills/codex-build/SKILL.md and
-# skills/dev-review/SKILL.md). If someone edits the seats in one and forgets the
-# others, the docs lie. This gate greps the load-bearing seat tokens out of each
-# file and asserts they all agree on the same triple:
+# skills/dev-review/SKILL.md). The `claude-build` preset is defined beside it and
+# described in dev-review/codex/claude-build.md. If someone edits the seats in one
+# and forgets the others, the docs lie. This gate greps the load-bearing seat
+# tokens out of each file and asserts they all agree on the same triples:
 #
 #   composer  = best (currently Opus), effort high
 #   executor  = Codex, effort xhigh (model unpinned — CLI config rules)
 #   verifier  = best (currently Opus), effort max
 #   bounces   = 2
 #
+#   claude-build:
+#   composer  = Codex, effort xhigh (model unpinned — CLI config rules)
+#   executor  = best (currently Opus), effort high
+#   verifier  = Codex, effort xhigh (model unpinned — CLI config rules)
+#   bounces   = 2
+#
 # Pattern: pure-grep assertions over the real files (no agents, no stubs). Each
 # scenario is a (file, required-token) pair. This is a structural invariant test,
 # the same flavour as the frozen-surface grep checks in classifier-simulation.sh.
@@ -25,6 +32,7 @@ REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 
 RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh"
 CODEX_BUILD_SKILL="$REPO_ROOT/skills/codex-build/SKILL.md"
+CLAUDE_BUILD_DOC="$REPO_ROOT/dev-review/codex/claude-build.md"
 DEV_REVIEW_SKILL="$REPO_ROOT/skills/dev-review/SKILL.md"
 
 TOTAL=0
@@ -65,31 +73,71 @@ assert_not_grep() {
   fi
 }
 
+# assert_text_grep <label> <text> <ERE-pattern>
+# One scenario: the pattern must match inside an already-extracted text block.
+assert_text_grep() {
+  local label="$1" text="$2" pat="$3"
+  TOTAL=$((TOTAL + 1))
+  if printf '%s\n' "$text" | grep -Eiq -- "$pat"; then
+    pass "$label"
+  else
+    fail "$label — pattern not found in extracted preset arm: /$pat/"
+  fi
+}
+
+# assert_text_not_grep <label> <text> <ERE-pattern>
+# One scenario: the pattern must NOT match inside an already-extracted text block.
+assert_text_not_grep() {
+  local label="$1" text="$2" pat="$3"
+  TOTAL=$((TOTAL + 1))
+  if printf '%s\n' "$text" | grep -Eq -- "$pat"; then
+    fail "$label — pattern unexpectedly found in extracted preset arm: /$pat/"
+  else
+    pass "$label"
+  fi
+}
+
 # ===========================================================================
 # Group 1: the runner's apply_preset() is the single source of truth.
 # These pin the literal knob assignments so a doc edit can be diffed against
 # the code, and a code edit that changes the triple breaks this gate too.
 # ===========================================================================
-assert_grep "runner: composer best seat (COMPOSER_MODEL:=best)" \
-  "$RUNNER" 'COMPOSER_MODEL:=best'
-assert_grep "runner: composer effort high (COMPOSER_EFFORT:=high)" \
-  "$RUNNER" 'COMPOSER_EFFORT:=high'
-assert_grep "runner: verifier best seat (VERIFIER_MODEL:=best)" \
-  "$RUNNER" 'VERIFIER_MODEL:=best'
-assert_grep "runner: verifier effort max (VERIFIER_EFFORT:=max)" \
-  "$RUNNER" 'VERIFIER_EFFORT:=max'
-assert_grep "runner: executor effort xhigh (EXECUTOR_EFFORT:=xhigh)" \
-  "$RUNNER" 'EXECUTOR_EFFORT:=xhigh'
-assert_grep "runner: bounces pinned to 2 (BOUNCES=2)" \
-  "$RUNNER" 'BOUNCES=2'
-# Executor model is deliberately unpinned: apply_preset must NOT set EXECUTOR_MODEL
-# (the CLI's ~/.codex/config.toml model wins). Pin that invariant.
-TOTAL=$((TOTAL + 1))
-if sed -n '/^apply_preset()/,/^}/p' "$RUNNER" | grep -Eq 'EXECUTOR_MODEL'; then
-  fail "runner: executor model unexpectedly pinned in apply_preset (must stay unpinned)"
-else
-  pass "runner: executor model unpinned in apply_preset (CLI config rules)"
-fi
+CODEX_BUILD_ARM=$(sed -n '/codex-build).*build with codex/,/^[[:space:]]*;;/p' "$RUNNER")
+CLAUDE_BUILD_ARM=$(sed -n '/claude-build).*build with claude/,/^[[:space:]]*;;/p' "$RUNNER")
+
+assert_text_grep "runner codex-build: composer opus / executor codex / verifier opus" \
+  "$CODEX_BUILD_ARM" 'COMPOSER="opus"; EXECUTOR="codex"; VERIFIER_OVERRIDE="opus"'
+assert_text_grep "runner codex-build: composer best seat (COMPOSER_MODEL:=best)" \
+  "$CODEX_BUILD_ARM" 'COMPOSER_MODEL:=best'
+assert_text_grep "runner codex-build: composer effort high (COMPOSER_EFFORT:=high)" \
+  "$CODEX_BUILD_ARM" 'COMPOSER_EFFORT:=high'
+assert_text_grep "runner codex-build: verifier best seat (VERIFIER_MODEL:=best)" \
+  "$CODEX_BUILD_ARM" 'VERIFIER_MODEL:=best'
+assert_text_grep "runner codex-build: verifier effort max (VERIFIER_EFFORT:=max)" \
+  "$CODEX_BUILD_ARM" 'VERIFIER_EFFORT:=max'
+assert_text_grep "runner codex-build: executor effort xhigh (EXECUTOR_EFFORT:=xhigh)" \
+  "$CODEX_BUILD_ARM" 'EXECUTOR_EFFORT:=xhigh'
+assert_text_grep "runner codex-build: bounces pinned to 2 (BOUNCES=2)" \
+  "$CODEX_BUILD_ARM" 'BOUNCES=2'
+assert_text_not_grep "runner codex-build: executor model unpinned in preset arm" \
+  "$CODEX_BUILD_ARM" 'EXECUTOR_MODEL'
+
+assert_text_grep "runner claude-build: composer codex / executor opus / verifier codex" \
+  "$CLAUDE_BUILD_ARM" 'COMPOSER="codex"; EXECUTOR="opus"; VERIFIER_OVERRIDE="codex"'
+assert_text_grep "runner claude-build: executor best seat (EXECUTOR_MODEL:=best)" \
+  "$CLAUDE_BUILD_ARM" 'EXECUTOR_MODEL:=best'
+assert_text_grep "runner claude-build: executor effort high (EXECUTOR_EFFORT:=high)" \
+  "$CLAUDE_BUILD_ARM" 'EXECUTOR_EFFORT:=high'
+assert_text_grep "runner claude-build: composer effort xhigh (COMPOSER_EFFORT:=xhigh)" \
+  "$CLAUDE_BUILD_ARM" 'COMPOSER_EFFORT:=xhigh'
+assert_text_grep "runner claude-build: verifier effort xhigh (VERIFIER_EFFORT:=xhigh)" \
+  "$CLAUDE_BUILD_ARM" 'VERIFIER_EFFORT:=xhigh'
+assert_text_grep "runner claude-build: bounces pinned to 2 (BOUNCES=2)" \
+  "$CLAUDE_BUILD_ARM" 'BOUNCES=2'
+assert_text_not_grep "runner claude-build: composer model unpinned in preset arm" \
+  "$CLAUDE_BUILD_ARM" 'COMPOSER_MODEL'
+assert_text_not_grep "runner claude-build: verifier model unpinned in preset arm" \
+  "$CLAUDE_BUILD_ARM" 'VERIFIER_MODEL'
 
 # ===========================================================================
 # Group 2: skills/codex-build/SKILL.md must describe the SAME triple.
@@ -108,7 +156,31 @@ assert_grep "codex-build skill: executor model unpinned (CLI config note)" \
   "$CODEX_BUILD_SKILL" "left to the CLI's config"
 
 # ===========================================================================
-# Group 3: skills/dev-review/SKILL.md cross-references the same preset triple.
+# Group 3: dev-review/codex/claude-build.md must describe the SAME reverse
+# triple and its synchronous/hard-auth orchestration contract.
+# ===========================================================================
+assert_grep "claude-build doc: composer Codex at xhigh" \
+  "$CLAUDE_BUILD_DOC" 'composer = Codex \(xhigh'
+assert_grep "claude-build doc: executor Opus at best/high" \
+  "$CLAUDE_BUILD_DOC" 'executor = Opus \(best/high\)'
+assert_grep "claude-build doc: verifier Codex at xhigh" \
+  "$CLAUDE_BUILD_DOC" 'verifier = Codex \(xhigh'
+assert_grep "claude-build doc: codex model unpinned (CLI config note)" \
+  "$CLAUDE_BUILD_DOC" "left to the CLI's config"
+assert_grep "claude-build doc: bounces 2" \
+  "$CLAUDE_BUILD_DOC" 'bounces 2'
+assert_grep "claude-build doc: revise-loop 1" \
+  "$CLAUDE_BUILD_DOC" 'revise-loop 1'
+assert_grep "claude-build doc: hard stop on missing claude auth" \
+  "$CLAUDE_BUILD_DOC" 'HARD STOP'
+assert_grep "claude-build doc: blocking run contract" \
+  "$CLAUDE_BUILD_DOC" 'RUN \(BLOCKING\)'
+assert_grep "claude-build doc: future detached mode design-only" \
+  "$CLAUDE_BUILD_DOC" 'Future: detached mode \(design only\)'
+
+# ===========================================================================
+# Group 4: skills/dev-review/SKILL.md cross-references the codex-build preset
+# triple.
 # ===========================================================================
 assert_grep "dev-review skill: Opus plans at high" \
   "$DEV_REVIEW_SKILL" 'Opus plans at .?high'
diff --git a/tests/preset-expansion-simulation.sh b/tests/preset-expansion-simulation.sh
index 00b5068..b9a5e36 100644
--- a/tests/preset-expansion-simulation.sh
+++ b/tests/preset-expansion-simulation.sh
@@ -5,6 +5,8 @@
 #
 # The codex-build preset expands to the post's ladder — best (currently Opus)
 # plans (high), Codex executes (xhigh), best reviews (max) — behind one flag.
+# The claude-build preset mirrors it: Codex plans/reviews (xhigh, model
+# unpinned), Claude executes (best/high).
 # This gate pins:
 #   a. seat argv under the preset (composer claude --model claude-opus-4-8
 #      [the `best` alias resolved] --effort high; executor codex
@@ -23,6 +25,9 @@
 #      on a ChatGPT account); seat_models.verifier falls back to default.
 #   i. alias resolution: resolve_claude_model_alias maps best/opus ->
 #      claude-opus-4-8, keeps fable -> claude-fable-5, passes ids through.
+#   j. --preset claude-build: composer=codex xhigh/no model, executor=claude
+#      best/high, verifier=codex xhigh/no model, with no codex model leaking
+#      into the claude executor argv.
 #
 # Pattern: PATH-injected claude + codex stubs append their full argv (one line
 # per invocation) to phase-agnostic logs; assertions grep the logs. Both stubs
@@ -69,7 +74,9 @@ mkdir -p "$TEST_DIR/bin"
 # drain stdin via read-loop, then emit a phase-appropriate body. The verify
 # phase is detected by the review-prompt heading on stdin; for it the stub
 # emits a verdict whose JSON object lives at column 0 wrapped in prose + fences
-# (exercises the brace-block hardening fallback). All other phases emit a plan.
+# (exercises the brace-block hardening fallback). The Opus execute phase appends
+# to touched.txt when the runner grants a writable --add-dir. All other phases
+# emit a plan.
 cat > "$TEST_DIR/bin/claude" <<'STUB'
 #!/usr/bin/env bash
 for arg in "$@"; do
@@ -78,6 +85,13 @@ for arg in "$@"; do
   esac
 done
 [[ -n "${CLAUDE_ARGV_LOG:-}" ]] && printf '%s\n' "$*" >> "$CLAUDE_ARGV_LOG"
+workdir=""; prev=""
+for arg in "$@"; do
+  case "$prev" in
+    --add-dir) workdir="$arg" ;;
+  esac
+  prev="$arg"
+done
 # Drain stdin via read-loop (no SIGPIPE); capture it so we can detect the phase.
 stdin_capture=""
 while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done
@@ -101,6 +115,10 @@ That concludes the review.
 VERDICT
   exit 0
 fi
+if [[ "$stdin_capture" == *"Execution Instructions (Opus)"* \
+      && -n "$workdir" && -f "$workdir/touched.txt" ]]; then
+  printf 'claude-stub-change\n' >> "$workdir/touched.txt"
+fi
 # Compose / bounce phase: emit a minimal valid plan that clears the plan_quality
 # gate (validate_plan_artifact wants >= 60 words, >= 5 non-empty lines, >= 2
 # structural lines) so PLAN_EXIT stays 0 and the execute + verify phases run.
@@ -349,8 +367,8 @@ fi
 # ===========================================================================
 TOTAL=$((TOTAL + 1))
 e_out=$(bash "$RUNNER" --preset bogus -- "x" 2>&1 || true)
-if printf '%s' "$e_out" | grep -Eq 'Unknown preset: bogus'; then
-  pass "unknown preset dies with the unknown-preset message"
+if printf '%s' "$e_out" | grep -Eq 'Unknown preset: bogus \(available: codex-build, claude-build\)'; then
+  pass "unknown preset dies with the unknown-preset message and available presets"
 else
   fail "unknown preset did not produce the expected die message (got: $e_out)"
 fi
@@ -360,10 +378,12 @@ fi
 # ===========================================================================
 TOTAL=$((TOTAL + 1))
 help_out=$(bash "$RUNNER" --help 2>/dev/null || true)
-if printf '%s' "$help_out" | grep -Eq -- '--preset'; then
-  pass "--help mentions --preset"
+if printf '%s' "$help_out" | grep -Eq -- '--preset' \
+   && printf '%s' "$help_out" | grep -Eq -- 'codex-build, claude-build' \
+   && printf '%s' "$help_out" | grep -Eq -- 'claude-build = Codex plans'; then
+  pass "--help mentions --preset and both build presets"
 else
-  fail "--help does not mention --preset"
+  fail "--help does not mention --preset and both build presets"
 fi
 
 # ===========================================================================
@@ -473,6 +493,64 @@ else
   fail "alias resolution mismatch (best=$(resolve_claude_model_alias best 2>/dev/null), opus=$(resolve_claude_model_alias opus 2>/dev/null), fable=$(resolve_claude_model_alias fable 2>/dev/null))"
 fi
 
+# ===========================================================================
+# Scenario (j): --preset claude-build -> seat argv. This is the reverse ladder:
+# codex composes/reviews at xhigh with no pinned model; claude executes with
+# best resolved to the current Opus id and high effort. Also asserts the
+# symmetric leak guard direction: no codex-shaped model reaches the claude
+# executor argv.
+# ===========================================================================
+TOTAL=$((TOTAL + 1))
+repo=$(make_scratch_repo j)
+claude_log="$TEST_DIR/j_claude.log"; codex_log="$TEST_DIR/j_codex.log"
+: > "$claude_log"; : > "$codex_log"
+run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true)
+(
+  unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT
+  unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT
+  export PATH="$TEST_DIR/bin:$PATH"
+  export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log"
+  bash "$RUNNER" --preset claude-build --workdir "$repo" --timeout 60 -- "preset claude-build seat argv probe"
+) >"$TEST_DIR/j.out" 2>&1 || true
+j_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1)
+j_claude_execute_argv=$(grep -- '--permission-mode bypassPermissions' "$claude_log" || true)
+j_codex_verify_argv=$(grep -- '--output-schema' "$codex_log" || true)
+j_ok=true
+
+# Composer/verifier = codex with xhigh effort and no pinned model.
+if ! grep -Eq -- '-c model_reasoning_effort=xhigh' "$codex_log"; then j_ok=false; fi
+if grep -Eq -- '-c model=' "$codex_log"; then j_ok=false; fi
+if [[ -z "$j_codex_verify_argv" ]]; then j_ok=false; fi
+if ! printf '%s' "$j_codex_verify_argv" | grep -Eq -- '-c model_reasoning_effort=xhigh'; then j_ok=false; fi
+
+# Executor = claude writable argv with best -> claude-opus-4-8 and effort high.
+if [[ -z "$j_claude_execute_argv" ]]; then j_ok=false; fi
+if ! printf '%s' "$j_claude_execute_argv" | grep -Eq -- '--model claude-opus-4-8'; then j_ok=false; fi
+if ! printf '%s' "$j_claude_execute_argv" | grep -Eq -- '--effort high'; then j_ok=false; fi
+if printf '%s' "$j_claude_execute_argv" | grep -Eq -- '--model (gpt-|codex)'; then j_ok=false; fi
+
+if [[ -n "$j_run_dir" && "$j_run_dir" != "$run_dir_before" ]]; then
+  if ! jq -e '.seat_models.composer == "codex:(default)@xhigh"
+              and .seat_models.executor == "opus:claude-opus-4-8@high"
+              and .seat_models.verifier == "codex:(default)@xhigh"' \
+       "$j_run_dir/state.json" >/dev/null 2>&1; then
+    j_ok=false
+  fi
+else
+  j_ok=false
+fi
+
+if [[ "$j_ok" == true ]]; then
+  pass "preset claude-build: composer codex/xhigh (no model), executor best->opus/high, verifier codex/xhigh (no model)"
+else
+  fail "preset claude-build seat argv mismatch (run dir: ${j_run_dir:-<none>})"
+  { echo "--- claude execute argv ---"; printf '%s\n' "$j_claude_execute_argv"
+    echo "--- codex argv ---"; cat "$codex_log"
+    [[ -n "${j_run_dir:-}" && -f "$j_run_dir/state.json" ]] && { echo "--- seat_models ---"; jq -c '.seat_models' "$j_run_dir/state.json"; }
+  } >&2
+fi
+[[ -n "${j_run_dir:-}" && "$j_run_dir" != "$run_dir_before" ]] && rm -rf "$j_run_dir"
+
 # --- summary ----------------------------------------------------------------
 passed=$((TOTAL - FAILURES))
 if (( FAILURES == 0 )); then

From deb4669881c614bf946523353e8d2fd3e3c08743 Mon Sep 17 00:00:00 2001
From: Alan Shurafa <ithelast@hotmail.com>
Date: Sat, 13 Jun 2026 16:33:58 -0400
Subject: [PATCH 12/12] fix: agent_auth_failed no longer false-positives on
 echoed auth text
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The execute/verify auth gate scanned both the output and the full stderr
log for auth substrings unconditionally, so any run whose working log
echoed auth strings — plan text, or the auth-detection source itself —
was misread as "authentication failed" and aborted with no verdict
(observed building claude-build via codex-build). Mirror lib's robust
validate_agent_artifact discriminator: an auth banner in the output
counts only when the output is short (<50 words), and a banner in stderr
counts only when the output is empty (a real work product means the CLI
authenticated). Adds reliability sim S5a-S5d covering the regression.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 dev-review/codex/dev-review.sh  | 30 ++++++++++---
 tests/reliability-simulation.sh | 75 +++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh
index 8242e66..1027aee 100644
--- a/dev-review/codex/dev-review.sh
+++ b/dev-review/codex/dev-review.sh
@@ -315,18 +315,36 @@ write_text_file() {
 
 agent_auth_failed() {
   local agent="$1"
-  shift
-  local file_path
-  local cli_name
+  local output_file="${2:-}"
+  local stderr_file="${3:-}"
+  local cli_name words
 
   cli_name=$(agent_cli_name "$agent")
 
-  for file_path in "$@"; do
-    if file_contains_auth_failure "$file_path"; then
+  # A genuine auth failure means the CLI bailed BEFORE doing work, so its banner
+  # is short and stands alone. Mirror validate_agent_artifact's discriminator so
+  # a substantial work product that merely echoes auth strings — e.g. plan text,
+  # or the auth-detection source itself — is never misread as an auth failure.
+  #
+  # (1) Auth banner IN THE OUTPUT, but only when the output is short (< 50
+  #     words). A long output that mentions "Unauthorized"/"Not logged in" is
+  #     real work, not the CLI's own banner.
+  if [[ -n "$output_file" ]] && file_contains_auth_failure "$output_file"; then
+    words=$(wc -w < "$output_file" | tr -d '\r\n ')
+    if (( words < 50 )); then
       log "WARNING: ${cli_name} authentication failed. Refresh the ${cli_name} CLI session and rerun."
       return 0
     fi
-  done
+  fi
+
+  # (2) Auth banner in STDERR counts only when the agent produced NO output. A
+  #     non-empty work product means the CLI authenticated and ran; auth strings
+  #     in its (possibly huge) working log are echoed content, not the banner.
+  if [[ ! -s "$output_file" && -n "$stderr_file" && -s "$stderr_file" ]] \
+     && file_contains_auth_failure "$stderr_file"; then
+    log "WARNING: ${cli_name} authentication failed. Refresh the ${cli_name} CLI session and rerun."
+    return 0
+  fi
 
   return 1
 }
diff --git a/tests/reliability-simulation.sh b/tests/reliability-simulation.sh
index 3f9462f..3bc8149 100755
--- a/tests/reliability-simulation.sh
+++ b/tests/reliability-simulation.sh
@@ -100,6 +100,81 @@ else
   fail "S5: empty output -> expected rc 1, got $rc"
 fi
 
+# ---------------------------------------------------------------------------
+# R-2b: agent_auth_failed (the dev-review.sh execute/verify auth gate, distinct
+# from lib's validate_agent_artifact). Regression for the false positive where a
+# large working log echoes auth strings (plan text, or the auth-detection source
+# itself) and trips a naive substring scan. Extract the real functions from the
+# runner via sed (dev-review.sh has no main guard, so it is not safe to source
+# whole — same discipline as tests/revise-loop-simulation.sh).
+# ---------------------------------------------------------------------------
+AUTH_FN_SRC="$TEST_DIR/agent_auth_failed.sh"
+sed -n '/^agent_cli_name() {/,/^}$/p'    "$REPO_ROOT/dev-review/codex/dev-review.sh" >  "$AUTH_FN_SRC"
+sed -n '/^agent_auth_failed() {/,/^}$/p' "$REPO_ROOT/dev-review/codex/dev-review.sh" >> "$AUTH_FN_SRC"
+# shellcheck disable=SC1090
+source "$AUTH_FN_SRC"
+
+if ! declare -F agent_auth_failed >/dev/null; then
+  TOTAL=$((TOTAL + 1))
+  fail "R-2b: agent_auth_failed not sourced — cannot test"
+else
+  # Scenario 5a: empty output + auth banner in stderr -> detected (rc 0).
+  TOTAL=$((TOTAL + 1))
+  out="$TEST_DIR/s5a-out.md"; err="$TEST_DIR/s5a-err.log"
+  : > "$out"; printf 'Not logged in. Please run /login\n' > "$err"
+  rc=0; agent_auth_failed codex "$out" "$err" >/dev/null 2>&1 || rc=$?
+  if [[ "$rc" -eq 0 ]]; then
+    pass "S5a: agent_auth_failed: empty output + stderr banner -> detected"
+  else
+    fail "S5a: expected detection (rc 0), got $rc"
+  fi
+
+  # Scenario 5b: short auth banner in the OUTPUT -> detected (rc 0).
+  TOTAL=$((TOTAL + 1))
+  out="$TEST_DIR/s5b-out.md"; err="$TEST_DIR/s5b-err.log"
+  printf 'Failed to authenticate. Please run `claude login`.\n' > "$out"; : > "$err"
+  rc=0; agent_auth_failed claude "$out" "$err" >/dev/null 2>&1 || rc=$?
+  if [[ "$rc" -eq 0 ]]; then
+    pass "S5b: agent_auth_failed: short banner in output -> detected"
+  else
+    fail "S5b: expected detection (rc 0), got $rc"
+  fi
+
+  # Scenario 5c (REGRESSION): substantial output + a huge stderr working log that
+  # echoes auth strings deep inside (plan text / the auth-detection source) ->
+  # NOT an auth failure (rc 1). This is the codex-build self-build false positive.
+  TOTAL=$((TOTAL + 1))
+  out="$TEST_DIR/s5c-out.md"; err="$TEST_DIR/s5c-err.log"
+  {
+    printf 'Implemented the feature as planned. Files changed and tests pass.\n'
+    for i in $(seq 1 80); do printf 'word%d ' "$i"; done; printf '\n'
+  } > "$out"
+  {
+    for i in $(seq 1 3000); do printf 'log line %d: working...\n' "$i"; done
+    printf '  66: - If the output contains `Not logged in` (or `/login`): degrade\n'
+    printf "  576: grep -qiE 'Not logged in|Please run /login|Unauthorized' file\n"
+    for i in $(seq 1 3000); do printf 'log line %d: more work...\n' "$i"; done
+  } > "$err"
+  rc=0; agent_auth_failed codex "$out" "$err" >/dev/null 2>&1 || rc=$?
+  if [[ "$rc" -eq 1 ]]; then
+    pass "S5c: agent_auth_failed: big working log echoing auth strings -> NOT flagged (regression)"
+  else
+    fail "S5c: expected no detection (rc 1), got $rc — false positive regressed"
+  fi
+
+  # Scenario 5d: empty output + genuine stderr banner -> still detected (rc 0).
+  # The regression guard must not mask a real failure whose only signal is stderr.
+  TOTAL=$((TOTAL + 1))
+  out="$TEST_DIR/s5d-out.md"; err="$TEST_DIR/s5d-err.log"
+  : > "$out"; printf 'authentication_error: OAuth token expired\n' > "$err"
+  rc=0; agent_auth_failed codex "$out" "$err" >/dev/null 2>&1 || rc=$?
+  if [[ "$rc" -eq 0 ]]; then
+    pass "S5d: agent_auth_failed: empty output + genuine stderr banner -> detected"
+  else
+    fail "S5d: expected detection (rc 0), got $rc"
+  fi
+fi
+
 # ---------------------------------------------------------------------------
 # C-2/S-2: fill_template metacharacter pinning (false-positive verification)
 # ---------------------------------------------------------------------------