From b67214569aeafa186ba8588a467de00a9fa5b18f Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 12:40:29 -0400 Subject: [PATCH 01/12] docs(planning): register v1.5 'Build with Codex'; Phase 0 env + R1/R2 research notes Co-Authored-By: Claude Fable 5 --- .planning/ROADMAP.md | 14 +- .planning/STATE.md | 25 ++- .../2026-06-12-claude-json-envelope.md | 71 ++++++ .../2026-06-12-codex-headless-facts.md | 207 ++++++++++++++++++ .planning/v1.5-DESIGN.md | 119 ++++++++++ 5 files changed, 425 insertions(+), 11 deletions(-) create mode 100644 .planning/research/2026-06-12-claude-json-envelope.md create mode 100644 .planning/research/2026-06-12-codex-headless-facts.md create mode 100644 .planning/v1.5-DESIGN.md diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 5c4b660..941dd2c 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -10,7 +10,19 @@ Co-Evolution is a tooling repo for structured iterative refinement between AI ag - [x] **v1.1 Polish & Ergonomics** (shipped 2026-04-17) — v1.0 code review fixes (WR-01/02/03) + runtime ergonomics (REVISE auto-loop, visible live mode, branch/worktree management). 4 phases, 6 requirements closed. PR [#2](https://github.com/alanshurafa/co-evolution/pull/2) · See [`milestones/v1.1-ROADMAP.md`](milestones/v1.1-ROADMAP.md) · [`milestones/v1.1-SUMMARY.md`](milestones/v1.1-SUMMARY.md) · [`milestones/v1.1-REQUIREMENTS.md`](milestones/v1.1-REQUIREMENTS.md) - [x] **v1.3 Reliability, Measurement & Cross-Platform** (shipped 2026-06-11) — stranded-fix landing, macOS/bash-3.2+5.2 portability with 3-OS CI, silent-failure hardening, and the bounce measurement stack (state.json, deterministic scorer + marker-fate ledger, blind judge, human report). Headline: 17.6% deletion-convergence measured; Fable-5 judge 7/7 improved. 9 phases. See [`milestones/v1.3-SUMMARY.md`](milestones/v1.3-SUMMARY.md) · audit at `docs/audits/2026-06-10-v13-audit.md` -## Active Milestone: v1.4 Distribution — npm + MCP (2026-06-11) +## Active Milestone: v1.5 Build with Codex — model ladder + orchestrated execution (2026-06-12) + +**Goal:** Adopt the Codex-execution / Fable-orchestration split (per @cjzafir's pattern) in the dev-review runner: fix 3 latent env-export bugs, add per-seat model/effort config, a `--preset codex-build` shortcut, detached background execution with harness-exit-wakeup, a status-reader script, token capture to measure the 50% cost claim, and a `/codex-build` orchestration skill for both the runner and plugin transports. Design basis: `.planning/v1.5-DESIGN.md` (approved 2026-06-12). + +- [ ] **Phase 0: Environment + research** (2026-06-12, in progress) — codex symlink + smoke; plugin install; R1 pin `claude -p --output-format json` envelope; R2 pin codex end-of-run token line; register milestone in .planning/; research notes filed. +- [ ] **Phase 1: Seat plumbing + env-export correctness** — `lib/co-evolution.sh` effort knobs + `invoke_codex_schema` move (B2); `dev-review.sh` `export CODEX_MODEL` (B1) + `export WORKDIR` (B3) + `--verifier/--claude-model` flags + per-seat env via `apply_seat_env`. Gate: `tests/run-all.sh` green; byte-parity with knobs off. +- [ ] **Phase 2: Claude-verifier hardening + `--preset codex-build`** — fenced-JSON verdict fallback; preset expansion (fable/high → codex/xhigh → fable/max, bounces=2, revise-loop=1); banner; `tests/preset-expansion-simulation.sh`. +- [ ] **Phase 3: Runner observability + status reader** — `state.json` additions (`current_phase`, `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id`); new `dev-review-status.sh` (~120 lines, exit codes 0/2/3/4/5); `tests/status-reader-simulation.sh`. +- [ ] **Phase 4: Token capture** — `CO_EVOLVE_TOKEN_CAPTURE=1` (default off); `invoke_claude` gated JSON mode; codex stderr harvest; `collect_token_usage` → `state.json.tokens`; `tests/token-capture-simulation.sh`. +- [ ] **Phase 5: `/codex-build` skill + docs** — new `skills/codex-build/SKILL.md` (preflight → plan → kick → wake/gate loop, both runner and plugin transports); CLAUDE.md Default Rule update; routing doc updates. +- [ ] **Phase 6: Dogfood + evidence** — 2–3 real `/codex-build` tasks (ACCEPT / REVISE→ACCEPT / ESCALATE); token evidence note; MCP parity (`vendor.sh` + `npm test`); memory update. + +## Previous Milestone: v1.4 Distribution — npm + MCP (2026-06-11) **Goal:** Make the bounce protocol invocable without `git clone`: a Node/TS MCP server (`@alanshurafa/co-evolution-mcp`, one `co_evolve` tool) published diff --git a/.planning/STATE.md b/.planning/STATE.md index c429f60..8a914df 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -1,17 +1,17 @@ --- gsd_state_version: 1.0 -milestone: v1.4 -milestone_name: Distribution — npm + MCP +milestone: v1.5 +milestone_name: Build with Codex — model ladder + orchestrated execution status: executing -stopped_at: v1.4 Phases 0-4 EXECUTED (2026-06-11). mcp/ package built + 4/4 hermetic smoke tests + stdio handshake verified; CI gains 3-OS mcp job; publish-mcp.yml ready. Remaining = Phase 5 (HUMAN: verify npm scope @alanshurafa, add NPM_TOKEN secret, git tag -> auto-publish, Claude Desktop round-trip) + Phase 6 post-ship registry/awesome-list. -last_updated: "2026-06-10T23:45:00.000Z" -last_activity: 2026-06-11 -- v1.4 milestone registered; v1.3 archived +stopped_at: v1.5 Phase 0 IN PROGRESS (2026-06-12). Codex symlink created, smoke pass, claude -p envelope pinned, plugin installed. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched. +last_updated: "2026-06-12T12:28:00.000Z" +last_activity: 2026-06-12 -- v1.5 milestone registered; Phase 0 executing progress: - total_phases: 8 - completed_phases: 8 - total_plans: 17 - completed_plans: 18 - percent: 100 + total_phases: 7 + completed_phases: 0 + total_plans: 0 + completed_plans: 0 + percent: 0 --- # Project State @@ -30,6 +30,11 @@ Milestone: v1.4 Distribution — npm + MCP Phase: 0-4 complete; Phase 5 (publish) blocked on human items Status: npm scope verification + NPM_TOKEN secret + git tag are Alan's; everything else built and CI-gated. v1.2 SC-4 gate still open (VERIFY-SC4.md). Last activity: 2026-06-10 -- Phase 0 merges + LF policy + audit report + +Milestone: v1.5 Build with Codex — model ladder + orchestrated execution +Phase: 0 (env + research) in progress +Status: codex symlink created + smoke pass; claude -p envelope pinned; codex plugin installed (openai-codex marketplace, v1.0.4); Phase 0 research notes filed. Branch: feat/v1.5-codex-build. +Last activity: 2026-06-12 -- Phase 0 start; design file committed at .planning/v1.5-DESIGN.md Working directories: `~/co-evolution-v13/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC macOS baseline before v1.3 fixes: scorer-verification 11/14; code-proposer sim 1/16; pr-emitter sim 4/12; template-proposer sim 1/8; revise-loop sim aborts. Root causes: bash 3.2 (mapfile, source <(…)), BSD sed GNU-isms. Target after Phase 0.5: all green on macOS. diff --git a/.planning/research/2026-06-12-claude-json-envelope.md b/.planning/research/2026-06-12-claude-json-envelope.md new file mode 100644 index 0000000..766fa02 --- /dev/null +++ b/.planning/research/2026-06-12-claude-json-envelope.md @@ -0,0 +1,71 @@ +# R1: claude -p JSON Envelope — Field Reference + +**Date:** 2026-06-12 +**Phase:** v1.5 Phase 0 +**Purpose:** Pin the exact JSON output structure of `claude -p --output-format json` for Phase 4 token-capture parsing. + +## Command run + +``` +claude -p --output-format json --model claude-haiku-4-5-20251001 "Reply with exactly: PING" +``` + +## Verbatim output (not-logged-in state; all usage fields present and zero) + +```json +{"type":"result","subtype":"success","is_error":true,"api_error_status":null,"duration_ms":1103,"duration_api_ms":0,"num_turns":1,"result":"Not logged in · Please run /login","stop_reason":"stop_sequence","session_id":"4642f382-c299-4d72-bcaf-3e7bca396c7d","total_cost_usd":0,"usage":{"input_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":0,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"inference_geo":"","iterations":[],"speed":"standard"},"modelUsage":{},"permission_denials":[],"terminal_reason":"completed","fast_mode_state":"off","uuid":"def45115-84e3-4e92-a601-20e0dae05dda"} +``` + +Exit code: 1 (error because not logged in; envelope still emitted on stdout) + +## Top-level fields + +| Field | Type | Notes | +|---|---|---| +| `type` | string | Always `"result"` | +| `subtype` | string | `"success"` even on error | +| `is_error` | bool | `true` when auth fails or API error | +| `api_error_status` | null/int | HTTP status code on API errors; null here | +| `duration_ms` | int | Wall time in ms | +| `duration_api_ms` | int | API time in ms | +| `num_turns` | int | Number of conversation turns | +| `result` | string | The model's text output (or error message) | +| `stop_reason` | string | e.g. `"stop_sequence"`, `"end_turn"` | +| `session_id` | string | UUID | +| `total_cost_usd` | float | Total cost in USD (0 when not logged in) | +| `usage` | object | Token usage breakdown — see below | +| `modelUsage` | object | Per-model usage breakdown (empty when not logged in) | +| `permission_denials` | array | Tool permission denial events | +| `terminal_reason` | string | `"completed"` | +| `fast_mode_state` | string | `"off"` | +| `uuid` | string | Run UUID | + +## usage subfields + +| Field | Type | Notes | +|---|---|---| +| `input_tokens` | int | Prompt input tokens | +| `cache_creation_input_tokens` | int | Cache write tokens | +| `cache_read_input_tokens` | int | Cache hit tokens | +| `output_tokens` | int | Response tokens | +| `server_tool_use.web_search_requests` | int | Web search count | +| `server_tool_use.web_fetch_requests` | int | Web fetch count | +| `service_tier` | string | `"standard"` or `"priority"` | +| `cache_creation.ephemeral_1h_input_tokens` | int | 1-hour ephemeral cache write tokens | +| `cache_creation.ephemeral_5m_input_tokens` | int | 5-min ephemeral cache write tokens | +| `inference_geo` | string | Inference geography code | +| `iterations` | array | Per-iteration usage (for multi-turn) | +| `speed` | string | `"standard"` | + +## Notes for Phase 4 + +- All token fields live under `.usage`. Phase 4's `invoke_claude` gated JSON mode should extract: `.usage.input_tokens`, `.usage.output_tokens`, `.usage.cache_creation_input_tokens`, `.usage.cache_read_input_tokens`. +- `.total_cost_usd` is a direct top-level field, not nested. +- The envelope is always emitted to **stdout** even when `is_error=true`. +- The exit code is 1 on auth error; Phase 4 must handle non-zero exit with usable envelope (capture stdout regardless of exit code using `|| true`). +- **Limitation:** This capture is from a not-logged-in shell. When logged in, `modelUsage` will be populated and `iterations` may have per-turn breakdown. Field names are stable across auth state. + +## Auth status at capture time + +`claude whoami` returns: `Not logged in · Please run /login` +(The Mac's interactive Claude Code sessions authenticate through the Electron app, not this shell. The sub-agent shell does not carry the session token.) diff --git a/.planning/research/2026-06-12-codex-headless-facts.md b/.planning/research/2026-06-12-codex-headless-facts.md new file mode 100644 index 0000000..62ab02a --- /dev/null +++ b/.planning/research/2026-06-12-codex-headless-facts.md @@ -0,0 +1,207 @@ +# Codex Headless Research: Smoke, Token Line, Plugin Route + +**Date:** 2026-06-12 +**Phase:** v1.5 Phase 0 (T1, R2, T4) +**Purpose:** Pin environment facts needed by Phases 1–5. + +--- + +## T1 — Codex CLI symlink + smoke + +### Binary location + +``` +/Applications/Codex.app/Contents/Resources/codex +-rwxr-xr-x@ 1 shurafa staff 213429648 Jun 11 16:11 +``` + +### Symlink created + +``` +ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex +``` + +Result: `lrwxr-xr-x@ 1 shurafa staff 48 Jun 12 12:28 /Users/shurafa/.local/bin/codex -> /Applications/Codex.app/Contents/Resources/codex` + +### Version + +``` +codex-cli 0.140.0-alpha.2 +``` + +### Config (~/.codex/config.toml relevant fields) + +```toml +model = "gpt-5.5" +model_reasoning_effort = "xhigh" +service_tier = "priority" +``` + +### Headless smoke command + +```bash +mkdir -p "$TMPDIR/codex-smoke" && cd "$TMPDIR/codex-smoke" && \ + ~/.local/bin/codex exec --full-auto --skip-git-repo-check \ + "Reply with exactly: SMOKE-OK" > out.txt 2> err.txt; echo "exit=$?" +``` + +**Exit code: 0** + +### stdout (out.txt — verbatim) + +``` +SMOKE-OK +``` + +### stderr (err.txt — verbatim) + +``` +warning: `--full-auto` is deprecated; use `--sandbox workspace-write` instead. +Reading additional input from stdin... +OpenAI Codex v0.140.0-alpha.2 +-------- +workdir: /private/var/folders/20/vl4p6fsd2dsbf5r68611c6nc0000gn/T/codex-smoke +model: gpt-5.5 +provider: openai +approval: never +sandbox: workspace-write [workdir, /tmp, $TMPDIR] +reasoning effort: xhigh +session id: 019ebcaa-5766-7c60-beb8-824b921d69a1 +-------- +user +Reply with exactly: SMOKE-OK +codex +SMOKE-OK +tokens used +11,681 +``` + +**Conclusion:** Auth works headless. gpt-5.5 + xhigh accepted. ChatGPT-plan tier confirmed (`service_tier = "priority"` in config). + +--- + +## R2 — Codex end-of-run token usage line (verbatim + stream) + +**Stream:** stderr +**Exact lines (verbatim, from smoke run above):** + +``` +tokens used +11,681 +``` + +Two lines: the label `tokens used` on one line, then the formatted integer with comma separator on the next line. + +**Format for Phase 4 grep/parse:** +The token count follows the literal line `tokens used` (no colon) as the immediately subsequent line. Parse with: +```bash +grep -A1 "^tokens used$" err.txt | tail -1 +``` +or equivalently with `awk '/^tokens used$/{getline; print}'`. +The number uses comma formatting (e.g., `11,681`); strip commas before arithmetic: `tr -d ','`. + +**Note on deprecation:** `--full-auto` is deprecated in favor of `--sandbox workspace-write`. Both flags produce the same token output format. The runner may want to migrate to `--sandbox workspace-write` in Phase 1. + +--- + +## T4 — Plugin install route + +### `claude plugin --help` subcommands (relevant) + +``` +claude plugin install|i [options] + Install a plugin from available marketplaces + Options: + --config Set a userConfig option (repeatable) + -s, --scope user, project, or local (default: "user") + +claude plugin marketplace [options] [command] + Commands: + add [options] Add a marketplace from a URL, path, or GitHub repo + list [options] List all configured marketplaces + remove|rm [options] Remove a configured marketplace + update [options] [name] Update marketplace(s) +``` + +### Non-interactive install route (confirmed working) + +Step 1 — Add the marketplace: +```bash +claude plugin marketplace add openai/codex-plugin-cc +``` +Output: +``` +Adding marketplace…SSH not configured, cloning via HTTPS: https://github.com/openai/codex-plugin-cc.git +Refreshing marketplace cache (timeout: 120s)… +Cloning repository (timeout: 120s): https://github.com/openai/codex-plugin-cc.git +Clone complete, validating marketplace… +Cleaning up old marketplace cache… +✔ Successfully added marketplace: openai-codex (declared in user settings) +``` + +Step 2 — Install the plugin: +```bash +claude plugin install codex@openai-codex +``` +Output: +``` +Installing plugin "codex@openai-codex"...✔ Successfully installed plugin: codex@openai-codex (scope: user) +``` + +### Verification (`claude plugin list`) + +``` +Installed plugins: + + ❯ codex@openai-codex + Version: 1.0.4 + Scope: user + Status: ✔ enabled +``` + +### Plugin details (component inventory) + +``` +codex 1.0.4 + Use Codex from Claude Code to review code or delegate tasks. + Source: codex@openai-codex + +Component inventory + Skills (10) adversarial-review, cancel, codex-cli-runtime, codex-result-handling, + gpt-5-4-prompting, rescue, result, review, setup, status + Agents (1) codex-rescue + Hooks (3) SessionStart, SessionEnd, Stop (harness-only — no model context cost) + MCP servers (0) + LSP servers (0) + +Projected token cost + Always-on: ~306 tok added to every session +``` + +**Key skills for v1.5:** `/codex:rescue` (delegate tasks to Codex), `/codex:status` (check job status), `/codex:review` (adversarial review), `/codex:cancel` (cancel running job). + +### Conclusion + +A fully **non-interactive** install route exists and works. No `/plugin` UI required. The plan's note about "if it requires the interactive /plugin UI" is moot — `claude plugin marketplace add` + `claude plugin install` suffice. The install persists at user scope. + +### Interactive equivalent (for reference only, not needed) + +Would be: open Claude Code, type `/plugin marketplace add openai/codex-plugin-cc`, then `/plugin install codex`. The CLI route above is strictly equivalent. + +--- + +## Summary + +| Item | Fact | +|---|---| +| codex binary | `/Applications/Codex.app/Contents/Resources/codex` (213 MB, Jun 11) | +| codex version | 0.140.0-alpha.2 | +| symlink | `~/.local/bin/codex` → binary (created 2026-06-12) | +| smoke exit | 0 | +| smoke stdout | `SMOKE-OK` | +| token line stream | **stderr** | +| token line format | two lines: `tokens used` then `11,681` (comma-formatted integer) | +| `--full-auto` | deprecated; equivalent `--sandbox workspace-write` | +| plugin install | non-interactive, fully CLI-driven | +| plugin name | `codex@openai-codex` v1.0.4, user scope, enabled | +| plugin skills | rescue, status, review, cancel, adversarial-review (+ 5 more) | diff --git a/.planning/v1.5-DESIGN.md b/.planning/v1.5-DESIGN.md new file mode 100644 index 0000000..9d661b9 --- /dev/null +++ b/.planning/v1.5-DESIGN.md @@ -0,0 +1,119 @@ +# v1.5 "Build with Codex" — model ladder + Fable-orchestrated execution + + +## Context: the evaluation + +**The post** ([@cjzafir, 2026-06-11](https://x.com/cjzafir/status/2065104422762684745), ~98k views): cut weekly Claude Code limits ~50% by installing OpenAI's official [codex-plugin-cc](https://github.com/openai/codex-plugin-cc) (20.7k stars, Apache 2.0) and splitting roles — **Fable 5 high plans, Codex 5.5 xhigh executes on the ChatGPT plan (no API), Fable 5 max reviews and gates**. "Fable never writes the code, Codex never decides the design." Mechanically the plugin's `codex-rescue` agent shells out to `codex exec --full-auto` — **the exact invocation our dev-review runner already uses**. + +**Verdict: adopt.** dev-review already implements ~80% of this (compose → bounce → execute → verify, Codex executor, review-verdict.json gate, revise loop, worktree isolation). What's genuinely missing: per-seat model/effort selection, a blessed preset, detached execution with gate-based check-ins (the user's "Fable orchestrates, checks in periodically" vision), and token instrumentation to test the 50% claim. + +**Where we improve on the post:** +1. **Bounce the brief** — Codex critiques the plan with `[CONTESTED]`/`[CLARIFY]` markers before execution (post has no plan hardening). +2. **Schema-bound verdict gate with bounded convergence** — APPROVED/REVISE + confidence + issues[], max 2 orchestrator revise rounds (mirrors the 2-pass marker-expiry rule). The post's "cycle repeats until it passes" has no termination guarantee. +3. **Detached execution, not babysitting** — the post keeps a Fable session supervising inline, burning cache reads every turn (against the token policy). We kick the runner via harness background Bash, **end the turn**, and get woken on exit — Fable pays only for plan + review gates. +4. **Worktree/branch isolation** (`--worktree auto` exists) — post runs Codex loose in the repo. +5. **Evidence, not vibes** — per-phase token capture into state.json so runs measure whether the 50% claim holds here. + +**Environment discoveries (this Mac):** +- Codex.app installed, `~/.codex/config.toml` already `model = "gpt-5.5"`, `model_reasoning_effort = "xhigh"`, fresh auth.json; CLI binary at `/Applications/Codex.app/Contents/Resources/codex` (not on PATH — symlink needed). +- claude CLI 2.1.162; `--effort` proven headless in-repo (`evals/judge-bounce.sh:139` uses `--effort high` with `claude-fable-5`). +- `~/Project/co-evolution` clone current at 215dc83, clean. +- **3 latent bugs found** (must fix; they break per-seat config under the `timeout bash -c` process boundary): **B1** `--model` sets `CODEX_MODEL` without `export` (dev-review.sh:976) so it never reaches `invoke_codex`; **B2** `invoke_codex_schema` defined in dev-review.sh but called in a child that sources only lib → exit 127 when verifier=codex; **B3** `WORKDIR` never exported → agents fall back to launch cwd. + +**User decisions:** full v1.5 scope · both transports (own runner core + plugin path) · skill named `/codex-build` · Mac codex setup yes. + +## Ground rules + +- Implement in **`~/Project/co-evolution`** (GitHub-first; the SMB mount `/Volumes/Project/co-evolution` is read-only/sync-only). Branch `feat/v1.5-codex-build`; push when sessions end. +- Register milestone **v1.5** per `.planning/` conventions (STATE.md, ROADMAP.md, `phases/NN-*/`). v1.4 Phase 5 (npm publish) stays human-blocked in parallel — untouched. +- All new knobs default empty/off → **byte-parity** with today's argv and artifacts (existing sims must stay green). +- lib/co-evolution.sh changes must stay additive (it's vendored into mcp/; re-run `vendor.sh` + mcp tests to prove parity). + +## Phases + +### Phase 0 — Environment + research spikes +- `ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex`; `codex --version`; headless smoke `codex exec --full-auto` in a scratch dir (confirms ChatGPT-plan auth + gpt-5.5/xhigh accepted). +- Install the plugin in Claude Code: `/plugin marketplace add openai/codex-plugin-cc` → install `codex` → smoke `/codex:rescue` + `/codex:status`. +- R1: pin `claude -p --output-format json` envelope fields (usage, total_cost_usd). R2: pin codex end-of-run token line format on stderr. R3: confirm harness background-exit notification with a ~5-min stub run. +- Output: `.planning/research/` notes + v1.5 milestone registration. + +### Phase 1 — Seat plumbing + env-export correctness (runner) +- `lib/co-evolution.sh`: new empty-default knobs `CLAUDE_EFFORT` (→ `--effort` in `invoke_claude`) and `CODEX_REASONING_EFFORT` (→ `-c model_reasoning_effort=` in `invoke_codex`); **move `invoke_codex_schema` into lib** (fixes B2). +- `dev-review/codex/dev-review.sh`: `export CODEX_MODEL` (B1) + `export WORKDIR` (B3); new flags `--verifier opus|claude|codex`, `--claude-model` (+ `fable` → `claude-fable-5` alias); per-seat env `COMPOSER_/EXECUTOR_/VERIFIER_MODEL|EFFORT` via `apply_seat_env` called at the 5 invocation sites (compose, bounce both turns, execute, verify); `select_verifier` honors `VERIFIER_OVERRIDE`. +- Gate: `tests/run-all.sh` green; default invocation argv byte-identical. + +### Phase 2 — Claude-verifier hardening + `--preset codex-build` +- Verify phase: brace-block extraction fallback for prose-wrapped verdicts; persist normalized verdict back to the contract path `verdict.json` (no-op for codex `--output-schema`). +- `apply_preset codex-build`: composer=claude(fable, high) · executor=codex(model unpinned — user's codex config rules, effort xhigh) · verifier=claude(fable, max) · `--verify` on · bounces=2 · revise-loop=1. Banner prints seats; optional `seat_models` state field; RUNNER-CONTRACT.md → v1.1 (additive rows). +- New `tests/preset-expansion-simulation.sh` (PATH-stub claude/codex, assert seat argv incl. `--effort`/`-c model_reasoning_effort=xhigh`, preset expansion, B1/B2/B3 regressions, fenced-JSON verdict parse). + +### Phase 3 — Runner observability + status reader +- state.json additions (single-writer, additive): `current_phase {name, started_at}` written at phase **start** (null at EOF), `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id` via new `--parent-run` flag (lineage across revise re-kicks; re-kicks always get fresh run dirs — no `--resume` in v1.5). +- New `dev-review/codex/dev-review-status.sh` (~120 lines, read-only): run summary, phase progress, heartbeat = stderr-log mtime/size (BSD/GNU `stat` feature-detect), marker counts, verdict, diffstat, dead-runner detection. Exit codes: 0 done / 2 partial / 5 running / 4 presumed-dead / 3 no-run; `--json` mode. This is the **progress-file protocol** any fresh Fable session, Haiku watcher, or human reads. +- New `tests/status-reader-simulation.sh` (synthetic run dirs) + lifecycle sim. + +### Phase 4 — Token capture (`CO_EVOLVE_TOKEN_CAPTURE=1`, default off) +- `invoke_claude`: gated `--output-format json` mode — `.result` → output file (contract preserved), usage → `outputs/usage-.json` sidecar. +- Codex: harvest the token line from already-captured per-phase stderr logs (format pinned in R2). +- `collect_token_usage` at EOF → `state.json.tokens {phases, totals}`; scorer copies totals into report details (informational). New `tests/token-capture-simulation.sh`; parity guard when off. + +### Phase 5 — `/codex-build` skill + docs (both transports) +- **New `skills/codex-build/SKILL.md`** (~250 lines) — the orchestration loop: + 1. PREFLIGHT: codex present, no other active run (`dev-review-status.sh --list`), clean tree → `--branch auto`, dirty → `--worktree auto` (dirty tree otherwise makes verify silently skip). + 2. PLAN in-session (Fable = composer; plan file in `$TMPDIR`, **never** inside the workdir, content embedded inline per repo convention) + optional ≤2 synchronous bounce passes; resolve markers. + 3. KICK: one Bash call, `run_in_background: true`: `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --skip-plan --plan $PLAN --preset codex-build [--parent-run ] -- ""` → report run id → **END TURN** (no polling, no sleep; harness wakes the session on exit). + 4. WAKE → REVIEW GATE: status script → read verdict.json + diffstat first, targeted diff hunks only → **ACCEPT** (report branch/diffstat/tokens, suggest merge/PR) / **REVISE** (append "Reviewer Corrections (Round N)" to a copy of the plan, re-kick with `--parent-run`; **max 2 rounds**) / **ESCALATE** (missing verdict, exit 1, dead runner, rounds exhausted, scope_creep_detected — never auto-merge). + 5. Budget rules + troubleshooting (incl. `pkill -f 'codex exec'` for orphans; optional one-shot scheduled check for very long runs). +- **Plugin transport section** (second supported path, interactive flavor): when mid-session or for ad-hoc rescue work, delegate via `codex-rescue` / `/codex:rescue --background`, check `/codex:status`, then Fable reviews using the same review template + verdict JSON contract and the same ≤2-round cap. Documented protocol sharing our templates/schema; not used by evals/CI (no `--output-schema` contract through the plugin). +- Docs: `CLAUDE.md` Default Rule gains the third option ("build with codex" — Fable plans/reviews, Codex executes); `dev-review/codex/instructions.md` routing rows (runner preset + plugin alternative); `skills/dev-review/SKILL.md` preset mention, `$CODEX_MODEL`/`-c` args gap fix, live-mode limitation note; sync-token grep test pinning the preset table across files. + +### Phase 6 — Dogfood + evidence +- 2–3 real tasks through `/codex-build` on the Mac (now codex-capable); exercise ACCEPT, REVISE→ACCEPT, and ESCALATE paths for real; tune defaults. +- **Evidence note** on the 50% claim: runner-side tokens (state.json) + Fable session turn counts vs. an interactive `/dev-review` baseline on the same task. Honest scope: harness-side session spend visible only as turn counts. +- Re-run `mcp/scripts/vendor.sh` + `(cd mcp && npm test)` (lib changed); update auto-memory with the outcome. + +## Critical files + +| File | Change | +|---|---| +| `dev-review/codex/dev-review.sh` | B1/B3 exports, seat flags/env, preset, `--parent-run`, phase-start state writes, EOF token collect | +| `lib/co-evolution.sh` | effort knobs, `invoke_codex_schema` relocation (B2), `begin_state_phase`, gated JSON mode, `collect_token_usage` | +| `dev-review/codex/dev-review-status.sh` | **new** status reader | +| `skills/codex-build/SKILL.md` | **new** orchestration skill (both transports) | +| `skills/dev-review/SKILL.md`, `CLAUDE.md`, `dev-review/codex/instructions.md` | docs/routing | +| `evals/RUNNER-CONTRACT.md` | v1.1 additive rows (`seat_models`, `current_phase`, `runner_pid`, `tokens`, lineage) | +| `tests/preset-expansion-simulation.sh`, `tests/status-reader-simulation.sh`, `tests/token-capture-simulation.sh` | **new** hermetic sims | +| `.planning/` (STATE.md, ROADMAP.md, `phases/`, `research/`) | v1.5 registration + R1–R3 notes | + +## Verification + +1. **Hermetic (Mac, no real agents):** `bash -n` on touched scripts; `bash tests/run-all.sh` — all existing + 3 new sims green; byte-parity assertions with knobs off. +2. **Claude-only live dry run:** scratch repo, all three seats = claude with `VERIFIER_MODEL=fable VERIFIER_EFFORT=max` — proves `--effort max` headless and the claude-verdict path end-to-end (fallback: `high`). +3. **Codex smoke (Mac, post-symlink):** headless `codex exec` + one full `--preset codex-build` run on a toy repo, incl. one forced REVISE round. +4. **Orchestrated loop:** kick via background Bash, end turn, confirm wake-on-exit + status script output; plugin-transport smoke via `/codex:rescue`. +5. **MCP parity:** `vendor.sh` + `cd mcp && npm test`. +6. **Dogfood matrix (Phase 6):** ACCEPT / REVISE→ACCEPT / ESCALATE each exercised once; token evidence note written. + +## Execution strategy — sub-agent deployment + +This session (Fable) acts as orchestrator only: it briefs one executor sub-agent per phase, reviews the diff + test output at each gate, and never writes the code itself. Each agent gets the plan file path, its phase section, and the repo conventions; context stays small in the main session. + +| Phase | Agent model | Rationale | +|---|---|---| +| 0 — env setup, R1–R3 spikes, v1.5 `.planning` registration | **Sonnet** | Mechanical: symlink, smoke commands, record outputs, templated planning docs. Plugin install (`/plugin`) may need the main session or user if no CLI route. | +| 1 — seat plumbing + B1/B2/B3 fixes | **Opus** | Careful surgery in a 1471-line bash script with byte-parity invariants. | +| 2 — verifier hardening + preset + sim | **Opus** | Verdict-path correctness + new hermetic simulation. | +| 3 — observability + status reader + sims | **Opus** | state.json single-writer weaving is delicate. | +| 4 — token capture | **Opus** | Gated invoke_claude changes against the vendored lib. | +| 5 — `/codex-build` skill + docs | **Opus** | Spec is fully written in this plan; converting it to SKILL.md prose + doc routing is execution, not design. | +| 6 — dogfood + evidence | **Fable gates, Opus drafts** | Real runs go through the runner itself; Fable (main session) reviews verdicts/evidence; an Opus agent drafts the evidence note. | + +Phases run **sequentially** (1→2 and 3→4 edit the same files; parallelism would conflict). Gate protocol between phases: executor reports diff summary + test results → Fable spot-checks the diff → commit on the `feat/v1.5-codex-build` branch → next phase. All work in `~/Project/co-evolution`, never the SMB mount. + +## Risks + +- `--effort max` on claude CLI unverified (in-repo proof exists only for `high`) → dry-run step 2 resolves; one-constant fallback. +- `xhigh` × model mismatch in codex → surfaced by existing stderr/error-payload gates; `EXECUTOR_EFFORT` override. +- Seat env is process-global → every call site re-applies; sim pins all 5 sites. +- Plugin drift (OpenAI-owned formats) → plugin path is prose-level protocol only, pinned to plugin major version in docs; runner path carries CI. +- Missed wake notification (machine sleep) → status script's dead-runner detection + optional one-shot scheduled check. From fb6862ab3d8b84c5f576d2904853d7eb5e270ffd Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 12:40:29 -0400 Subject: [PATCH 02/12] feat: per-seat model/effort seats + B1/B2/B3 env-export fixes (v1.5 Phase 1) Co-Authored-By: Claude Fable 5 --- dev-review/codex/dev-review.sh | 138 +++++++++++++++++++++++++-------- lib/co-evolution.sh | 72 ++++++++++++++++- 2 files changed, 175 insertions(+), 35 deletions(-) diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh index 2d96675..7d7595f 100644 --- a/dev-review/codex/dev-review.sh +++ b/dev-review/codex/dev-review.sh @@ -27,6 +27,9 @@ LIVE_MODE="${LIVE_MODE:-false}" # exclusive — both non-empty after parsing = die. Default empty = Phase 3 byte-parity. BRANCH_SPEC="${DEV_REVIEW_BRANCH:-}" WORKTREE_SPEC="${DEV_REVIEW_WORKTREE:-}" +# v1.5: explicit verifier seat override. Empty = derive from executor via the +# existing select_verifier logic (byte-parity default). Env default + --verifier flag. +VERIFIER_OVERRIDE="${DEV_REVIEW_VERIFIER:-}" BRANCH_CREATED="" WORKTREE_PATH="" WORKDIR="$(pwd)" @@ -75,6 +78,8 @@ Options: --skip-plan Skip compose+bounce and execute an existing plan --plan FILE Existing plan file to use with --skip-plan --model MODEL Override Codex model + --verifier AGENT Force the verify-phase agent (codex|opus|claude); default derives from --executor + --claude-model MODEL Override Claude model (alias: fable -> claude-fable-5; else passthrough) --workdir DIR Working directory (default: current directory) --timeout SECONDS Per-phase timeout in seconds (default: 1800) --revise-loop N Auto-retry on REVISE verdict up to N extra passes (default: 0 = disabled) @@ -110,6 +115,17 @@ normalize_agent() { esac } +# v1.5: resolve a friendly Claude model alias to its CLI model id. Only `fable` +# is aliased today (→ claude-fable-5); anything else passes through verbatim so +# explicit ids (claude-opus-4-6, claude-opus-4-8[1m], …) and an empty value are +# untouched. Used by the --claude-model flag and the per-seat env layer. +resolve_claude_model_alias() { + case "$1" in + fable) echo "claude-fable-5" ;; + *) echo "$1" ;; + esac +} + invoke_agent() { local agent="$1" shift @@ -169,6 +185,12 @@ require_agent_cli() { } select_verifier() { + # v1.5: an explicit --verifier / DEV_REVIEW_VERIFIER override wins; otherwise + # derive the opposite-of-executor default (byte-parity when unset). + if [[ -n "${VERIFIER_OVERRIDE:-}" ]]; then + echo "$VERIFIER_OVERRIDE" + return 0 + fi if [[ "$EXECUTOR" == "codex" ]]; then echo "opus" else @@ -214,6 +236,14 @@ ensure_codex_compatible_workdir() { needs_codex="true" fi + # v1.5: with --verifier codex (or executor=opus default) the verify phase runs + # codex too, so its workdir must satisfy the same WSL constraint. Only matters + # when verification is requested. Byte-parity holds: default executor=codex → + # verifier=opus → this branch is a no-op. + if [[ "$VERIFY" == "true" && "$(select_verifier)" == "codex" ]]; then + needs_codex="true" + fi + if [[ "$needs_codex" != "true" ]]; then return 0 fi @@ -226,38 +256,12 @@ ensure_codex_compatible_workdir() { fi } -invoke_codex_schema() { - local prompt_file="$1" - local output_file="$2" - local stderr_file="$3" - local schema_file="$4" - local workdir="${WORKDIR:-$PWD}" - local -a cmd - local windows_workdir="" - local windows_output="" - local windows_schema="" - - if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then - windows_workdir=$(wslpath -w "$workdir") - windows_output=$(wslpath -w "$output_file") - windows_schema=$(wslpath -w "$schema_file") - cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir") - else - cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir") - fi - - if [[ -n "${CODEX_MODEL:-}" ]]; then - cmd+=(-c "model=${CODEX_MODEL}") - fi - - if [[ -n "$windows_schema" ]]; then - cmd+=(--output-schema "$windows_schema" -o "$windows_output") - else - cmd+=(--output-schema "$schema_file" -o "$output_file") - fi - - "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true -} +# v1.5 (fixes B2): invoke_codex_schema now lives in lib/co-evolution.sh (sourced +# at the top of this script). It was previously defined here, but the verify +# phase dispatches it inside a `bash -c 'source lib/co-evolution.sh; invoke_...'` +# child that can only see lib-defined functions — so the local copy was invisible +# there (exit 127 when a timeout binary exists and verifier=codex). Moved to lib +# so both the timeout child and the no-timeout fallback resolve the same function. write_text_file() { local output_path="$1" @@ -565,6 +569,8 @@ Output ONLY the plan document. No preamble." # abort_on_timeout will fire from main flow if the dispatcher reports 124. # RTUX-01: Launch a live-tail window before the agent call (no-op unless --live). maybe_launch_live_window "compose" "$compose_stderr_file" + # v1.5: layer the composer seat's model/effort (no-op when no per-seat env set). + apply_seat_env composer "$COMPOSER" invoke_agent_with_timeout "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$(phase_is_writable compose)" abort_on_timeout "compose" "$phase_start" ensure_valid_plan_output "compose phase" "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$compose_retry_stderr_file" "" "compose" || return $? @@ -655,6 +661,14 @@ run_bounce_phase() { # so state.json records which pass hit the timeout. # RTUX-01: Per-pass live-tail window (bounce-01, bounce-02, ...) when --live. maybe_launch_live_window "bounce-${pass_padded}" "$stderr_file" + # v1.5: composer turns get the composer seat's model/effort; reviewer (the + # bounce counterparty) turns use globals only (the `bounce` seat is a no-op + # arm in apply_seat_env). No-op overall when no per-seat env is set. + if [[ "$role" == "composer" ]]; then + apply_seat_env composer "$current_agent" + else + apply_seat_env bounce "$current_agent" + fi invoke_agent_with_timeout "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$(phase_is_writable bounce)" abort_on_timeout "bounce-${pass_padded}" "$bounce_pass_start" ensure_valid_plan_output "bounce pass ${pass}" "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$retry_stderr_file" "$PLAN_PATH" "bounce" || return $? @@ -751,6 +765,10 @@ run_execute_phase() { # RNPT-05: timeout-wrapped. abort_on_timeout uses _execute_phase_start from main flow. # RTUX-01: Live-tail window for execute phase (no-op unless --live). maybe_launch_live_window "execute" "$execute_stderr_file" + # v1.5: layer the executor seat's model/effort. The in-phase empty-output retry + # below inherits these exported values (no second apply needed). No-op when no + # per-seat env is set. + apply_seat_env executor "$EXECUTOR" invoke_agent_with_timeout "$EXECUTOR" "$execute_prompt_file" "$execute_output_file" "$execute_stderr_file" "$(phase_is_writable execute)" abort_on_timeout "execute" "$phase_start" @@ -857,6 +875,10 @@ run_verify_phase() { fi verifier=$(select_verifier) + # v1.5: layer the verifier seat's model/effort, AFTER the verifier agent is + # resolved (the seat env keys off the agent type). Covers both the codex-schema + # and opus branches below. No-op when no per-seat env is set. + apply_seat_env verifier "$verifier" plan_content=$(cat "$PLAN_PATH") diff_content=$(cat "$diff_file") @@ -973,7 +995,23 @@ while [[ $# -gt 0 ]]; do ;; --model) [[ $# -gt 1 ]] || die "--model requires a value" - CODEX_MODEL="$2" + # v1.5 (fixes B1): export so the `bash -c` child inside + # invoke_agent_with_timeout inherits it (mirrors the --timeout arm). Without + # export, CODEX_MODEL never reached invoke_codex in the dispatched child. + export CODEX_MODEL="$2" + shift 2 + ;; + --verifier) + [[ $# -gt 1 ]] || die "--verifier requires a value" + # v1.5: normalize so `claude` maps to the opus seat name the rest of the + # script uses; reject anything normalize_agent doesn't accept. + VERIFIER_OVERRIDE=$(normalize_agent "$2") || die "Unsupported verifier: $2" + shift 2 + ;; + --claude-model) + [[ $# -gt 1 ]] || die "--claude-model requires a value" + # v1.5: resolve friendly alias (fable -> claude-fable-5) else passthrough. + CLAUDE_MODEL=$(resolve_claude_model_alias "$2") shift 2 ;; --workdir) @@ -1147,6 +1185,12 @@ if [[ -n "$PLAN_SOURCE" ]]; then fi WORKDIR="$(cd "$WORKDIR" && pwd)" +# v1.5 (fixes B3): export WORKDIR so the `bash -c` child inside +# invoke_agent_with_timeout (and the verify-codex `bash -c` block) inherits it; +# otherwise ${WORKDIR:-$PWD} in invoke_claude/invoke_codex falls back to the +# child's launch cwd. The export attribute persists across the later worktree-mode +# reassignment of WORKDIR (~line 1300), so the worktree path is exported too. +export WORKDIR if [[ "$SKIP_PLAN" == "true" && -z "$PLAN_SOURCE" ]]; then die "--skip-plan requires --plan FILE" @@ -1192,6 +1236,34 @@ ensure_codex_compatible_workdir require_selected_agent_clis +# v1.5 per-seat env layer. Snapshot the post-parse base model/effort values (set +# by globals / --model / --claude-model / CLAUDE_EFFORT / CODEX_REASONING_EFFORT), +# then apply_seat_env layers an optional per-seat override before each invocation. +# Byte-parity: with no COMPOSER_/EXECUTOR_/VERIFIER_ envs set, every seat resolves +# to its base, so exported CLAUDE_MODEL/CLAUDE_EFFORT/CODEX_MODEL/CODEX_REASONING_EFFORT +# match what the unlayered globals already were — argv is unchanged. +CLAUDE_MODEL_BASE="$CLAUDE_MODEL"; CODEX_MODEL_BASE="${CODEX_MODEL:-}" +CLAUDE_EFFORT_BASE="${CLAUDE_EFFORT:-}"; CODEX_EFFORT_BASE="${CODEX_REASONING_EFFORT:-}" + +# Export is load-bearing: invoke_agent_with_timeout crosses a `timeout bash -c` +# process boundary; only exported values survive it. +apply_seat_env() { + local seat="$1" agent="$2" model="" effort="" + case "$seat" in + composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;; + executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;; + verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;; + *) : ;; # bounce counterparty: globals only + esac + if [[ "$agent" == "codex" ]]; then + export CODEX_MODEL="${model:-$CODEX_MODEL_BASE}" + export CODEX_REASONING_EFFORT="${effort:-$CODEX_EFFORT_BASE}" + else + export CLAUDE_MODEL="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")" + export CLAUDE_EFFORT="${effort:-$CLAUDE_EFFORT_BASE}" + fi +} + # Phase 8.1 WR-04: honor --run-dir when set; otherwise preserve v1.2 default (byte-parity). if [[ -n "$RUN_DIR_OVERRIDE" ]]; then RUN_DIR="$RUN_DIR_OVERRIDE" diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh index 65aa7e0..ef0aeed 100644 --- a/lib/co-evolution.sh +++ b/lib/co-evolution.sh @@ -58,6 +58,15 @@ LIVE_MODE_WARNING_LOGGED=false # appends -c model= when it is set.) : "${CLAUDE_MODEL:=claude-opus-4-6}" +# v1.5: Per-seat reasoning-effort knobs. Both default empty = OFF, so invoke_* +# omit the effort flag entirely and argv stays byte-identical to today (parity +# invariant). When non-empty, invoke_claude appends `--effort "$CLAUDE_EFFORT"` +# and invoke_codex appends `-c model_reasoning_effort=...`. `--effort high` is +# proven headless in -p mode at evals/judge-bounce.sh:58 (built into JUDGE_ARGS, +# consumed by the `"$JUDGE_CMD" -p ...` call at evals/judge-bounce.sh:139). +: "${CLAUDE_EFFORT:=}" +: "${CODEX_REASONING_EFFORT:=}" + # RNPT-02: Authoritative list of phases that require write access to the workdir. # Phase code MUST NOT pass a hard-coded "true"/"false" to invoke_agent; it must # call `phase_is_writable ""` instead. To add a new writable phase @@ -408,11 +417,19 @@ invoke_claude() { tool_flags=(--disallowedTools "Edit,Write,Bash,Glob,Grep,WebSearch,WebFetch") fi + # v1.5: model + optional reasoning effort. CLAUDE_EFFORT defaults empty (OFF) + # so model_flags is exactly `--model "$CLAUDE_MODEL"` and argv is byte-identical + # to the prior single-flag form. The --effort element is appended only when set. + local -a model_flags=(--model "${CLAUDE_MODEL}") + if [[ -n "${CLAUDE_EFFORT:-}" ]]; then + model_flags+=(--effort "${CLAUDE_EFFORT}") + fi + if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1; then # Under WSL, reuse the Windows Claude session because WSL and Windows keep separate auth state. - cmd=(cmd.exe /c claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}") + cmd=(cmd.exe /c claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}") else - cmd=(claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}") + cmd=(claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}") fi "${cmd[@]}" < "$prompt_file" > "$output_file" 2>"$stderr_file" || true @@ -439,6 +456,12 @@ invoke_codex() { cmd+=(-c "model=${CODEX_MODEL}") fi + # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append, so argv + # is byte-identical to today. Appended only when CODEX_REASONING_EFFORT is set. + if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then + cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}") + fi + if [[ -n "${windows_output:-}" ]]; then cmd+=(-o "$windows_output") else @@ -448,6 +471,51 @@ invoke_codex() { "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true } +# v1.5 (fixes B2): relocated verbatim from dev-review/codex/dev-review.sh so the +# verify-phase `bash -c 'source lib/co-evolution.sh; invoke_codex_schema ...'` +# child can actually find it (it was defined only in dev-review.sh → exit 127 +# whenever a timeout binary exists and verifier=codex). Behavior is identical to +# the prior local copy (--output-schema / -o / --skip-git-repo-check / WSL path +# handling), plus the same conditional model + reasoning-effort appends that +# invoke_codex grew above — both default-off so argv is unchanged when unset. +invoke_codex_schema() { + local prompt_file="$1" + local output_file="$2" + local stderr_file="$3" + local schema_file="$4" + local workdir="${WORKDIR:-$PWD}" + local -a cmd + local windows_workdir="" + local windows_output="" + local windows_schema="" + + if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then + windows_workdir=$(wslpath -w "$workdir") + windows_output=$(wslpath -w "$output_file") + windows_schema=$(wslpath -w "$schema_file") + cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir") + else + cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir") + fi + + if [[ -n "${CODEX_MODEL:-}" ]]; then + cmd+=(-c "model=${CODEX_MODEL}") + fi + + # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append. + if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then + cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}") + fi + + if [[ -n "$windows_schema" ]]; then + cmd+=(--output-schema "$windows_schema" -o "$windows_output") + else + cmd+=(--output-schema "$schema_file" -o "$output_file") + fi + + "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true +} + file_contains_auth_failure() { local file_path="$1" From ffe765f23e7a684b1176a8dff279946975f6f5e1 Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 12:53:02 -0400 Subject: [PATCH 03/12] feat: claude-verifier verdict hardening + --preset codex-build (v1.5 Phase 2) Co-Authored-By: Claude Fable 5 --- dev-review/codex/dev-review.sh | 104 ++++++- evals/RUNNER-CONTRACT.md | 3 +- tests/preset-expansion-simulation.sh | 402 +++++++++++++++++++++++++++ 3 files changed, 504 insertions(+), 5 deletions(-) create mode 100644 tests/preset-expansion-simulation.sh diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh index 7d7595f..0bf0ff7 100644 --- a/dev-review/codex/dev-review.sh +++ b/dev-review/codex/dev-review.sh @@ -30,6 +30,10 @@ WORKTREE_SPEC="${DEV_REVIEW_WORKTREE:-}" # v1.5: explicit verifier seat override. Empty = derive from executor via the # existing select_verifier logic (byte-parity default). Env default + --verifier flag. VERIFIER_OVERRIDE="${DEV_REVIEW_VERIFIER:-}" +# v1.5: named seat preset (e.g. codex-build). Empty = no preset (byte-parity +# default). Applied in-parser so AFTER-preset flags win (last-wins) and model/ +# effort values fill-if-empty so pre-set env vars win. See apply_preset(). +PRESET="" BRANCH_CREATED="" WORKTREE_PATH="" WORKDIR="$(pwd)" @@ -80,6 +84,10 @@ Options: --model MODEL Override Codex model --verifier AGENT Force the verify-phase agent (codex|opus|claude); default derives from --executor --claude-model MODEL Override Claude model (alias: fable -> claude-fable-5; else passthrough) + --preset NAME Expand a named seat preset (available: codex-build). + codex-build = Fable plans (high) + Codex executes (xhigh) + Fable reviews (max), + --verify on, bounces 2, revise-loop 1. Precedence: flags placed AFTER --preset + override it (last-wins); pre-set model/effort env vars win over the preset. --workdir DIR Working directory (default: current directory) --timeout SECONDS Per-phase timeout in seconds (default: 1800) --revise-loop N Auto-retry on REVISE verdict up to N extra passes (default: 0 = disabled) @@ -126,6 +134,26 @@ resolve_claude_model_alias() { esac } +# v1.5: expand a named seat preset into the underlying seat/model/effort knobs. +# Called from the --preset parser arm. Two precedence rules, both deliberate: +# - seat/structural knobs (COMPOSER, EXECUTOR, VERIFIER_OVERRIDE, VERIFY, +# BOUNCES, REVISE_LOOP_MAX) are set hard, so flags placed AFTER --preset +# override them via last-wins parsing. +# - model/effort knobs use fill-if-empty (`: "${VAR:=…}"`) so a value already +# present in the environment (e.g. COMPOSER_EFFORT=low) wins over the preset. +apply_preset() { + case "$1" in + codex-build) # "build with codex": Fable plans/reviews, Codex executes. + COMPOSER="opus"; EXECUTOR="codex"; VERIFIER_OVERRIDE="opus" + VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1 + : "${COMPOSER_MODEL:=fable}"; : "${COMPOSER_EFFORT:=high}" + : "${VERIFIER_MODEL:=fable}"; : "${VERIFIER_EFFORT:=max}" + : "${EXECUTOR_EFFORT:=xhigh}" # codex model stays the CLI's configured default — deliberately unpinned + ;; + *) die "Unknown preset: $1 (available: codex-build)" ;; + esac +} + invoke_agent() { local agent="$1" shift @@ -927,16 +955,31 @@ run_verify_phase() { return 2 fi - verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file") || { - log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." - return 2 - } + if ! verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file"); then + # v1.5: claude verifiers (no --output-schema) sometimes wrap the verdict in + # prose, which normalize_json_artifact rejects. Try a brace-block extraction + # (same idiom as evals/judge-bounce.sh:149) before giving up. codex verdicts + # come from --output-schema and never hit this branch (they normalize clean). + sed -n '/^{/,/^}/p' "$verdict_file" > "$normalized_verdict_file" + if [[ ! -s "$normalized_verdict_file" ]]; then + log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." + return 2 + fi + fi verdict_data=$(validate_review_verdict "$normalized_verdict_file") || { log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." return 2 } + # v1.5: persist the normalized verdict back to the contract path so downstream + # consumers (evals/score-run.sh jq-parses verdict.json raw; mapping unparseable + # -> FAIL) read clean JSON. Byte-no-op for codex --output-schema verdicts, which + # are already clean. The dot-prefixed .verdict-normalized.json copy stays for the + # revise-loop read; cleanup_runtime_artifacts sweeps it but verdict.json (plain + # path, no leading dot) survives as the contract artifact. + cp "$normalized_verdict_file" "$verdict_file" + eval "$verdict_data" review_status="$VERDICT" @@ -961,6 +1004,15 @@ cleanup_runtime_artifacts() { while [[ $# -gt 0 ]]; do case "$1" in + --preset) + # v1.5: expand a named seat preset. Applied in-parser so flags placed + # AFTER --preset override it (last-wins) and pre-set model/effort env vars + # win over the preset's fill-if-empty defaults. See apply_preset(). + [[ $# -gt 1 ]] || die "--preset requires a value" + PRESET="$2" + apply_preset "$2" + shift 2 + ;; --composer) [[ $# -gt 1 ]] || die "--composer requires a value" COMPOSER=$(normalize_agent "$2") || die "Unsupported composer: $2" @@ -1264,6 +1316,29 @@ apply_seat_env() { fi } +# v1.5: resolve a seat's "agent:model@effort" descriptor using the SAME model/ +# effort precedence apply_seat_env uses, but without mutating the exported env. +# Feeds the startup banner and the optional state.json seat_models fields. +# model defaults to "(default)" when unpinned (codex's executor seat under the +# codex-build preset); effort defaults to "(default)" when no effort is set. +resolve_seat_model_string() { + local seat="$1" agent="$2" model="" effort="" model_str="" effort_str="" + case "$seat" in + composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;; + executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;; + verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;; + *) : ;; + esac + if [[ "$agent" == "codex" ]]; then + model_str="${model:-$CODEX_MODEL_BASE}" + effort_str="${effort:-$CODEX_EFFORT_BASE}" + else + model_str="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")" + effort_str="${effort:-$CLAUDE_EFFORT_BASE}" + fi + printf '%s:%s@%s' "$agent" "${model_str:-(default)}" "${effort_str:-(default)}" +} + # Phase 8.1 WR-04: honor --run-dir when set; otherwise preserve v1.2 default (byte-parity). if [[ -n "$RUN_DIR_OVERRIDE" ]]; then RUN_DIR="$RUN_DIR_OVERRIDE" @@ -1286,12 +1361,30 @@ EXECUTE_DELTA_JSON="${RUN_DIR}/.execute-delta.json" RUN_ID="dev-review-${TIMESTAMP}" init_state_json "$STATE_JSON" "$RUN_ID" "$TASK" "$COMPOSER" "$EXECUTOR" "$REVIEWER" +# v1.5: resolve the verifier seat and per-seat model/effort descriptors for the +# banner + observability fields. _VERIFIER_SEAT reflects --verifier / executor +# derivation (select_verifier); the seat strings reflect whatever the seats +# resolve to under the per-seat env layer (apply_seat_env precedence). +_VERIFIER_SEAT=$(select_verifier) +_SEAT_COMPOSER=$(resolve_seat_model_string composer "$COMPOSER") +_SEAT_EXECUTOR=$(resolve_seat_model_string executor "$EXECUTOR") +_SEAT_VERIFIER=$(resolve_seat_model_string verifier "$_VERIFIER_SEAT") + +# v1.5: optional seat_models observability — records what each seat resolves to. +# Unconditional + additive (no existing sim asserts an exact state.json key set, +# so this does not break shape); cheap, and reflects the actual resolved seats. +write_state_field "$STATE_JSON" ".seat_models.composer" "string" "$_SEAT_COMPOSER" +write_state_field "$STATE_JSON" ".seat_models.executor" "string" "$_SEAT_EXECUTOR" +write_state_field "$STATE_JSON" ".seat_models.verifier" "string" "$_SEAT_VERIFIER" + log "============================================" log " DEV-REVIEW SESSION" log "============================================" log " Task: $TASK" +log " Preset: ${PRESET:-}" log " Composer: $COMPOSER" log " Executor: $EXECUTOR" +log " Verifier: $_VERIFIER_SEAT" log " Bounces: $BOUNCES" log " Verify: $VERIFY" log " Workdir: $WORKDIR" @@ -1300,6 +1393,9 @@ log " Timeout: ${PHASE_TIMEOUT}s per phase" log " Live mode: $LIVE_MODE" log " Branch: ${BRANCH_SPEC:-}" log " Worktree: ${WORKTREE_SPEC:-}" +log " Composer model: $_SEAT_COMPOSER" +log " Executor model: $_SEAT_EXECUTOR" +log " Verifier model: $_SEAT_VERIFIER" log "============================================" log "" diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md index f8fe194..650f476 100644 --- a/evals/RUNNER-CONTRACT.md +++ b/evals/RUNNER-CONTRACT.md @@ -1,6 +1,6 @@ --- title: Runner ↔ scorer contract -version: 1.0 +version: 1.1 status: locked owners: - dev-review/codex/dev-review.sh (runner) @@ -43,6 +43,7 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`. | `verify_verdict` | string \| null | nullable | runner (post-verify) | `score-run.sh:498` | Verify-phase verdict — `APPROVED` / `REVISE` / null | | `history` | array\<`{phase,status,detail,timestamp}`\> | yes | runner (per phase) | `score-run.sh:338` | Canonical bounce-trace array. Convergence needs ≥1 entry with `phase` matching `^bounce-[0-9]+$` when bounces were expected. | | `mode` | string | no | runner (optional tag) | — | Fake-runner sets `"fake-runner"`; real runner may set `"real"` or omit. Runner-owned metadata — NOT written by the shared `init_state_json` library. | +| `seat_models` | object `{composer, executor, verifier}` (each string `agent:model@effort`) | no | `dev-review.sh` (post-init) | — | v1.5 observability: what each seat resolved to under the per-seat env layer (e.g. `"claude:claude-fable-5@high"`, `"codex:(default)@xhigh"`). Runner-owned metadata — NOT written by the shared `init_state_json` library; scorer ignores it. | ### Legacy alias: `phases` → `history` diff --git a/tests/preset-expansion-simulation.sh b/tests/preset-expansion-simulation.sh new file mode 100644 index 0000000..6cf10bc --- /dev/null +++ b/tests/preset-expansion-simulation.sh @@ -0,0 +1,402 @@ +#!/usr/bin/env bash +# tests/preset-expansion-simulation.sh +# Hermetic gate for v1.5 Phase 2: --preset codex-build + claude-verifier +# verdict hardening (dev-review/codex/dev-review.sh). +# +# The codex-build preset expands to the post's ladder — Fable plans (high), +# Codex executes (xhigh), Fable reviews (max) — behind one flag. This gate +# pins: +# a. seat argv under the preset (composer claude --model claude-fable-5 +# --effort high; executor codex -c model_reasoning_effort=xhigh and NO +# -c model=; verifier claude --effort max). +# b. claude-verifier verdict hardening: a verdict wrapped in markdown +# fences/prose lands in verdict.json jq-parseable (brace-block fallback). +# c. last-wins parsing: --verifier codex after --preset wins. +# d. fill-if-empty: a pre-set COMPOSER_EFFORT env var beats the preset. +# e. unknown preset dies with the right message. +# f. --help mentions --preset. +# g. parity guard: a default run emits no --effort and no +# model_reasoning_effort anywhere. +# +# Pattern: PATH-injected claude + codex stubs append their full argv (one line +# per invocation) to phase-agnostic logs; assertions grep the logs. Both stubs +# drain stdin via a read-loop (NOT `cat`) to avoid SIGPIPE against the runner's +# stdin producer under strict-mode bash (same discipline as the scorer Tier-4 +# stubs in evals/tests/scorer-verification.sh). + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh" + +command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; } + +TEST_DIR="$(mktemp -d -t preset-expand-XXXXXX)" +trap 'rm -rf "$TEST_DIR"' EXIT + +TOTAL=0 +FAILURES=0 +pass() { printf "PASS: %s\n" "$1"; } +fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); } + +# --- hermetic git env (mirrors tests/run-all.sh so the scratch repo is +# independent of the host's identity / autocrlf / hooks) --------------------- +export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig" +export GIT_CONFIG_SYSTEM=/dev/null +cat > "$TEST_DIR/gitconfig" <<'GITCFG' +[user] + name = preset-sim + email = preset-sim@invalid.local +[init] + defaultBranch = master +[core] + autocrlf = false +[commit] + gpgsign = false +GITCFG + +# --- stub CLIs -------------------------------------------------------------- +mkdir -p "$TEST_DIR/bin" + +# claude stub: append full argv to $CLAUDE_ARGV_LOG (one line per invocation), +# drain stdin via read-loop, then emit a phase-appropriate body. The verify +# phase is detected by the review-prompt heading on stdin; for it the stub +# emits a verdict whose JSON object lives at column 0 wrapped in prose + fences +# (exercises the brace-block hardening fallback). All other phases emit a plan. +cat > "$TEST_DIR/bin/claude" <<'STUB' +#!/usr/bin/env bash +for arg in "$@"; do + case "$arg" in + --version|-v|version) echo "claude 1.0.0 (preset-stub)"; exit 0 ;; + esac +done +[[ -n "${CLAUDE_ARGV_LOG:-}" ]] && printf '%s\n' "$*" >> "$CLAUDE_ARGV_LOG" +# Drain stdin via read-loop (no SIGPIPE); capture it so we can detect the phase. +stdin_capture="" +while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done +if [[ "$stdin_capture" == *"Verification Instructions"* ]]; then + # Verify phase: prose-wrapped, fenced verdict. The JSON object's braces sit at + # column 0 so the runner's `sed -n '/^{/,/^}/p'` brace-block fallback recovers + # clean JSON after normalize_json_artifact rejects the mixed fence+prose. + cat <<'VERDICT' +Here is my assessment of the diff against the plan. + +```json +{ + "verdict": "APPROVED", + "confidence": 90, + "summary": "Stub verifier approves the hermetic change.", + "issues": [] +} +``` + +That concludes the review. +VERDICT + exit 0 +fi +# Compose / bounce phase: emit a minimal valid plan that clears the plan_quality +# gate (validate_plan_artifact wants >= 60 words, >= 5 non-empty lines, >= 2 +# structural lines) so PLAN_EXIT stays 0 and the execute + verify phases run. +cat <<'PLAN' +# Preset Expansion Stub Plan + +## Approach + +This is a hermetic stub plan emitted by the preset-expansion simulation claude +stub. It exists solely to satisfy the plan heading regex, the minimum word +count, and the structural-line check so the compose phase validates and the run +proceeds into execute and verify. It is not a real plan and the executor stub +performs only a single harmless append to a tracked file so the run produces a +diff for the verify phase to score. No external commands run and no network is +touched anywhere in this hermetic path. + +## Files to Change + +- `touched.txt` — the executor stub appends a line so the run produces a diff. + +## Risks + +- None identified. Stubbed CLI produces a fixed plan with no side effects. +PLAN +STUB +chmod +x "$TEST_DIR/bin/claude" + +# codex stub: append full argv to $CODEX_ARGV_LOG, parse -o FILE and -C DIR, +# drain stdin, then (a) emit the same stub plan to -o FILE (compose/bounce) and +# (b) make a real file change inside -C DIR so the execute phase yields a +# non-empty diff (so verify actually runs). The execute phase is the one that +# carries -C to the workdir; for it we append a line to a tracked file. +cat > "$TEST_DIR/bin/codex" <<'STUB' +#!/usr/bin/env bash +for arg in "$@"; do + case "$arg" in + --version|-v|version) echo "codex 0.117.0 (preset-stub)"; exit 0 ;; + esac +done +[[ -n "${CODEX_ARGV_LOG:-}" ]] && printf '%s\n' "$*" >> "$CODEX_ARGV_LOG" +output_file=""; workdir=""; prev="" +for arg in "$@"; do + case "$prev" in + -o) output_file="$arg" ;; + -C) workdir="$arg" ;; + esac + prev="$arg" +done +# Capture stdin (no SIGPIPE) so we can detect the execute phase by its prompt. +stdin_capture="" +while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done +# Only the EXECUTE phase may mutate the workdir — bounce passes (also codex +# under the preset, since composer=opus → reviewer=codex) run BEFORE the runner +# snapshots INITIAL_GIT_STATUS, so a bounce-time edit would make execute look +# like a no-op ("no changes detected"). Gate on the execute-prompt heading. +if [[ "$stdin_capture" == *"Execution Instructions (Codex)"* \ + && -n "$workdir" && -f "$workdir/touched.txt" ]]; then + printf 'codex-stub-change\n' >> "$workdir/touched.txt" +fi +plan_content=$(cat <<'PLAN' +# Preset Expansion Stub Plan + +## Approach + +This is a hermetic stub plan emitted by the preset-expansion simulation codex +stub. It satisfies the plan heading regex, the minimum word count, and the +structural-line check so the compose phase validates and the run proceeds into +execute and verify. It is not a real plan and the only side effect is a single +harmless append to a tracked file so the run produces a diff for the verify +phase to score. No external commands run and no network is touched anywhere. + +## Files to Change + +- `touched.txt` — appended to by the executor stub. + +## Risks + +- None identified. Stubbed codex produces a fixed plan with no side effects. +PLAN +) +if [[ -n "$output_file" ]]; then + printf '%s\n' "$plan_content" > "$output_file" +else + printf '%s\n' "$plan_content" +fi +exit 0 +STUB +chmod +x "$TEST_DIR/bin/codex" + +# --- helpers ---------------------------------------------------------------- + +# Make a fresh scratch git repo with one tracked file. Echoes the repo path. +make_scratch_repo() { + local repo + repo="$TEST_DIR/repo-$1" + rm -rf "$repo" + mkdir -p "$repo" + git -C "$repo" init -q + printf 'baseline\n' > "$repo/touched.txt" + git -C "$repo" add -A + git -C "$repo" commit -qm "baseline" + printf '%s' "$repo" +} + +# A pre-existing plan fixture so scenarios that don't need to test compose can +# use --skip-plan and run faster. Scenario (a) deliberately does NOT use it (it +# must exercise compose to capture the composer argv). +PLAN_FIXTURE="$TEST_DIR/plan-fixture.md" +cat > "$PLAN_FIXTURE" <<'PLAN' +# Skip-Plan Fixture + +## Approach + +A pre-baked plan so the --skip-plan scenarios bypass the compose phase entirely +and drive only the execute and verify phases. It satisfies the plan heading +regex, the minimum word count, and the structural-line check. The executor stub +appends one harmless line to a tracked file so the run produces a diff for the +verify phase to score, and nothing else happens in this hermetic path. + +## Files to Change + +- `touched.txt` — the executor stub appends a line. + +## Risks + +- None identified. +PLAN + +# =========================================================================== +# Scenario (a): --preset codex-build → seat argv (exercises compose). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo a) +claude_log="$TEST_DIR/a_claude.log"; codex_log="$TEST_DIR/a_codex.log" +: > "$claude_log"; : > "$codex_log" +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + bash "$RUNNER" --preset codex-build --workdir "$repo" --timeout 60 -- "preset seat argv probe" +) >"$TEST_DIR/a.out" 2>&1 || true + +a_ok=true +# composer = claude with the fable model alias resolved + high effort. +if ! grep -Eq -- '--model claude-fable-5' "$claude_log"; then a_ok=false; fi +if ! grep -Eq -- '--effort high' "$claude_log"; then a_ok=false; fi +# executor = codex with xhigh effort and NO pinned model. +if ! grep -Eq -- '-c model_reasoning_effort=xhigh' "$codex_log"; then a_ok=false; fi +if grep -Eq -- '-c model=' "$codex_log"; then a_ok=false; fi +# verifier = claude with max effort. +if ! grep -Eq -- '--effort max' "$claude_log"; then a_ok=false; fi +if [[ "$a_ok" == true ]]; then + pass "preset codex-build: composer fable/high, executor codex/xhigh (no model=), verifier claude/max" +else + fail "preset codex-build seat argv mismatch (claude_log + codex_log below)" + { echo "--- claude argv ---"; cat "$claude_log"; echo "--- codex argv ---"; cat "$codex_log"; } >&2 +fi + +# =========================================================================== +# Scenario (b): claude-verifier verdict hardening — fenced/prose verdict ends +# up jq-parseable at verdict.json. Reuses the run dir from scenario (a) (same +# preset run already produced a claude verdict through the hardening path). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +# Find the run dir the scenario-(a) run created (newest under runs/). +a_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1) +b_ok=false +if [[ -n "$a_run_dir" && -f "$a_run_dir/verdict.json" ]]; then + if jq -e '.verdict == "APPROVED" and (.confidence | type) == "number" and has("summary") and has("issues")' \ + "$a_run_dir/verdict.json" >/dev/null 2>&1; then + b_ok=true + fi +fi +if [[ "$b_ok" == true ]]; then + pass "verdict hardening: prose/fence-wrapped verdict.json is jq-parseable with expected fields" +else + fail "verdict.json not jq-parseable after hardening (run dir: ${a_run_dir:-})" + [[ -n "${a_run_dir:-}" && -f "$a_run_dir/verdict.json" ]] && { echo "--- verdict.json ---"; cat "$a_run_dir/verdict.json"; } >&2 +fi +# Clean up the runs/ artifacts this gate created so it leaves no side effects. +[[ -n "${a_run_dir:-}" ]] && rm -rf "$a_run_dir" + +# =========================================================================== +# Scenario (c): --preset codex-build --verifier codex → verifier is codex +# (last-wins parsing; flags after --preset override it). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo c) +codex_log="$TEST_DIR/c_codex.log"; claude_log="$TEST_DIR/c_claude.log" +: > "$codex_log"; : > "$claude_log" +run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + bash "$RUNNER" --preset codex-build --verifier codex --skip-plan --plan "$PLAN_FIXTURE" \ + --workdir "$repo" --timeout 60 -- "preset verifier override probe" +) >"$TEST_DIR/c.out" 2>&1 || true +c_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1) +c_ok=false +if [[ -n "$c_run_dir" && "$c_run_dir" != "$run_dir_before" ]]; then + # state.json seat_models.verifier must report codex; banner Verifier line too. + if jq -e '.seat_models.verifier | startswith("codex:")' "$c_run_dir/state.json" >/dev/null 2>&1 \ + && grep -Eq '^ Verifier: codex' "$c_run_dir/run.log"; then + c_ok=true + fi +fi +if [[ "$c_ok" == true ]]; then + pass "preset codex-build --verifier codex: verifier seat is codex (last-wins)" +else + fail "verifier override did not win over preset (run dir: ${c_run_dir:-})" + [[ -n "${c_run_dir:-}" ]] && grep -E '^ Verifier:' "$c_run_dir/run.log" >&2 || true +fi +[[ -n "${c_run_dir:-}" && "$c_run_dir" != "$run_dir_before" ]] && rm -rf "$c_run_dir" + +# =========================================================================== +# Scenario (d): COMPOSER_EFFORT=low pre-set + preset → composer gets +# --effort low (fill-if-empty: env beats the preset's high default). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo d) +claude_log="$TEST_DIR/d_claude.log"; codex_log="$TEST_DIR/d_codex.log" +: > "$claude_log"; : > "$codex_log" +run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export COMPOSER_EFFORT="low" + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + bash "$RUNNER" --preset codex-build --workdir "$repo" --timeout 60 -- "preset composer effort env probe" +) >"$TEST_DIR/d.out" 2>&1 || true +d_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1) +if grep -Eq -- '--effort low' "$claude_log" && ! grep -Eq -- '--effort high' "$claude_log"; then + pass "preset + COMPOSER_EFFORT=low: composer gets --effort low (fill-if-empty env wins)" +else + fail "COMPOSER_EFFORT env did not beat preset high (claude argv below)" + cat "$claude_log" >&2 +fi +[[ -n "${d_run_dir:-}" && "$d_run_dir" != "$run_dir_before" ]] && rm -rf "$d_run_dir" + +# =========================================================================== +# Scenario (e): --preset bogus dies with the unknown-preset message. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +e_out=$(bash "$RUNNER" --preset bogus -- "x" 2>&1 || true) +if printf '%s' "$e_out" | grep -Eq 'Unknown preset: bogus'; then + pass "unknown preset dies with the unknown-preset message" +else + fail "unknown preset did not produce the expected die message (got: $e_out)" +fi + +# =========================================================================== +# Scenario (f): --help mentions --preset. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +help_out=$(bash "$RUNNER" --help 2>/dev/null || true) +if printf '%s' "$help_out" | grep -Eq -- '--preset'; then + pass "--help mentions --preset" +else + fail "--help does not mention --preset" +fi + +# =========================================================================== +# Scenario (g): parity guard — a default run (no preset, no new envs) emits no +# --effort (claude) and no model_reasoning_effort (codex) anywhere. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo g) +claude_log="$TEST_DIR/g_claude.log"; codex_log="$TEST_DIR/g_codex.log" +: > "$claude_log"; : > "$codex_log" +run_dir_before=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1 || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + # Default flavor: codex composes + executes, opus verifies. --verify so a + # claude verdict path runs too. No preset, no effort knobs. + bash "$RUNNER" --verify --bounces 0 --skip-plan --plan "$PLAN_FIXTURE" \ + --workdir "$repo" --timeout 60 -- "parity guard probe" +) >"$TEST_DIR/g.out" 2>&1 || true +g_run_dir=$(ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1) +g_ok=true +if grep -Eq -- '--effort ' "$claude_log"; then g_ok=false; fi +if grep -Eq -- 'model_reasoning_effort=' "$codex_log"; then g_ok=false; fi +if [[ "$g_ok" == true ]]; then + pass "parity guard: default run emits no --effort and no model_reasoning_effort" +else + fail "parity guard: an effort flag leaked into a default run (logs below)" + { echo "--- claude argv ---"; cat "$claude_log"; echo "--- codex argv ---"; cat "$codex_log"; } >&2 +fi +[[ -n "${g_run_dir:-}" && "$g_run_dir" != "$run_dir_before" ]] && rm -rf "$g_run_dir" + +# --- summary ---------------------------------------------------------------- +passed=$((TOTAL - FAILURES)) +if (( FAILURES == 0 )); then + echo "$passed/$TOTAL scenarios passed" + exit 0 +else + echo "$passed/$TOTAL scenarios passed ($FAILURES failed)" >&2 + exit 1 +fi From 13c2beea4f2f7cec5a6d6dbcf4d08b97b2752833 Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 13:07:02 -0400 Subject: [PATCH 04/12] =?UTF-8?q?feat:=20runner=20observability=20?= =?UTF-8?q?=E2=80=94=20current=5Fphase,=20runner=5Fpid,=20shas,=20lineage?= =?UTF-8?q?=20+=20status=20reader=20(v1.5=20Phase=203)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Fable 5 --- dev-review/codex/dev-review-status.sh | 362 ++++++++++++++++++++ dev-review/codex/dev-review.sh | 67 ++++ evals/RUNNER-CONTRACT.md | 7 +- lib/co-evolution.sh | 29 ++ tests/observability-lifecycle-simulation.sh | 292 ++++++++++++++++ tests/status-reader-simulation.sh | 203 +++++++++++ 6 files changed, 959 insertions(+), 1 deletion(-) create mode 100755 dev-review/codex/dev-review-status.sh create mode 100644 tests/observability-lifecycle-simulation.sh create mode 100644 tests/status-reader-simulation.sh diff --git a/dev-review/codex/dev-review-status.sh b/dev-review/codex/dev-review-status.sh new file mode 100755 index 0000000..540584b --- /dev/null +++ b/dev-review/codex/dev-review-status.sh @@ -0,0 +1,362 @@ +#!/usr/bin/env bash +# dev-review/codex/dev-review-status.sh +# v1.5 Phase 3 — read-only status reader for dev-review runner runs. +# +# A Claude Code session kicks the runner via a background Bash task and ENDS +# ITS TURN. On wake (or via a cheap watcher, or a human) it must reconstruct +# run state purely from disk. This script does exactly that — it writes NOTHING +# and never touches the runner. It reports run summary, phase progress, a +# heartbeat derived from the in-flight phase's stderr-log mtime/size, marker +# counts, verdict, diffstat, and a liveness assessment with an actionable line. +# +# Usage: +# dev-review-status.sh [--json] [--list] [RUN_ID|RUN_DIR] +# (no positional) → latest runs/dev-review-* by mtime +# --list → one summary line per non-terminal run + 3 newest terminal +# RUN_ID → resolved against /runs/RUN_ID +# RUN_DIR → an absolute/relative dir path, used as-is +# +# Exit codes (liveness assessment): +# 0 terminal-completed (status=completed) +# 2 terminal-partial (status=partial/failed) +# 5 still-running (non-terminal + pid alive OR heartbeat fresh <120s) +# 4 presumed-dead (non-terminal + pid gone + heartbeat stale) +# 3 run not found / no state.json / bad argument +# +# Reader, not runner: requires jq and dies (exit 3) without it. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# Repo root: dev-review/codex/ -> repo (two levels up). +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +RUNS_DIR="${CO_EVOLVE_RUNS_DIR:-$REPO_ROOT/runs}" + +HEARTBEAT_FRESH_SECS=120 + +die() { printf 'dev-review-status: %s\n' "$1" >&2; exit 3; } + +command -v jq >/dev/null 2>&1 || die "jq is required (this is a reader, not the runner)" + +# --- portable file mtime (epoch) -------------------------------------------- +# Feature-detect GNU stat (supports --version) vs BSD stat, mirroring the +# GNU-find detection convention in lib/co-evolution.sh (list_available_lab_modes). +if stat --version >/dev/null 2>&1; then + _file_mtime() { stat -c %Y "$1" 2>/dev/null; } # GNU +else + _file_mtime() { stat -f %m "$1" 2>/dev/null; } # BSD/macOS +fi + +now_epoch() { date +%s; } + +# --- argument parsing -------------------------------------------------------- +JSON=false +LIST=false +ARG="" +while [[ $# -gt 0 ]]; do + case "$1" in + --json) JSON=true; shift ;; + --list) LIST=true; shift ;; + -h|--help) + sed -n '5,30p' "${BASH_SOURCE[0]}" | sed 's/^# \{0,1\}//' + exit 0 + ;; + --*) die "unknown flag: $1" ;; + *) ARG="$1"; shift ;; + esac +done + +# --- run-dir resolution ------------------------------------------------------ +# A value containing a slash (or that resolves to an existing dir) is a path; +# otherwise treat it as a RUN_ID under $RUNS_DIR. +resolve_run_dir() { + local arg="$1" + if [[ -z "$arg" ]]; then + ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null | head -1 + return 0 + fi + if [[ -d "$arg" ]]; then + printf '%s' "$arg" + return 0 + fi + if [[ "$arg" == */* ]]; then + # Looks like a path but does not exist. + printf '%s' "$arg" + return 0 + fi + printf '%s' "$RUNS_DIR/$arg" +} + +# --- map a phase name to its stderr-log path (heartbeat source) -------------- +# Mirrors the exact filenames the runner writes: +# compose -> compose-stderr.log +# bounce-NN -> pass-N-stderr.log (N un-padded; runner names the file with +# the raw pass int, the phase with NN) +# execute[-N] -> execute-stderr.log (reused across revise passes) +# verify[-N] -> review-stderr.log (reused across revise passes) +phase_stderr_file() { + local run_dir="$1" phase="$2" + case "$phase" in + compose) printf '%s' "$run_dir/compose-stderr.log" ;; + bounce-*) + local nn="${phase#bounce-}" + nn=$((10#$nn)) # strip leading zero: bounce-01 -> 1 + printf '%s' "$run_dir/pass-${nn}-stderr.log" + ;; + execute|execute-*) printf '%s' "$run_dir/execute-stderr.log" ;; + verify|verify-*) printf '%s' "$run_dir/review-stderr.log" ;; + *) printf '' ;; + esac +} + +# --- one-line summary for --list --------------------------------------------- +summarize_one() { + local run_dir="$1" + local state="$run_dir/state.json" + local id status phase + id=$(jq -r '.run_id // "?"' "$state" 2>/dev/null) + status=$(jq -r '.status // "null"' "$state" 2>/dev/null) + phase=$(jq -r '.current_phase.name // "-"' "$state" 2>/dev/null) + printf '%-32s status=%-9s phase=%s\n' "$id" "$status" "$phase" +} + +if [[ "$LIST" == true ]]; then + shopt -s nullglob + dirs=("$RUNS_DIR"/dev-review-*) + shopt -u nullglob + if [[ ${#dirs[@]} -eq 0 ]]; then + echo "(no dev-review runs under $RUNS_DIR)" + exit 3 + fi + # newest first. bash-3.2 portable (macOS /bin/bash has no mapfile); read the + # ls -t output line by line into the array. + sorted=() + while IFS= read -r _d; do + [[ -n "$_d" ]] && sorted+=("$_d") + done < <(ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null) + echo "ACTIVE (non-terminal: pending/null):" + active_found=false + for d in "${sorted[@]}"; do + [[ -f "$d/state.json" ]] || continue + st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null) + if [[ "$st" == "pending" || "$st" == "null" ]]; then + summarize_one "$d"; active_found=true + fi + done + [[ "$active_found" == false ]] && echo " (none)" + echo + echo "RECENT (3 newest terminal):" + shown=0 + for d in "${sorted[@]}"; do + [[ -f "$d/state.json" ]] || continue + st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null) + if [[ "$st" != "pending" && "$st" != "null" ]]; then + summarize_one "$d"; shown=$((shown + 1)) + [[ "$shown" -ge 3 ]] && break + fi + done + [[ "$shown" -eq 0 ]] && echo " (none)" + exit 0 +fi + +RUN_DIR=$(resolve_run_dir "$ARG") +[[ -n "$RUN_DIR" && -d "$RUN_DIR" ]] || die "run not found: ${ARG:-} (looked under $RUNS_DIR)" +STATE="$RUN_DIR/state.json" +[[ -f "$STATE" ]] || die "no state.json in $RUN_DIR" +jq -e . "$STATE" >/dev/null 2>&1 || die "state.json is not valid JSON: $STATE" + +# --- read fields ------------------------------------------------------------- +RUN_ID=$(jq -r '.run_id // "?"' "$STATE") +STATUS=$(jq -r '.status // "null"' "$STATE") +TASK=$(jq -r '.task // ""' "$STATE" | head -c 100) +RUNNER_PID=$(jq -r '.runner_pid // empty' "$STATE") +CUR_PHASE=$(jq -r '.current_phase.name // empty' "$STATE") +CUR_STARTED=$(jq -r '.current_phase.started_at // empty' "$STATE") +PARENT_RUN=$(jq -r '.orchestration.parent_run_id // empty' "$STATE") +VERDICT=$(jq -r '.verify_verdict // empty' "$STATE") +PRE_SHA=$(jq -r '.pre_execute_sha // empty' "$STATE") +POST_SHA=$(jq -r '.post_execute_sha // empty' "$STATE") +M_CONTESTED=$(jq -r '.marker_counts.contested // 0' "$STATE") +M_CLARIFY=$(jq -r '.marker_counts.clarify // 0' "$STATE") + +# --- runner liveness via kill -0 --------------------------------------------- +# yes = pid present and alive; no = pid present and gone; unknown = no pid. +RUNNER_ALIVE="unknown" +if [[ -n "$RUNNER_PID" ]]; then + if kill -0 "$RUNNER_PID" 2>/dev/null; then + RUNNER_ALIVE="yes" + else + RUNNER_ALIVE="no" + fi +fi + +# --- heartbeat from the in-flight phase's stderr log ------------------------- +HB_AGE="" # seconds since last write (empty if no current phase / no file) +HB_LINES="" # line count of the stderr file +HB_FILE="" +if [[ -n "$CUR_PHASE" ]]; then + HB_FILE=$(phase_stderr_file "$RUN_DIR" "$CUR_PHASE") + if [[ -n "$HB_FILE" && -f "$HB_FILE" ]]; then + mt=$(_file_mtime "$HB_FILE") + if [[ -n "$mt" ]]; then + HB_AGE=$(( $(now_epoch) - mt )) + fi + HB_LINES=$(wc -l < "$HB_FILE" | tr -d ' ') + fi +fi + +# --- liveness assessment + exit code ----------------------------------------- +# Terminal states short-circuit. Non-terminal: alive pid OR fresh heartbeat +# => still-running; pid gone + stale/absent heartbeat => presumed-dead. +ASSESS="" +EXIT_CODE=0 +case "$STATUS" in + completed) + ASSESS="DONE: run completed cleanly." + EXIT_CODE=0 + ;; + partial|failed) + ASSESS="PARTIAL: run reached a terminal non-completed status ($STATUS) — review verdict/diffstat before acting." + EXIT_CODE=2 + ;; + *) + # Non-terminal (pending / null / unexpected). Decide running vs dead. + hb_fresh=false + if [[ -n "$HB_AGE" && "$HB_AGE" -lt "$HEARTBEAT_FRESH_SECS" ]]; then + hb_fresh=true + fi + if [[ "$RUNNER_ALIVE" == "yes" || "$hb_fresh" == true ]]; then + ASSESS="RUNNING: runner alive ($RUNNER_ALIVE) / heartbeat ${HB_AGE:-?}s — in phase ${CUR_PHASE:-?}. Leave it; re-check on wake." + EXIT_CODE=5 + else + ASSESS="DEAD: runner gone mid-phase (${CUR_PHASE:-?}) — escalate or re-kick with --skip-plan --plan ${RUN_DIR}/plan.md" + EXIT_CODE=4 + fi + ;; +esac + +# --- diffstat tail + execute stderr tail (paths) ----------------------------- +DIFFSTAT_FILE="$RUN_DIR/execute-diffstat.txt" +EXEC_STDERR="$RUN_DIR/execute-stderr.log" +VERDICT_JSON="$RUN_DIR/verdict.json" + +# ============================================================================ +# JSON mode — one machine-readable object for the orchestrating skill. +# ============================================================================ +if [[ "$JSON" == true ]]; then + diffstat_tail="" + [[ -f "$DIFFSTAT_FILE" ]] && diffstat_tail=$(tail -5 "$DIFFSTAT_FILE") + exec_tail="" + [[ -f "$EXEC_STDERR" ]] && exec_tail=$(tail -5 "$EXEC_STDERR") + verdict_present=false + [[ -f "$VERDICT_JSON" ]] && verdict_present=true + + jq -n \ + --arg run_id "$RUN_ID" \ + --arg run_dir "$RUN_DIR" \ + --arg status "$STATUS" \ + --arg task "$TASK" \ + --arg runner_pid "${RUNNER_PID:-}" \ + --arg runner_alive "$RUNNER_ALIVE" \ + --arg current_phase "${CUR_PHASE:-}" \ + --arg current_started "${CUR_STARTED:-}" \ + --arg heartbeat_age "${HB_AGE:-}" \ + --arg heartbeat_lines "${HB_LINES:-}" \ + --arg heartbeat_file "${HB_FILE:-}" \ + --arg parent_run "${PARENT_RUN:-}" \ + --arg verdict "${VERDICT:-}" \ + --arg verdict_json "$([[ -f "$VERDICT_JSON" ]] && printf '%s' "$VERDICT_JSON")" \ + --argjson verdict_present "$verdict_present" \ + --arg pre_sha "${PRE_SHA:-}" \ + --arg post_sha "${POST_SHA:-}" \ + --argjson contested "$M_CONTESTED" \ + --argjson clarify "$M_CLARIFY" \ + --arg diffstat_tail "$diffstat_tail" \ + --arg execute_stderr_tail "$exec_tail" \ + --arg assess "$ASSESS" \ + --argjson exit_code "$EXIT_CODE" \ + --slurpfile phases <(jq '.phases // []' "$STATE") \ + '{ + run_id: $run_id, + run_dir: $run_dir, + status: $status, + task: $task, + runner_pid: (if $runner_pid == "" then null else ($runner_pid|tonumber) end), + runner_alive: $runner_alive, + current_phase: (if $current_phase == "" then null else {name: $current_phase, started_at: $current_started} end), + heartbeat: { + age_secs: (if $heartbeat_age == "" then null else ($heartbeat_age|tonumber) end), + lines: (if $heartbeat_lines == "" then null else ($heartbeat_lines|tonumber) end), + file: (if $heartbeat_file == "" then null else $heartbeat_file end) + }, + orchestration: {parent_run_id: (if $parent_run == "" then null else $parent_run end)}, + verdict: (if $verdict == "" then null else $verdict end), + verdict_json: (if $verdict_json == "" then null else $verdict_json end), + verdict_present: $verdict_present, + pre_execute_sha: (if $pre_sha == "" then null else $pre_sha end), + post_execute_sha: (if $post_sha == "" then null else $post_sha end), + marker_counts: {contested: $contested, clarify: $clarify}, + phases: $phases[0], + diffstat_tail: $diffstat_tail, + execute_stderr_tail: $execute_stderr_tail, + assess: $assess, + exit_code: $exit_code + }' + exit "$EXIT_CODE" +fi + +# ============================================================================ +# Human mode. +# ============================================================================ +printf '== dev-review run: %s ==\n' "$RUN_ID" +printf 'run dir: %s\n' "$RUN_DIR" +printf 'status: %s\n' "$STATUS" +printf 'runner pid: %s (alive: %s)\n' "${RUNNER_PID:-}" "$RUNNER_ALIVE" +[[ -n "$PARENT_RUN" ]] && printf 'parent run: %s\n' "$PARENT_RUN" +printf 'task: %s\n' "$TASK" + +if [[ -n "$CUR_PHASE" ]]; then + printf 'current phase: %s (started %s)\n' "$CUR_PHASE" "${CUR_STARTED:-?}" + if [[ -n "$HB_AGE" ]]; then + printf 'heartbeat: %ss ago, %s lines (%s)\n' "$HB_AGE" "${HB_LINES:-0}" "$HB_FILE" + else + printf 'heartbeat: (no stderr activity yet for %s)\n' "$CUR_PHASE" + fi +else + printf 'current phase: \n' +fi + +# Completed phases table. +phase_rows=$(jq -r '.phases // [] | .[] | " \(.name)\t\(.status)\texit=\(.exit_code)\t\(.completed_at // "")"' "$STATE" 2>/dev/null) +if [[ -n "$phase_rows" ]]; then + printf 'phases:\n%s\n' "$phase_rows" +else + printf 'phases: (none recorded yet)\n' +fi + +printf 'markers: contested=%s clarify=%s\n' "$M_CONTESTED" "$M_CLARIFY" + +if [[ -n "$PRE_SHA" || -n "$POST_SHA" ]]; then + printf 'execute SHAs: pre=%s post=%s\n' "${PRE_SHA:-}" "${POST_SHA:-}" +fi + +if [[ -n "$VERDICT" ]]; then + printf 'verdict: %s' "$VERDICT" + [[ -f "$VERDICT_JSON" ]] && printf ' (%s)' "$VERDICT_JSON" + printf '\n' +elif [[ -f "$VERDICT_JSON" ]]; then + printf 'verdict: (verdict.json present: %s)\n' "$VERDICT_JSON" +fi + +if [[ -f "$DIFFSTAT_FILE" ]]; then + printf 'diffstat (tail):\n' + tail -5 "$DIFFSTAT_FILE" | sed 's/^/ /' +fi + +if [[ -f "$EXEC_STDERR" ]]; then + printf 'execute stderr (last 5 lines):\n' + tail -5 "$EXEC_STDERR" | sed 's/^/ /' +fi + +printf 'ASSESS: %s\n' "$ASSESS" +exit "$EXIT_CODE" diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh index 0bf0ff7..583d8a9 100644 --- a/dev-review/codex/dev-review.sh +++ b/dev-review/codex/dev-review.sh @@ -67,6 +67,9 @@ FLAVOR_OVERRIDE="" # (byte-parity invariant SC-5 / D-06). Harness-side passes an absolute path rooted under # the eval fixture's .co-evolution/runs/ subtree. RUN_DIR_OVERRIDE="" +# v1.5 Phase 3: lineage token for orchestrated re-kicks. Empty = standalone run +# (byte-parity: .orchestration is omitted). Validated as a safe fs-ish token. +PARENT_RUN_ID="" usage() { cat <<'EOF' @@ -94,6 +97,7 @@ Options: --live Launch visible Windows terminal tailing each phase's stderr (Windows-only; warns + falls back on other OS) --branch auto|NAME Create a feature branch off HEAD before execute (auto = dev-review/auto--); mutually exclusive with --worktree --worktree auto|PATH Create a git worktree for isolation before execute (auto = sibling dir); mutually exclusive with --branch + --parent-run RUN_ID Lineage tag: record the orchestrator's parent run id in state.orchestration.parent_run_id (re-kicks always get a fresh run dir; no behavior change) --lab MODE Route to lab//entry.sh (opt-in beta channel; see lab/README.md) --target FILE PEL-only: file to mutate (used with --lab pel-proposer; must be repo-relative forward-slash path, e.g. lib/co-evolution.sh — NOT absolute or WSL/Windows-style) --tier TIER PEL-only: override tier auto-detect (template|policy|code) @@ -597,6 +601,8 @@ Output ONLY the plan document. No preamble." # abort_on_timeout will fire from main flow if the dispatcher reports 124. # RTUX-01: Launch a live-tail window before the agent call (no-op unless --live). maybe_launch_live_window "compose" "$compose_stderr_file" + # v1.5 Phase 3: mark the compose phase as STARTING (status reader heartbeat). + begin_state_phase "$STATE_JSON" "compose" # v1.5: layer the composer seat's model/effort (no-op when no per-seat env set). apply_seat_env composer "$COMPOSER" invoke_agent_with_timeout "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$(phase_is_writable compose)" @@ -689,6 +695,9 @@ run_bounce_phase() { # so state.json records which pass hit the timeout. # RTUX-01: Per-pass live-tail window (bounce-01, bounce-02, ...) when --live. maybe_launch_live_window "bounce-${pass_padded}" "$stderr_file" + # v1.5 Phase 3: mark this bounce pass as STARTING (matches the bounce-NN + # name write_state_phase records on completion). + begin_state_phase "$STATE_JSON" "bounce-${pass_padded}" # v1.5: composer turns get the composer seat's model/effort; reviewer (the # bounce counterparty) turns use globals only (the `bounce` seat is a no-op # arm in apply_seat_env). No-op overall when no per-seat env is set. @@ -775,6 +784,17 @@ run_execute_phase() { PRE_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true) fi + # v1.5 Phase 3: record the workdir HEAD just before the executor runs so a + # reviewer can pin the exact pre-change commit. Non-git / detached / failed + # rev-parse yields an empty PRE_EXECUTE_SHA → write null (never die). + if [[ -n "${STATE_JSON:-}" ]]; then + if [[ -n "$PRE_EXECUTE_SHA" ]]; then + write_state_field "$STATE_JSON" ".pre_execute_sha" "string" "$PRE_EXECUTE_SHA" + else + write_state_field "$STATE_JSON" ".pre_execute_sha" "null" + fi + fi + # RNPT-03: Capture pre-execute baseline hash of every workdir file. # Delta computed post-execute and written into state.json. if [[ -n "${STATE_JSON:-}" && -n "${BASELINE_HASHES_JSON:-}" ]]; then @@ -823,6 +843,17 @@ run_execute_phase() { POST_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true) status_output=$(git -C "$WORKDIR" status --short) + # v1.5 Phase 3: record the workdir HEAD just after change detection so a + # reviewer can see whether the executor committed (pre != post) or left the + # tree dirty (pre == post). Empty rev-parse → null (never die). + if [[ -n "${STATE_JSON:-}" ]]; then + if [[ -n "$POST_EXECUTE_SHA" ]]; then + write_state_field "$STATE_JSON" ".post_execute_sha" "string" "$POST_EXECUTE_SHA" + else + write_state_field "$STATE_JSON" ".post_execute_sha" "null" + fi + fi + # RNPT-03: Post-execute delta (runs regardless of "no changes" branch below # so Phase 8 scorer sees an empty delta rather than a missing field). if [[ -n "${STATE_JSON:-}" && -n "${CURRENT_HASHES_JSON:-}" && -n "${EXECUTE_DELTA_JSON:-}" ]]; then @@ -1110,6 +1141,19 @@ while [[ $# -gt 0 ]]; do WORKTREE_SPEC="$2" shift 2 ;; + --parent-run) + # v1.5 Phase 3: orchestration lineage. Validate as a safe filesystem-ish + # token (alnum, _, -, .) so a malicious id can't traverse paths or inject + # shell-meta when echoed downstream (mirrors validate_lab_mode's posture, + # plus '.' since run ids carry dotted timestamps). Lineage only — no + # behavior change beyond the state.orchestration.parent_run_id write. + [[ $# -gt 1 ]] || die "--parent-run requires a value" + if ! [[ "$2" =~ ^[A-Za-z0-9._-]+$ ]] || [[ "${#2}" -gt 128 ]]; then + die "--parent-run must be a safe token [A-Za-z0-9._-]{1,128} (got: $2)" + fi + PARENT_RUN_ID="$2" + shift 2 + ;; --lab) # Phase 3 LAB-01: opt-in routing to lab//entry.sh. # Arm sits BEFORE the `--` argv-terminator so args after `--` remain @@ -1361,6 +1405,17 @@ EXECUTE_DELTA_JSON="${RUN_DIR}/.execute-delta.json" RUN_ID="dev-review-${TIMESTAMP}" init_state_json "$STATE_JSON" "$RUN_ID" "$TASK" "$COMPOSER" "$EXECUTOR" "$REVIEWER" +# v1.5 Phase 3: observability — record the runner's PID once so a status reader +# can probe liveness via `kill -0`. Written as a number (jq accepts $$ verbatim). +write_state_field "$STATE_JSON" ".runner_pid" "number" "$$" + +# v1.5 Phase 3: lineage — when this run was re-kicked by an orchestrator under a +# parent run, record the parent's id (validated at parse time). Lineage only; +# no other behavior. Omitted (left absent) when --parent-run was not passed. +if [[ -n "$PARENT_RUN_ID" ]]; then + write_state_field "$STATE_JSON" ".orchestration.parent_run_id" "string" "$PARENT_RUN_ID" +fi + # v1.5: resolve the verifier seat and per-seat model/effort descriptors for the # banner + observability fields. _VERIFIER_SEAT reflects --verifier / executor # derivation (select_verifier); the seat strings reflect whatever the seats @@ -1440,6 +1495,9 @@ if [[ "$PLAN_ONLY" == "true" ]]; then # RNPT-04: record completion even on plan-only exit so Phase 8 eval sees a # terminated run rather than one that looks crashed mid-phase. write_state_field "$STATE_JSON" ".completed_at" "string" "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + # v1.5 Phase 3: plan-only runs also reach a clean terminal — clear the in-flight + # phase so a status reader does not see a stale compose/bounce as "still running". + write_state_field "$STATE_JSON" ".current_phase" "null" cleanup_runtime_artifacts if [[ "${PLAN_EXIT:-0}" -eq 2 ]]; then exit 2 @@ -1531,6 +1589,9 @@ _run_revise_loop() { _execute_phase_start=$(date -u +%Y-%m-%dT%H:%M:%SZ) EXECUTE_EXIT=0 + # v1.5 Phase 3: mark execute (or execute-N on a revise pass) as STARTING, + # using the same name write_state_phase records on completion below. + begin_state_phase "$STATE_JSON" "$exec_name" run_execute_phase "$_execute_phase_start" || EXECUTE_EXIT=$? _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ) _phase_status=$([[ "$EXECUTE_EXIT" -eq 0 ]] && echo "ok" || echo "error") @@ -1545,6 +1606,8 @@ _run_revise_loop() { VERIFY_EXIT=0 # Clear prior verdict globals so a hung/empty verify pass can't leak them. unset VERDICT CONFIDENCE SUMMARY + # v1.5 Phase 3: mark verify (or verify-N on a revise pass) as STARTING. + begin_state_phase "$STATE_JSON" "$verify_name" run_verify_phase "$_verify_phase_start" || VERIFY_EXIT=$? _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ) _phase_status=$([[ "$VERIFY_EXIT" -eq 0 ]] && echo "ok" || echo "error") @@ -1595,6 +1658,10 @@ write_state_field "$STATE_JSON" ".status" "string" "$_run_status" write_state_field "$STATE_JSON" ".completed_at" "string" "$_run_end_ts" # Phase 8.1 WR-02 supporting: .updated_at feeds Cost dimension (score-run.sh:261). write_state_field "$STATE_JSON" ".updated_at" "string" "$_run_end_ts" +# v1.5 Phase 3: the run reached EOF — no phase is in flight. The status reader +# treats a non-null .current_phase on a non-terminal run as "still in "; +# clearing it here is the in-band "phases done" signal. +write_state_field "$STATE_JSON" ".current_phase" "null" # Phase 8.1 / D-01: mirror phases[] into history[] (canonical contract name). # Field-rename transition posture — phases[] stays as legacy alias for one diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md index 650f476..e262780 100644 --- a/evals/RUNNER-CONTRACT.md +++ b/evals/RUNNER-CONTRACT.md @@ -6,7 +6,7 @@ owners: - dev-review/codex/dev-review.sh (runner) - evals/score-run.sh (scorer) - evals/tests/fake-runner.sh (reference implementation) -updated: 2026-04-19 +updated: 2026-06-12 --- # Runner ↔ scorer contract @@ -44,6 +44,11 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`. | `history` | array\<`{phase,status,detail,timestamp}`\> | yes | runner (per phase) | `score-run.sh:338` | Canonical bounce-trace array. Convergence needs ≥1 entry with `phase` matching `^bounce-[0-9]+$` when bounces were expected. | | `mode` | string | no | runner (optional tag) | — | Fake-runner sets `"fake-runner"`; real runner may set `"real"` or omit. Runner-owned metadata — NOT written by the shared `init_state_json` library. | | `seat_models` | object `{composer, executor, verifier}` (each string `agent:model@effort`) | no | `dev-review.sh` (post-init) | — | v1.5 observability: what each seat resolved to under the per-seat env layer (e.g. `"claude:claude-fable-5@high"`, `"codex:(default)@xhigh"`). Runner-owned metadata — NOT written by the shared `init_state_json` library; scorer ignores it. | +| `current_phase` | object `{name, started_at}` \| null | no | `dev-review.sh` (`begin_state_phase` at each phase start; null at EOF) | — | v1.5 additions — observability. The phase the runner is currently in; `name` matches the `phases[]`/`history[]` phase name (`compose`, `bounce-NN`, `execute[-N]`, `verify[-N]`). Set at phase **start**, cleared to null at EOF (incl. plan-only). A non-null value on a non-terminal run means the runner is in that phase or died there. Read by `dev-review-status.sh`; scorer ignores it. | +| `runner_pid` | number | no | `dev-review.sh` (post-init, once) | — | v1.5 additions — observability. The runner process's PID (`$$`). A status reader probes liveness via `kill -0`. Runner-owned; scorer ignores it. | +| `pre_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, before executor) | — | v1.5 additions — observability. Workdir `git rev-parse HEAD` captured just before the executor runs; null when non-git/detached. Scorer ignores it. | +| `post_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, after change detection) | — | v1.5 additions — observability. Workdir HEAD just after change detection; `pre != post` means the executor committed. Null when non-git/detached. Scorer ignores it. | +| `orchestration` | object `{parent_run_id}` | no | `dev-review.sh` (post-init, only when `--parent-run` passed) | — | v1.5 additions — lineage. Records the orchestrator's parent run id for re-kicks (re-kicks always get a fresh run dir; no `--resume`). Omitted entirely for standalone runs. Runner-owned; scorer ignores it. | ### Legacy alias: `phases` → `history` diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh index ef0aeed..cc83a77 100644 --- a/lib/co-evolution.sh +++ b/lib/co-evolution.sh @@ -1277,6 +1277,35 @@ write_state_phase() { fi } +# v1.5: record the phase now STARTING (write_state_phase records completions). +# Sets .current_phase = {name, started_at} and bumps .updated_at so a status +# reader knows which phase the runner is in mid-run. The EOF block clears it +# back to null. Single-writer discipline: called from the runner main flow only +# — there is NO concurrent heartbeat loop, which would race write_state_field's +# read-modify-write. Degrades silently (return 0) when jq is absent, matching +# write_state_phase / write_state_field. +begin_state_phase() { + local state_path="${1:?state path required}" + local phase_name="${2:?phase name required}" + + if command -v jq >/dev/null 2>&1; then + local tmp now + now=$(date -u +%Y-%m-%dT%H:%M:%SZ) + tmp=$(mktemp) + # FIX-WR-02 idiom: clean up $tmp on jq failure so we don't leak files. + if jq --arg name "$phase_name" --arg now "$now" \ + '.current_phase = {name: $name, started_at: $now} | .updated_at = $now' \ + "$state_path" > "$tmp"; then + mv "$tmp" "$state_path" + else + rm -f "$tmp" + log "WARNING: jq failed in begin_state_phase ($phase_name) — state.json unchanged" + fi + else + log "WARNING: jq unavailable — begin_state_phase skipping ($phase_name)" + fi +} + # RNPT-04: Generic jq-path field setter. Supports string|number|bool|null|rawfile. # Usage: write_state_field state.json '.verify_verdict' string APPROVED # write_state_field state.json '.execute_delta' rawfile path/to/delta.json diff --git a/tests/observability-lifecycle-simulation.sh b/tests/observability-lifecycle-simulation.sh new file mode 100644 index 0000000..7544674 --- /dev/null +++ b/tests/observability-lifecycle-simulation.sh @@ -0,0 +1,292 @@ +#!/usr/bin/env bash +# tests/observability-lifecycle-simulation.sh +# Hermetic gate for v1.5 Phase 3: runner observability weave-in +# (dev-review/codex/dev-review.sh). Runs the REAL runner end-to-end with +# PATH-stubbed claude/codex (no agents, no network) and asserts the additive +# state.json fields the status reader depends on: +# +# - .current_phase == null at a clean terminal (EOF clears the in-flight phase) +# - .runner_pid present (number) +# - .pre_execute_sha / .post_execute_sha present and 40-hex +# - --parent-run abc-123 lands in .orchestration.parent_run_id +# - a mid-run state.json capture shows .current_phase.name == "execute" +# +# Cheap shape: --skip-plan --plan --bounces 0 --verify --verifier codex +# (codex executor + codex verifier via --output-schema, so a single codex stub +# drives both phases without needing a working claude-verdict path). +# +# House final-line convention: "N/N scenarios passed". + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh" + +command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; } + +TEST_DIR="$(mktemp -d -t obs-lifecycle-XXXXXX)" +trap 'rm -rf "$TEST_DIR"' EXIT + +TOTAL=0 +PASSED=0 +pass() { printf 'PASS: %s\n' "$1"; PASSED=$((PASSED + 1)); } +fail() { printf 'FAIL: %s\n' "$1" >&2; } + +# --- hermetic git env (mirrors tests/run-all.sh) ---------------------------- +export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig" +export GIT_CONFIG_SYSTEM=/dev/null +cat > "$TEST_DIR/gitconfig" <<'GITCFG' +[user] + name = obs-sim + email = obs-sim@invalid.local +[init] + defaultBranch = master +[core] + autocrlf = false +[commit] + gpgsign = false +GITCFG + +# --- stubs ------------------------------------------------------------------ +# codex stub handles BOTH phases the default-seat skip-plan run exercises: +# - EXECUTE: prompt heading "Execution Instructions (Codex)" + a -C workdir → +# append a line to the tracked file so the run produces a diff. Honors +# OBS_EXECUTE_SLEEP (seconds) so the mid-run scenario has a polling window. +# - VERIFY: argv carries --output-schema + -o FILE → write a valid verdict. +mkdir -p "$TEST_DIR/bin" +cat > "$TEST_DIR/bin/codex" <<'STUB' +#!/usr/bin/env bash +for arg in "$@"; do + case "$arg" in --version|-v|version) echo "codex 0.117.0 (obs-stub)"; exit 0 ;; esac +done +output_file=""; workdir=""; has_schema=false; prev="" +for arg in "$@"; do + case "$prev" in + -o) output_file="$arg" ;; + -C) workdir="$arg" ;; + esac + [[ "$arg" == "--output-schema" ]] && has_schema=true + prev="$arg" +done +stdin_capture="" +while IFS= read -r _line; do stdin_capture+="$_line"$'\n'; done + +if [[ "$has_schema" == true ]]; then + # VERIFY phase — emit a schema-shaped verdict to the -o file. + printf '%s\n' '{"verdict":"APPROVED","confidence":90,"summary":"obs-stub approves the change.","issues":[],"scope_creep_detected":false,"iteration_notes":""}' > "$output_file" + exit 0 +fi + +if [[ "$stdin_capture" == *"Execution Instructions (Codex)"* && -n "$workdir" && -f "$workdir/touched.txt" ]]; then + # EXECUTE phase — optional dwell so a poller can observe current_phase=execute. + if [[ -n "${OBS_EXECUTE_SLEEP:-}" ]]; then + sleep "$OBS_EXECUTE_SLEEP" + fi + printf 'obs-stub-change\n' >> "$workdir/touched.txt" +fi +# Non-schema codex calls emit to -o (or stdout) so the runner sees non-empty output. +plan='obs stub output line one\nobs stub output line two\nobs stub output line three' +if [[ -n "$output_file" ]]; then + printf '%b\n' "$plan" > "$output_file" +else + printf '%b\n' "$plan" +fi +exit 0 +STUB +chmod +x "$TEST_DIR/bin/codex" +# claude stub (unused by default seats, present for safety): emit a plan. +cat > "$TEST_DIR/bin/claude" <<'STUB' +#!/usr/bin/env bash +for arg in "$@"; do + case "$arg" in --version|-v|version) echo "claude 1.0.0 (obs-stub)"; exit 0 ;; esac +done +while IFS= read -r _line; do :; done +printf 'obs claude stub output\n' +STUB +chmod +x "$TEST_DIR/bin/claude" + +# --- fixtures --------------------------------------------------------------- +make_scratch_repo() { + local repo="$TEST_DIR/repo-$1" + rm -rf "$repo"; mkdir -p "$repo" + git -C "$repo" init -q + printf 'baseline\n' > "$repo/touched.txt" + git -C "$repo" add -A + git -C "$repo" commit -qm baseline + printf '%s' "$repo" +} + +PLAN_FIXTURE="$TEST_DIR/plan-fixture.md" +cat > "$PLAN_FIXTURE" <<'PLAN' +# Observability Lifecycle Fixture + +## Approach + +A pre-baked plan so the --skip-plan run bypasses compose and bounce and drives +only the execute and verify phases. It satisfies the plan heading regex, the +minimum word count, and the structural-line check. The executor stub appends one +harmless line to a tracked file so the run produces a diff for verify to score. + +## Files to Change + +- `touched.txt` — the executor stub appends a line. + +## Risks + +- None identified. +PLAN + +latest_run_dir() { ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1; } + +# =========================================================================== +# Scenario 1: full lifecycle — current_phase=null at EOF, runner_pid present, +# pre/post SHA 40-hex. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo s1) +before=$(latest_run_dir || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \ + --workdir "$repo" --timeout 60 -- "obs lifecycle probe" +) > "$TEST_DIR/s1.out" 2>&1 || true +run1=$(latest_run_dir || true) +s1_state="$run1/state.json" +s1_ok=true +if [[ -z "$run1" || "$run1" == "$before" || ! -f "$s1_state" ]]; then + s1_ok=false +else + jq -e '.current_phase == null' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo " current_phase not null" >&2; } + jq -e '(.runner_pid | type) == "number"' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo " runner_pid missing/non-number" >&2; } + jq -e '(.pre_execute_sha | test("^[0-9a-f]{40}$"))' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo " pre_execute_sha not 40-hex" >&2; } + jq -e '(.post_execute_sha | test("^[0-9a-f]{40}$"))' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo " post_execute_sha not 40-hex" >&2; } + jq -e '.verify_verdict == "APPROVED"' "$s1_state" >/dev/null 2>&1 || { s1_ok=false; echo " verdict not APPROVED" >&2; } +fi +if [[ "$s1_ok" == true ]]; then + pass "S1: lifecycle — current_phase=null, runner_pid number, pre/post SHA 40-hex, verdict APPROVED" +else + fail "S1: state below" + [[ -f "$s1_state" ]] && cat "$s1_state" >&2 + cat "$TEST_DIR/s1.out" >&2 +fi +[[ -n "$run1" && "$run1" != "$before" ]] && rm -rf "$run1" + +# =========================================================================== +# Scenario 2: --parent-run lands in .orchestration.parent_run_id. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo s2) +before=$(latest_run_dir || true) +( + export PATH="$TEST_DIR/bin:$PATH" + bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \ + --parent-run "abc-123" --workdir "$repo" --timeout 60 -- "obs parent-run probe" +) > "$TEST_DIR/s2.out" 2>&1 || true +run2=$(latest_run_dir || true) +if [[ -n "$run2" && "$run2" != "$before" && -f "$run2/state.json" ]] \ + && jq -e '.orchestration.parent_run_id == "abc-123"' "$run2/state.json" >/dev/null 2>&1; then + pass "S2: --parent-run abc-123 recorded in .orchestration.parent_run_id" +else + fail "S2: parent_run_id missing (run=$run2)" + [[ -f "$run2/state.json" ]] && cat "$run2/state.json" >&2 +fi +[[ -n "$run2" && "$run2" != "$before" ]] && rm -rf "$run2" + +# =========================================================================== +# Scenario 3: --parent-run rejects a garbage token (path traversal / shell meta). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +rc3=0 +( export PATH="$TEST_DIR/bin:$PATH" + bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --parent-run '../evil;rm' \ + --workdir "$TEST_DIR" --timeout 60 -- "x" ) > "$TEST_DIR/s3.out" 2>&1 || rc3=$? +if [[ "$rc3" -ne 0 ]] && grep -qi 'parent-run' "$TEST_DIR/s3.out"; then + pass "S3: --parent-run rejects an unsafe token (die)" +else + fail "S3: rc=$rc3 out=$(cat "$TEST_DIR/s3.out")" +fi + +# =========================================================================== +# Scenario 4: mid-run capture — while execute dwells, state.json shows +# current_phase.name == "execute" and runner_pid is alive. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo s4) +before=$(latest_run_dir || true) +( + export PATH="$TEST_DIR/bin:$PATH" + export OBS_EXECUTE_SLEEP=4 + bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \ + --workdir "$repo" --timeout 60 -- "obs mid-run probe" +) > "$TEST_DIR/s4.out" 2>&1 & +runner_bg=$! + +# Poll for a run dir whose state.json reports current_phase.name == execute. +captured=false +captured_pid="" +for _ in $(seq 1 60); do + rd=$(latest_run_dir || true) + if [[ -n "$rd" && "$rd" != "$before" && -f "$rd/state.json" ]]; then + name=$(jq -r '.current_phase.name // empty' "$rd/state.json" 2>/dev/null || true) + if [[ "$name" == "execute" ]]; then + captured=true + captured_pid=$(jq -r '.runner_pid // empty' "$rd/state.json" 2>/dev/null || true) + break + fi + fi + sleep 0.2 +done +wait "$runner_bg" 2>/dev/null || true + +s4_ok=true +[[ "$captured" == true ]] || { s4_ok=false; echo " never observed current_phase=execute" >&2; } +# runner_pid captured mid-run must be a positive integer. +[[ "$captured_pid" =~ ^[0-9]+$ ]] || { s4_ok=false; echo " captured runner_pid not numeric: $captured_pid" >&2; } +# And the run must still terminate cleanly with current_phase cleared. +run4=$(latest_run_dir || true) +if [[ -n "$run4" && -f "$run4/state.json" ]]; then + jq -e '.current_phase == null and .status == "completed"' "$run4/state.json" >/dev/null 2>&1 \ + || { s4_ok=false; echo " post-run state not clean terminal" >&2; } +fi +if [[ "$s4_ok" == true ]]; then + pass "S4: mid-run capture observed current_phase=execute; run then terminated clean" +else + fail "S4: capture failed" + [[ -f "$run4/state.json" ]] && cat "$run4/state.json" >&2 + cat "$TEST_DIR/s4.out" >&2 +fi +[[ -n "$run4" && "$run4" != "$before" ]] && rm -rf "$run4" + +# =========================================================================== +# Scenario 5: parity — a default skip-plan run with NO new flags still produces +# a completed state (no-new-flags regression; existing bounce-state behavior +# is covered by tests/bounce-state-simulation.sh). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo s5) +before=$(latest_run_dir || true) +( + export PATH="$TEST_DIR/bin:$PATH" + bash "$RUNNER" --skip-plan --plan "$PLAN_FIXTURE" --bounces 0 --verify --verifier codex \ + --workdir "$repo" --timeout 60 -- "obs parity probe" +) > "$TEST_DIR/s5.out" 2>&1 || true +run5=$(latest_run_dir || true) +if [[ -n "$run5" && "$run5" != "$before" && -f "$run5/state.json" ]] \ + && jq -e '.status == "completed" and .current_phase == null' "$run5/state.json" >/dev/null 2>&1; then + pass "S5: no-new-flags run reaches completed with current_phase cleared" +else + fail "S5: parity run not clean (run=$run5)" + [[ -f "$run5/state.json" ]] && cat "$run5/state.json" >&2 +fi +[[ -n "$run5" && "$run5" != "$before" ]] && rm -rf "$run5" + +# --------------------------------------------------------------------------- +printf '%d/%d scenarios passed' "$PASSED" "$TOTAL" +if (( PASSED != TOTAL )); then + printf ' (%d failed)\n' "$((TOTAL - PASSED))" + exit 1 +fi +printf '\n' diff --git a/tests/status-reader-simulation.sh b/tests/status-reader-simulation.sh new file mode 100644 index 0000000..817e831 --- /dev/null +++ b/tests/status-reader-simulation.sh @@ -0,0 +1,203 @@ +#!/usr/bin/env bash +# tests/status-reader-simulation.sh +# Hermetic gate for v1.5 Phase 3: dev-review/codex/dev-review-status.sh. +# +# Fabricates synthetic run dirs (state.json + stderr heartbeat logs, contract- +# shaped per evals/tests/fake-runner.sh) under a throwaway CO_EVOLVE_RUNS_DIR +# and asserts the status reader's liveness assessment, exit codes, --json keys, +# and --list output. No runner, no agents, no network. +# +# Exit-code contract under test (dev-review-status.sh): +# 0 terminal-completed · 2 terminal-partial · 5 still-running · +# 4 presumed-dead · 3 not found / no state.json +# +# House final-line convention: "N/N scenarios passed". + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +STATUS="$REPO_ROOT/dev-review/codex/dev-review-status.sh" + +command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; } + +TEST_DIR="$(mktemp -d -t status-reader-sim-XXXXXX)" +trap 'rm -rf "$TEST_DIR"' EXIT +RUNS="$TEST_DIR/runs" +mkdir -p "$RUNS" +export CO_EVOLVE_RUNS_DIR="$RUNS" + +TOTAL=0 +PASSED=0 +pass() { printf 'PASS: %s\n' "$1"; PASSED=$((PASSED + 1)); } +fail() { printf 'FAIL: %s\n' "$1" >&2; } + +# Run the reader; capture stdout + the exit code it returns. +run_status() { + STATUS_OUT="" + STATUS_RC=0 + STATUS_OUT=$(bash "$STATUS" "$@" 2>&1) || STATUS_RC=$? +} + +# Fabricate a run dir. Args: +# Writes a contract-shaped state.json with runner_pid + (optional) current_phase. +# Extra knobs via env: FAB_PID, FAB_VERDICT. +fabricate() { + local id="$1" status="$2" phase="$3" + local dir="$RUNS/$id" + mkdir -p "$dir/outputs" + local pid="${FAB_PID:-$$}" + local cur="null" + if [[ -n "$phase" ]]; then + cur=$(jq -n --arg n "$phase" '{name: $n, started_at: "2026-06-12T00:00:00Z"}') + fi + jq -n \ + --arg id "$id" \ + --arg status "$status" \ + --argjson pid "$pid" \ + --argjson cur "$cur" \ + --arg verdict "${FAB_VERDICT:-}" \ + '{ + run_id: $id, + task: "synthetic status-reader probe task that is fairly long to test truncation behavior in the reader output line", + composer: "opus", executor: "codex", reviewer: "opus", + phases: [ + {name: "compose", started_at: "2026-06-12T00:00:00Z", completed_at: "2026-06-12T00:00:01Z", status: "ok", exit_code: 0} + ], + marker_counts: {contested: 1, clarify: 2}, + verify_verdict: (if $verdict == "" then null else $verdict end), + runner_pid: $pid, + current_phase: (if $status == "null" then $cur else $cur end), + started_at: "2026-06-12T00:00:00Z", + completed_at: null, + status: (if $status == "null" then null else $status end), + updated_at: "2026-06-12T00:00:00Z" + }' > "$dir/state.json" + printf '%s' "$dir" +} + +# --------------------------------------------------------------------------- +# Scenario 1: still-running — alive pid ($$) + fresh execute stderr -> exit 5 +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +d1=$(FAB_PID="$$" fabricate "dev-review-0001-running" null "execute") +# Fresh heartbeat: execute-stderr.log just touched (mtime = now). +printf 'executor working...\nstill going...\n' > "$d1/execute-stderr.log" +run_status "dev-review-0001-running" +if [[ "$STATUS_RC" -eq 5 ]] && grep -q 'RUNNING' <<<"$STATUS_OUT"; then + pass "S1: still-running (alive pid + fresh heartbeat) -> exit 5" +else + fail "S1: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 2: presumed-dead — bogus pid + stale stderr + non-terminal -> exit 4 +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +d2=$(FAB_PID=999999 fabricate "dev-review-0002-dead" null "execute") +printf 'executor started\n' > "$d2/execute-stderr.log" +# Make the heartbeat stale (>120s old). +touch -t 202001010000 "$d2/execute-stderr.log" 2>/dev/null || true +run_status "dev-review-0002-dead" +if [[ "$STATUS_RC" -eq 4 ]] && grep -q 'DEAD' <<<"$STATUS_OUT"; then + pass "S2: presumed-dead (pid gone + stale heartbeat) -> exit 4" +else + fail "S2: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 3: completed -> exit 0 +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +d3=$(FAB_VERDICT="APPROVED" fabricate "dev-review-0003-done" completed "") +jq -n '{verdict:"APPROVED",confidence:90,summary:"ok",issues:[]}' > "$d3/verdict.json" +run_status "dev-review-0003-done" +if [[ "$STATUS_RC" -eq 0 ]] && grep -q 'DONE' <<<"$STATUS_OUT"; then + pass "S3: completed -> exit 0" +else + fail "S3: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 4: partial -> exit 2 +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +d4=$(fabricate "dev-review-0004-partial" partial "") +run_status "dev-review-0004-partial" +if [[ "$STATUS_RC" -eq 2 ]] && grep -q 'PARTIAL' <<<"$STATUS_OUT"; then + pass "S4: partial -> exit 2" +else + fail "S4: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 5: missing run -> exit 3 +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +run_status "dev-review-9999-nope" +if [[ "$STATUS_RC" -eq 3 ]]; then + pass "S5: missing run -> exit 3" +else + fail "S5: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 6: --json parses with jq and has the expected keys +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +json_rc=0 +json_out=$(bash "$STATUS" --json "dev-review-0001-running" 2>/dev/null) || json_rc=$? +j_ok=true +if ! jq -e . <<<"$json_out" >/dev/null 2>&1; then j_ok=false; fi +for key in run_id status runner_pid runner_alive current_phase heartbeat marker_counts assess exit_code phases; do + jq -e "has(\"$key\")" <<<"$json_out" >/dev/null 2>&1 || { j_ok=false; echo " missing key: $key" >&2; } +done +# current_phase.name and heartbeat.age_secs must be populated for the running run. +if [[ "$j_ok" == true ]]; then + jq -e '.current_phase.name == "execute"' <<<"$json_out" >/dev/null 2>&1 || j_ok=false + jq -e '.heartbeat.age_secs != null' <<<"$json_out" >/dev/null 2>&1 || j_ok=false + jq -e '.exit_code == 5' <<<"$json_out" >/dev/null 2>&1 || j_ok=false +fi +# --json still returns the assessment exit code. +[[ "$json_rc" -eq 5 ]] || j_ok=false +if [[ "$j_ok" == true ]]; then + pass "S6: --json is valid JSON with expected keys + populated current_phase/heartbeat" +else + fail "S6: json_rc=$json_rc out=<<$json_out>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 7: --list shows the active (non-terminal) run + recent terminal ones +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +list_out=$(bash "$STATUS" --list 2>&1) || true +if grep -q 'ACTIVE' <<<"$list_out" \ + && grep -q 'dev-review-0001-running' <<<"$list_out" \ + && grep -q 'dev-review-0003-done' <<<"$list_out"; then + pass "S7: --list shows the active run and recent terminal runs" +else + fail "S7: out=<<$list_out>>" +fi + +# --------------------------------------------------------------------------- +# Scenario 8: no positional arg -> latest run by mtime (the dead one we touched +# oldest is NOT latest; ensure default resolution returns *some* run cleanly). +# --------------------------------------------------------------------------- +TOTAL=$((TOTAL + 1)) +# Touch a brand-new completed run so it is unambiguously newest. +d8=$(fabricate "dev-review-0008-latest" completed "") +touch "$d8/state.json" +run_status # no arg +if [[ "$STATUS_RC" -eq 0 ]] && grep -q 'dev-review-0008-latest' <<<"$STATUS_OUT"; then + pass "S8: no-arg resolves the newest run by mtime" +else + fail "S8: rc=$STATUS_RC out=<<$STATUS_OUT>>" +fi + +# --------------------------------------------------------------------------- +printf '%d/%d scenarios passed' "$PASSED" "$TOTAL" +if (( PASSED != TOTAL )); then + printf ' (%d failed)\n' "$((TOTAL - PASSED))" + exit 1 +fi +printf '\n' From fb965ad70bd6e7abf515b5fdd168cdc5fcb070d3 Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 13:19:49 -0400 Subject: [PATCH 05/12] feat: gated per-phase token capture into state.json (v1.5 Phase 4) Co-Authored-By: Claude Fable 5 --- dev-review/codex/dev-review.sh | 23 ++ evals/RUNNER-CONTRACT.md | 1 + evals/score-run.sh | 12 +- lib/co-evolution.sh | 184 ++++++++++++- tests/token-capture-simulation.sh | 413 ++++++++++++++++++++++++++++++ 5 files changed, 630 insertions(+), 3 deletions(-) create mode 100755 tests/token-capture-simulation.sh diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh index 583d8a9..30fd33e 100644 --- a/dev-review/codex/dev-review.sh +++ b/dev-review/codex/dev-review.sh @@ -1033,6 +1033,22 @@ cleanup_runtime_artifacts() { find "$RUN_DIR" -maxdepth 1 -type f -name '.*' -delete 2>/dev/null || true } +# v1.5 Phase 4: token-usage aggregation, gated on CO_EVOLVE_TOKEN_CAPTURE=1. +# When the flag is unset/off this is a no-op and state.json never grows a +# `tokens` key (byte-parity). Called just BEFORE cleanup_runtime_artifacts in +# both terminal paths (plan-only early exit and normal EOF). collect_token_usage +# reads the *.usage.json sidecars + per-phase stderr logs and writes one +# state.json.tokens block; we then remove the transient sidecars/envelopes since +# state.json is the durable record (some sidecars — execute-output.md.usage.json, +# verdict.json.usage.json — have no leading dot, so cleanup_runtime_artifacts +# would not sweep them). +maybe_collect_token_usage() { + [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] || return 0 + collect_token_usage "$RUN_DIR" "$STATE_JSON" + find "$RUN_DIR" -maxdepth 1 -type f \ + \( -name '*.usage.json' -o -name '*.envelope.json' \) -delete 2>/dev/null || true +} + while [[ $# -gt 0 ]]; do case "$1" in --preset) @@ -1498,6 +1514,9 @@ if [[ "$PLAN_ONLY" == "true" ]]; then # v1.5 Phase 3: plan-only runs also reach a clean terminal — clear the in-flight # phase so a status reader does not see a stale compose/bounce as "still running". write_state_field "$STATE_JSON" ".current_phase" "null" + # v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1) + # BEFORE cleanup sweeps the dot-prefixed sidecars. + maybe_collect_token_usage cleanup_runtime_artifacts if [[ "${PLAN_EXIT:-0}" -eq 2 ]]; then exit 2 @@ -1679,6 +1698,10 @@ if command -v jq >/dev/null 2>&1; then fi unset _run_status _run_end_ts +# v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1) +# BEFORE cleanup sweeps the dot-prefixed sidecars. +maybe_collect_token_usage + cleanup_runtime_artifacts log "" diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md index e262780..a331f5f 100644 --- a/evals/RUNNER-CONTRACT.md +++ b/evals/RUNNER-CONTRACT.md @@ -49,6 +49,7 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`. | `pre_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, before executor) | — | v1.5 additions — observability. Workdir `git rev-parse HEAD` captured just before the executor runs; null when non-git/detached. Scorer ignores it. | | `post_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, after change detection) | — | v1.5 additions — observability. Workdir HEAD just after change detection; `pre != post` means the executor committed. Null when non-git/detached. Scorer ignores it. | | `orchestration` | object `{parent_run_id}` | no | `dev-review.sh` (post-init, only when `--parent-run` passed) | — | v1.5 additions — lineage. Records the orchestrator's parent run id for re-kicks (re-kicks always get a fresh run dir; no `--resume`). Omitted entirely for standalone runs. Runner-owned; scorer ignores it. | +| `tokens` | object `{phases:{:{…}}, totals:{claude_input, claude_output, claude_cache_read, claude_cost_usd, codex_total_tokens}}` | no | `dev-review.sh` (`collect_token_usage` at EOF, **only when `CO_EVOLVE_TOKEN_CAPTURE=1`**) | `score-run.sh` — `.tokens.totals` copied into `details.cost.tokens`, informational only | v1.5 addition — token economics. Per-phase usage: claude phases carry `{input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, total_cost_usd, source:"claude-json"}` from the gated `--output-format json` envelope; codex phases carry `{source:"codex-stderr", total_tokens}` harvested from the per-phase stderr token line. **Omitted entirely unless capture is enabled** (default off → byte-parity with prior state.json). Scorer surfaces totals for evidence; never affects any PASS/FAIL dimension. | ### Legacy alias: `phases` → `history` diff --git a/evals/score-run.sh b/evals/score-run.sh index 843f603..b9a6380 100644 --- a/evals/score-run.sh +++ b/evals/score-run.sh @@ -644,6 +644,14 @@ composite=$(jq -n --argjson scores "$scores_json" ' # Details (sparse for Task 1; Task 2 expands per-dimension metadata) # --------------------------------------------------------------------------- +# v1.5 Phase 4: token economics (informational only). When the runner ran with +# CO_EVOLVE_TOKEN_CAPTURE=1, state.json carries a .tokens.totals block; surface +# it under details.cost.tokens so the report can speak to the ~50% Claude-token +# savings claim. The `// empty` guard means: when there is no tokens block (every +# existing fixture, and any run with capture off), the key is simply absent and +# details stays byte-identical — this MUST NOT touch any PASS/FAIL dimension. +tokens_totals_json=$(jq -c '.tokens.totals // empty' "$state_json_path" 2>/dev/null) + details_json=$(jq -n \ --arg status "$status" \ --argjson had_exception "$had_exception" \ @@ -651,9 +659,11 @@ details_json=$(jq -n \ --argjson max_wall "$max_wall" \ --arg composer "$composer" \ --arg reviewer "$reviewer" \ + --argjson tokens "${tokens_totals_json:-null}" \ '{ robustness: {status: $status, had_exception: $had_exception}, - cost: {wall_clock_seconds: $wall_secs, max: $max_wall}, + cost: ({wall_clock_seconds: $wall_secs, max: $max_wall} + + (if $tokens == null then {} else {tokens: $tokens} end)), cross_ai_diversity: {composer: $composer, reviewer: $reviewer} }') diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh index cc83a77..8478cf7 100644 --- a/lib/co-evolution.sh +++ b/lib/co-evolution.sh @@ -425,11 +425,61 @@ invoke_claude() { model_flags+=(--effort "${CLAUDE_EFFORT}") fi + # v1.5 Phase 4: per-phase token capture, default OFF. Gated behind + # CO_EVOLVE_TOKEN_CAPTURE=1 AND jq present so the default path stays + # byte-identical (argv + artifacts). When ON we swap `--output-format text` + # for `--output-format json`: the JSON envelope (R1) carries `.usage` token + # counts, and `.result` holds the same text the text path would have emitted. + # PRTP-03 reminder: this is `--output-format json`, NOT `--output-schema` + # (the latter hangs claude -p on Windows). The gate keeps it default-off so + # the Windows precedent is respected unless an operator opts in. + local capture_json=false + if [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] && command -v jq >/dev/null 2>&1; then + capture_json=true + fi + + local output_format=text + [[ "$capture_json" == "true" ]] && output_format=json + if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1; then # Under WSL, reuse the Windows Claude session because WSL and Windows keep separate auth state. - cmd=(cmd.exe /c claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}") + cmd=(cmd.exe /c claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}") else - cmd=(claude -p --output-format text "${model_flags[@]}" "${tool_flags[@]}") + cmd=(claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}") + fi + + if [[ "$capture_json" == "true" ]]; then + local envelope_file="${output_file}.envelope.json" + local usage_file="${output_file}.usage.json" + # The envelope is emitted on stdout even on failure (R1: is_error=true still + # carries a body, exit nonzero) — `|| true` matches the text path and we + # parse regardless. Capture stdout to the envelope, NOT the output file. + "${cmd[@]}" < "$prompt_file" > "$envelope_file" 2>"$stderr_file" || true + # Extract `.result` into the output file so the external contract is + # preserved: callers (validate_agent_artifact, file_contains_auth_failure) + # only ever read $output_file. On auth failure `.result` is the "Not logged + # in" banner, so the existing auth gate keeps working. If `.result` is empty + # or the envelope is unparseable, fall back to copying the raw envelope so + # validate_agent_artifact sees a non-empty file rather than a silent blank. + if jq -e 'has("result") and (.result | type == "string") and (.result | length > 0)' \ + "$envelope_file" >/dev/null 2>&1; then + jq -r '.result // empty' "$envelope_file" > "$output_file" + else + cp "$envelope_file" "$output_file" + fi + # Usage sidecar (R1 fields). `// null` guards a malformed/partial envelope so + # the file is always valid JSON for collect_token_usage to slurp. Never + # fatal — token capture must not break a run. + jq '{ + input_tokens: (.usage.input_tokens // null), + output_tokens: (.usage.output_tokens // null), + cache_read_input_tokens: (.usage.cache_read_input_tokens // null), + cache_creation_input_tokens: (.usage.cache_creation_input_tokens // null), + total_cost_usd: (.total_cost_usd // null), + source: "claude-json" + }' "$envelope_file" > "$usage_file" 2>/dev/null \ + || printf '{"input_tokens":null,"output_tokens":null,"cache_read_input_tokens":null,"cache_creation_input_tokens":null,"total_cost_usd":null,"source":"claude-json"}\n' > "$usage_file" + return 0 fi "${cmd[@]}" < "$prompt_file" > "$output_file" 2>"$stderr_file" || true @@ -1306,6 +1356,136 @@ begin_state_phase() { fi } +# v1.5 Phase 4: aggregate per-phase token usage into state.json. Called ONLY +# when CO_EVOLVE_TOKEN_CAPTURE=1 (the runner gates the call), so when the flag +# is off there is no `tokens` key at all — state.json stays byte-identical. +# +# Two evidence sources, both already on disk by the time this runs: +# 1. Claude usage sidecars (`*.usage.json`, written by invoke_claude's gated +# JSON path). Mapped to a phase via the sidecar's base output filename: +# .compose-output.md.usage.json -> compose +# .bounce-output-N.md.usage.json -> bounce-NN (zero-padded) +# execute-output.md.usage.json -> execute +# verdict.json.usage.json -> verify +# 2. Codex token line (R2) harvested read-only from per-phase stderr logs: +# compose-stderr.log -> compose +# pass-N-stderr.log -> bounce-NN +# execute-stderr.log -> execute +# review-stderr.log -> verify +# Format: the literal line `tokens used` immediately followed by a +# comma-formatted integer. awk grabs the next line and strips commas. +# +# Writes ONE `tokens` block: {phases:{:{…}}, totals:{…}} and bumps +# .updated_at. Degrades silently when jq is unavailable. The caller removes the +# transient sidecars afterward (state.json is the durable record). +collect_token_usage() { + local run_dir="${1:?run_dir required}" + local state_json="${2:?state_json required}" + + command -v jq >/dev/null 2>&1 || { + log "WARNING: jq unavailable — collect_token_usage skipping" + return 0 + } + [[ -f "$state_json" ]] || return 0 + + # Build the phases object incrementally as a JSON string. + local phases_json='{}' + local tmp + + # --- 1. Claude usage sidecars ------------------------------------------- + # Two globs: the leading-dot one matches compose/bounce sidecars whose base + # output file is itself dot-prefixed (.compose-output.md.usage.json, + # .bounce-output-N.md.usage.json) — a plain *.usage.json glob skips dotfiles. + local sidecar base phase pass + for sidecar in "$run_dir"/*.usage.json "$run_dir"/.*.usage.json; do + [[ -e "$sidecar" ]] || continue # no-glob guard + base=$(basename "$sidecar") + base="${base%.usage.json}" # strip sidecar suffix -> output filename + case "$base" in + .compose-output.md) phase="compose" ;; + execute-output.md) phase="execute" ;; + verdict.json) phase="verify" ;; + .bounce-output-*.md) + pass="${base#.bounce-output-}" + pass="${pass%.md}" + if [[ "$pass" =~ ^[0-9]+$ ]]; then + printf -v phase 'bounce-%02d' "$pass" + else + phase="bounce-${pass}" + fi + ;; + *) phase="$base" ;; # unknown shape: key by raw base + esac + # Merge this sidecar's object under .. Tolerate a malformed sidecar. + if tmp=$(jq --arg p "$phase" --slurpfile s "$sidecar" \ + '. + {($p): $s[0]}' <<<"$phases_json" 2>/dev/null); then + phases_json="$tmp" + fi + done + + # --- 2. Codex stderr token lines ---------------------------------------- + local stderr_log log_base codex_phase tokens_val + for stderr_log in "$run_dir"/compose-stderr.log "$run_dir"/execute-stderr.log \ + "$run_dir"/review-stderr.log "$run_dir"/pass-*-stderr.log; do + [[ -f "$stderr_log" ]] || continue + log_base=$(basename "$stderr_log") + case "$log_base" in + compose-stderr.log) codex_phase="compose" ;; + execute-stderr.log) codex_phase="execute" ;; + review-stderr.log) codex_phase="verify" ;; + pass-*-stderr.log) + pass="${log_base#pass-}" + pass="${pass%-stderr.log}" + if [[ "$pass" =~ ^[0-9]+$ ]]; then + printf -v codex_phase 'bounce-%02d' "$pass" + else + codex_phase="bounce-${pass}" + fi + ;; + *) continue ;; + esac + # Skip the retry-stderr variants (pass-N-stderr-retry.log won't match the + # globs above; the primary logs are authoritative for the phase total). + tokens_val=$(awk '/^tokens used$/{getline; gsub(",",""); print; exit}' "$stderr_log" 2>/dev/null) + [[ "$tokens_val" =~ ^[0-9]+$ ]] || continue + # A claude sidecar already claimed this phase only if claude ran it; codex + # phases never have a sidecar, so this is collision-free in practice. If both + # somehow exist, the codex entry wins last (rare/diagnostic). + if tmp=$(jq --arg p "$codex_phase" --argjson t "$tokens_val" \ + '. + {($p): {source: "codex-stderr", total_tokens: $t}}' \ + <<<"$phases_json" 2>/dev/null); then + phases_json="$tmp" + fi + done + + # --- 3. Totals + single write into state.json ---------------------------- + # Totals sum across phases: claude fields from claude-json sidecars, codex + # tokens from codex-stderr entries. Nulls are treated as 0 in the sums. + local now + now=$(date -u +%Y-%m-%dT%H:%M:%SZ) + tmp=$(mktemp) + if jq --argjson phases "$phases_json" --arg now "$now" ' + ($phases | to_entries) as $entries | + def num($x): ($x // 0); + { + phases: $phases, + totals: { + claude_input: ([$entries[] | select(.value.source=="claude-json") | num(.value.input_tokens)] | add // 0), + claude_output: ([$entries[] | select(.value.source=="claude-json") | num(.value.output_tokens)] | add // 0), + claude_cache_read: ([$entries[] | select(.value.source=="claude-json") | num(.value.cache_read_input_tokens)] | add // 0), + claude_cost_usd: ([$entries[] | select(.value.source=="claude-json") | num(.value.total_cost_usd)] | add // 0), + codex_total_tokens:([$entries[] | select(.value.source=="codex-stderr") | num(.value.total_tokens)] | add // 0) + } + } as $tokens | + .tokens = $tokens | .updated_at = $now + ' "$state_json" > "$tmp" 2>/dev/null; then + mv "$tmp" "$state_json" + else + rm -f "$tmp" + log "WARNING: jq failed in collect_token_usage — state.json unchanged (no tokens block)" + fi +} + # RNPT-04: Generic jq-path field setter. Supports string|number|bool|null|rawfile. # Usage: write_state_field state.json '.verify_verdict' string APPROVED # write_state_field state.json '.execute_delta' rawfile path/to/delta.json diff --git a/tests/token-capture-simulation.sh b/tests/token-capture-simulation.sh new file mode 100755 index 0000000..ecfb1a1 --- /dev/null +++ b/tests/token-capture-simulation.sh @@ -0,0 +1,413 @@ +#!/usr/bin/env bash +# tests/token-capture-simulation.sh +# Hermetic gate for v1.5 Phase 4: CO_EVOLVE_TOKEN_CAPTURE=1 token capture in +# lib/co-evolution.sh (invoke_claude gated JSON mode + collect_token_usage) and +# the runner's EOF aggregation into state.json. +# +# The capture is gated behind CO_EVOLVE_TOKEN_CAPTURE=1 AND jq present; default +# OFF must be byte-identical to today (argv, artifacts, no state.json `tokens` +# key). This gate pins: +# a. flag ON, mixed run (claude compose+verify + codex execute) → state.json +# .tokens.phases has both kinds of entries, totals are correct, the *.usage +# and *.envelope sidecars are cleaned up, and the claude output files hold +# the extracted `.result` text (NOT the raw envelope). +# b. flag ON, malformed claude envelope → the output file falls back to the +# envelope copy (non-empty), the run does not die, and that phase's tokens +# entry carries nulls. +# c. flag OFF (default) → claude argv contains `--output-format text` (not +# json), NO *.envelope.json / *.usage.json sidecars survive, and state.json +# has NO `tokens` key (the byte-parity guard). +# d. flag OFF → `--help` is unaffected (still exits 0, still emits usage) and +# no token machinery leaks into the help path. +# +# Pattern: PATH-injected claude + codex stubs (same discipline as +# tests/preset-expansion-simulation.sh) — drain stdin via a read-loop (NOT +# `cat`) to avoid SIGPIPE under strict-mode bash. The claude stub branches on +# whether `--output-format json` is in its argv: when present it emits a canned +# R1-shaped envelope (real token numbers) on stdout; otherwise the plain text +# body. Phase (compose/bounce vs verify) is detected from the prompt on stdin. + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh" + +command -v jq >/dev/null 2>&1 || { echo "SKIP-FAIL: jq required for this gate" >&2; exit 1; } + +TEST_DIR="$(mktemp -d -t token-capture-XXXXXX)" +trap 'rm -rf "$TEST_DIR"' EXIT + +TOTAL=0 +FAILURES=0 +pass() { printf "PASS: %s\n" "$1"; } +fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); } + +# --- hermetic git env (mirrors tests/run-all.sh) ---------------------------- +export GIT_CONFIG_GLOBAL="$TEST_DIR/gitconfig" +export GIT_CONFIG_SYSTEM=/dev/null +cat > "$TEST_DIR/gitconfig" <<'GITCFG' +[user] + name = token-capture-sim + email = token-capture-sim@invalid.local +[init] + defaultBranch = master +[core] + autocrlf = false +[commit] + gpgsign = false +GITCFG + +# --- canned R1 envelope numbers (the assertions key off these) -------------- +# These are the per-claude-invocation token counts the stub reports. The runner +# calls claude once for compose and once for verify in the scenario-(a) run +# (bounces 0, so no bounce passes), so claude totals = 2x each field. +CLAUDE_IN=1111 +CLAUDE_OUT=222 +CLAUDE_CACHE_READ=33333 +CLAUDE_CACHE_CREATE=44 +CODEX_TOKENS=12345 + +# --- stub CLIs -------------------------------------------------------------- +mkdir -p "$TEST_DIR/bin" + +# claude stub. If `--output-format json` is in argv → emit a canned R1 envelope +# on stdout whose `.result` is a phase-appropriate body (plan or verdict). +# Otherwise → emit the body as plain text (the default text path). The +# MALFORMED_ENVELOPE env var (scenario b) makes the json path emit a body that +# is NOT valid JSON, exercising invoke_claude's envelope-copy fallback. +cat > "$TEST_DIR/bin/claude" <> "\$CLAUDE_ARGV_LOG" +json_mode=false +for arg in "\$@"; do + if [[ "\$arg" == "json" ]]; then json_mode=true; fi +done +# Drain stdin via read-loop (no SIGPIPE); capture to detect the phase. +stdin_capture="" +while IFS= read -r _line; do stdin_capture+="\$_line"\$'\n'; done + +# Phase-appropriate text body. +if [[ "\$stdin_capture" == *"Verification Instructions"* ]]; then + body=\$(cat <<'VERDICT' +{ + "verdict": "APPROVED", + "confidence": 88, + "summary": "Token-capture stub verifier approves the hermetic change.", + "issues": [] +} +VERDICT +) +else + body=\$(cat <<'PLAN' +# Token Capture Stub Plan + +## Approach + +This is a hermetic stub plan emitted by the token-capture simulation claude +stub. It satisfies the plan heading regex, the minimum word count, and the +structural-line check so the compose phase validates and the run proceeds into +execute and verify. It is not a real plan and the only side effect is a single +harmless append to a tracked file so the run produces a diff for the verify +phase to score. No external commands run and no network is touched anywhere. + +## Files to Change + +- \`touched.txt\` — the executor stub appends a line so the run yields a diff. + +## Risks + +- None identified. Stubbed CLI produces a fixed plan with no side effects. +PLAN +) +fi + +if [[ "\$json_mode" == true ]]; then + if [[ "\${MALFORMED_ENVELOPE:-}" == "1" ]]; then + # NOT a valid JSON envelope — invoke_claude must copy the raw stdout to the + # output file (non-empty fallback) and write an all-null usage sidecar + # (jq can't extract .usage from non-JSON). We emit the plain plan body so the + # copied fallback still clears the compose plan-quality gate; the point under + # test is the fallback + null-usage path, not the plan validator. + printf '%s\n' "\$body" + exit 1 + fi + # Canned R1-shaped envelope on stdout. .result carries the text body. jq -n + # builds it so .result is correctly JSON-escaped regardless of body content. + jq -n --arg result "\$body" \ + --argjson it ${CLAUDE_IN} --argjson ot ${CLAUDE_OUT} \ + --argjson cr ${CLAUDE_CACHE_READ} --argjson cc ${CLAUDE_CACHE_CREATE} ' + { + type: "result", subtype: "success", is_error: false, + duration_ms: 100, num_turns: 1, result: \$result, + total_cost_usd: 0.05, + usage: { + input_tokens: \$it, output_tokens: \$ot, + cache_read_input_tokens: \$cr, cache_creation_input_tokens: \$cc + } + }' + exit 0 +fi +# Plain text path (flag OFF): just the body. +printf '%s\n' "\$body" +STUB +chmod +x "$TEST_DIR/bin/claude" + +# codex stub: append argv, parse -o/-C, drain stdin, emit plan to -o (or make a +# real change in -C for the execute phase), and ALWAYS write the R2 token line +# to stderr so the harvest path has something to find. +cat > "$TEST_DIR/bin/codex" <> "\$CODEX_ARGV_LOG" +output_file=""; workdir=""; prev="" +for arg in "\$@"; do + case "\$prev" in + -o) output_file="\$arg" ;; + -C) workdir="\$arg" ;; + esac + prev="\$arg" +done +stdin_capture="" +while IFS= read -r _line; do stdin_capture+="\$_line"\$'\n'; done +if [[ "\$stdin_capture" == *"Execution Instructions (Codex)"* \ + && -n "\$workdir" && -f "\$workdir/touched.txt" ]]; then + printf 'codex-stub-change\n' >> "\$workdir/touched.txt" +fi +# R2 token line on stderr (always — codex prints it at end of every run). +printf 'OpenAI Codex stub\ntokens used\n%s\n' "${CODEX_TOKENS}" >&2 +plan_content=\$(cat <<'PLAN' +# Token Capture Stub Plan + +## Approach + +This is a hermetic stub plan emitted by the token-capture simulation codex +stub. It satisfies the plan heading regex, the minimum word count, and the +structural-line check so the compose phase validates and the run proceeds into +execute and verify. The only side effect is a single harmless append to a +tracked file so the run produces a diff. No external commands run and no +network is touched anywhere in this hermetic path. + +## Files to Change + +- \`touched.txt\` — appended to by the executor stub. + +## Risks + +- None identified. Stubbed codex produces a fixed plan with no side effects. +PLAN +) +if [[ -n "\$output_file" ]]; then + printf '%s\n' "\$plan_content" > "\$output_file" +else + printf '%s\n' "\$plan_content" +fi +exit 0 +STUB +chmod +x "$TEST_DIR/bin/codex" + +# --- helpers ---------------------------------------------------------------- +make_scratch_repo() { + local repo + repo="$TEST_DIR/repo-$1" + rm -rf "$repo" + mkdir -p "$repo" + git -C "$repo" init -q + printf 'baseline\n' > "$repo/touched.txt" + git -C "$repo" add -A + git -C "$repo" commit -qm "baseline" + printf '%s' "$repo" +} + +latest_run_dir() { ls -dt "$REPO_ROOT"/runs/dev-review-* 2>/dev/null | head -1; } + +# =========================================================================== +# Scenario (a): flag ON, mixed run — claude composes + verifies, codex executes. +# Asserts: both kinds of phase entries, correct totals, sidecars cleaned, claude +# output files contain the extracted .result text (not the raw envelope). +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo a) +claude_log="$TEST_DIR/a_claude.log"; codex_log="$TEST_DIR/a_codex.log" +: > "$claude_log"; : > "$codex_log" +run_dir_before=$(latest_run_dir || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT MALFORMED_ENVELOPE + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + export CO_EVOLVE_TOKEN_CAPTURE=1 + # composer=claude, executor=codex, verifier=claude (opus alias). bounces 0 so + # exactly two claude calls (compose + verify) and one codex call (execute). + bash "$RUNNER" --composer opus --executor codex --verifier opus \ + --verify --bounces 0 --workdir "$repo" --timeout 60 -- "token capture mixed run" +) >"$TEST_DIR/a.out" 2>&1 || true +a_run_dir=$(latest_run_dir) +a_ok=true +a_msg="" +if [[ -z "$a_run_dir" || "$a_run_dir" == "$run_dir_before" ]]; then + a_ok=false; a_msg="no new run dir" +else + state="$a_run_dir/state.json" + # tokens block present, both phase kinds. + if ! jq -e '.tokens.phases.compose.source == "claude-json"' "$state" >/dev/null 2>&1; then + a_ok=false; a_msg="compose not claude-json" + fi + if ! jq -e '.tokens.phases.verify.source == "claude-json"' "$state" >/dev/null 2>&1; then + a_ok=false; a_msg="verify not claude-json" + fi + if ! jq -e '.tokens.phases.execute.source == "codex-stderr"' "$state" >/dev/null 2>&1; then + a_ok=false; a_msg="execute not codex-stderr" + fi + # totals: claude (compose+verify) = 2x each; codex = one execute line. + exp_in=$((CLAUDE_IN * 2)); exp_out=$((CLAUDE_OUT * 2)); exp_cr=$((CLAUDE_CACHE_READ * 2)) + if ! jq -e --argjson v "$exp_in" '.tokens.totals.claude_input == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_input total"; fi + if ! jq -e --argjson v "$exp_out" '.tokens.totals.claude_output == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_output total"; fi + if ! jq -e --argjson v "$exp_cr" '.tokens.totals.claude_cache_read == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="claude_cache_read total"; fi + if ! jq -e --argjson v "$CODEX_TOKENS" '.tokens.totals.codex_total_tokens == $v' "$state" >/dev/null 2>&1; then a_ok=false; a_msg="codex total"; fi + # sidecars cleaned up (neither *.usage.json nor *.envelope.json survive). + if ls "$a_run_dir"/*.usage.json "$a_run_dir"/.*.usage.json >/dev/null 2>&1; then a_ok=false; a_msg="usage sidecars survived"; fi + if ls "$a_run_dir"/*.envelope.json "$a_run_dir"/.*.envelope.json >/dev/null 2>&1; then a_ok=false; a_msg="envelope sidecars survived"; fi + # claude output files hold the extracted .result text — verdict.json must be + # the verdict object (not the raw envelope, which would carry a "usage" key). + if ! jq -e '.verdict == "APPROVED" and (has("usage") | not)' "$a_run_dir/verdict.json" >/dev/null 2>&1; then + a_ok=false; a_msg="verdict.json is not the extracted .result" + fi +fi +if [[ "$a_ok" == true ]]; then + pass "flag ON mixed run: claude+codex phases captured, totals correct, sidecars cleaned, .result extracted" +else + fail "flag ON mixed run failed ($a_msg) — run dir: ${a_run_dir:-}" + [[ -n "${a_run_dir:-}" ]] && jq -c '.tokens // ""' "$a_run_dir/state.json" >&2 2>/dev/null || true +fi +[[ -n "${a_run_dir:-}" && "$a_run_dir" != "$run_dir_before" ]] && rm -rf "$a_run_dir" + +# =========================================================================== +# Scenario (b): flag ON, malformed claude envelope. The compose output file +# falls back to the envelope copy (non-empty → run does not die mid-compose), +# and the compose tokens entry carries nulls. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo b) +run_dir_before=$(latest_run_dir || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + export PATH="$TEST_DIR/bin:$PATH" + export CO_EVOLVE_TOKEN_CAPTURE=1 + export MALFORMED_ENVELOPE=1 + # Only the compose phase is claude here; --bounces 0 and no --verify so the + # malformed compose envelope is the one under test. plan-only keeps it short: + # the run reaches the plan-only terminal (still calls maybe_collect_token_usage). + bash "$RUNNER" --composer opus --plan-only --bounces 0 \ + --workdir "$repo" --timeout 60 -- "token capture malformed envelope" +) >"$TEST_DIR/b.out" 2>&1 || true +b_run_dir=$(latest_run_dir) +b_ok=true +b_msg="" +if [[ -z "$b_run_dir" || "$b_run_dir" == "$run_dir_before" ]]; then + b_ok=false; b_msg="no new run dir" +else + state="$b_run_dir/state.json" + # The run produced a plan.md from the envelope-copy fallback (non-empty). + if [[ ! -s "$b_run_dir/plan.md" ]]; then b_ok=false; b_msg="plan.md empty (fallback did not copy envelope)"; fi + # tokens block exists and the compose entry has null token fields. + if ! jq -e '.tokens.phases.compose.source == "claude-json"' "$state" >/dev/null 2>&1; then + b_ok=false; b_msg="compose entry missing" + fi + if ! jq -e '.tokens.phases.compose.input_tokens == null' "$state" >/dev/null 2>&1; then + b_ok=false; b_msg="compose input_tokens not null on malformed envelope" + fi + # totals sum nulls as 0 → claude_input 0; state.json valid. + if ! jq -e '.tokens.totals.claude_input == 0' "$state" >/dev/null 2>&1; then + b_ok=false; b_msg="malformed totals not zeroed" + fi + if ! jq empty "$state" >/dev/null 2>&1; then b_ok=false; b_msg="state.json invalid JSON"; fi +fi +if [[ "$b_ok" == true ]]; then + pass "flag ON malformed envelope: output falls back to envelope copy, run survives, tokens nulled" +else + fail "flag ON malformed envelope failed ($b_msg) — run dir: ${b_run_dir:-}" +fi +[[ -n "${b_run_dir:-}" && "$b_run_dir" != "$run_dir_before" ]] && rm -rf "$b_run_dir" + +# =========================================================================== +# Scenario (c): flag OFF (default) — byte-parity guard. claude argv carries +# `--output-format text` (not json), NO envelope/usage sidecars survive, and +# state.json has NO `tokens` key. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +repo=$(make_scratch_repo c) +claude_log="$TEST_DIR/c_claude.log"; codex_log="$TEST_DIR/c_codex.log" +: > "$claude_log"; : > "$codex_log" +run_dir_before=$(latest_run_dir || true) +( + unset CLAUDE_MODEL CLAUDE_EFFORT CODEX_MODEL CODEX_REASONING_EFFORT MALFORMED_ENVELOPE + unset COMPOSER_MODEL COMPOSER_EFFORT EXECUTOR_MODEL EXECUTOR_EFFORT VERIFIER_MODEL VERIFIER_EFFORT + unset CO_EVOLVE_TOKEN_CAPTURE + export PATH="$TEST_DIR/bin:$PATH" + export CLAUDE_ARGV_LOG="$claude_log" CODEX_ARGV_LOG="$codex_log" + bash "$RUNNER" --composer opus --executor codex --verifier opus \ + --verify --bounces 0 --workdir "$repo" --timeout 60 -- "token capture parity off" +) >"$TEST_DIR/c.out" 2>&1 || true +c_run_dir=$(latest_run_dir) +c_ok=true +c_msg="" +# claude argv: text mode, never json. +if ! grep -Eq -- '--output-format text' "$claude_log"; then c_ok=false; c_msg="claude not in text mode"; fi +if grep -Eq -- '--output-format json' "$claude_log"; then c_ok=false; c_msg="claude leaked json mode with flag off"; fi +if [[ -z "$c_run_dir" || "$c_run_dir" == "$run_dir_before" ]]; then + c_ok=false; c_msg="no new run dir" +else + # No sidecars anywhere (they were never written). + if ls "$c_run_dir"/*.usage.json "$c_run_dir"/.*.usage.json >/dev/null 2>&1; then c_ok=false; c_msg="usage sidecar exists with flag off"; fi + if ls "$c_run_dir"/*.envelope.json "$c_run_dir"/.*.envelope.json >/dev/null 2>&1; then c_ok=false; c_msg="envelope sidecar exists with flag off"; fi + # state.json has NO tokens key. + if jq -e 'has("tokens")' "$c_run_dir/state.json" >/dev/null 2>&1; then c_ok=false; c_msg="state.json has tokens key with flag off"; fi +fi +if [[ "$c_ok" == true ]]; then + pass "flag OFF parity: claude argv is --output-format text, no sidecars, no state.json tokens key" +else + fail "flag OFF parity guard failed ($c_msg) — run dir: ${c_run_dir:-}" + [[ -s "$claude_log" ]] && { echo "--- claude argv ---"; cat "$claude_log"; } >&2 +fi +[[ -n "${c_run_dir:-}" && "$c_run_dir" != "$run_dir_before" ]] && rm -rf "$c_run_dir" + +# =========================================================================== +# Scenario (d): --help is unaffected by the capture machinery (exits 0, emits +# usage) regardless of the flag, and never references the token internals. +# =========================================================================== +TOTAL=$((TOTAL + 1)) +d_ok=true +d_msg="" +# Help with flag explicitly ON must still behave exactly as the normal help. +help_on=$(CO_EVOLVE_TOKEN_CAPTURE=1 bash "$RUNNER" --help 2>/dev/null; echo "EXIT=$?") +help_off=$(bash "$RUNNER" --help 2>/dev/null; echo "EXIT=$?") +if [[ "$help_on" != *"EXIT=0"* ]]; then d_ok=false; d_msg="--help exit nonzero with flag on"; fi +if ! printf '%s' "$help_off" | grep -Eq -- '--composer|Usage|usage'; then d_ok=false; d_msg="--help did not emit usage"; fi +# The flag must not change help output (no token sidecars, no env-dependent text). +if [[ "${help_on%EXIT=*}" != "${help_off%EXIT=*}" ]]; then d_ok=false; d_msg="--help output differs by flag"; fi +if [[ "$d_ok" == true ]]; then + pass "flag ON/OFF: --help unaffected (exit 0, usage emitted, byte-identical across the flag)" +else + fail "--help affected by capture flag ($d_msg)" +fi + +# --- summary ---------------------------------------------------------------- +passed=$((TOTAL - FAILURES)) +if (( FAILURES == 0 )); then + echo "$passed/$TOTAL scenarios passed" + exit 0 +else + echo "$passed/$TOTAL scenarios passed ($FAILURES failed)" >&2 + exit 1 +fi From 30a9a00b5b39999596a41404d258010459a78a6e Mon Sep 17 00:00:00 2001 From: Alan Shurafa Date: Fri, 12 Jun 2026 13:35:45 -0400 Subject: [PATCH 06/12] feat: /codex-build orchestration skill + routing docs, both transports (v1.5 Phase 5) Co-Authored-By: Claude Fable 5 --- AGENTS.md | 1 + CLAUDE.md | 25 +- README.md | 1 + dev-review/codex/instructions.md | 3 + skills/codex-build/SKILL.md | 378 +++++++++++++++++++++++++++++++ skills/dev-review/SKILL.md | 43 +++- tests/docs-sync-simulation.sh | 132 +++++++++++ 7 files changed, 577 insertions(+), 6 deletions(-) create mode 100644 skills/codex-build/SKILL.md create mode 100644 tests/docs-sync-simulation.sh diff --git a/AGENTS.md b/AGENTS.md index ee53f61..05d790b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -17,6 +17,7 @@ Full spec: [`BOUNCE-PROTOCOL.md`](BOUNCE-PROTOCOL.md). Reference implementation: in the general-purpose co-evolution workflow for questions, drafts, plans, specs, arguments, and markdown refinement - If you are invoked via `/dev-review` (Claude Code) or `bash dev-review/codex/dev-review.sh` (Codex), you are inside the bounce pipeline — your role (reviewer / composer) and pass number are passed to you in the prompt template +- If you are invoked via `/codex-build`, you are orchestrating a detached Codex build: the session plans and reviews, Codex executes in the background under `--preset codex-build`, and the session is woken at gates (see `skills/codex-build/`) - If you are invoked via `bash agent-bouncer/agent-bouncer.sh `, you are bouncing a single markdown document — same protocol applies - If you are exploring the repo directly without an explicit role, treat the protocol section above as orientation, then read the GSD-managed sections below for project meta diff --git a/CLAUDE.md b/CLAUDE.md index 8cdfe98..20d3b98 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,9 +5,17 @@ agents using structured `[CONTESTED]` / `[CLARIFY]` markers until convergence. ## Default Rule -Use general co-evolution by default. Reach for `dev-review` only when the user -specifically wants repo files changed, a bug fixed, a feature implemented, or a -code diff verified against a plan. +Three routes, lowest-ceremony first: + +- **co-evolution** (default) — bounce protocol for questions, drafts, plans, + specs, arguments, and markdown refinement. +- **dev-review** — interactive code pipeline (compose → bounce → execute → + verify in one session) when the user wants repo files changed, a bug fixed, a + feature implemented, or a code diff verified against a plan. +- **codex-build** — "build with codex": the session plans and reviews, Codex + executes detached, with check-ins only at gates. Reach for this when the user + wants Codex to grind on a build in the background while the session stays free + (the model-ladder flavor of dev-review). ## Components @@ -54,6 +62,17 @@ Code-focused compose-bounce-execute-verify pipeline integrated with Claude Code. Key flags: `--skip-plan` executes a pre-existing plan, `--plan-only` stops after bounce, and `--live` opens visible Windows terminals for Codex passes. +### Codex-Build Skill (`skills/codex-build/`) + +Detached orchestration skill: the session (typically Fable) plans and reviews, +Codex executes in the background, and the session is woken at gates instead of +babysitting. Kicks `dev-review.sh --preset codex-build` via a background task, +ends the turn, then runs a schema-bound ACCEPT / REVISE / ESCALATE gate on wake. + +```text +/codex-build Have codex implement the retry wrapper while I review +``` + ### Codex Runtime (`dev-review/codex/`) Standalone Bash runtime for the code-focused compose-bounce-execute-verify flow diff --git a/README.md b/README.md index e9ffb06..fcc58bd 100644 --- a/README.md +++ b/README.md @@ -173,6 +173,7 @@ fallthrough. | Existing markdown document needs refinement | `co-evolve-bouncer.sh --bounce-only` | | Small, low-risk repo edit in 1-2 files | Direct execution | | Multi-file code change, medium/high risk, or plan/verify workflow | `dev-review/codex/dev-review.sh` | +| Codex builds in the background while the session plans and reviews | `skills/codex-build/` (`--preset codex-build`, detached) | | General co-evolution inside Claude Code | `skills/co-evolution/` | | Code pipeline inside Claude Code | `skills/dev-review/` | | External MCP client (Claude Desktop, Cursor, Continue) | `npm i -g @alanshurafa/co-evolution-mcp` | diff --git a/dev-review/codex/instructions.md b/dev-review/codex/instructions.md index f30eb6b..92fc2dc 100644 --- a/dev-review/codex/instructions.md +++ b/dev-review/codex/instructions.md @@ -18,6 +18,8 @@ Pick the lowest-ceremony path that still protects correctness. **Start with `co- | Existing approved plan should be executed as-is | `bash dev-review/codex/dev-review.sh --skip-plan --plan ` | Reuses the approved plan | | User wants a reviewed plan before any code changes | `bash co-evolve-bouncer.sh --vanilla "task"` or `dev-review.sh --plan-only` | Plan artifact only | | User wants a final review against the plan and diff | `bash dev-review/codex/dev-review.sh --verify ` | Runs verifier after execution | +| Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan --branch auto -- ""` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: Fable plans/reviews, Codex executes; session is not babysat | +| Ad-hoc interactive delegation to Codex mid-session | OpenAI plugin `/codex:rescue --background` (see `skills/codex-build/` Transport B) | Quick delegation; review with the same verdict schema by hand, not via CI | ## Distinguish Coding From Everything Else @@ -101,5 +103,6 @@ bash dev-review/codex/dev-review.sh --skip-plan --plan .planning/phases/04-docs- - Legacy document bouncer: `agent-bouncer/agent-bouncer.sh` - Default Claude Code skill surface: `skills/co-evolution/` - Code-focused Claude Code skill surface: `skills/dev-review/` +- Detached Codex orchestration skill surface: `skills/codex-build/` (kicks `--preset codex-build` in the background) - Shared library: `lib/co-evolution.sh` - Co-evolve templates: `templates/co-evolve/` diff --git a/skills/codex-build/SKILL.md b/skills/codex-build/SKILL.md new file mode 100644 index 0000000..f1e731f --- /dev/null +++ b/skills/codex-build/SKILL.md @@ -0,0 +1,378 @@ +--- +name: codex-build +description: > + Orchestrate Codex to BUILD code in the background while this Claude Code + session (typically Fable) plans and reviews — never babysitting. The session + composes the implementation plan, kicks the dev-review runner detached via a + background Bash task with --preset codex-build, ENDS ITS TURN, and is woken on + exit to run a schema-bound review gate (ACCEPT / REVISE / ESCALATE, max 2 + rounds). Use when the user wants to "build with codex", "have codex build", + "have codex implement", "orchestrate codex", "kick off codex in the + background", "let codex grind", or "codex builds while I review". For an + interactive single-session compose-bounce-execute loop with no detach, use + /dev-review. For non-code document refinement, use /co-evolution. +allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent +--- + +# /codex-build - Detached Codex Execution with Gate-Based Review + +This is the **orchestration protocol** behind the "Fable plans, Codex executes, +Fable reviews" model ladder. The session is the composer and the reviewer; Codex +is the executor. The defining rule: **the session is NEVER kept busy while Codex +grinds.** You kick the runner as a background task, end your turn, and the +harness wakes you when it exits. + +Use `/dev-review` instead when you want the classic interactive +compose-bounce-execute loop in one session (you watch every pass). Use +`/co-evolution` for questions, drafts, plans, and document refinement with no +code execution. + +The protocol has five steps. Steps 1-3 happen in the FIRST turn (preflight, +plan, kick-and-stop). Steps 4-5 happen on WAKE, after the background task exits. + +--- + +## Step 1: PREFLIGHT (one turn, before planning) + +Run these checks. Each gate either passes, degrades with a stated fallback, or +dies with a fix hint. + +### Codex present + +```bash +command -v codex +``` + +If missing, die with the install hint: + +```text +codex CLI not found. On this Mac with Codex.app installed: + ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex +Otherwise install the CLI: + npm i -g @openai/codex +Then re-run /codex-build. +``` + +### Claude CLI auth probe (decides the verifier seat) + +The runner's `claude` verifier seat runs in a headless shell that does NOT carry +the interactive app's session token. Probe it: + +```bash +claude -p --output-format text --model claude-haiku-4-5-20251001 "ping" 2>&1 | head -2 +``` + +- If the output contains `Not logged in` (or `/login`): the runner's claude + verifier seat cannot run. **DEGRADE** — kick with `--verifier codex` so the + verify phase uses Codex's schema-bound review instead. Tell the user plainly: + the in-session review gate (Step 4, which YOU run) is then the only Claude-side + review of the diff, and to run `claude /login` in a terminal to restore the + full ladder (Fable verifier seat inside the runner). +- Otherwise: the full ladder is available. Kick with the preset's default claude + verifier (Fable, max effort) — no `--verifier` override needed. + +### No other active run + +```bash +bash dev-review/codex/dev-review-status.sh --list +``` + +If the `ACTIVE` section shows a non-terminal run (`status=pending` or +`status=null`), do NOT kick a second one — one orchestrated run per workdir. +Either wait for it (check its status on wake) or escalate to the user. + +### Isolation — clean vs dirty tree + +```bash +git -C "$(pwd)" status --short +``` + +- **Clean tree** → kick with `--branch auto` (the runner cuts + `dev-review/auto--` off HEAD before execute). +- **Dirty tree** → kick with `--worktree auto` (REQUIRED). A dirty tree + otherwise makes the runner's verify phase silently skip: it returns early with + `verification skipped - workdir had pre-existing uncommitted changes` because + it cannot isolate this run's diff. `--worktree auto` gives Codex a clean + sibling checkout so verify runs against an isolatable diff. + +### Local clone, never an SMB mount + +The workdir must be a local git clone. Never run from a network/SMB mount +(`/Volumes/...`, `\\server\...`) — it causes line-ending churn and the runner's +diff isolation is unreliable. If the cwd is a mount, stop and ask the user to +point at a local clone. + +--- + +## Step 2: PLAN (in-session — the session is the composer) + +This is the cheap, interactive part. You explore and write the plan; you do NOT +hand planning to Codex. + +1. Explore the codebase with Read / Grep / Glob (or an Explore Agent for + fan-out). Understand the task, the files involved, and the conventions. +2. Write the implementation plan to a temp file **outside the workdir**: + + ```bash + PLAN_FILE="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S).md" + ``` + + **NEVER** write the plan inside the workdir. An untracked plan file there + makes the runner's verify phase skip (`run left untracked files that cannot + be diffed automatically`). **NEVER** pass the plan path inside a prompt to + Codex — per repo convention the runner embeds plan content inline; the + orchestrator is the sole owner of the canonical plan file. + +3. Shape the plan the way `--skip-plan --plan FILE` expects. The runner's plan + validator wants a real plan artifact: a top-level heading, >= ~60 words, >= 5 + non-empty lines, and >= 2 structural lines (headings / bullets). Write the + plan with these sections: + + ```text + # + + ## Approach + + + ## Files to Change + - `path/to/file` — + + ## Steps + 1. + + ## Risks / Out of scope + - + ``` + + A thin or malformed plan makes the runner's plan-quality gate fail before + execute. + +### Optional plan hardening (≤ 2 synchronous bounce passes) + +For higher-stakes work, harden the plan against Codex BEFORE kicking, using the +bounce protocol markers. This is the only synchronous Codex use in this skill — +keep it short (a plan bounce is seconds-to-low-minutes, well under the 5-minute +synchronous ceiling): + +- Send the plan to Codex asking it to critique with `[CONTESTED]` (disagreement + + counter-argument) and `[CLARIFY]` (ambiguity + two interpretations) markers. +- Resolve EVERY `[CONTESTED]` / `[CLARIFY]` before kicking. Apply the 2-pass + expiry rule: any marker still open after 2 passes — decide it yourself or ask + the user. Do not kick with open markers. + +If you skip hardening, that is fine for routine tasks — go straight to Step 3. + +--- + +## Step 3: KICK + END TURN (the load-bearing step) + +Kick the runner as a **background** Bash task, report the run id, and **end your +turn**. This is what makes the model ladder pay off — the session pays only for +plan + review gates, not for watching Codex grind. + +Run this with the Bash tool, **`run_in_background: true`**: + +```bash +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset codex-build --skip-plan --plan "$PLAN_FILE" \ + [--verifier codex] \ + [--branch auto | --worktree auto] \ + [--parent-run ] \ + [--timeout 1800] \ + -- "" +``` + +Fill the brackets from Step 1: +- `--verifier codex` ONLY when the auth probe degraded (else omit — the preset's + Fable/max claude verifier is the default). +- `--branch auto` for a clean tree; `--worktree auto` for a dirty tree. +- `--parent-run ` only on a REVISE re-kick (Step 4), to tag lineage. +- `--timeout` defaults to 1800s; raise it for large tasks. + +The preset expands to: composer = Fable (high), executor = Codex (xhigh, model +left to the CLI's config), verifier = Fable (max), `--verify` on, bounces 2, +revise-loop 1. `CO_EVOLVE_TOKEN_CAPTURE=1` records per-phase tokens into +`state.json` so you can report spend at the gate. + +After kicking: +1. Capture the run id. It is `dev-review-` and the run dir is + `runs/dev-review-/`. Read it from the runner's early stdout, or + resolve the newest run with `ls -dt runs/dev-review-* | head -1`. +2. Report to the user: the run id, and how to check it manually — + `bash dev-review/codex/dev-review-status.sh ` (human) or `--json` + (machine). +3. **END THE TURN.** + +### Hard rules for Step 3 + +- Do **NOT** poll. Do **NOT** sleep. Do **NOT** loop waiting on status. +- Do **NOT** run Codex synchronously for anything expected to take longer than + ~5 minutes — that defeats the entire purpose (the session burns cache reads + every turn while idle, the exact anti-pattern this skill exists to avoid). +- The harness re-invokes this session when the background task exits. **That + notification IS the wake signal.** You do not build the wake mechanism. +- Optional safety net for very long runs (reference, do not build inline): a + one-shot scheduled reminder that re-checks status after the expected duration, + in case a machine-sleep drops the wake notification. + +--- + +## Step 4: WAKE → REVIEW GATE (1-2 turns, after the background task exits) + +When the background task exits, you are re-invoked. Reconstruct state from disk — +read the status JSON FIRST, then read narrowly. Do not re-read the whole tree. + +```bash +bash dev-review/codex/dev-review-status.sh --json +``` + +This emits one object with: `status`, `verdict` (APPROVED / REVISE / null), +`verdict_json` (path to `verdict.json`), `verdict_present`, `diffstat_tail`, +`current_phase`, `marker_counts`, `assess`, and `exit_code` (the status reader's +liveness code: 0 done / 2 partial / 4 presumed-dead / 5 running / 3 no-run). + +Then, BEFORE reading any source files: +1. Read `verdict.json` (the path is in `.verdict_json`) — the schema-bound + verdict: `verdict`, `confidence`, `summary`, `issues[]`, + `scope_creep_detected`. +2. Read the diffstat (`.diffstat_tail`, or + `runs//execute-diffstat.txt`). + +Only THEN read targeted diff hunks and the specific files named in `issues[]`. +Never re-read the whole tree to form an opinion. + +Decide exactly ONE of three outcomes. + +### ACCEPT + +Conditions: verdict is `APPROVED` AND your own spot-check of the named diff +hunks agrees. Report and finish: + +- Branch or worktree path (where the code landed). +- Diffstat (files changed, insertions/deletions). +- Verdict confidence and one-line summary. +- Token totals: + + ```bash + jq '.tokens.totals' runs//state.json + ``` + +- Your own turn count for this orchestration (plan + kick + this gate). +- Suggested next step: merge / PR commands, e.g. + `git checkout master && git merge ` or `gh pr create`. + +Then you are done. **Never auto-merge** — surface the commands for the user. + +### REVISE (max 2 orchestrator rounds total) + +Conditions: verdict is `REVISE`, or your spot-check finds a real, fixable +problem, AND you have not already used 2 revise rounds. + +1. Copy the plan file to a new path (keep the original; lineage is by file): + + ```bash + REVISE_PLAN="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S)-rN.md" + cp "$PLAN_FILE" "$REVISE_PLAN" + ``` + +2. Append a corrections section with **file-specific, actionable** fixes drawn + from `issues[]` and your spot-check: + + ```text + ## Reviewer Corrections (Round N) + - `path/to/file`: + ``` + +3. Re-kick Step 3 with the new plan and `--parent-run ` (tags the + re-kick's lineage). Then END THE TURN again. + +Never re-kick on identical feedback twice — if the same issue survives a round, +that is an ESCALATE, not another REVISE. + +### ESCALATE + +Conditions (any): verdict missing (`verdict_present: false` / +`verdict: null`) · runner exited non-zero (status `failed`) · status reader exit +4 (presumed-dead runner) · 2 revise rounds exhausted · `scope_creep_detected: +true`. + +Write a concise human summary — do NOT auto-merge, do NOT silently retry: +- What was attempted (task, rounds used). +- Artifact paths (run dir, `verdict.json`, diffstat, the plan file). +- Open issues (from `issues[]` and/or the dead-runner assessment). +- Recommended manual step (inspect the worktree, re-kick with `--skip-plan`, + fix by hand, etc.). + +--- + +## Step 5: BUDGET + TROUBLESHOOTING + +### Budget rules + +- **Max 2 orchestrator revise rounds.** Round 3 is an ESCALATE. +- The runner's internal `--revise-loop 1` (set by the preset) adds at most one + cheap automatic retry per kick, INSIDE the runner — that is separate from and + cheaper than your orchestrator rounds. You still cap your own re-kicks at 2. +- **One orchestrated run per workdir.** The Step 1 `--list` check enforces this. + +### Troubleshooting + +| Symptom | Cause | Action | +|---|---|---| +| status reader exits 4 (presumed-dead) | runner process gone mid-phase | `pkill -f 'codex exec'` to clear orphans, then re-kick `--skip-plan --plan "$PLAN_FILE"` (add `--parent-run `) | +| Machine slept during the run | wake notification may have been dropped | Run the status script manually; do not assume the run failed | +| `tokens` block missing from state.json | `CO_EVOLVE_TOKEN_CAPTURE=1` was not set, or `jq` is missing | Re-kick with the flag set; install `jq` for token capture | +| Verify silently skipped | dirty tree kicked without `--worktree auto`, or run left untracked files | Re-kick on a clean tree with `--branch auto`, or `--worktree auto` | +| `claude` verifier seat errors mid-run | headless shell not logged in | Re-kick with `--verifier codex`; run `claude /login` to restore the Fable seat | + +--- + +## Transport B: the OpenAI plugin (secondary, interactive flavor) + +When you are already mid-session on small ad-hoc work and want to delegate a +quick build/fix to Codex without the full runner, the official OpenAI plugin is +a supported alternative. + +Install (one-time, non-interactive): + +```bash +claude plugin marketplace add openai/codex-plugin-cc +claude plugin install codex@openai-codex +``` + +It provides `/codex:rescue --background` (delegate a task to Codex detached), +`/codex:status` (check the job), `/codex:result` (fetch the result), and +`/codex:cancel`. + +Protocol when using the plugin: +1. Delegate execution to `codex-rescue` (or `/codex:rescue --background`). +2. Apply the **same review contract** in-session: read the diff and fill the + review-verdict JSON yourself (`skills/dev-review/schemas/review-verdict.json` + — `verdict`, `confidence`, `summary`, `issues[]`, `scope_creep_detected`). +3. Same `≤ 2`-round revise cap. Same **never auto-merge** rule. + +State plainly to the user: the plugin path shares our templates/schema **by +convention** but is **NOT** exercised by evals/CI — there is no `--output-schema` +contract enforced through it (you fill the verdict by hand), and it is pinned to +the plugin's major version (`v1.x`). Format drift in the plugin's output is the +plugin's risk, not ours. The runner path (Transport A above) is the one that +carries CI. + +--- + +## Notes + +- **Why detached, not babysat:** the post that inspired this (Fable plans, Codex + executes, Fable reviews) keeps a session supervising inline, burning cache + reads every turn. This skill kicks the runner via background Bash, ends the + turn, and gets woken on exit — the session pays only for plan + review gates. +- **The preset is the single source of truth for the seats.** `codex-build` + expands to Fable/high · Codex/xhigh · Fable/max · verify on · bounces 2 · + revise-loop 1, inside `dev-review/codex/dev-review.sh` (`apply_preset`). The + preset triple is pinned by `tests/docs-sync-simulation.sh` so this doc and the + runner cannot drift. +- **Live-mode caveat:** `--live` codex windows ignore model/effort overrides + (documented v1 limitation in /dev-review). This skill does not use `--live`. +- **Plan ownership:** the orchestrator owns the canonical plan file; Codex only + ever receives plan content inline (embedded by the runner) and writes to a + separate output. Never let Codex see the canonical plan path. diff --git a/skills/dev-review/SKILL.md b/skills/dev-review/SKILL.md index 36dd9ad..719fb39 100644 --- a/skills/dev-review/SKILL.md +++ b/skills/dev-review/SKILL.md @@ -26,6 +26,14 @@ Use `/co-evolution` for general questions, ideas, arguments, prompt refinement, or document bounces. Use `/dev-review` only when the intended deliverable is a repo change or code-focused verification workflow. +For the **detached** flavor — the session plans and reviews while Codex executes +in the background and the session is woken at gates instead of babysitting — use +`/codex-build` (`skills/codex-build/`). It kicks the standalone runner with +`--preset codex-build` (Fable plans at `high`, Codex executes at `xhigh`, Fable +reviews at `max`; `--verify` on, bounces `2`, revise-loop `1`) and applies a +schema-bound ACCEPT / REVISE / ESCALATE gate. `/dev-review` is the interactive +single-session loop; `/codex-build` is the same pipeline run detached. + When `--live` is enabled on Windows, each Codex pass runs in a visible PowerShell window so the user can watch progress in real time. The handoff contract stays file-based: prompts are written to disk, Codex writes the final message to disk, @@ -170,6 +178,28 @@ the warning from Step 2 is enough; still render the banner as `Live: no`. ### Initialize the plan document: Create `/tmp/dev-review-plan-{timestamp}.md` - this is the living document that bounces. +### Build the shared Codex argument block: `CODEX_ARGS` + +`--model` and the per-seat effort knob are parsed but must actually reach +`codex exec`. Build a `CODEX_ARGS` array ONCE here and splice it into every +headless `codex exec` invocation (compose, bounce, execute, verify). Codex takes +model and reasoning-effort as `-c key=value` overrides, not as named flags: + +```bash +CODEX_ARGS=() +[ -n "$CODEX_MODEL" ] && CODEX_ARGS+=( -c "model=$CODEX_MODEL" ) +[ -n "${CODEX_EFFORT:-}" ] && CODEX_ARGS+=( -c "model_reasoning_effort=$CODEX_EFFORT" ) +``` + +`$CODEX_MODEL` comes from `--model`; `$CODEX_EFFORT` (if your invocation sets it) +maps to the executor's reasoning effort, mirroring the standalone runner's +`-c model_reasoning_effort=` wiring. When both are empty, `CODEX_ARGS` is empty +and the codex commands are byte-identical to today's — Codex falls back to its +own `~/.codex/config.toml` defaults. + +Splice it into each headless codex command with `"${CODEX_ARGS[@]}"`, e.g. +`codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o `. + ### Shared helper: `launch_codex_live` Use one helper for every live Codex pass: compose, bounce, execute, and verify. @@ -329,7 +359,7 @@ launch_codex_live "Codex - Compose" "exec" \ If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md +printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md ``` **Important:** Always pipe plan content through stdin or embed it in the prompt file. @@ -421,7 +451,7 @@ cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timest If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md +printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md # Orchestrator overwrites canonical plan with Codex's output cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timestamp}.md ``` @@ -582,7 +612,7 @@ launch_codex_live "Codex - Execute" "exec" \ If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -codex exec --full-auto -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1 +codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1 ``` After execution, verify changes: @@ -631,6 +661,8 @@ Pass 1: `codex review --uncommitted` (if available, skip if not) - If `$LIVE_ACTIVE=false`, keep the current headless `codex review` behavior. Pass 2: `codex exec --output-schema` with the review prompt +- Headless: splice `"${CODEX_ARGS[@]}"` in too, e.g. + `codex exec --full-auto "${CODEX_ARGS[@]}" --output-schema -C "$(pwd)" -o /tmp/dev-review-verdict.json < /tmp/dev-review-review-prompt.md` - If `$LIVE_ACTIVE=true`, use `launch_codex_live` with title `Codex - Verify`, mode `exec-schema`, prompt `/tmp/dev-review-review-prompt.md`, output `/tmp/dev-review-verdict.json`, and schema path @@ -745,6 +777,11 @@ Dev-review is integrated into GSD workflows: - **`--live` is launch-mode only**: It does not change the artifact contract. Prompts still come from temp files, and the final Codex message still lands in the same output file before Claude Code continues. +- **`--live` ignores model/effort overrides** (documented v1 limitation): the + visible-window launcher (`launch_codex_live`) does not splice `CODEX_ARGS`, so + `--model` / the reasoning-effort knob have no effect on live Codex passes. + Those overrides apply only on the headless path. Use headless mode when a + pinned Codex model or effort matters. - **`--live` is Windows-only in the first release**: On non-Windows platforms, warn once and continue headless. - **Live windows auto-close after a short read delay**: The wrapper waits 5 seconds diff --git a/tests/docs-sync-simulation.sh b/tests/docs-sync-simulation.sh new file mode 100644 index 0000000..03d4b1e --- /dev/null +++ b/tests/docs-sync-simulation.sh @@ -0,0 +1,132 @@ +#!/usr/bin/env bash +# tests/docs-sync-simulation.sh +# Hermetic sync-token gate for v1.5 Phase 5: pin the codex-build preset triple +# so the runner code and the user-facing docs cannot drift. +# +# The `codex-build` preset is defined in ONE place (dev-review/codex/dev-review.sh +# apply_preset) but DESCRIBED in two skill docs (skills/codex-build/SKILL.md and +# skills/dev-review/SKILL.md). If someone edits the seats in one and forgets the +# others, the docs lie. This gate greps the load-bearing seat tokens out of each +# file and asserts they all agree on the same triple: +# +# composer = Fable, effort high +# executor = Codex, effort xhigh (model unpinned — CLI config rules) +# verifier = Fable, effort max +# bounces = 2 +# +# Pattern: pure-grep assertions over the real files (no agents, no stubs). Each +# scenario is a (file, required-token) pair. This is a structural invariant test, +# the same flavour as the frozen-surface grep checks in classifier-simulation.sh. + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh" +CODEX_BUILD_SKILL="$REPO_ROOT/skills/codex-build/SKILL.md" +DEV_REVIEW_SKILL="$REPO_ROOT/skills/dev-review/SKILL.md" + +TOTAL=0 +FAILURES=0 +pass() { printf "PASS: %s\n" "$1"; } +fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); } + +# assert_grep