diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 5c4b660..941dd2c 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -10,7 +10,19 @@ Co-Evolution is a tooling repo for structured iterative refinement between AI ag - [x] **v1.1 Polish & Ergonomics** (shipped 2026-04-17) — v1.0 code review fixes (WR-01/02/03) + runtime ergonomics (REVISE auto-loop, visible live mode, branch/worktree management). 4 phases, 6 requirements closed. PR [#2](https://github.com/alanshurafa/co-evolution/pull/2) · See [`milestones/v1.1-ROADMAP.md`](milestones/v1.1-ROADMAP.md) · [`milestones/v1.1-SUMMARY.md`](milestones/v1.1-SUMMARY.md) · [`milestones/v1.1-REQUIREMENTS.md`](milestones/v1.1-REQUIREMENTS.md) - [x] **v1.3 Reliability, Measurement & Cross-Platform** (shipped 2026-06-11) — stranded-fix landing, macOS/bash-3.2+5.2 portability with 3-OS CI, silent-failure hardening, and the bounce measurement stack (state.json, deterministic scorer + marker-fate ledger, blind judge, human report). Headline: 17.6% deletion-convergence measured; Fable-5 judge 7/7 improved. 9 phases. See [`milestones/v1.3-SUMMARY.md`](milestones/v1.3-SUMMARY.md) · audit at `docs/audits/2026-06-10-v13-audit.md` -## Active Milestone: v1.4 Distribution — npm + MCP (2026-06-11) +## Active Milestone: v1.5 Build with Codex — model ladder + orchestrated execution (2026-06-12) + +**Goal:** Adopt the Codex-execution / Fable-orchestration split (per @cjzafir's pattern) in the dev-review runner: fix 3 latent env-export bugs, add per-seat model/effort config, a `--preset codex-build` shortcut, detached background execution with harness-exit-wakeup, a status-reader script, token capture to measure the 50% cost claim, and a `/codex-build` orchestration skill for both the runner and plugin transports. Design basis: `.planning/v1.5-DESIGN.md` (approved 2026-06-12). + +- [ ] **Phase 0: Environment + research** (2026-06-12, in progress) — codex symlink + smoke; plugin install; R1 pin `claude -p --output-format json` envelope; R2 pin codex end-of-run token line; register milestone in .planning/; research notes filed. +- [ ] **Phase 1: Seat plumbing + env-export correctness** — `lib/co-evolution.sh` effort knobs + `invoke_codex_schema` move (B2); `dev-review.sh` `export CODEX_MODEL` (B1) + `export WORKDIR` (B3) + `--verifier/--claude-model` flags + per-seat env via `apply_seat_env`. Gate: `tests/run-all.sh` green; byte-parity with knobs off. +- [ ] **Phase 2: Claude-verifier hardening + `--preset codex-build`** — fenced-JSON verdict fallback; preset expansion (fable/high → codex/xhigh → fable/max, bounces=2, revise-loop=1); banner; `tests/preset-expansion-simulation.sh`. +- [ ] **Phase 3: Runner observability + status reader** — `state.json` additions (`current_phase`, `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id`); new `dev-review-status.sh` (~120 lines, exit codes 0/2/3/4/5); `tests/status-reader-simulation.sh`. +- [ ] **Phase 4: Token capture** — `CO_EVOLVE_TOKEN_CAPTURE=1` (default off); `invoke_claude` gated JSON mode; codex stderr harvest; `collect_token_usage` → `state.json.tokens`; `tests/token-capture-simulation.sh`. +- [ ] **Phase 5: `/codex-build` skill + docs** — new `skills/codex-build/SKILL.md` (preflight → plan → kick → wake/gate loop, both runner and plugin transports); CLAUDE.md Default Rule update; routing doc updates. +- [ ] **Phase 6: Dogfood + evidence** — 2–3 real `/codex-build` tasks (ACCEPT / REVISE→ACCEPT / ESCALATE); token evidence note; MCP parity (`vendor.sh` + `npm test`); memory update. + +## Previous Milestone: v1.4 Distribution — npm + MCP (2026-06-11) **Goal:** Make the bounce protocol invocable without `git clone`: a Node/TS MCP server (`@alanshurafa/co-evolution-mcp`, one `co_evolve` tool) published diff --git a/.planning/STATE.md b/.planning/STATE.md index c429f60..512dc4b 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -1,17 +1,17 @@ --- gsd_state_version: 1.0 -milestone: v1.4 -milestone_name: Distribution — npm + MCP +milestone: v1.5 +milestone_name: Build with Codex — model ladder + orchestrated execution status: executing -stopped_at: v1.4 Phases 0-4 EXECUTED (2026-06-11). mcp/ package built + 4/4 hermetic smoke tests + stdio handshake verified; CI gains 3-OS mcp job; publish-mcp.yml ready. Remaining = Phase 5 (HUMAN: verify npm scope @alanshurafa, add NPM_TOKEN secret, git tag -> auto-publish, Claude Desktop round-trip) + Phase 6 post-ship registry/awesome-list. -last_updated: "2026-06-10T23:45:00.000Z" -last_activity: 2026-06-11 -- v1.4 milestone registered; v1.3 archived +stopped_at: v1.5 Phases 0-5 EXECUTED (b672145..30a9a00). Phase 6 PARTIAL (2026-06-12) — MCP vendor parity green; codex-verifier degrade path now FULLY GREEN end-to-end: model-leak fix + review-verdict.json schema 400 fix both landed, first real /codex-build ACCEPT produced (subtract-helper task, exit 0, APPROVED conf 96, verify tokens captured). Remaining Phase 6: claude /login (human, for full-ladder + non-zero claude_* tokens), REVISE→ACCEPT row, interactive baseline. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched. +last_updated: "2026-06-12T17:43:00.000Z" +last_activity: 2026-06-12 -- v1.5 Phase 6: review-verdict.json schema 400 fixed; codex-verifier degrade path E2E green (first ACCEPT) progress: - total_phases: 8 - completed_phases: 8 - total_plans: 17 - completed_plans: 18 - percent: 100 + total_phases: 7 + completed_phases: 6 + total_plans: 0 + completed_plans: 0 + percent: 86 --- # Project State @@ -30,7 +30,18 @@ Milestone: v1.4 Distribution — npm + MCP Phase: 0-4 complete; Phase 5 (publish) blocked on human items Status: npm scope verification + NPM_TOKEN secret + git tag are Alan's; everything else built and CI-gated. v1.2 SC-4 gate still open (VERIFY-SC4.md). Last activity: 2026-06-10 -- Phase 0 merges + LF policy + audit report -Working directories: `~/co-evolution-v13/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC + +Milestone: v1.5 Build with Codex — model ladder + orchestrated execution +Phase: 0-5 EXECUTED; Phase 6 (dogfood + evidence) PARTIAL +Status: Phases 0-5 shipped on feat/v1.5-codex-build (b672145 Phase 0 env + research · fb6862a Phase 1 seats + B1/B2/B3 fixes · ffe765f Phase 2 verifier hardening + preset · 13c2bee Phase 3 observability + status reader · fb965ad Phase 4 token capture · 30a9a00 Phase 5 /codex-build skill + docs). Phase 6 partial — see below. +Phase 6 progress (2026-06-12): + - MCP vendor parity GREEN: `bash mcp/scripts/vendor.sh` clean; `(cd mcp && npm test)` 4/4 pass. `mcp/vendor/` is gitignored (generated-at-publish via `npm run build:vendor`, NOT checked in) — Phase 1/3/4 lib changes were additive and broke nothing. + - First real `/codex-build` dogfood (slugify task, scratch repo under $TMPDIR, `--verifier codex` degrade since headless claude is logged out): execute phase SUCCEEDED (slugify landed, all 4 scratch tests pass), but verify phase ERRORED → runner exit 2, `verdict_present: false`, ESCALATE. codex_total_tokens=21497 (token capture works); claude_* totals=0 (ladder not exercised on the degrade path). One re-kick (`--parent-run`, lineage recorded) hit the same error. NO ACCEPT data point yet. + - Real bug found AND FIXED (2026-06-12): the documented `--verifier codex` degrade leaked the preset's `VERIFIER_MODEL=fable` into the codex seat (`apply_seat_env`, dev-review.sh:1370-1372) → codex on a ChatGPT account returned HTTP 400 "The 'fable' model is not supported". Fix = cross-agent leak guard in `apply_seat_env` + `resolve_seat_model_string` (drop a wrong-kind model+effort pair as a unit; codex seat falls back to `codex:(default)@(default)`). Sim scenario (h) added (preset-expansion-simulation.sh now 8/8; run-all 25/25 green). Re-run proof: fable 400 GONE, execute SUCCEEDED, scratch run-tests ALL PASS, codex_total_tokens=17237, wall 54s — BUT verdict still null: the verify seat now hit a SEPARATE, pre-existing schema 400 (`invalid_json_schema`: nested `issues.items` missing `additionalProperties:false` in skills/dev-review/schemas/review-verdict.json). Seat fix proven; degrade path was blocked one layer deeper by the schema bug. Detail in .planning/research/2026-06-12-token-evidence.md. + - Schema 400 FIXED (2026-06-12): OpenAI strict structured-output requires `additionalProperties:false` + a `required` list covering EVERY property on EVERY object node. `issues.items` was missing both; top-level `required` omitted `scope_creep_detected`/`iteration_notes`. Tightened all THREE canonical copies identically (schemas/, runners/codex-ps/schemas/, skills/dev-review/schemas/) so the drift guard stays green; shell `validate_review_verdict` is unaffected (stays loose, independent of this file). DEGRADE-PATH E2E NOW GREEN: real /codex-build (subtract-helper task, scratch repo under $TMPDIR, `--verifier codex`, --branch auto) → exit 0, verify OK, **verdict APPROVED conf 96**, verdict.json complete (all 6 strict fields), tokens execute=30606 verify=14485 codex_total=45091, wall 59s, scratch run-tests ALL 4 PASS. First full ACCEPT-path evidence row (degrade path). run-all 25/25 green. First ACCEPT data point logged in .planning/research/2026-06-12-token-evidence.md. + - Remaining Phase 6 (HUMAN + follow-up): `claude /login` on this Mac (unblocks full ladder + non-zero claude_* tokens; degrade-path claude_* are 0 by design); REVISE→ACCEPT row still owed (ESCALATE + ACCEPT now evidenced); an interactive `/dev-review` baseline for the 50%-claim denominator. +Last activity: 2026-06-12 -- Phase 6: schema 400 fixed (3 copies); degrade-path E2E green, first ACCEPT (.planning/research/2026-06-12-token-evidence.md) +Working directories: `~/Project/co-evolution/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC macOS baseline before v1.3 fixes: scorer-verification 11/14; code-proposer sim 1/16; pr-emitter sim 4/12; template-proposer sim 1/8; revise-loop sim aborts. Root causes: bash 3.2 (mapfile, source <(…)), BSD sed GNU-isms. Target after Phase 0.5: all green on macOS. diff --git a/.planning/research/2026-06-12-claude-json-envelope.md b/.planning/research/2026-06-12-claude-json-envelope.md new file mode 100644 index 0000000..766fa02 --- /dev/null +++ b/.planning/research/2026-06-12-claude-json-envelope.md @@ -0,0 +1,71 @@ +# R1: claude -p JSON Envelope — Field Reference + +**Date:** 2026-06-12 +**Phase:** v1.5 Phase 0 +**Purpose:** Pin the exact JSON output structure of `claude -p --output-format json` for Phase 4 token-capture parsing. + +## Command run + +``` +claude -p --output-format json --model claude-haiku-4-5-20251001 "Reply with exactly: PING" +``` + +## Verbatim output (not-logged-in state; all usage fields present and zero) + +```json +{"type":"result","subtype":"success","is_error":true,"api_error_status":null,"duration_ms":1103,"duration_api_ms":0,"num_turns":1,"result":"Not logged in · Please run /login","stop_reason":"stop_sequence","session_id":"4642f382-c299-4d72-bcaf-3e7bca396c7d","total_cost_usd":0,"usage":{"input_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":0,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"inference_geo":"","iterations":[],"speed":"standard"},"modelUsage":{},"permission_denials":[],"terminal_reason":"completed","fast_mode_state":"off","uuid":"def45115-84e3-4e92-a601-20e0dae05dda"} +``` + +Exit code: 1 (error because not logged in; envelope still emitted on stdout) + +## Top-level fields + +| Field | Type | Notes | +|---|---|---| +| `type` | string | Always `"result"` | +| `subtype` | string | `"success"` even on error | +| `is_error` | bool | `true` when auth fails or API error | +| `api_error_status` | null/int | HTTP status code on API errors; null here | +| `duration_ms` | int | Wall time in ms | +| `duration_api_ms` | int | API time in ms | +| `num_turns` | int | Number of conversation turns | +| `result` | string | The model's text output (or error message) | +| `stop_reason` | string | e.g. `"stop_sequence"`, `"end_turn"` | +| `session_id` | string | UUID | +| `total_cost_usd` | float | Total cost in USD (0 when not logged in) | +| `usage` | object | Token usage breakdown — see below | +| `modelUsage` | object | Per-model usage breakdown (empty when not logged in) | +| `permission_denials` | array | Tool permission denial events | +| `terminal_reason` | string | `"completed"` | +| `fast_mode_state` | string | `"off"` | +| `uuid` | string | Run UUID | + +## usage subfields + +| Field | Type | Notes | +|---|---|---| +| `input_tokens` | int | Prompt input tokens | +| `cache_creation_input_tokens` | int | Cache write tokens | +| `cache_read_input_tokens` | int | Cache hit tokens | +| `output_tokens` | int | Response tokens | +| `server_tool_use.web_search_requests` | int | Web search count | +| `server_tool_use.web_fetch_requests` | int | Web fetch count | +| `service_tier` | string | `"standard"` or `"priority"` | +| `cache_creation.ephemeral_1h_input_tokens` | int | 1-hour ephemeral cache write tokens | +| `cache_creation.ephemeral_5m_input_tokens` | int | 5-min ephemeral cache write tokens | +| `inference_geo` | string | Inference geography code | +| `iterations` | array | Per-iteration usage (for multi-turn) | +| `speed` | string | `"standard"` | + +## Notes for Phase 4 + +- All token fields live under `.usage`. Phase 4's `invoke_claude` gated JSON mode should extract: `.usage.input_tokens`, `.usage.output_tokens`, `.usage.cache_creation_input_tokens`, `.usage.cache_read_input_tokens`. +- `.total_cost_usd` is a direct top-level field, not nested. +- The envelope is always emitted to **stdout** even when `is_error=true`. +- The exit code is 1 on auth error; Phase 4 must handle non-zero exit with usable envelope (capture stdout regardless of exit code using `|| true`). +- **Limitation:** This capture is from a not-logged-in shell. When logged in, `modelUsage` will be populated and `iterations` may have per-turn breakdown. Field names are stable across auth state. + +## Auth status at capture time + +`claude whoami` returns: `Not logged in · Please run /login` +(The Mac's interactive Claude Code sessions authenticate through the Electron app, not this shell. The sub-agent shell does not carry the session token.) diff --git a/.planning/research/2026-06-12-codex-headless-facts.md b/.planning/research/2026-06-12-codex-headless-facts.md new file mode 100644 index 0000000..62ab02a --- /dev/null +++ b/.planning/research/2026-06-12-codex-headless-facts.md @@ -0,0 +1,207 @@ +# Codex Headless Research: Smoke, Token Line, Plugin Route + +**Date:** 2026-06-12 +**Phase:** v1.5 Phase 0 (T1, R2, T4) +**Purpose:** Pin environment facts needed by Phases 1–5. + +--- + +## T1 — Codex CLI symlink + smoke + +### Binary location + +``` +/Applications/Codex.app/Contents/Resources/codex +-rwxr-xr-x@ 1 shurafa staff 213429648 Jun 11 16:11 +``` + +### Symlink created + +``` +ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex +``` + +Result: `lrwxr-xr-x@ 1 shurafa staff 48 Jun 12 12:28 /Users/shurafa/.local/bin/codex -> /Applications/Codex.app/Contents/Resources/codex` + +### Version + +``` +codex-cli 0.140.0-alpha.2 +``` + +### Config (~/.codex/config.toml relevant fields) + +```toml +model = "gpt-5.5" +model_reasoning_effort = "xhigh" +service_tier = "priority" +``` + +### Headless smoke command + +```bash +mkdir -p "$TMPDIR/codex-smoke" && cd "$TMPDIR/codex-smoke" && \ + ~/.local/bin/codex exec --full-auto --skip-git-repo-check \ + "Reply with exactly: SMOKE-OK" > out.txt 2> err.txt; echo "exit=$?" +``` + +**Exit code: 0** + +### stdout (out.txt — verbatim) + +``` +SMOKE-OK +``` + +### stderr (err.txt — verbatim) + +``` +warning: `--full-auto` is deprecated; use `--sandbox workspace-write` instead. +Reading additional input from stdin... +OpenAI Codex v0.140.0-alpha.2 +-------- +workdir: /private/var/folders/20/vl4p6fsd2dsbf5r68611c6nc0000gn/T/codex-smoke +model: gpt-5.5 +provider: openai +approval: never +sandbox: workspace-write [workdir, /tmp, $TMPDIR] +reasoning effort: xhigh +session id: 019ebcaa-5766-7c60-beb8-824b921d69a1 +-------- +user +Reply with exactly: SMOKE-OK +codex +SMOKE-OK +tokens used +11,681 +``` + +**Conclusion:** Auth works headless. gpt-5.5 + xhigh accepted. ChatGPT-plan tier confirmed (`service_tier = "priority"` in config). + +--- + +## R2 — Codex end-of-run token usage line (verbatim + stream) + +**Stream:** stderr +**Exact lines (verbatim, from smoke run above):** + +``` +tokens used +11,681 +``` + +Two lines: the label `tokens used` on one line, then the formatted integer with comma separator on the next line. + +**Format for Phase 4 grep/parse:** +The token count follows the literal line `tokens used` (no colon) as the immediately subsequent line. Parse with: +```bash +grep -A1 "^tokens used$" err.txt | tail -1 +``` +or equivalently with `awk '/^tokens used$/{getline; print}'`. +The number uses comma formatting (e.g., `11,681`); strip commas before arithmetic: `tr -d ','`. + +**Note on deprecation:** `--full-auto` is deprecated in favor of `--sandbox workspace-write`. Both flags produce the same token output format. The runner may want to migrate to `--sandbox workspace-write` in Phase 1. + +--- + +## T4 — Plugin install route + +### `claude plugin --help` subcommands (relevant) + +``` +claude plugin install|i [options] + Install a plugin from available marketplaces + Options: + --config Set a userConfig option (repeatable) + -s, --scope user, project, or local (default: "user") + +claude plugin marketplace [options] [command] + Commands: + add [options] Add a marketplace from a URL, path, or GitHub repo + list [options] List all configured marketplaces + remove|rm [options] Remove a configured marketplace + update [options] [name] Update marketplace(s) +``` + +### Non-interactive install route (confirmed working) + +Step 1 — Add the marketplace: +```bash +claude plugin marketplace add openai/codex-plugin-cc +``` +Output: +``` +Adding marketplace…SSH not configured, cloning via HTTPS: https://github.com/openai/codex-plugin-cc.git +Refreshing marketplace cache (timeout: 120s)… +Cloning repository (timeout: 120s): https://github.com/openai/codex-plugin-cc.git +Clone complete, validating marketplace… +Cleaning up old marketplace cache… +✔ Successfully added marketplace: openai-codex (declared in user settings) +``` + +Step 2 — Install the plugin: +```bash +claude plugin install codex@openai-codex +``` +Output: +``` +Installing plugin "codex@openai-codex"...✔ Successfully installed plugin: codex@openai-codex (scope: user) +``` + +### Verification (`claude plugin list`) + +``` +Installed plugins: + + ❯ codex@openai-codex + Version: 1.0.4 + Scope: user + Status: ✔ enabled +``` + +### Plugin details (component inventory) + +``` +codex 1.0.4 + Use Codex from Claude Code to review code or delegate tasks. + Source: codex@openai-codex + +Component inventory + Skills (10) adversarial-review, cancel, codex-cli-runtime, codex-result-handling, + gpt-5-4-prompting, rescue, result, review, setup, status + Agents (1) codex-rescue + Hooks (3) SessionStart, SessionEnd, Stop (harness-only — no model context cost) + MCP servers (0) + LSP servers (0) + +Projected token cost + Always-on: ~306 tok added to every session +``` + +**Key skills for v1.5:** `/codex:rescue` (delegate tasks to Codex), `/codex:status` (check job status), `/codex:review` (adversarial review), `/codex:cancel` (cancel running job). + +### Conclusion + +A fully **non-interactive** install route exists and works. No `/plugin` UI required. The plan's note about "if it requires the interactive /plugin UI" is moot — `claude plugin marketplace add` + `claude plugin install` suffice. The install persists at user scope. + +### Interactive equivalent (for reference only, not needed) + +Would be: open Claude Code, type `/plugin marketplace add openai/codex-plugin-cc`, then `/plugin install codex`. The CLI route above is strictly equivalent. + +--- + +## Summary + +| Item | Fact | +|---|---| +| codex binary | `/Applications/Codex.app/Contents/Resources/codex` (213 MB, Jun 11) | +| codex version | 0.140.0-alpha.2 | +| symlink | `~/.local/bin/codex` → binary (created 2026-06-12) | +| smoke exit | 0 | +| smoke stdout | `SMOKE-OK` | +| token line stream | **stderr** | +| token line format | two lines: `tokens used` then `11,681` (comma-formatted integer) | +| `--full-auto` | deprecated; equivalent `--sandbox workspace-write` | +| plugin install | non-interactive, fully CLI-driven | +| plugin name | `codex@openai-codex` v1.0.4, user scope, enabled | +| plugin skills | rescue, status, review, cancel, adversarial-review (+ 5 more) | diff --git a/.planning/research/2026-06-12-token-evidence.md b/.planning/research/2026-06-12-token-evidence.md new file mode 100644 index 0000000..34cd05a --- /dev/null +++ b/.planning/research/2026-06-12-token-evidence.md @@ -0,0 +1,294 @@ +# Token Evidence: the "50% Claude-limit reduction" claim + +**Date:** 2026-06-12 +**Phase:** v1.5 Phase 6 (dogfood + evidence) +**Purpose:** Establish the measurement method for the post's "~50% weekly +Claude-limit reduction" claim, record the first real `/codex-build` data point, +and enumerate what is still needed before a verdict on the claim can be drawn. + +--- + +## Measurement method + +The claim is that splitting roles (Claude plans/reviews, Codex executes) cuts +weekly Claude usage roughly in half versus Claude doing the whole loop. We +measure two distinct spend surfaces, because they live in different places: + +1. **Runner-side agent tokens** — captured per phase into `state.json.tokens` + when `CO_EVOLVE_TOKEN_CAPTURE=1` (Phase 4). Claude seats land under + `.tokens.totals.claude_*` (from the `claude -p --output-format json` envelope, + field map in `2026-06-12-claude-json-envelope.md`); Codex seats land under + `.tokens.totals.codex_total_tokens` (harvested from the `tokens used` stderr + line, format pinned in `2026-06-12-codex-headless-facts.md` R2). This is the + part the runner can see and attribute by seat. + +2. **Orchestrator-session spend** — the Fable/Claude Code session that plans and + runs the review gate. The harness does **not** expose a token meter to the + session, so this is visible only as **turn counts**: how many model turns the + orchestrator burned (plan + kick + each wake/review gate). Under `/codex-build` + the session ends its turn at KICK and is woken on exit, so the orchestrator + pays for plan + review gates only — not for watching Codex grind. + +**The comparison** is `/codex-build` (Codex executes, Claude only plans+reviews) +vs. an **interactive `/dev-review` baseline** on the *same task* (Claude in more +seats, session supervises inline every pass). The 50% question is whether the +Claude-attributable spend — runner-side `claude_*` totals **plus** orchestrator +turn count — drops by roughly half. Codex tokens are "free" against the +ChatGPT-plan quota and so are reported but excluded from the Claude-limit math. + +--- + +## First real data point (T2 dogfood, 2026-06-12) + +**Task:** add a `slugify` function to `utils.sh` (lowercase; spaces/underscores → +hyphens; strip other punctuation) + two test cases in `run-tests.sh`. Honest +scratch repo under `$TMPDIR`, clean tree, `--branch auto`, `--workdir` on the +scratch repo, `--run-dir` outside the main repo. Auth-degraded path +(`--verifier codex`) because the headless `claude` shell on this Mac is +**not logged in** (expected; the in-app session token does not reach sub-shells). + +Exact kick (seats from `--preset codex-build`): + +```bash +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset codex-build --skip-plan --plan "$PLAN" \ + --verifier codex --branch auto --workdir "$SCRATCH" \ + --run-dir "$TMPDIR/codex-build-dogfood-run" --timeout 900 \ + -- "Add a slugify function to utils.sh and two test cases to run-tests.sh" +``` + +### Result: execute SUCCEEDED, verify ERRORED (no verdict) — NOT an ACCEPT + +| Signal | Value | +|---|---| +| Runner exit code | 2 (partial) | +| Wall clock | 65 s (run 1), 61 s (run 2 re-kick) | +| Status reader `status` / `verdict` / `verdict_present` | `partial` / `null` / `false` | +| Execute phase | `ok`, exit 0 — slugify landed | +| Verify phase | `error`, exit 2 — codex verifier returned an error payload | +| Diffstat | `utils.sh +7`, `run-tests.sh +2`, 2 files, 9 insertions | +| Scratch tests after execute | **ALL 4 PASS** (`bash run-tests.sh` → `ALL TESTS PASSED`) | + +The executor's work was **correct** — the implemented `slugify` passes both new +assertions and the two pre-existing ones. The run did not reach ACCEPT only +because the **verify seat could not produce a verdict**. + +### Tokens block (verbatim, run 1 — `jq '.tokens' state.json`) + +```json +{ + "phases": { + "execute": { + "source": "codex-stderr", + "total_tokens": 21497 + } + }, + "totals": { + "claude_input": 0, + "claude_output": 0, + "claude_cache_read": 0, + "claude_cost_usd": 0, + "codex_total_tokens": 21497 + } +} +``` + +Run 2 (re-kick, `--parent-run dev-review-20260612-133912`, lineage recorded) +re-ran execute and reported `codex_total_tokens: 34935`; verify failed the same +way. Both runs: `claude_*` totals are **0** — by design under the codex-verifier +degrade, the only Claude work was orchestrator-side (this session), and the +headless claude seats were never invoked. + +### Seat models actually used (verbatim) + +```json +{ + "composer": "opus:claude-fable-5@high", + "executor": "codex:(default)@xhigh", + "verifier": "codex:fable@max" +} +``` + +--- + +## Finding: the `--verifier codex` degrade is broken on a ChatGPT-plan Codex + +Root cause, fully diagnosed (this is a real runner bug, not a plan or env fault): + +- The `codex-build` preset hard-fills `VERIFIER_MODEL=fable` (correct for the + **default** Claude verifier seat — Fable reviews). +- When the documented degrade `--verifier codex` flips the verifier *agent* to + codex, `apply_seat_env verifier codex` exports `CODEX_MODEL=fable` + (`dev-review.sh:1370-1372`: `export CODEX_MODEL="${model:-...}"` where + `model` = `VERIFIER_MODEL` = `fable`). The verdict is the seat model string + `codex:fable@max`. +- Codex on a **ChatGPT account** rejects an unknown model with HTTP 400: + + ``` + ERROR: {"type":"error","status":400,"error":{"type":"invalid_request_error", + "message":"The 'fable' model is not supported when using Codex with a ChatGPT account."}} + ``` + +So the skill's own documented auth-degrade (kick with `--verifier codex` when the +headless claude seat is logged out — the *exact* situation on this Mac) currently +**cannot produce a verdict**: the Claude model name leaks into the Codex seat. +`VERIFIER_MODEL=` (empty) does not fix it — the preset's `:=fable` refills an +empty value. A real fix is a runner change (clear/override `VERIFIER_MODEL` when +the verifier resolves to codex, or pass a codex-valid model), out of scope for +this evidence pass per Phase 6 ground rules. Per the skill's gate logic a missing +verdict is an **ESCALATE**, never an auto-merge — the gate behaved correctly. + +> **Update 2026-06-12 (v1.5 fix):** degrade-path verifier model leak FIXED in +> `apply_seat_env` / `resolve_seat_model_string` (cross-agent leak guard drops a +> wrong-kind model+effort pair as a unit; codex seat falls back to +> `codex:(default)@(default)` = gpt-5.5/xhigh). Re-run proof: the `fable` HTTP 400 +> is gone, execute SUCCEEDED, scratch `run-tests.sh` ALL PASS, +> `codex_total_tokens: 17237`, wall 54 s — but verdict is still `null` because the +> verify seat now hits a SEPARATE, pre-existing schema 400 (`invalid_json_schema`: +> `additionalProperties` required `false` on `issues.items` in +> `skills/dev-review/schemas/review-verdict.json`). Seat fix proven; degrade path +> still blocked one layer deeper by the schema bug (tracked separately). + +> **Update 2026-06-12 (v1.5 schema fix — FULL ACCEPT path now green):** the +> schema 400 above is FIXED. OpenAI strict structured-output requires +> `additionalProperties: false` and a `required` list covering every property on +> every object node; `issues.items` was missing both, and the top-level `required` +> omitted `scope_creep_detected` / `iteration_notes`. All three canonical copies +> (`schemas/`, `runners/codex-ps/schemas/`, `skills/dev-review/schemas/`) were +> tightened identically (drift guard stays green). The shell `validate_review_verdict` +> gate is unaffected — it stays loose and is independent of this file. With both +> the seat leak and the schema 400 fixed, the degrade path (`--verifier codex`) +> reaches a **real verdict end-to-end** for the first time. See the data point +> immediately below. + +--- + +## First COMPLETE-RUN data point — codex-verifier ACCEPT (2026-06-12) + +**This is the first full ACCEPT-path evidence row.** Same harness as T2 (honest +scratch repo under `$TMPDIR`, plan outside it, clean tree, `--branch auto`, +`--workdir` on the scratch repo, `--run-dir` outside the main repo), auth-degraded +`--verifier codex` seat (headless `claude` still logged out on this Mac). The only +difference from T2 is the schema fix — so this isolates the schema 400 as the last +blocker on the degrade path. + +**Task:** add a `subtract(a,b)` helper to `utils.sh` (echo `a-b`) + two assertions +in `run-tests.sh`. + +Exact kick (seats from `--preset codex-build`): + +```bash +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset codex-build --skip-plan --plan "$PLAN" \ + --verifier codex --branch auto --workdir "$SCRATCH" \ + --run-dir "$TMPDIR/codex-build-schema-proof" --timeout 900 \ + -- "Add a subtract(a,b) helper to utils.sh that echoes a-b, and add two assertions for it to run-tests.sh" +``` + +### Result: ACCEPT — execute OK, verify OK, real verdict produced + +| Signal | Value | +|---|---| +| Runner exit code | **0** | +| Wall clock | **59 s** | +| Status reader `status` / `verdict` | `completed` / `APPROVED` | +| Execute phase | `ok`, exit 0 — `subtract()` landed | +| Verify phase | `ok`, exit 0 — codex `--output-schema` returned a valid verdict (no schema 400) | +| Diffstat | `utils.sh +4`, `run-tests.sh +2`, 2 files, 6 insertions | +| Scratch tests after execute | **ALL 4 PASS** (`bash run-tests.sh` → `ALL TESTS PASSED`) | + +### Verdict (verbatim — `verdict.json`) + +```json +{"verdict":"APPROVED","confidence":96,"summary":"Implementation matches the plan: `utils.sh` adds a `subtract()` helper in the same style as `add()`, `run-tests.sh` includes the two requested assertions, and `bash run-tests.sh` passes all checks. No logic, style, security, or scope issues found for the requested change.","issues":[],"scope_creep_detected":false,"iteration_notes":"No changes needed."} +``` + +The verdict carries all six strict-mode fields (including `scope_creep_detected` +and `iteration_notes`), proving the tightened schema round-trips through Codex. + +### Tokens block (verbatim — `jq '.tokens' state.json`) + +```json +{ + "phases": { + "execute": { + "source": "codex-stderr", + "total_tokens": 30606 + }, + "verify": { + "source": "codex-stderr", + "total_tokens": 14485 + } + }, + "totals": { + "claude_input": 0, + "claude_output": 0, + "claude_cache_read": 0, + "claude_cost_usd": 0, + "codex_total_tokens": 45091 + } +} +``` + +Note this is the first row with a populated **`verify` phase** token entry +(14,485) — T2's verify never produced one because it errored before emitting a +`tokens used` line. `claude_*` totals remain **0** by design under the +codex-verifier degrade (no headless Claude seat was invoked); the Claude-limit +numerator still requires `claude /login` (see below). Codex tokens +(45,091) are excluded from the Claude-reduction math. + +### Real-task matrix progress + +T2 evidenced the **ESCALATE** row (missing verdict). This run evidences the +**ACCEPT** row for real on the degrade path. **REVISE→ACCEPT** and the +full-ladder (Fable-verifier, non-zero `claude_*`) ACCEPT remain — both still +gated on `claude /login`. + +--- + +## What is still needed for a 50%-claim verdict + +1. **`claude /login` on this Mac (human step).** The headless `claude` shell is + not logged in, so the full ladder — and any non-zero `claude_*` token data — + is unreachable here. Until then every runner-side `claude_*` total reads 0 and + the Claude-limit math has no numerator. This is the single biggest gap. +2. **A real ACCEPT data point.** ~~T2 reached execute-OK but verify-error.~~ + **DONE on the degrade path (2026-06-12, schema fix):** the subtract-helper run + reached a clean `APPROVED` verdict (`verdict.json` present, exit 0) — see the + COMPLETE-RUN data point above. Still owed: a **full-ladder** ACCEPT with + non-zero `claude_*` tokens (the degrade path's `claude_*` totals are 0 by + design; that gap is item 1, `claude /login`). +3. **The real-task matrix (Phase 6 design):** ACCEPT, REVISE→ACCEPT, and + ESCALATE each exercised once on real tasks. T2 produced an **ESCALATE** path + for real (missing verdict); the subtract run produced an **ACCEPT** for real + (degrade path). **REVISE→ACCEPT** remains. +4. **An interactive `/dev-review` baseline on the same task** — the denominator + for the comparison. Without it there is no "half of what?" to measure against. +5. **Orchestrator turn-count instrumentation.** Harness-side session spend is + visible only as turn counts (see Honest scope). For this orchestration the + count is: plan + 1 kick + 1 re-kick + this gate ≈ 4 orchestrator turns; + a clean ACCEPT would be plan + kick + 1 gate ≈ 3. + +## Honest scope + +- **Runner-side tokens are attributable; orchestrator-side spend is not metered.** + The harness gives the session no token meter — only turn counts. Any 50% figure + is therefore "runner-side Claude tokens (exact) + orchestrator turns (proxy)", + not a single dollar number. State this whenever the claim is quoted. +- **Codex tokens do not count against the Claude limit.** They are reported + (`codex_total_tokens`) for completeness and to size the ChatGPT-plan quota + draw, but excluded from the Claude-reduction numerator. +- **This data point is auth-degraded.** It measures the codex-only-verify path, + which is the fallback, not the blessed Fable-verifier ladder. Treat the + `claude_* = 0` totals as "ladder not exercised here," not as "Claude spend was + zero for this kind of task." + +## Artifacts + +| Item | Path | +|---|---| +| Run 1 dir (ACCEPT attempt) | `$TMPDIR/codex-build-dogfood-run/` | +| Run 2 dir (re-kick, parent-linked) | `$TMPDIR/codex-build-dogfood-run-r2/` | +| Plan file | `$TMPDIR/codex-build-plan-.md` | +| Scratch repo | `$TMPDIR/codex-build-dogfood-/` | +| Verifier error log | `/review-stderr.log` (the HTTP-400 fable rejection) | diff --git a/.planning/v1.5-DESIGN.md b/.planning/v1.5-DESIGN.md new file mode 100644 index 0000000..9d661b9 --- /dev/null +++ b/.planning/v1.5-DESIGN.md @@ -0,0 +1,119 @@ +# v1.5 "Build with Codex" — model ladder + Fable-orchestrated execution + + +## Context: the evaluation + +**The post** ([@cjzafir, 2026-06-11](https://x.com/cjzafir/status/2065104422762684745), ~98k views): cut weekly Claude Code limits ~50% by installing OpenAI's official [codex-plugin-cc](https://github.com/openai/codex-plugin-cc) (20.7k stars, Apache 2.0) and splitting roles — **Fable 5 high plans, Codex 5.5 xhigh executes on the ChatGPT plan (no API), Fable 5 max reviews and gates**. "Fable never writes the code, Codex never decides the design." Mechanically the plugin's `codex-rescue` agent shells out to `codex exec --full-auto` — **the exact invocation our dev-review runner already uses**. + +**Verdict: adopt.** dev-review already implements ~80% of this (compose → bounce → execute → verify, Codex executor, review-verdict.json gate, revise loop, worktree isolation). What's genuinely missing: per-seat model/effort selection, a blessed preset, detached execution with gate-based check-ins (the user's "Fable orchestrates, checks in periodically" vision), and token instrumentation to test the 50% claim. + +**Where we improve on the post:** +1. **Bounce the brief** — Codex critiques the plan with `[CONTESTED]`/`[CLARIFY]` markers before execution (post has no plan hardening). +2. **Schema-bound verdict gate with bounded convergence** — APPROVED/REVISE + confidence + issues[], max 2 orchestrator revise rounds (mirrors the 2-pass marker-expiry rule). The post's "cycle repeats until it passes" has no termination guarantee. +3. **Detached execution, not babysitting** — the post keeps a Fable session supervising inline, burning cache reads every turn (against the token policy). We kick the runner via harness background Bash, **end the turn**, and get woken on exit — Fable pays only for plan + review gates. +4. **Worktree/branch isolation** (`--worktree auto` exists) — post runs Codex loose in the repo. +5. **Evidence, not vibes** — per-phase token capture into state.json so runs measure whether the 50% claim holds here. + +**Environment discoveries (this Mac):** +- Codex.app installed, `~/.codex/config.toml` already `model = "gpt-5.5"`, `model_reasoning_effort = "xhigh"`, fresh auth.json; CLI binary at `/Applications/Codex.app/Contents/Resources/codex` (not on PATH — symlink needed). +- claude CLI 2.1.162; `--effort` proven headless in-repo (`evals/judge-bounce.sh:139` uses `--effort high` with `claude-fable-5`). +- `~/Project/co-evolution` clone current at 215dc83, clean. +- **3 latent bugs found** (must fix; they break per-seat config under the `timeout bash -c` process boundary): **B1** `--model` sets `CODEX_MODEL` without `export` (dev-review.sh:976) so it never reaches `invoke_codex`; **B2** `invoke_codex_schema` defined in dev-review.sh but called in a child that sources only lib → exit 127 when verifier=codex; **B3** `WORKDIR` never exported → agents fall back to launch cwd. + +**User decisions:** full v1.5 scope · both transports (own runner core + plugin path) · skill named `/codex-build` · Mac codex setup yes. + +## Ground rules + +- Implement in **`~/Project/co-evolution`** (GitHub-first; the SMB mount `/Volumes/Project/co-evolution` is read-only/sync-only). Branch `feat/v1.5-codex-build`; push when sessions end. +- Register milestone **v1.5** per `.planning/` conventions (STATE.md, ROADMAP.md, `phases/NN-*/`). v1.4 Phase 5 (npm publish) stays human-blocked in parallel — untouched. +- All new knobs default empty/off → **byte-parity** with today's argv and artifacts (existing sims must stay green). +- lib/co-evolution.sh changes must stay additive (it's vendored into mcp/; re-run `vendor.sh` + mcp tests to prove parity). + +## Phases + +### Phase 0 — Environment + research spikes +- `ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex`; `codex --version`; headless smoke `codex exec --full-auto` in a scratch dir (confirms ChatGPT-plan auth + gpt-5.5/xhigh accepted). +- Install the plugin in Claude Code: `/plugin marketplace add openai/codex-plugin-cc` → install `codex` → smoke `/codex:rescue` + `/codex:status`. +- R1: pin `claude -p --output-format json` envelope fields (usage, total_cost_usd). R2: pin codex end-of-run token line format on stderr. R3: confirm harness background-exit notification with a ~5-min stub run. +- Output: `.planning/research/` notes + v1.5 milestone registration. + +### Phase 1 — Seat plumbing + env-export correctness (runner) +- `lib/co-evolution.sh`: new empty-default knobs `CLAUDE_EFFORT` (→ `--effort` in `invoke_claude`) and `CODEX_REASONING_EFFORT` (→ `-c model_reasoning_effort=` in `invoke_codex`); **move `invoke_codex_schema` into lib** (fixes B2). +- `dev-review/codex/dev-review.sh`: `export CODEX_MODEL` (B1) + `export WORKDIR` (B3); new flags `--verifier opus|claude|codex`, `--claude-model` (+ `fable` → `claude-fable-5` alias); per-seat env `COMPOSER_/EXECUTOR_/VERIFIER_MODEL|EFFORT` via `apply_seat_env` called at the 5 invocation sites (compose, bounce both turns, execute, verify); `select_verifier` honors `VERIFIER_OVERRIDE`. +- Gate: `tests/run-all.sh` green; default invocation argv byte-identical. + +### Phase 2 — Claude-verifier hardening + `--preset codex-build` +- Verify phase: brace-block extraction fallback for prose-wrapped verdicts; persist normalized verdict back to the contract path `verdict.json` (no-op for codex `--output-schema`). +- `apply_preset codex-build`: composer=claude(fable, high) · executor=codex(model unpinned — user's codex config rules, effort xhigh) · verifier=claude(fable, max) · `--verify` on · bounces=2 · revise-loop=1. Banner prints seats; optional `seat_models` state field; RUNNER-CONTRACT.md → v1.1 (additive rows). +- New `tests/preset-expansion-simulation.sh` (PATH-stub claude/codex, assert seat argv incl. `--effort`/`-c model_reasoning_effort=xhigh`, preset expansion, B1/B2/B3 regressions, fenced-JSON verdict parse). + +### Phase 3 — Runner observability + status reader +- state.json additions (single-writer, additive): `current_phase {name, started_at}` written at phase **start** (null at EOF), `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id` via new `--parent-run` flag (lineage across revise re-kicks; re-kicks always get fresh run dirs — no `--resume` in v1.5). +- New `dev-review/codex/dev-review-status.sh` (~120 lines, read-only): run summary, phase progress, heartbeat = stderr-log mtime/size (BSD/GNU `stat` feature-detect), marker counts, verdict, diffstat, dead-runner detection. Exit codes: 0 done / 2 partial / 5 running / 4 presumed-dead / 3 no-run; `--json` mode. This is the **progress-file protocol** any fresh Fable session, Haiku watcher, or human reads. +- New `tests/status-reader-simulation.sh` (synthetic run dirs) + lifecycle sim. + +### Phase 4 — Token capture (`CO_EVOLVE_TOKEN_CAPTURE=1`, default off) +- `invoke_claude`: gated `--output-format json` mode — `.result` → output file (contract preserved), usage → `outputs/usage-.json` sidecar. +- Codex: harvest the token line from already-captured per-phase stderr logs (format pinned in R2). +- `collect_token_usage` at EOF → `state.json.tokens {phases, totals}`; scorer copies totals into report details (informational). New `tests/token-capture-simulation.sh`; parity guard when off. + +### Phase 5 — `/codex-build` skill + docs (both transports) +- **New `skills/codex-build/SKILL.md`** (~250 lines) — the orchestration loop: + 1. PREFLIGHT: codex present, no other active run (`dev-review-status.sh --list`), clean tree → `--branch auto`, dirty → `--worktree auto` (dirty tree otherwise makes verify silently skip). + 2. PLAN in-session (Fable = composer; plan file in `$TMPDIR`, **never** inside the workdir, content embedded inline per repo convention) + optional ≤2 synchronous bounce passes; resolve markers. + 3. KICK: one Bash call, `run_in_background: true`: `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --skip-plan --plan $PLAN --preset codex-build [--parent-run ] -- ""` → report run id → **END TURN** (no polling, no sleep; harness wakes the session on exit). + 4. WAKE → REVIEW GATE: status script → read verdict.json + diffstat first, targeted diff hunks only → **ACCEPT** (report branch/diffstat/tokens, suggest merge/PR) / **REVISE** (append "Reviewer Corrections (Round N)" to a copy of the plan, re-kick with `--parent-run`; **max 2 rounds**) / **ESCALATE** (missing verdict, exit 1, dead runner, rounds exhausted, scope_creep_detected — never auto-merge). + 5. Budget rules + troubleshooting (incl. `pkill -f 'codex exec'` for orphans; optional one-shot scheduled check for very long runs). +- **Plugin transport section** (second supported path, interactive flavor): when mid-session or for ad-hoc rescue work, delegate via `codex-rescue` / `/codex:rescue --background`, check `/codex:status`, then Fable reviews using the same review template + verdict JSON contract and the same ≤2-round cap. Documented protocol sharing our templates/schema; not used by evals/CI (no `--output-schema` contract through the plugin). +- Docs: `CLAUDE.md` Default Rule gains the third option ("build with codex" — Fable plans/reviews, Codex executes); `dev-review/codex/instructions.md` routing rows (runner preset + plugin alternative); `skills/dev-review/SKILL.md` preset mention, `$CODEX_MODEL`/`-c` args gap fix, live-mode limitation note; sync-token grep test pinning the preset table across files. + +### Phase 6 — Dogfood + evidence +- 2–3 real tasks through `/codex-build` on the Mac (now codex-capable); exercise ACCEPT, REVISE→ACCEPT, and ESCALATE paths for real; tune defaults. +- **Evidence note** on the 50% claim: runner-side tokens (state.json) + Fable session turn counts vs. an interactive `/dev-review` baseline on the same task. Honest scope: harness-side session spend visible only as turn counts. +- Re-run `mcp/scripts/vendor.sh` + `(cd mcp && npm test)` (lib changed); update auto-memory with the outcome. + +## Critical files + +| File | Change | +|---|---| +| `dev-review/codex/dev-review.sh` | B1/B3 exports, seat flags/env, preset, `--parent-run`, phase-start state writes, EOF token collect | +| `lib/co-evolution.sh` | effort knobs, `invoke_codex_schema` relocation (B2), `begin_state_phase`, gated JSON mode, `collect_token_usage` | +| `dev-review/codex/dev-review-status.sh` | **new** status reader | +| `skills/codex-build/SKILL.md` | **new** orchestration skill (both transports) | +| `skills/dev-review/SKILL.md`, `CLAUDE.md`, `dev-review/codex/instructions.md` | docs/routing | +| `evals/RUNNER-CONTRACT.md` | v1.1 additive rows (`seat_models`, `current_phase`, `runner_pid`, `tokens`, lineage) | +| `tests/preset-expansion-simulation.sh`, `tests/status-reader-simulation.sh`, `tests/token-capture-simulation.sh` | **new** hermetic sims | +| `.planning/` (STATE.md, ROADMAP.md, `phases/`, `research/`) | v1.5 registration + R1–R3 notes | + +## Verification + +1. **Hermetic (Mac, no real agents):** `bash -n` on touched scripts; `bash tests/run-all.sh` — all existing + 3 new sims green; byte-parity assertions with knobs off. +2. **Claude-only live dry run:** scratch repo, all three seats = claude with `VERIFIER_MODEL=fable VERIFIER_EFFORT=max` — proves `--effort max` headless and the claude-verdict path end-to-end (fallback: `high`). +3. **Codex smoke (Mac, post-symlink):** headless `codex exec` + one full `--preset codex-build` run on a toy repo, incl. one forced REVISE round. +4. **Orchestrated loop:** kick via background Bash, end turn, confirm wake-on-exit + status script output; plugin-transport smoke via `/codex:rescue`. +5. **MCP parity:** `vendor.sh` + `cd mcp && npm test`. +6. **Dogfood matrix (Phase 6):** ACCEPT / REVISE→ACCEPT / ESCALATE each exercised once; token evidence note written. + +## Execution strategy — sub-agent deployment + +This session (Fable) acts as orchestrator only: it briefs one executor sub-agent per phase, reviews the diff + test output at each gate, and never writes the code itself. Each agent gets the plan file path, its phase section, and the repo conventions; context stays small in the main session. + +| Phase | Agent model | Rationale | +|---|---|---| +| 0 — env setup, R1–R3 spikes, v1.5 `.planning` registration | **Sonnet** | Mechanical: symlink, smoke commands, record outputs, templated planning docs. Plugin install (`/plugin`) may need the main session or user if no CLI route. | +| 1 — seat plumbing + B1/B2/B3 fixes | **Opus** | Careful surgery in a 1471-line bash script with byte-parity invariants. | +| 2 — verifier hardening + preset + sim | **Opus** | Verdict-path correctness + new hermetic simulation. | +| 3 — observability + status reader + sims | **Opus** | state.json single-writer weaving is delicate. | +| 4 — token capture | **Opus** | Gated invoke_claude changes against the vendored lib. | +| 5 — `/codex-build` skill + docs | **Opus** | Spec is fully written in this plan; converting it to SKILL.md prose + doc routing is execution, not design. | +| 6 — dogfood + evidence | **Fable gates, Opus drafts** | Real runs go through the runner itself; Fable (main session) reviews verdicts/evidence; an Opus agent drafts the evidence note. | + +Phases run **sequentially** (1→2 and 3→4 edit the same files; parallelism would conflict). Gate protocol between phases: executor reports diff summary + test results → Fable spot-checks the diff → commit on the `feat/v1.5-codex-build` branch → next phase. All work in `~/Project/co-evolution`, never the SMB mount. + +## Risks + +- `--effort max` on claude CLI unverified (in-repo proof exists only for `high`) → dry-run step 2 resolves; one-constant fallback. +- `xhigh` × model mismatch in codex → surfaced by existing stderr/error-payload gates; `EXECUTOR_EFFORT` override. +- Seat env is process-global → every call site re-applies; sim pins all 5 sites. +- Plugin drift (OpenAI-owned formats) → plugin path is prose-level protocol only, pinned to plugin major version in docs; runner path carries CI. +- Missed wake notification (machine sleep) → status script's dead-runner detection + optional one-shot scheduled check. diff --git a/AGENTS.md b/AGENTS.md index ee53f61..ffe05de 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -17,6 +17,8 @@ Full spec: [`BOUNCE-PROTOCOL.md`](BOUNCE-PROTOCOL.md). Reference implementation: in the general-purpose co-evolution workflow for questions, drafts, plans, specs, arguments, and markdown refinement - If you are invoked via `/dev-review` (Claude Code) or `bash dev-review/codex/dev-review.sh` (Codex), you are inside the bounce pipeline — your role (reviewer / composer) and pass number are passed to you in the prompt template +- If you are invoked via `/codex-build`, you are orchestrating a detached Codex build: the session plans and reviews, Codex executes in the background under `--preset codex-build`, and the session is woken at gates (see `skills/codex-build/`) +- If you are orchestrating via `--preset claude-build`, Codex plans and reviews while Claude executes the build synchronously; this path hard-requires `claude` CLI auth because Claude is the executor (see `dev-review/codex/claude-build.md`) - If you are invoked via `bash agent-bouncer/agent-bouncer.sh `, you are bouncing a single markdown document — same protocol applies - If you are exploring the repo directly without an explicit role, treat the protocol section above as orientation, then read the GSD-managed sections below for project meta diff --git a/CLAUDE.md b/CLAUDE.md index 8cdfe98..ce09d4e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,9 +5,17 @@ agents using structured `[CONTESTED]` / `[CLARIFY]` markers until convergence. ## Default Rule -Use general co-evolution by default. Reach for `dev-review` only when the user -specifically wants repo files changed, a bug fixed, a feature implemented, or a -code diff verified against a plan. +Three routes, lowest-ceremony first: + +- **co-evolution** (default) — bounce protocol for questions, drafts, plans, + specs, arguments, and markdown refinement. +- **dev-review** — interactive code pipeline (compose → bounce → execute → + verify in one session) when the user wants repo files changed, a bug fixed, a + feature implemented, or a code diff verified against a plan. +- **codex-build** — "build with codex": the session plans and reviews, Codex + executes detached, with check-ins only at gates. Reach for this when the user + wants Codex to grind on a build in the background while the session stays free + (the model-ladder flavor of dev-review). ## Components @@ -54,6 +62,17 @@ Code-focused compose-bounce-execute-verify pipeline integrated with Claude Code. Key flags: `--skip-plan` executes a pre-existing plan, `--plan-only` stops after bounce, and `--live` opens visible Windows terminals for Codex passes. +### Codex-Build Skill (`skills/codex-build/`) + +Detached orchestration skill: the session (typically Opus) plans and reviews, +Codex executes in the background, and the session is woken at gates instead of +babysitting. Kicks `dev-review.sh --preset codex-build` via a background task, +ends the turn, then runs a schema-bound ACCEPT / REVISE / ESCALATE gate on wake. + +```text +/codex-build Have codex implement the retry wrapper while I review +``` + ### Codex Runtime (`dev-review/codex/`) Standalone Bash runtime for the code-focused compose-bounce-execute-verify flow diff --git a/README.md b/README.md index e9ffb06..fcc58bd 100644 --- a/README.md +++ b/README.md @@ -173,6 +173,7 @@ fallthrough. | Existing markdown document needs refinement | `co-evolve-bouncer.sh --bounce-only` | | Small, low-risk repo edit in 1-2 files | Direct execution | | Multi-file code change, medium/high risk, or plan/verify workflow | `dev-review/codex/dev-review.sh` | +| Codex builds in the background while the session plans and reviews | `skills/codex-build/` (`--preset codex-build`, detached) | | General co-evolution inside Claude Code | `skills/co-evolution/` | | Code pipeline inside Claude Code | `skills/dev-review/` | | External MCP client (Claude Desktop, Cursor, Continue) | `npm i -g @alanshurafa/co-evolution-mcp` | diff --git a/dev-review/codex/claude-build.md b/dev-review/codex/claude-build.md new file mode 100644 index 0000000..5013f34 --- /dev/null +++ b/dev-review/codex/claude-build.md @@ -0,0 +1,187 @@ +# claude-build - Codex-Orchestrated Claude Execution + +This is the Codex-facing orchestration protocol for `--preset claude-build`. +Codex is the orchestrator: it plans, kicks the runner, waits synchronously, and +reviews the result. Claude is the executor: it gets the writable execute phase. + +Use this when the user wants Codex to supervise a build but wants Claude to write +the code. This is intentionally the mirror of `codex-build`, with one important +difference: Codex has no harness wake-on-exit. The Phase 1 mode is therefore +blocking and synchronous. + +The preset expands to: composer = Codex (xhigh, model left to the CLI's config), +executor = Opus (best/high), verifier = Codex (xhigh, model left to the CLI's +config), `--verify` on, bounces 2, revise-loop 1. + +## Step 1: PREFLIGHT + +Run these checks before writing the plan or kicking the runner. + +### Claude CLI present and authenticated + +Claude is the executor in this preset, so missing or logged-out Claude is a +HARD STOP. Do not degrade the executor to Codex; that is just `codex-build`. + +```bash +command -v claude +claude -p --model claude-haiku-4-5-20251001 "ping" +``` + +If either command fails, stop and tell the user to install or authenticate the +`claude` CLI, then re-run. A logged-out Claude executor cannot write code. + +### Codex CLI present + +Codex is still the composer and verifier. + +```bash +command -v codex +``` + +If missing, stop and tell the user to install or expose the Codex CLI before +running `claude-build`. + +### No other active dev-review run + +```bash +bash dev-review/codex/dev-review-status.sh --list +``` + +If the `ACTIVE` section shows a non-terminal run (`status=pending` or +`status=null`), do not kick a second orchestrated run in the same workdir. +Escalate or wait for the existing run to reach a terminal status. + +### Tree state and isolation + +```bash +git -C "$(pwd)" status --short +``` + +Record whether the tree is clean. This protocol always kicks with +`--worktree auto` so Claude writes in an isolated sibling checkout. That matters +because this feature edits the same runner it invokes, and because dirty +workdirs make diff-based verification ambiguous. + +Never run from a network or SMB mount. Use a local git clone. + +## Step 2: PLAN + +Codex writes the implementation plan before invoking the runner. + +1. Inspect the repo with `rg`, shell reads, and narrow file opens. +2. Write the plan to `$TMPDIR`, never inside the workdir: + + ```bash + PLAN_FILE="${TMPDIR:-/tmp}/claude-build-plan-$(date +%Y%m%d-%H%M%S).md" + ``` + +3. Shape the file for `--skip-plan --plan FILE`: a top-level heading, at least + about 60 words, at least 5 non-empty lines, and at least 2 structural lines. + Include: + + ```text + # + + ## Approach + + + ## Files to Change + - `path/to/file` - + + ## Steps + 1. + + ## Risks / Out of scope + - + ``` + +Optional hardening: ask Claude for a synchronous plan critique using the bounce +markers `[CONTESTED]` and `[CLARIFY]`. Resolve every marker before kicking. After +2 passes, resolve remaining markers yourself or ask the user; do not run with +open markers. + +## Step 3: RUN (BLOCKING) + +Kick the runner synchronously and wait for it to exit: + +```bash +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset claude-build --skip-plan --plan "$PLAN_FILE" \ + --worktree auto \ + -- "" +``` + +For a revision re-kick, add `--parent-run ` and use the revised +plan file. The parent id is lineage only; every re-kick gets a fresh run dir. + +Do not background this command in Phase 1. Codex cannot wake itself on runner +exit, so detached execution would only create an orphaned process without a +reliable gate. + +## Step 4: GATE + +After the blocking runner exits, read status first: + +```bash +bash dev-review/codex/dev-review-status.sh --json +``` + +The status reader contract provides `status`, `verdict`, `verdict_json`, +`verdict_present`, `diffstat_tail`, `current_phase`, `marker_counts`, `assess`, +and its own terminal exit code. The verdict file follows +`skills/dev-review/schemas/review-verdict.json`. + +Then inspect narrowly: + +1. Read `verdict.json` from `.verdict_json`. +2. Read the diffstat and only the changed files needed to spot-check the verdict. +3. Decide one gate outcome. + +### ACCEPT + +Accept only when `verdict` is `APPROVED` and Codex's own spot-check finds no +material issue. Never auto-merge. + +### REVISE + +Revise when the verdict is `REVISE` or the spot-check finds a fixable issue, and +the orchestrator has not exhausted its revise budget. Write a revised plan in +`$TMPDIR`, preserve the original scope, and re-run Step 3 with +`--parent-run `. + +### ESCALATE + +Escalate on any missing verdict, non-zero runner exit, status-reader failure, +scope creep, repeated identical feedback, or exhausted revise rounds. Report the +run dir, plan path, verdict path if present, diffstat, and the concrete blocker. + +## Step 5: BUDGET + +Codex gets at most 2 orchestrator revise rounds. The runner's internal +`--revise-loop 1` can add one cheap execute retry inside each run, but that does +not increase the orchestrator budget. Round 3 is an escalation. + +Never auto-merge. The gate decides whether to accept, revise, or escalate; a +human still controls merge/push policy. + +## Future: detached mode (design only) + +A future `claude-build.sh` wrapper could background the runner with `nohup`, +print a run id, and let a polling driver watch: + +```bash +nohup bash dev-review/codex/dev-review.sh \ + --preset claude-build --skip-plan --plan "$PLAN_FILE" \ + --worktree auto -- "" & + +while true; do + sleep 30 + bash dev-review/codex/dev-review-status.sh --json + # stop on terminal status-reader exits: 0, 2, or 4 +done +``` + +That mode is not implemented in this milestone. It adds orphan handling, +stale-pid detection, and no-auto-wake failure modes, and it only pays off for +very long Claude builds. Until the wrapper and polling contract exist, use the +synchronous Phase 1 protocol above. diff --git a/dev-review/codex/dev-review-status.sh b/dev-review/codex/dev-review-status.sh new file mode 100755 index 0000000..b457e35 --- /dev/null +++ b/dev-review/codex/dev-review-status.sh @@ -0,0 +1,362 @@ +#!/usr/bin/env bash +# dev-review/codex/dev-review-status.sh +# v1.5 Phase 3 — read-only status reader for dev-review runner runs. +# +# A Claude Code session kicks the runner via a background Bash task and ENDS +# ITS TURN. On wake (or via a cheap watcher, or a human) it must reconstruct +# run state purely from disk. This script does exactly that — it writes NOTHING +# and never touches the runner. It reports run summary, phase progress, a +# heartbeat derived from the in-flight phase's stderr-log mtime/size, marker +# counts, verdict, diffstat, and a liveness assessment with an actionable line. +# +# Usage: +# dev-review-status.sh [--json] [--list] [RUN_ID|RUN_DIR] +# (no positional) → latest runs/dev-review-* by mtime +# --list → one summary line per non-terminal run + 3 newest terminal +# RUN_ID → resolved against /runs/RUN_ID +# RUN_DIR → an absolute/relative dir path, used as-is +# +# Exit codes (liveness assessment): +# 0 terminal-completed (status=completed) +# 2 terminal-partial (status=partial/failed) +# 5 still-running (non-terminal + pid alive OR heartbeat fresh <120s) +# 4 presumed-dead (non-terminal + pid gone + heartbeat stale) +# 3 run not found / no state.json / bad argument +# +# Reader, not runner: requires jq and dies (exit 3) without it. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# Repo root: dev-review/codex/ -> repo (two levels up). +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +RUNS_DIR="${CO_EVOLVE_RUNS_DIR:-$REPO_ROOT/runs}" + +HEARTBEAT_FRESH_SECS=120 + +die() { printf 'dev-review-status: %s\n' "$1" >&2; exit 3; } + +command -v jq >/dev/null 2>&1 || die "jq is required (this is a reader, not the runner)" + +# --- portable file mtime (epoch) -------------------------------------------- +# Feature-detect GNU stat (supports --version) vs BSD stat, mirroring the +# GNU-find detection convention in lib/co-evolution.sh (list_available_lab_modes). +if stat --version >/dev/null 2>&1; then + _file_mtime() { stat -c %Y "$1" 2>/dev/null; } # GNU +else + _file_mtime() { stat -f %m "$1" 2>/dev/null; } # BSD/macOS +fi + +now_epoch() { date +%s; } + +# --- argument parsing -------------------------------------------------------- +JSON=false +LIST=false +ARG="" +while [[ $# -gt 0 ]]; do + case "$1" in + --json) JSON=true; shift ;; + --list) LIST=true; shift ;; + -h|--help) + sed -n '5,30p' "${BASH_SOURCE[0]}" | sed 's/^# \{0,1\}//' + exit 0 + ;; + --*) die "unknown flag: $1" ;; + *) ARG="$1"; shift ;; + esac +done + +# --- run-dir resolution ------------------------------------------------------ +# A value containing a slash (or that resolves to an existing dir) is a path; +# otherwise treat it as a RUN_ID under $RUNS_DIR. +resolve_run_dir() { + local arg="$1" + if [[ -z "$arg" ]]; then + ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null | head -1 + return 0 + fi + if [[ -d "$arg" ]]; then + printf '%s' "$arg" + return 0 + fi + if [[ "$arg" == */* ]]; then + # Looks like a path but does not exist. + printf '%s' "$arg" + return 0 + fi + printf '%s' "$RUNS_DIR/$arg" +} + +# --- map a phase name to its stderr-log path (heartbeat source) -------------- +# Mirrors the exact filenames the runner writes: +# compose -> compose-stderr.log +# bounce-NN -> pass-N-stderr.log (N un-padded; runner names the file with +# the raw pass int, the phase with NN) +# execute[-N] -> execute-stderr.log (reused across revise passes) +# verify[-N] -> review-stderr.log (reused across revise passes) +phase_stderr_file() { + local run_dir="$1" phase="$2" + case "$phase" in + compose) printf '%s' "$run_dir/compose-stderr.log" ;; + bounce-*) + local nn="${phase#bounce-}" + nn=$((10#$nn)) # strip leading zero: bounce-01 -> 1 + printf '%s' "$run_dir/pass-${nn}-stderr.log" + ;; + execute|execute-*) printf '%s' "$run_dir/execute-stderr.log" ;; + verify|verify-*) printf '%s' "$run_dir/review-stderr.log" ;; + *) printf '' ;; + esac +} + +# --- one-line summary for --list --------------------------------------------- +summarize_one() { + local run_dir="$1" + local state="$run_dir/state.json" + local id status phase + id=$(jq -r '.run_id // "?"' "$state" 2>/dev/null) + status=$(jq -r '.status // "null"' "$state" 2>/dev/null) + phase=$(jq -r '.current_phase.name // "-"' "$state" 2>/dev/null) + printf '%-32s status=%-9s phase=%s\n' "$id" "$status" "$phase" +} + +if [[ "$LIST" == true ]]; then + shopt -s nullglob + dirs=("$RUNS_DIR"/dev-review-*) + shopt -u nullglob + if [[ ${#dirs[@]} -eq 0 ]]; then + echo "(no dev-review runs under $RUNS_DIR)" + exit 3 + fi + # newest first. bash-3.2 portable (macOS /bin/bash has no mapfile); read the + # ls -t output line by line into the array. + sorted=() + while IFS= read -r _d; do + [[ -n "$_d" ]] && sorted+=("$_d") + done < <(ls -dt "$RUNS_DIR"/dev-review-* 2>/dev/null) + echo "ACTIVE (non-terminal: pending/null):" + active_found=false + for d in "${sorted[@]}"; do + [[ -f "$d/state.json" ]] || continue + st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null) + if [[ "$st" == "pending" || "$st" == "null" ]]; then + summarize_one "$d"; active_found=true + fi + done + [[ "$active_found" == false ]] && echo " (none)" + echo + echo "RECENT (3 newest terminal):" + shown=0 + for d in "${sorted[@]}"; do + [[ -f "$d/state.json" ]] || continue + st=$(jq -r '.status // "null"' "$d/state.json" 2>/dev/null) + if [[ "$st" != "pending" && "$st" != "null" ]]; then + summarize_one "$d"; shown=$((shown + 1)) + [[ "$shown" -ge 3 ]] && break + fi + done + [[ "$shown" -eq 0 ]] && echo " (none)" + exit 0 +fi + +RUN_DIR=$(resolve_run_dir "$ARG") +[[ -n "$RUN_DIR" && -d "$RUN_DIR" ]] || die "run not found: ${ARG:-} (looked under $RUNS_DIR)" +STATE="$RUN_DIR/state.json" +[[ -f "$STATE" ]] || die "no state.json in $RUN_DIR" +jq -e . "$STATE" >/dev/null 2>&1 || die "state.json is not valid JSON: $STATE" + +# --- read fields ------------------------------------------------------------- +RUN_ID=$(jq -r '.run_id // "?"' "$STATE") +STATUS=$(jq -r '.status // "null"' "$STATE") +TASK=$(jq -r '.task // ""' "$STATE" | head -c 100) +RUNNER_PID=$(jq -r '.runner_pid // empty' "$STATE") +CUR_PHASE=$(jq -r '.current_phase.name // empty' "$STATE") +CUR_STARTED=$(jq -r '.current_phase.started_at // empty' "$STATE") +PARENT_RUN=$(jq -r '.orchestration.parent_run_id // empty' "$STATE") +VERDICT=$(jq -r '.verify_verdict // empty' "$STATE") +PRE_SHA=$(jq -r '.pre_execute_sha // empty' "$STATE") +POST_SHA=$(jq -r '.post_execute_sha // empty' "$STATE") +M_CONTESTED=$(jq -r '.marker_counts.contested // 0' "$STATE") +M_CLARIFY=$(jq -r '.marker_counts.clarify // 0' "$STATE") + +# --- runner liveness via kill -0 --------------------------------------------- +# yes = pid present and alive; no = pid present and gone; unknown = no pid. +RUNNER_ALIVE="unknown" +if [[ -n "$RUNNER_PID" ]]; then + if kill -0 "$RUNNER_PID" 2>/dev/null; then + RUNNER_ALIVE="yes" + else + RUNNER_ALIVE="no" + fi +fi + +# --- heartbeat from the in-flight phase's stderr log ------------------------- +HB_AGE="" # seconds since last write (empty if no current phase / no file) +HB_LINES="" # line count of the stderr file +HB_FILE="" +if [[ -n "$CUR_PHASE" ]]; then + HB_FILE=$(phase_stderr_file "$RUN_DIR" "$CUR_PHASE") + if [[ -n "$HB_FILE" && -f "$HB_FILE" ]]; then + mt=$(_file_mtime "$HB_FILE") + if [[ -n "$mt" ]]; then + HB_AGE=$(( $(now_epoch) - mt )) + fi + HB_LINES=$(wc -l < "$HB_FILE" | tr -d ' ') + fi +fi + +# --- liveness assessment + exit code ----------------------------------------- +# Terminal states short-circuit. Non-terminal: alive pid OR fresh heartbeat +# => still-running; pid gone + stale/absent heartbeat => presumed-dead. +ASSESS="" +EXIT_CODE=0 +case "$STATUS" in + completed) + ASSESS="DONE: run completed cleanly." + EXIT_CODE=0 + ;; + partial|failed) + ASSESS="PARTIAL: run reached a terminal non-completed status ($STATUS) — review verdict/diffstat before acting." + EXIT_CODE=2 + ;; + *) + # Non-terminal (pending / null / unexpected). Decide running vs dead. + hb_fresh=false + if [[ -n "$HB_AGE" && "$HB_AGE" -lt "$HEARTBEAT_FRESH_SECS" ]]; then + hb_fresh=true + fi + if [[ "$RUNNER_ALIVE" == "yes" || "$hb_fresh" == true ]]; then + ASSESS="RUNNING: runner alive ($RUNNER_ALIVE) / heartbeat ${HB_AGE:-?}s — in phase ${CUR_PHASE:-?}. Leave it; re-check on wake." + EXIT_CODE=5 + else + ASSESS="DEAD: runner gone mid-phase (${CUR_PHASE:-?}) — escalate or re-kick with --skip-plan --plan ${RUN_DIR}/plan.md" + EXIT_CODE=4 + fi + ;; +esac + +# --- diffstat tail + execute stderr tail (paths) ----------------------------- +DIFFSTAT_FILE="$RUN_DIR/execute-diffstat.txt" +EXEC_STDERR="$RUN_DIR/execute-stderr.log" +VERDICT_JSON="$RUN_DIR/verdict.json" + +# ============================================================================ +# JSON mode — one machine-readable object for the orchestrating skill. +# ============================================================================ +if [[ "$JSON" == true ]]; then + diffstat_tail="" + [[ -f "$DIFFSTAT_FILE" ]] && diffstat_tail=$(tail -5 "$DIFFSTAT_FILE") + exec_tail="" + [[ -f "$EXEC_STDERR" ]] && exec_tail=$(tail -5 "$EXEC_STDERR") + verdict_present=false + [[ -f "$VERDICT_JSON" ]] && verdict_present=true + + jq -n \ + --arg run_id "$RUN_ID" \ + --arg run_dir "$RUN_DIR" \ + --arg status "$STATUS" \ + --arg task "$TASK" \ + --arg runner_pid "${RUNNER_PID:-}" \ + --arg runner_alive "$RUNNER_ALIVE" \ + --arg current_phase "${CUR_PHASE:-}" \ + --arg current_started "${CUR_STARTED:-}" \ + --arg heartbeat_age "${HB_AGE:-}" \ + --arg heartbeat_lines "${HB_LINES:-}" \ + --arg heartbeat_file "${HB_FILE:-}" \ + --arg parent_run "${PARENT_RUN:-}" \ + --arg verdict "${VERDICT:-}" \ + --arg verdict_json "$([[ -f "$VERDICT_JSON" ]] && printf '%s' "$VERDICT_JSON")" \ + --argjson verdict_present "$verdict_present" \ + --arg pre_sha "${PRE_SHA:-}" \ + --arg post_sha "${POST_SHA:-}" \ + --argjson contested "$M_CONTESTED" \ + --argjson clarify "$M_CLARIFY" \ + --arg diffstat_tail "$diffstat_tail" \ + --arg execute_stderr_tail "$exec_tail" \ + --arg assess "$ASSESS" \ + --argjson exit_code "$EXIT_CODE" \ + --argjson phases "$(jq -c '.phases // []' "$STATE")" \ + '{ + run_id: $run_id, + run_dir: $run_dir, + status: $status, + task: $task, + runner_pid: (if $runner_pid == "" then null else ($runner_pid|tonumber) end), + runner_alive: $runner_alive, + current_phase: (if $current_phase == "" then null else {name: $current_phase, started_at: $current_started} end), + heartbeat: { + age_secs: (if $heartbeat_age == "" then null else ($heartbeat_age|tonumber) end), + lines: (if $heartbeat_lines == "" then null else ($heartbeat_lines|tonumber) end), + file: (if $heartbeat_file == "" then null else $heartbeat_file end) + }, + orchestration: {parent_run_id: (if $parent_run == "" then null else $parent_run end)}, + verdict: (if $verdict == "" then null else $verdict end), + verdict_json: (if $verdict_json == "" then null else $verdict_json end), + verdict_present: $verdict_present, + pre_execute_sha: (if $pre_sha == "" then null else $pre_sha end), + post_execute_sha: (if $post_sha == "" then null else $post_sha end), + marker_counts: {contested: $contested, clarify: $clarify}, + phases: $phases, + diffstat_tail: $diffstat_tail, + execute_stderr_tail: $execute_stderr_tail, + assess: $assess, + exit_code: $exit_code + }' + exit "$EXIT_CODE" +fi + +# ============================================================================ +# Human mode. +# ============================================================================ +printf '== dev-review run: %s ==\n' "$RUN_ID" +printf 'run dir: %s\n' "$RUN_DIR" +printf 'status: %s\n' "$STATUS" +printf 'runner pid: %s (alive: %s)\n' "${RUNNER_PID:-}" "$RUNNER_ALIVE" +[[ -n "$PARENT_RUN" ]] && printf 'parent run: %s\n' "$PARENT_RUN" +printf 'task: %s\n' "$TASK" + +if [[ -n "$CUR_PHASE" ]]; then + printf 'current phase: %s (started %s)\n' "$CUR_PHASE" "${CUR_STARTED:-?}" + if [[ -n "$HB_AGE" ]]; then + printf 'heartbeat: %ss ago, %s lines (%s)\n' "$HB_AGE" "${HB_LINES:-0}" "$HB_FILE" + else + printf 'heartbeat: (no stderr activity yet for %s)\n' "$CUR_PHASE" + fi +else + printf 'current phase: \n' +fi + +# Completed phases table. +phase_rows=$(jq -r '.phases // [] | .[] | " \(.name)\t\(.status)\texit=\(.exit_code)\t\(.completed_at // "")"' "$STATE" 2>/dev/null) +if [[ -n "$phase_rows" ]]; then + printf 'phases:\n%s\n' "$phase_rows" +else + printf 'phases: (none recorded yet)\n' +fi + +printf 'markers: contested=%s clarify=%s\n' "$M_CONTESTED" "$M_CLARIFY" + +if [[ -n "$PRE_SHA" || -n "$POST_SHA" ]]; then + printf 'execute SHAs: pre=%s post=%s\n' "${PRE_SHA:-}" "${POST_SHA:-}" +fi + +if [[ -n "$VERDICT" ]]; then + printf 'verdict: %s' "$VERDICT" + [[ -f "$VERDICT_JSON" ]] && printf ' (%s)' "$VERDICT_JSON" + printf '\n' +elif [[ -f "$VERDICT_JSON" ]]; then + printf 'verdict: (verdict.json present: %s)\n' "$VERDICT_JSON" +fi + +if [[ -f "$DIFFSTAT_FILE" ]]; then + printf 'diffstat (tail):\n' + tail -5 "$DIFFSTAT_FILE" | sed 's/^/ /' +fi + +if [[ -f "$EXEC_STDERR" ]]; then + printf 'execute stderr (last 5 lines):\n' + tail -5 "$EXEC_STDERR" | sed 's/^/ /' +fi + +printf 'ASSESS: %s\n' "$ASSESS" +exit "$EXIT_CODE" diff --git a/dev-review/codex/dev-review.sh b/dev-review/codex/dev-review.sh index 2d96675..1027aee 100644 --- a/dev-review/codex/dev-review.sh +++ b/dev-review/codex/dev-review.sh @@ -27,6 +27,13 @@ LIVE_MODE="${LIVE_MODE:-false}" # exclusive — both non-empty after parsing = die. Default empty = Phase 3 byte-parity. BRANCH_SPEC="${DEV_REVIEW_BRANCH:-}" WORKTREE_SPEC="${DEV_REVIEW_WORKTREE:-}" +# v1.5: explicit verifier seat override. Empty = derive from executor via the +# existing select_verifier logic (byte-parity default). Env default + --verifier flag. +VERIFIER_OVERRIDE="${DEV_REVIEW_VERIFIER:-}" +# v1.5: named seat preset (e.g. codex-build). Empty = no preset (byte-parity +# default). Applied in-parser so AFTER-preset flags win (last-wins) and model/ +# effort values fill-if-empty so pre-set env vars win. See apply_preset(). +PRESET="" BRANCH_CREATED="" WORKTREE_PATH="" WORKDIR="$(pwd)" @@ -60,6 +67,9 @@ FLAVOR_OVERRIDE="" # (byte-parity invariant SC-5 / D-06). Harness-side passes an absolute path rooted under # the eval fixture's .co-evolution/runs/ subtree. RUN_DIR_OVERRIDE="" +# v1.5 Phase 3: lineage token for orchestrated re-kicks. Empty = standalone run +# (byte-parity: .orchestration is omitted). Validated as a safe fs-ish token. +PARENT_RUN_ID="" usage() { cat <<'EOF' @@ -75,12 +85,20 @@ Options: --skip-plan Skip compose+bounce and execute an existing plan --plan FILE Existing plan file to use with --skip-plan --model MODEL Override Codex model + --verifier AGENT Force the verify-phase agent (codex|opus|claude); default derives from --executor + --claude-model MODEL Override Claude model (aliases: best/opus -> claude-opus-4-8, fable -> claude-fable-5; else passthrough) + --preset NAME Expand a named seat preset (available: codex-build, claude-build). + codex-build = best (currently Opus) plans (high) + Codex executes (xhigh) + best reviews (max), + claude-build = Codex plans (xhigh) + Claude executes (best/high) + Codex reviews (xhigh), + --verify on, bounces 2, revise-loop 1. Precedence: flags placed AFTER --preset + override it (last-wins); pre-set model/effort env vars win over the preset. --workdir DIR Working directory (default: current directory) --timeout SECONDS Per-phase timeout in seconds (default: 1800) --revise-loop N Auto-retry on REVISE verdict up to N extra passes (default: 0 = disabled) --live Launch visible Windows terminal tailing each phase's stderr (Windows-only; warns + falls back on other OS) --branch auto|NAME Create a feature branch off HEAD before execute (auto = dev-review/auto--); mutually exclusive with --worktree --worktree auto|PATH Create a git worktree for isolation before execute (auto = sibling dir); mutually exclusive with --branch + --parent-run RUN_ID Lineage tag: record the orchestrator's parent run id in state.orchestration.parent_run_id (re-kicks always get a fresh run dir; no behavior change) --lab MODE Route to lab//entry.sh (opt-in beta channel; see lab/README.md) --target FILE PEL-only: file to mutate (used with --lab pel-proposer; must be repo-relative forward-slash path, e.g. lib/co-evolution.sh — NOT absolute or WSL/Windows-style) --tier TIER PEL-only: override tier auto-detect (template|policy|code) @@ -110,6 +128,48 @@ normalize_agent() { esac } +# v1.5: resolve a friendly Claude model alias to its CLI model id. `best` and +# `opus` resolve to the current Opus line (claude-opus-4-8); `fable` is retained +# for back-compat only (currently unreachable). Anything else passes through +# verbatim so explicit ids (claude-opus-4-6, claude-opus-4-8[1m], …) and an empty +# value are untouched. Used by the --claude-model flag and the per-seat env layer. +# Bumping the Claude default to a new model = edit the `best` arm here, one place. +resolve_claude_model_alias() { + case "$1" in + best) echo "claude-opus-4-8" ;; # strongest supported Claude default for the preset + opus) echo "claude-opus-4-8" ;; # current Opus-line model + fable) echo "claude-fable-5" ;; # retained for back-compat; currently unreachable, do not default to it + *) echo "$1" ;; + esac +} + +# v1.5: expand a named seat preset into the underlying seat/model/effort knobs. +# Called from the --preset parser arm. Two precedence rules, both deliberate: +# - seat/structural knobs (COMPOSER, EXECUTOR, VERIFIER_OVERRIDE, VERIFY, +# BOUNCES, REVISE_LOOP_MAX) are set hard, so flags placed AFTER --preset +# override them via last-wins parsing. +# - model/effort knobs use fill-if-empty (`: "${VAR:=…}"`) so a value already +# present in the environment (e.g. COMPOSER_EFFORT=low) wins over the preset. +apply_preset() { + case "$1" in + codex-build) # "build with codex": best Claude seats plan/review, Codex executes. + COMPOSER="opus"; EXECUTOR="codex"; VERIFIER_OVERRIDE="opus" + VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1 + : "${COMPOSER_MODEL:=best}"; : "${COMPOSER_EFFORT:=high}" + : "${VERIFIER_MODEL:=best}"; : "${VERIFIER_EFFORT:=max}" + : "${EXECUTOR_EFFORT:=xhigh}" # codex model stays the CLI's configured default — deliberately unpinned + ;; + claude-build) # "build with claude": Codex plans/reviews, Claude executes. + COMPOSER="codex"; EXECUTOR="opus"; VERIFIER_OVERRIDE="codex" + VERIFY=true; BOUNCES=2; REVISE_LOOP_MAX=1 + : "${EXECUTOR_MODEL:=best}"; : "${EXECUTOR_EFFORT:=high}" # Claude (Opus) writes the code + : "${COMPOSER_EFFORT:=xhigh}" # codex plans; model left to the CLI's config (unpinned) + : "${VERIFIER_EFFORT:=xhigh}" # codex reviews; model unpinned + ;; + *) die "Unknown preset: $1 (available: codex-build, claude-build)" ;; + esac +} + invoke_agent() { local agent="$1" shift @@ -169,6 +229,12 @@ require_agent_cli() { } select_verifier() { + # v1.5: an explicit --verifier / DEV_REVIEW_VERIFIER override wins; otherwise + # derive the opposite-of-executor default (byte-parity when unset). + if [[ -n "${VERIFIER_OVERRIDE:-}" ]]; then + echo "$VERIFIER_OVERRIDE" + return 0 + fi if [[ "$EXECUTOR" == "codex" ]]; then echo "opus" else @@ -214,6 +280,14 @@ ensure_codex_compatible_workdir() { needs_codex="true" fi + # v1.5: with --verifier codex (or executor=opus default) the verify phase runs + # codex too, so its workdir must satisfy the same WSL constraint. Only matters + # when verification is requested. Byte-parity holds: default executor=codex → + # verifier=opus → this branch is a no-op. + if [[ "$VERIFY" == "true" && "$(select_verifier)" == "codex" ]]; then + needs_codex="true" + fi + if [[ "$needs_codex" != "true" ]]; then return 0 fi @@ -226,38 +300,12 @@ ensure_codex_compatible_workdir() { fi } -invoke_codex_schema() { - local prompt_file="$1" - local output_file="$2" - local stderr_file="$3" - local schema_file="$4" - local workdir="${WORKDIR:-$PWD}" - local -a cmd - local windows_workdir="" - local windows_output="" - local windows_schema="" - - if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then - windows_workdir=$(wslpath -w "$workdir") - windows_output=$(wslpath -w "$output_file") - windows_schema=$(wslpath -w "$schema_file") - cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir") - else - cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir") - fi - - if [[ -n "${CODEX_MODEL:-}" ]]; then - cmd+=(-c "model=${CODEX_MODEL}") - fi - - if [[ -n "$windows_schema" ]]; then - cmd+=(--output-schema "$windows_schema" -o "$windows_output") - else - cmd+=(--output-schema "$schema_file" -o "$output_file") - fi - - "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true -} +# v1.5 (fixes B2): invoke_codex_schema now lives in lib/co-evolution.sh (sourced +# at the top of this script). It was previously defined here, but the verify +# phase dispatches it inside a `bash -c 'source lib/co-evolution.sh; invoke_...'` +# child that can only see lib-defined functions — so the local copy was invisible +# there (exit 127 when a timeout binary exists and verifier=codex). Moved to lib +# so both the timeout child and the no-timeout fallback resolve the same function. write_text_file() { local output_path="$1" @@ -267,18 +315,36 @@ write_text_file() { agent_auth_failed() { local agent="$1" - shift - local file_path - local cli_name + local output_file="${2:-}" + local stderr_file="${3:-}" + local cli_name words cli_name=$(agent_cli_name "$agent") - for file_path in "$@"; do - if file_contains_auth_failure "$file_path"; then + # A genuine auth failure means the CLI bailed BEFORE doing work, so its banner + # is short and stands alone. Mirror validate_agent_artifact's discriminator so + # a substantial work product that merely echoes auth strings — e.g. plan text, + # or the auth-detection source itself — is never misread as an auth failure. + # + # (1) Auth banner IN THE OUTPUT, but only when the output is short (< 50 + # words). A long output that mentions "Unauthorized"/"Not logged in" is + # real work, not the CLI's own banner. + if [[ -n "$output_file" ]] && file_contains_auth_failure "$output_file"; then + words=$(wc -w < "$output_file" | tr -d '\r\n ') + if (( words < 50 )); then log "WARNING: ${cli_name} authentication failed. Refresh the ${cli_name} CLI session and rerun." return 0 fi - done + fi + + # (2) Auth banner in STDERR counts only when the agent produced NO output. A + # non-empty work product means the CLI authenticated and ran; auth strings + # in its (possibly huge) working log are echoed content, not the banner. + if [[ ! -s "$output_file" && -n "$stderr_file" && -s "$stderr_file" ]] \ + && file_contains_auth_failure "$stderr_file"; then + log "WARNING: ${cli_name} authentication failed. Refresh the ${cli_name} CLI session and rerun." + return 0 + fi return 1 } @@ -565,6 +631,10 @@ Output ONLY the plan document. No preamble." # abort_on_timeout will fire from main flow if the dispatcher reports 124. # RTUX-01: Launch a live-tail window before the agent call (no-op unless --live). maybe_launch_live_window "compose" "$compose_stderr_file" + # v1.5 Phase 3: mark the compose phase as STARTING (status reader heartbeat). + begin_state_phase "$STATE_JSON" "compose" + # v1.5: layer the composer seat's model/effort (no-op when no per-seat env set). + apply_seat_env composer "$COMPOSER" invoke_agent_with_timeout "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$(phase_is_writable compose)" abort_on_timeout "compose" "$phase_start" ensure_valid_plan_output "compose phase" "$COMPOSER" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" "$compose_retry_stderr_file" "" "compose" || return $? @@ -655,6 +725,17 @@ run_bounce_phase() { # so state.json records which pass hit the timeout. # RTUX-01: Per-pass live-tail window (bounce-01, bounce-02, ...) when --live. maybe_launch_live_window "bounce-${pass_padded}" "$stderr_file" + # v1.5 Phase 3: mark this bounce pass as STARTING (matches the bounce-NN + # name write_state_phase records on completion). + begin_state_phase "$STATE_JSON" "bounce-${pass_padded}" + # v1.5: composer turns get the composer seat's model/effort; reviewer (the + # bounce counterparty) turns use globals only (the `bounce` seat is a no-op + # arm in apply_seat_env). No-op overall when no per-seat env is set. + if [[ "$role" == "composer" ]]; then + apply_seat_env composer "$current_agent" + else + apply_seat_env bounce "$current_agent" + fi invoke_agent_with_timeout "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$(phase_is_writable bounce)" abort_on_timeout "bounce-${pass_padded}" "$bounce_pass_start" ensure_valid_plan_output "bounce pass ${pass}" "$current_agent" "$prompt_file" "$output_file" "$stderr_file" "$retry_stderr_file" "$PLAN_PATH" "bounce" || return $? @@ -733,6 +814,17 @@ run_execute_phase() { PRE_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true) fi + # v1.5 Phase 3: record the workdir HEAD just before the executor runs so a + # reviewer can pin the exact pre-change commit. Non-git / detached / failed + # rev-parse yields an empty PRE_EXECUTE_SHA → write null (never die). + if [[ -n "${STATE_JSON:-}" ]]; then + if [[ -n "$PRE_EXECUTE_SHA" ]]; then + write_state_field "$STATE_JSON" ".pre_execute_sha" "string" "$PRE_EXECUTE_SHA" + else + write_state_field "$STATE_JSON" ".pre_execute_sha" "null" + fi + fi + # RNPT-03: Capture pre-execute baseline hash of every workdir file. # Delta computed post-execute and written into state.json. if [[ -n "${STATE_JSON:-}" && -n "${BASELINE_HASHES_JSON:-}" ]]; then @@ -751,6 +843,10 @@ run_execute_phase() { # RNPT-05: timeout-wrapped. abort_on_timeout uses _execute_phase_start from main flow. # RTUX-01: Live-tail window for execute phase (no-op unless --live). maybe_launch_live_window "execute" "$execute_stderr_file" + # v1.5: layer the executor seat's model/effort. The in-phase empty-output retry + # below inherits these exported values (no second apply needed). No-op when no + # per-seat env is set. + apply_seat_env executor "$EXECUTOR" invoke_agent_with_timeout "$EXECUTOR" "$execute_prompt_file" "$execute_output_file" "$execute_stderr_file" "$(phase_is_writable execute)" abort_on_timeout "execute" "$phase_start" @@ -777,6 +873,17 @@ run_execute_phase() { POST_EXECUTE_SHA=$(git -C "$WORKDIR" rev-parse HEAD 2>/dev/null || true) status_output=$(git -C "$WORKDIR" status --short) + # v1.5 Phase 3: record the workdir HEAD just after change detection so a + # reviewer can see whether the executor committed (pre != post) or left the + # tree dirty (pre == post). Empty rev-parse → null (never die). + if [[ -n "${STATE_JSON:-}" ]]; then + if [[ -n "$POST_EXECUTE_SHA" ]]; then + write_state_field "$STATE_JSON" ".post_execute_sha" "string" "$POST_EXECUTE_SHA" + else + write_state_field "$STATE_JSON" ".post_execute_sha" "null" + fi + fi + # RNPT-03: Post-execute delta (runs regardless of "no changes" branch below # so Phase 8 scorer sees an empty delta rather than a missing field). if [[ -n "${STATE_JSON:-}" && -n "${CURRENT_HASHES_JSON:-}" && -n "${EXECUTE_DELTA_JSON:-}" ]]; then @@ -857,6 +964,10 @@ run_verify_phase() { fi verifier=$(select_verifier) + # v1.5: layer the verifier seat's model/effort, AFTER the verifier agent is + # resolved (the seat env keys off the agent type). Covers both the codex-schema + # and opus branches below. No-op when no per-seat env is set. + apply_seat_env verifier "$verifier" plan_content=$(cat "$PLAN_PATH") diff_content=$(cat "$diff_file") @@ -905,16 +1016,31 @@ run_verify_phase() { return 2 fi - verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file") || { - log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." - return 2 - } + if ! verdict_data=$(normalize_json_artifact "$verdict_file" "$normalized_verdict_file"); then + # v1.5: claude verifiers (no --output-schema) sometimes wrap the verdict in + # prose, which normalize_json_artifact rejects. Try a brace-block extraction + # (same idiom as evals/judge-bounce.sh:149) before giving up. codex verdicts + # come from --output-schema and never hit this branch (they normalize clean). + sed -n '/^{/,/^}/p' "$verdict_file" > "$normalized_verdict_file" + if [[ ! -s "$normalized_verdict_file" ]]; then + log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." + return 2 + fi + fi verdict_data=$(validate_review_verdict "$normalized_verdict_file") || { log "WARNING: verifier output was unusable: ${verdict_data}. Review manually." return 2 } + # v1.5: persist the normalized verdict back to the contract path so downstream + # consumers (evals/score-run.sh jq-parses verdict.json raw; mapping unparseable + # -> FAIL) read clean JSON. Byte-no-op for codex --output-schema verdicts, which + # are already clean. The dot-prefixed .verdict-normalized.json copy stays for the + # revise-loop read; cleanup_runtime_artifacts sweeps it but verdict.json (plain + # path, no leading dot) survives as the contract artifact. + cp "$normalized_verdict_file" "$verdict_file" + eval "$verdict_data" review_status="$VERDICT" @@ -937,8 +1063,33 @@ cleanup_runtime_artifacts() { find "$RUN_DIR" -maxdepth 1 -type f -name '.*' -delete 2>/dev/null || true } +# v1.5 Phase 4: token-usage aggregation, gated on CO_EVOLVE_TOKEN_CAPTURE=1. +# When the flag is unset/off this is a no-op and state.json never grows a +# `tokens` key (byte-parity). Called just BEFORE cleanup_runtime_artifacts in +# both terminal paths (plan-only early exit and normal EOF). collect_token_usage +# reads the *.usage.json sidecars + per-phase stderr logs and writes one +# state.json.tokens block; we then remove the transient sidecars/envelopes since +# state.json is the durable record (some sidecars — execute-output.md.usage.json, +# verdict.json.usage.json — have no leading dot, so cleanup_runtime_artifacts +# would not sweep them). +maybe_collect_token_usage() { + [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] || return 0 + collect_token_usage "$RUN_DIR" "$STATE_JSON" + find "$RUN_DIR" -maxdepth 1 -type f \ + \( -name '*.usage.json' -o -name '*.envelope.json' \) -delete 2>/dev/null || true +} + while [[ $# -gt 0 ]]; do case "$1" in + --preset) + # v1.5: expand a named seat preset. Applied in-parser so flags placed + # AFTER --preset override it (last-wins) and pre-set model/effort env vars + # win over the preset's fill-if-empty defaults. See apply_preset(). + [[ $# -gt 1 ]] || die "--preset requires a value" + PRESET="$2" + apply_preset "$2" + shift 2 + ;; --composer) [[ $# -gt 1 ]] || die "--composer requires a value" COMPOSER=$(normalize_agent "$2") || die "Unsupported composer: $2" @@ -973,7 +1124,23 @@ while [[ $# -gt 0 ]]; do ;; --model) [[ $# -gt 1 ]] || die "--model requires a value" - CODEX_MODEL="$2" + # v1.5 (fixes B1): export so the `bash -c` child inside + # invoke_agent_with_timeout inherits it (mirrors the --timeout arm). Without + # export, CODEX_MODEL never reached invoke_codex in the dispatched child. + export CODEX_MODEL="$2" + shift 2 + ;; + --verifier) + [[ $# -gt 1 ]] || die "--verifier requires a value" + # v1.5: normalize so `claude` maps to the opus seat name the rest of the + # script uses; reject anything normalize_agent doesn't accept. + VERIFIER_OVERRIDE=$(normalize_agent "$2") || die "Unsupported verifier: $2" + shift 2 + ;; + --claude-model) + [[ $# -gt 1 ]] || die "--claude-model requires a value" + # v1.5: resolve friendly alias (fable -> claude-fable-5) else passthrough. + CLAUDE_MODEL=$(resolve_claude_model_alias "$2") shift 2 ;; --workdir) @@ -1020,6 +1187,19 @@ while [[ $# -gt 0 ]]; do WORKTREE_SPEC="$2" shift 2 ;; + --parent-run) + # v1.5 Phase 3: orchestration lineage. Validate as a safe filesystem-ish + # token (alnum, _, -, .) so a malicious id can't traverse paths or inject + # shell-meta when echoed downstream (mirrors validate_lab_mode's posture, + # plus '.' since run ids carry dotted timestamps). Lineage only — no + # behavior change beyond the state.orchestration.parent_run_id write. + [[ $# -gt 1 ]] || die "--parent-run requires a value" + if ! [[ "$2" =~ ^[A-Za-z0-9._-]+$ ]] || [[ "${#2}" -gt 128 ]]; then + die "--parent-run must be a safe token [A-Za-z0-9._-]{1,128} (got: $2)" + fi + PARENT_RUN_ID="$2" + shift 2 + ;; --lab) # Phase 3 LAB-01: opt-in routing to lab//entry.sh. # Arm sits BEFORE the `--` argv-terminator so args after `--` remain @@ -1147,6 +1327,12 @@ if [[ -n "$PLAN_SOURCE" ]]; then fi WORKDIR="$(cd "$WORKDIR" && pwd)" +# v1.5 (fixes B3): export WORKDIR so the `bash -c` child inside +# invoke_agent_with_timeout (and the verify-codex `bash -c` block) inherits it; +# otherwise ${WORKDIR:-$PWD} in invoke_claude/invoke_codex falls back to the +# child's launch cwd. The export attribute persists across the later worktree-mode +# reassignment of WORKDIR (~line 1300), so the worktree path is exported too. +export WORKDIR if [[ "$SKIP_PLAN" == "true" && -z "$PLAN_SOURCE" ]]; then die "--skip-plan requires --plan FILE" @@ -1192,6 +1378,91 @@ ensure_codex_compatible_workdir require_selected_agent_clis +# v1.5 per-seat env layer. Snapshot the post-parse base model/effort values (set +# by globals / --model / --claude-model / CLAUDE_EFFORT / CODEX_REASONING_EFFORT), +# then apply_seat_env layers an optional per-seat override before each invocation. +# Byte-parity: with no COMPOSER_/EXECUTOR_/VERIFIER_ envs set, every seat resolves +# to its base, so exported CLAUDE_MODEL/CLAUDE_EFFORT/CODEX_MODEL/CODEX_REASONING_EFFORT +# match what the unlayered globals already were — argv is unchanged. +CLAUDE_MODEL_BASE="$CLAUDE_MODEL"; CODEX_MODEL_BASE="${CODEX_MODEL:-}" +CLAUDE_EFFORT_BASE="${CLAUDE_EFFORT:-}"; CODEX_EFFORT_BASE="${CODEX_REASONING_EFFORT:-}" + +# Export is load-bearing: invoke_agent_with_timeout crosses a `timeout bash -c` +# process boundary; only exported values survive it. +apply_seat_env() { + local seat="$1" agent="$2" model="" effort="" + case "$seat" in + composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;; + executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;; + verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;; + *) : ;; # bounce counterparty: globals only + esac + # v1.5 cross-agent leak guard. A seat's model+effort is ONE override pair + # configured for a specific agent kind (e.g. the codex-build preset fills + # VERIFIER_MODEL=best / VERIFIER_EFFORT=max for its DEFAULT claude verifier). + # When --verifier codex flips that same seat to codex, the pair is now wrong: + # exporting CODEX_MODEL=best (a Claude alias) makes codex/ChatGPT reject the run + # with HTTP 400: e.g. "The 'fable' model is not supported when using Codex with + # a ChatGPT account" — any Claude alias/id is off codex's menu. + # and CODEX_REASONING_EFFORT=max is off codex's scale (it ends at xhigh). + # The two halves are coupled: a model picked for the wrong kind means the + # effort was too — so drop the WHOLE pair (not just the model) and fall back + # to that agent's base values (CODEX_MODEL_BASE/CODEX_EFFORT_BASE = the user's + # codex config defaults, which is exactly right for the degrade path). Symmetric + # on the claude arm so a codex-shaped override never reaches a claude seat. + if [[ "$agent" == "codex" ]]; then + case "$model" in + fable|best|opus|claude-*) model=""; effort="" ;; + esac + else + case "$model" in + gpt-*|codex*) model=""; effort="" ;; + esac + fi + if [[ "$agent" == "codex" ]]; then + export CODEX_MODEL="${model:-$CODEX_MODEL_BASE}" + export CODEX_REASONING_EFFORT="${effort:-$CODEX_EFFORT_BASE}" + else + export CLAUDE_MODEL="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")" + export CLAUDE_EFFORT="${effort:-$CLAUDE_EFFORT_BASE}" + fi +} + +# v1.5: resolve a seat's "agent:model@effort" descriptor using the SAME model/ +# effort precedence apply_seat_env uses, but without mutating the exported env. +# Feeds the startup banner and the optional state.json seat_models fields. +# model defaults to "(default)" when unpinned (codex's executor seat under the +# codex-build preset); effort defaults to "(default)" when no effort is set. +resolve_seat_model_string() { + local seat="$1" agent="$2" model="" effort="" model_str="" effort_str="" + case "$seat" in + composer) model="${COMPOSER_MODEL:-}"; effort="${COMPOSER_EFFORT:-}" ;; + executor) model="${EXECUTOR_MODEL:-}"; effort="${EXECUTOR_EFFORT:-}" ;; + verifier) model="${VERIFIER_MODEL:-}"; effort="${VERIFIER_EFFORT:-}" ;; + *) : ;; + esac + # v1.5: mirror apply_seat_env's cross-agent leak guard (drop the wrong-kind + # model+effort pair as a unit) so the banner / state.json seat_models report + # what actually runs — e.g. codex:(default)@(default), not codex:fable@max. + if [[ "$agent" == "codex" ]]; then + case "$model" in + fable|best|opus|claude-*) model=""; effort="" ;; + esac + else + case "$model" in + gpt-*|codex*) model=""; effort="" ;; + esac + fi + if [[ "$agent" == "codex" ]]; then + model_str="${model:-$CODEX_MODEL_BASE}" + effort_str="${effort:-$CODEX_EFFORT_BASE}" + else + model_str="$(resolve_claude_model_alias "${model:-$CLAUDE_MODEL_BASE}")" + effort_str="${effort:-$CLAUDE_EFFORT_BASE}" + fi + printf '%s:%s@%s' "$agent" "${model_str:-(default)}" "${effort_str:-(default)}" +} + # Phase 8.1 WR-04: honor --run-dir when set; otherwise preserve v1.2 default (byte-parity). if [[ -n "$RUN_DIR_OVERRIDE" ]]; then RUN_DIR="$RUN_DIR_OVERRIDE" @@ -1214,12 +1485,41 @@ EXECUTE_DELTA_JSON="${RUN_DIR}/.execute-delta.json" RUN_ID="dev-review-${TIMESTAMP}" init_state_json "$STATE_JSON" "$RUN_ID" "$TASK" "$COMPOSER" "$EXECUTOR" "$REVIEWER" +# v1.5 Phase 3: observability — record the runner's PID once so a status reader +# can probe liveness via `kill -0`. Written as a number (jq accepts $$ verbatim). +write_state_field "$STATE_JSON" ".runner_pid" "number" "$$" + +# v1.5 Phase 3: lineage — when this run was re-kicked by an orchestrator under a +# parent run, record the parent's id (validated at parse time). Lineage only; +# no other behavior. Omitted (left absent) when --parent-run was not passed. +if [[ -n "$PARENT_RUN_ID" ]]; then + write_state_field "$STATE_JSON" ".orchestration.parent_run_id" "string" "$PARENT_RUN_ID" +fi + +# v1.5: resolve the verifier seat and per-seat model/effort descriptors for the +# banner + observability fields. _VERIFIER_SEAT reflects --verifier / executor +# derivation (select_verifier); the seat strings reflect whatever the seats +# resolve to under the per-seat env layer (apply_seat_env precedence). +_VERIFIER_SEAT=$(select_verifier) +_SEAT_COMPOSER=$(resolve_seat_model_string composer "$COMPOSER") +_SEAT_EXECUTOR=$(resolve_seat_model_string executor "$EXECUTOR") +_SEAT_VERIFIER=$(resolve_seat_model_string verifier "$_VERIFIER_SEAT") + +# v1.5: optional seat_models observability — records what each seat resolves to. +# Unconditional + additive (no existing sim asserts an exact state.json key set, +# so this does not break shape); cheap, and reflects the actual resolved seats. +write_state_field "$STATE_JSON" ".seat_models.composer" "string" "$_SEAT_COMPOSER" +write_state_field "$STATE_JSON" ".seat_models.executor" "string" "$_SEAT_EXECUTOR" +write_state_field "$STATE_JSON" ".seat_models.verifier" "string" "$_SEAT_VERIFIER" + log "============================================" log " DEV-REVIEW SESSION" log "============================================" log " Task: $TASK" +log " Preset: ${PRESET:-}" log " Composer: $COMPOSER" log " Executor: $EXECUTOR" +log " Verifier: $_VERIFIER_SEAT" log " Bounces: $BOUNCES" log " Verify: $VERIFY" log " Workdir: $WORKDIR" @@ -1228,6 +1528,9 @@ log " Timeout: ${PHASE_TIMEOUT}s per phase" log " Live mode: $LIVE_MODE" log " Branch: ${BRANCH_SPEC:-}" log " Worktree: ${WORKTREE_SPEC:-}" +log " Composer model: $_SEAT_COMPOSER" +log " Executor model: $_SEAT_EXECUTOR" +log " Verifier model: $_SEAT_VERIFIER" log "============================================" log "" @@ -1272,6 +1575,12 @@ if [[ "$PLAN_ONLY" == "true" ]]; then # RNPT-04: record completion even on plan-only exit so Phase 8 eval sees a # terminated run rather than one that looks crashed mid-phase. write_state_field "$STATE_JSON" ".completed_at" "string" "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + # v1.5 Phase 3: plan-only runs also reach a clean terminal — clear the in-flight + # phase so a status reader does not see a stale compose/bounce as "still running". + write_state_field "$STATE_JSON" ".current_phase" "null" + # v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1) + # BEFORE cleanup sweeps the dot-prefixed sidecars. + maybe_collect_token_usage cleanup_runtime_artifacts if [[ "${PLAN_EXIT:-0}" -eq 2 ]]; then exit 2 @@ -1363,6 +1672,9 @@ _run_revise_loop() { _execute_phase_start=$(date -u +%Y-%m-%dT%H:%M:%SZ) EXECUTE_EXIT=0 + # v1.5 Phase 3: mark execute (or execute-N on a revise pass) as STARTING, + # using the same name write_state_phase records on completion below. + begin_state_phase "$STATE_JSON" "$exec_name" run_execute_phase "$_execute_phase_start" || EXECUTE_EXIT=$? _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ) _phase_status=$([[ "$EXECUTE_EXIT" -eq 0 ]] && echo "ok" || echo "error") @@ -1377,6 +1689,8 @@ _run_revise_loop() { VERIFY_EXIT=0 # Clear prior verdict globals so a hung/empty verify pass can't leak them. unset VERDICT CONFIDENCE SUMMARY + # v1.5 Phase 3: mark verify (or verify-N on a revise pass) as STARTING. + begin_state_phase "$STATE_JSON" "$verify_name" run_verify_phase "$_verify_phase_start" || VERIFY_EXIT=$? _phase_end=$(date -u +%Y-%m-%dT%H:%M:%SZ) _phase_status=$([[ "$VERIFY_EXIT" -eq 0 ]] && echo "ok" || echo "error") @@ -1427,6 +1741,10 @@ write_state_field "$STATE_JSON" ".status" "string" "$_run_status" write_state_field "$STATE_JSON" ".completed_at" "string" "$_run_end_ts" # Phase 8.1 WR-02 supporting: .updated_at feeds Cost dimension (score-run.sh:261). write_state_field "$STATE_JSON" ".updated_at" "string" "$_run_end_ts" +# v1.5 Phase 3: the run reached EOF — no phase is in flight. The status reader +# treats a non-null .current_phase on a non-terminal run as "still in "; +# clearing it here is the in-band "phases done" signal. +write_state_field "$STATE_JSON" ".current_phase" "null" # Phase 8.1 / D-01: mirror phases[] into history[] (canonical contract name). # Field-rename transition posture — phases[] stays as legacy alias for one @@ -1444,6 +1762,10 @@ if command -v jq >/dev/null 2>&1; then fi unset _run_status _run_end_ts +# v1.5 Phase 4: aggregate token usage (no-op unless CO_EVOLVE_TOKEN_CAPTURE=1) +# BEFORE cleanup sweeps the dot-prefixed sidecars. +maybe_collect_token_usage + cleanup_runtime_artifacts log "" diff --git a/dev-review/codex/instructions.md b/dev-review/codex/instructions.md index f30eb6b..d08c8be 100644 --- a/dev-review/codex/instructions.md +++ b/dev-review/codex/instructions.md @@ -18,6 +18,9 @@ Pick the lowest-ceremony path that still protects correctness. **Start with `co- | Existing approved plan should be executed as-is | `bash dev-review/codex/dev-review.sh --skip-plan --plan ` | Reuses the approved plan | | User wants a reviewed plan before any code changes | `bash co-evolve-bouncer.sh --vanilla "task"` or `dev-review.sh --plan-only` | Plan artifact only | | User wants a final review against the plan and diff | `bash dev-review/codex/dev-review.sh --verify ` | Runs verifier after execution | +| Orchestrated / background build — plan and review in-session, Codex executes detached | `CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh --preset codex-build --skip-plan --plan --branch auto -- ""` (kicked in the background; see `skills/codex-build/`) | Model-ladder ladder: best Claude (Opus) plans/reviews, Codex executes; session is not babysat | +| Orchestrated build on Claude — Codex plans/reviews, Claude executes (synchronous) | `bash dev-review/codex/dev-review.sh --preset claude-build --skip-plan --plan --worktree auto -- ""` (see `dev-review/codex/claude-build.md`) | Mirror ladder: Codex plans/reviews, Claude executes; hard-requires claude auth | +| Ad-hoc interactive delegation to Codex mid-session | OpenAI plugin `/codex:rescue --background` (see `skills/codex-build/` Transport B) | Quick delegation; review with the same verdict schema by hand, not via CI | ## Distinguish Coding From Everything Else @@ -101,5 +104,6 @@ bash dev-review/codex/dev-review.sh --skip-plan --plan .planning/phases/04-docs- - Legacy document bouncer: `agent-bouncer/agent-bouncer.sh` - Default Claude Code skill surface: `skills/co-evolution/` - Code-focused Claude Code skill surface: `skills/dev-review/` +- Detached Codex orchestration skill surface: `skills/codex-build/` (kicks `--preset codex-build` in the background) - Shared library: `lib/co-evolution.sh` - Co-evolve templates: `templates/co-evolve/` diff --git a/evals/RUNNER-CONTRACT.md b/evals/RUNNER-CONTRACT.md index f8fe194..a331f5f 100644 --- a/evals/RUNNER-CONTRACT.md +++ b/evals/RUNNER-CONTRACT.md @@ -1,12 +1,12 @@ --- title: Runner ↔ scorer contract -version: 1.0 +version: 1.1 status: locked owners: - dev-review/codex/dev-review.sh (runner) - evals/score-run.sh (scorer) - evals/tests/fake-runner.sh (reference implementation) -updated: 2026-04-19 +updated: 2026-06-12 --- # Runner ↔ scorer contract @@ -43,6 +43,13 @@ historical fixture corpus under `runners/codex-ps/evals/tests/fixtures/**`. | `verify_verdict` | string \| null | nullable | runner (post-verify) | `score-run.sh:498` | Verify-phase verdict — `APPROVED` / `REVISE` / null | | `history` | array\<`{phase,status,detail,timestamp}`\> | yes | runner (per phase) | `score-run.sh:338` | Canonical bounce-trace array. Convergence needs ≥1 entry with `phase` matching `^bounce-[0-9]+$` when bounces were expected. | | `mode` | string | no | runner (optional tag) | — | Fake-runner sets `"fake-runner"`; real runner may set `"real"` or omit. Runner-owned metadata — NOT written by the shared `init_state_json` library. | +| `seat_models` | object `{composer, executor, verifier}` (each string `agent:model@effort`) | no | `dev-review.sh` (post-init) | — | v1.5 observability: what each seat resolved to under the per-seat env layer (e.g. `"claude:claude-fable-5@high"`, `"codex:(default)@xhigh"`). Runner-owned metadata — NOT written by the shared `init_state_json` library; scorer ignores it. | +| `current_phase` | object `{name, started_at}` \| null | no | `dev-review.sh` (`begin_state_phase` at each phase start; null at EOF) | — | v1.5 additions — observability. The phase the runner is currently in; `name` matches the `phases[]`/`history[]` phase name (`compose`, `bounce-NN`, `execute[-N]`, `verify[-N]`). Set at phase **start**, cleared to null at EOF (incl. plan-only). A non-null value on a non-terminal run means the runner is in that phase or died there. Read by `dev-review-status.sh`; scorer ignores it. | +| `runner_pid` | number | no | `dev-review.sh` (post-init, once) | — | v1.5 additions — observability. The runner process's PID (`$$`). A status reader probes liveness via `kill -0`. Runner-owned; scorer ignores it. | +| `pre_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, before executor) | — | v1.5 additions — observability. Workdir `git rev-parse HEAD` captured just before the executor runs; null when non-git/detached. Scorer ignores it. | +| `post_execute_sha` | string (40-hex) \| null | no | `dev-review.sh` (execute phase, after change detection) | — | v1.5 additions — observability. Workdir HEAD just after change detection; `pre != post` means the executor committed. Null when non-git/detached. Scorer ignores it. | +| `orchestration` | object `{parent_run_id}` | no | `dev-review.sh` (post-init, only when `--parent-run` passed) | — | v1.5 additions — lineage. Records the orchestrator's parent run id for re-kicks (re-kicks always get a fresh run dir; no `--resume`). Omitted entirely for standalone runs. Runner-owned; scorer ignores it. | +| `tokens` | object `{phases:{:{…}}, totals:{claude_input, claude_output, claude_cache_read, claude_cost_usd, codex_total_tokens}}` | no | `dev-review.sh` (`collect_token_usage` at EOF, **only when `CO_EVOLVE_TOKEN_CAPTURE=1`**) | `score-run.sh` — `.tokens.totals` copied into `details.cost.tokens`, informational only | v1.5 addition — token economics. Per-phase usage: claude phases carry `{input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens, total_cost_usd, source:"claude-json"}` from the gated `--output-format json` envelope; codex phases carry `{source:"codex-stderr", total_tokens}` harvested from the per-phase stderr token line. **Omitted entirely unless capture is enabled** (default off → byte-parity with prior state.json). Scorer surfaces totals for evidence; never affects any PASS/FAIL dimension. | ### Legacy alias: `phases` → `history` diff --git a/evals/score-run.sh b/evals/score-run.sh index 843f603..b9a6380 100644 --- a/evals/score-run.sh +++ b/evals/score-run.sh @@ -644,6 +644,14 @@ composite=$(jq -n --argjson scores "$scores_json" ' # Details (sparse for Task 1; Task 2 expands per-dimension metadata) # --------------------------------------------------------------------------- +# v1.5 Phase 4: token economics (informational only). When the runner ran with +# CO_EVOLVE_TOKEN_CAPTURE=1, state.json carries a .tokens.totals block; surface +# it under details.cost.tokens so the report can speak to the ~50% Claude-token +# savings claim. The `// empty` guard means: when there is no tokens block (every +# existing fixture, and any run with capture off), the key is simply absent and +# details stays byte-identical — this MUST NOT touch any PASS/FAIL dimension. +tokens_totals_json=$(jq -c '.tokens.totals // empty' "$state_json_path" 2>/dev/null) + details_json=$(jq -n \ --arg status "$status" \ --argjson had_exception "$had_exception" \ @@ -651,9 +659,11 @@ details_json=$(jq -n \ --argjson max_wall "$max_wall" \ --arg composer "$composer" \ --arg reviewer "$reviewer" \ + --argjson tokens "${tokens_totals_json:-null}" \ '{ robustness: {status: $status, had_exception: $had_exception}, - cost: {wall_clock_seconds: $wall_secs, max: $max_wall}, + cost: ({wall_clock_seconds: $wall_secs, max: $max_wall} + + (if $tokens == null then {} else {tokens: $tokens} end)), cross_ai_diversity: {composer: $composer, reviewer: $reviewer} }') diff --git a/lib/co-evolution.sh b/lib/co-evolution.sh index 65aa7e0..8478cf7 100644 --- a/lib/co-evolution.sh +++ b/lib/co-evolution.sh @@ -58,6 +58,15 @@ LIVE_MODE_WARNING_LOGGED=false # appends -c model= when it is set.) : "${CLAUDE_MODEL:=claude-opus-4-6}" +# v1.5: Per-seat reasoning-effort knobs. Both default empty = OFF, so invoke_* +# omit the effort flag entirely and argv stays byte-identical to today (parity +# invariant). When non-empty, invoke_claude appends `--effort "$CLAUDE_EFFORT"` +# and invoke_codex appends `-c model_reasoning_effort=...`. `--effort high` is +# proven headless in -p mode at evals/judge-bounce.sh:58 (built into JUDGE_ARGS, +# consumed by the `"$JUDGE_CMD" -p ...` call at evals/judge-bounce.sh:139). +: "${CLAUDE_EFFORT:=}" +: "${CODEX_REASONING_EFFORT:=}" + # RNPT-02: Authoritative list of phases that require write access to the workdir. # Phase code MUST NOT pass a hard-coded "true"/"false" to invoke_agent; it must # call `phase_is_writable ""` instead. To add a new writable phase @@ -408,11 +417,69 @@ invoke_claude() { tool_flags=(--disallowedTools "Edit,Write,Bash,Glob,Grep,WebSearch,WebFetch") fi + # v1.5: model + optional reasoning effort. CLAUDE_EFFORT defaults empty (OFF) + # so model_flags is exactly `--model "$CLAUDE_MODEL"` and argv is byte-identical + # to the prior single-flag form. The --effort element is appended only when set. + local -a model_flags=(--model "${CLAUDE_MODEL}") + if [[ -n "${CLAUDE_EFFORT:-}" ]]; then + model_flags+=(--effort "${CLAUDE_EFFORT}") + fi + + # v1.5 Phase 4: per-phase token capture, default OFF. Gated behind + # CO_EVOLVE_TOKEN_CAPTURE=1 AND jq present so the default path stays + # byte-identical (argv + artifacts). When ON we swap `--output-format text` + # for `--output-format json`: the JSON envelope (R1) carries `.usage` token + # counts, and `.result` holds the same text the text path would have emitted. + # PRTP-03 reminder: this is `--output-format json`, NOT `--output-schema` + # (the latter hangs claude -p on Windows). The gate keeps it default-off so + # the Windows precedent is respected unless an operator opts in. + local capture_json=false + if [[ "${CO_EVOLVE_TOKEN_CAPTURE:-}" == "1" ]] && command -v jq >/dev/null 2>&1; then + capture_json=true + fi + + local output_format=text + [[ "$capture_json" == "true" ]] && output_format=json + if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1; then # Under WSL, reuse the Windows Claude session because WSL and Windows keep separate auth state. - cmd=(cmd.exe /c claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}") + cmd=(cmd.exe /c claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}") else - cmd=(claude -p --output-format text --model "${CLAUDE_MODEL}" "${tool_flags[@]}") + cmd=(claude -p --output-format "$output_format" "${model_flags[@]}" "${tool_flags[@]}") + fi + + if [[ "$capture_json" == "true" ]]; then + local envelope_file="${output_file}.envelope.json" + local usage_file="${output_file}.usage.json" + # The envelope is emitted on stdout even on failure (R1: is_error=true still + # carries a body, exit nonzero) — `|| true` matches the text path and we + # parse regardless. Capture stdout to the envelope, NOT the output file. + "${cmd[@]}" < "$prompt_file" > "$envelope_file" 2>"$stderr_file" || true + # Extract `.result` into the output file so the external contract is + # preserved: callers (validate_agent_artifact, file_contains_auth_failure) + # only ever read $output_file. On auth failure `.result` is the "Not logged + # in" banner, so the existing auth gate keeps working. If `.result` is empty + # or the envelope is unparseable, fall back to copying the raw envelope so + # validate_agent_artifact sees a non-empty file rather than a silent blank. + if jq -e 'has("result") and (.result | type == "string") and (.result | length > 0)' \ + "$envelope_file" >/dev/null 2>&1; then + jq -r '.result // empty' "$envelope_file" > "$output_file" + else + cp "$envelope_file" "$output_file" + fi + # Usage sidecar (R1 fields). `// null` guards a malformed/partial envelope so + # the file is always valid JSON for collect_token_usage to slurp. Never + # fatal — token capture must not break a run. + jq '{ + input_tokens: (.usage.input_tokens // null), + output_tokens: (.usage.output_tokens // null), + cache_read_input_tokens: (.usage.cache_read_input_tokens // null), + cache_creation_input_tokens: (.usage.cache_creation_input_tokens // null), + total_cost_usd: (.total_cost_usd // null), + source: "claude-json" + }' "$envelope_file" > "$usage_file" 2>/dev/null \ + || printf '{"input_tokens":null,"output_tokens":null,"cache_read_input_tokens":null,"cache_creation_input_tokens":null,"total_cost_usd":null,"source":"claude-json"}\n' > "$usage_file" + return 0 fi "${cmd[@]}" < "$prompt_file" > "$output_file" 2>"$stderr_file" || true @@ -439,6 +506,12 @@ invoke_codex() { cmd+=(-c "model=${CODEX_MODEL}") fi + # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append, so argv + # is byte-identical to today. Appended only when CODEX_REASONING_EFFORT is set. + if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then + cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}") + fi + if [[ -n "${windows_output:-}" ]]; then cmd+=(-o "$windows_output") else @@ -448,6 +521,51 @@ invoke_codex() { "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true } +# v1.5 (fixes B2): relocated verbatim from dev-review/codex/dev-review.sh so the +# verify-phase `bash -c 'source lib/co-evolution.sh; invoke_codex_schema ...'` +# child can actually find it (it was defined only in dev-review.sh → exit 127 +# whenever a timeout binary exists and verifier=codex). Behavior is identical to +# the prior local copy (--output-schema / -o / --skip-git-repo-check / WSL path +# handling), plus the same conditional model + reasoning-effort appends that +# invoke_codex grew above — both default-off so argv is unchanged when unset. +invoke_codex_schema() { + local prompt_file="$1" + local output_file="$2" + local stderr_file="$3" + local schema_file="$4" + local workdir="${WORKDIR:-$PWD}" + local -a cmd + local windows_workdir="" + local windows_output="" + local windows_schema="" + + if [[ -n "${WSL_DISTRO_NAME:-}" ]] && command -v cmd.exe >/dev/null 2>&1 && command -v wslpath >/dev/null 2>&1; then + windows_workdir=$(wslpath -w "$workdir") + windows_output=$(wslpath -w "$output_file") + windows_schema=$(wslpath -w "$schema_file") + cmd=(cmd.exe /c codex exec --full-auto --skip-git-repo-check -C "$windows_workdir") + else + cmd=(codex exec --full-auto --skip-git-repo-check -C "$workdir") + fi + + if [[ -n "${CODEX_MODEL:-}" ]]; then + cmd+=(-c "model=${CODEX_MODEL}") + fi + + # v1.5: optional reasoning effort. Defaults empty (OFF) → no -c append. + if [[ -n "${CODEX_REASONING_EFFORT:-}" ]]; then + cmd+=(-c "model_reasoning_effort=${CODEX_REASONING_EFFORT}") + fi + + if [[ -n "$windows_schema" ]]; then + cmd+=(--output-schema "$windows_schema" -o "$windows_output") + else + cmd+=(--output-schema "$schema_file" -o "$output_file") + fi + + "${cmd[@]}" < "$prompt_file" > /dev/null 2>"$stderr_file" || true +} + file_contains_auth_failure() { local file_path="$1" @@ -1209,6 +1327,165 @@ write_state_phase() { fi } +# v1.5: record the phase now STARTING (write_state_phase records completions). +# Sets .current_phase = {name, started_at} and bumps .updated_at so a status +# reader knows which phase the runner is in mid-run. The EOF block clears it +# back to null. Single-writer discipline: called from the runner main flow only +# — there is NO concurrent heartbeat loop, which would race write_state_field's +# read-modify-write. Degrades silently (return 0) when jq is absent, matching +# write_state_phase / write_state_field. +begin_state_phase() { + local state_path="${1:?state path required}" + local phase_name="${2:?phase name required}" + + if command -v jq >/dev/null 2>&1; then + local tmp now + now=$(date -u +%Y-%m-%dT%H:%M:%SZ) + tmp=$(mktemp) + # FIX-WR-02 idiom: clean up $tmp on jq failure so we don't leak files. + if jq --arg name "$phase_name" --arg now "$now" \ + '.current_phase = {name: $name, started_at: $now} | .updated_at = $now' \ + "$state_path" > "$tmp"; then + mv "$tmp" "$state_path" + else + rm -f "$tmp" + log "WARNING: jq failed in begin_state_phase ($phase_name) — state.json unchanged" + fi + else + log "WARNING: jq unavailable — begin_state_phase skipping ($phase_name)" + fi +} + +# v1.5 Phase 4: aggregate per-phase token usage into state.json. Called ONLY +# when CO_EVOLVE_TOKEN_CAPTURE=1 (the runner gates the call), so when the flag +# is off there is no `tokens` key at all — state.json stays byte-identical. +# +# Two evidence sources, both already on disk by the time this runs: +# 1. Claude usage sidecars (`*.usage.json`, written by invoke_claude's gated +# JSON path). Mapped to a phase via the sidecar's base output filename: +# .compose-output.md.usage.json -> compose +# .bounce-output-N.md.usage.json -> bounce-NN (zero-padded) +# execute-output.md.usage.json -> execute +# verdict.json.usage.json -> verify +# 2. Codex token line (R2) harvested read-only from per-phase stderr logs: +# compose-stderr.log -> compose +# pass-N-stderr.log -> bounce-NN +# execute-stderr.log -> execute +# review-stderr.log -> verify +# Format: the literal line `tokens used` immediately followed by a +# comma-formatted integer. awk grabs the next line and strips commas. +# +# Writes ONE `tokens` block: {phases:{:{…}}, totals:{…}} and bumps +# .updated_at. Degrades silently when jq is unavailable. The caller removes the +# transient sidecars afterward (state.json is the durable record). +collect_token_usage() { + local run_dir="${1:?run_dir required}" + local state_json="${2:?state_json required}" + + command -v jq >/dev/null 2>&1 || { + log "WARNING: jq unavailable — collect_token_usage skipping" + return 0 + } + [[ -f "$state_json" ]] || return 0 + + # Build the phases object incrementally as a JSON string. + local phases_json='{}' + local tmp + + # --- 1. Claude usage sidecars ------------------------------------------- + # Two globs: the leading-dot one matches compose/bounce sidecars whose base + # output file is itself dot-prefixed (.compose-output.md.usage.json, + # .bounce-output-N.md.usage.json) — a plain *.usage.json glob skips dotfiles. + local sidecar base phase pass + for sidecar in "$run_dir"/*.usage.json "$run_dir"/.*.usage.json; do + [[ -e "$sidecar" ]] || continue # no-glob guard + base=$(basename "$sidecar") + base="${base%.usage.json}" # strip sidecar suffix -> output filename + case "$base" in + .compose-output.md) phase="compose" ;; + execute-output.md) phase="execute" ;; + verdict.json) phase="verify" ;; + .bounce-output-*.md) + pass="${base#.bounce-output-}" + pass="${pass%.md}" + if [[ "$pass" =~ ^[0-9]+$ ]]; then + printf -v phase 'bounce-%02d' "$pass" + else + phase="bounce-${pass}" + fi + ;; + *) phase="$base" ;; # unknown shape: key by raw base + esac + # Merge this sidecar's object under .. Tolerate a malformed sidecar. + if tmp=$(jq --arg p "$phase" --slurpfile s "$sidecar" \ + '. + {($p): $s[0]}' <<<"$phases_json" 2>/dev/null); then + phases_json="$tmp" + fi + done + + # --- 2. Codex stderr token lines ---------------------------------------- + local stderr_log log_base codex_phase tokens_val + for stderr_log in "$run_dir"/compose-stderr.log "$run_dir"/execute-stderr.log \ + "$run_dir"/review-stderr.log "$run_dir"/pass-*-stderr.log; do + [[ -f "$stderr_log" ]] || continue + log_base=$(basename "$stderr_log") + case "$log_base" in + compose-stderr.log) codex_phase="compose" ;; + execute-stderr.log) codex_phase="execute" ;; + review-stderr.log) codex_phase="verify" ;; + pass-*-stderr.log) + pass="${log_base#pass-}" + pass="${pass%-stderr.log}" + if [[ "$pass" =~ ^[0-9]+$ ]]; then + printf -v codex_phase 'bounce-%02d' "$pass" + else + codex_phase="bounce-${pass}" + fi + ;; + *) continue ;; + esac + # Skip the retry-stderr variants (pass-N-stderr-retry.log won't match the + # globs above; the primary logs are authoritative for the phase total). + tokens_val=$(awk '/^tokens used$/{getline; gsub(",",""); print; exit}' "$stderr_log" 2>/dev/null) + [[ "$tokens_val" =~ ^[0-9]+$ ]] || continue + # A claude sidecar already claimed this phase only if claude ran it; codex + # phases never have a sidecar, so this is collision-free in practice. If both + # somehow exist, the codex entry wins last (rare/diagnostic). + if tmp=$(jq --arg p "$codex_phase" --argjson t "$tokens_val" \ + '. + {($p): {source: "codex-stderr", total_tokens: $t}}' \ + <<<"$phases_json" 2>/dev/null); then + phases_json="$tmp" + fi + done + + # --- 3. Totals + single write into state.json ---------------------------- + # Totals sum across phases: claude fields from claude-json sidecars, codex + # tokens from codex-stderr entries. Nulls are treated as 0 in the sums. + local now + now=$(date -u +%Y-%m-%dT%H:%M:%SZ) + tmp=$(mktemp) + if jq --argjson phases "$phases_json" --arg now "$now" ' + ($phases | to_entries) as $entries | + def num($x): ($x // 0); + { + phases: $phases, + totals: { + claude_input: ([$entries[] | select(.value.source=="claude-json") | num(.value.input_tokens)] | add // 0), + claude_output: ([$entries[] | select(.value.source=="claude-json") | num(.value.output_tokens)] | add // 0), + claude_cache_read: ([$entries[] | select(.value.source=="claude-json") | num(.value.cache_read_input_tokens)] | add // 0), + claude_cost_usd: ([$entries[] | select(.value.source=="claude-json") | num(.value.total_cost_usd)] | add // 0), + codex_total_tokens:([$entries[] | select(.value.source=="codex-stderr") | num(.value.total_tokens)] | add // 0) + } + } as $tokens | + .tokens = $tokens | .updated_at = $now + ' "$state_json" > "$tmp" 2>/dev/null; then + mv "$tmp" "$state_json" + else + rm -f "$tmp" + log "WARNING: jq failed in collect_token_usage — state.json unchanged (no tokens block)" + fi +} + # RNPT-04: Generic jq-path field setter. Supports string|number|bool|null|rawfile. # Usage: write_state_field state.json '.verify_verdict' string APPROVED # write_state_field state.json '.execute_delta' rawfile path/to/delta.json diff --git a/runners/codex-ps/schemas/review-verdict.json b/runners/codex-ps/schemas/review-verdict.json index 3d4835a..3e4193e 100644 --- a/runners/codex-ps/schemas/review-verdict.json +++ b/runners/codex-ps/schemas/review-verdict.json @@ -3,7 +3,7 @@ "title": "ReviewVerdict", "description": "Structured code review verdict for the dev-review loop", "type": "object", - "required": ["verdict", "confidence", "summary", "issues"], + "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"], "properties": { "verdict": { "type": "string", @@ -24,7 +24,8 @@ "type": "array", "items": { "type": "object", - "required": ["severity", "description"], + "additionalProperties": false, + "required": ["severity", "file", "line_range", "description", "suggestion"], "properties": { "severity": { "type": "string", diff --git a/schemas/review-verdict.json b/schemas/review-verdict.json index 3d4835a..3e4193e 100644 --- a/schemas/review-verdict.json +++ b/schemas/review-verdict.json @@ -3,7 +3,7 @@ "title": "ReviewVerdict", "description": "Structured code review verdict for the dev-review loop", "type": "object", - "required": ["verdict", "confidence", "summary", "issues"], + "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"], "properties": { "verdict": { "type": "string", @@ -24,7 +24,8 @@ "type": "array", "items": { "type": "object", - "required": ["severity", "description"], + "additionalProperties": false, + "required": ["severity", "file", "line_range", "description", "suggestion"], "properties": { "severity": { "type": "string", diff --git a/skills/codex-build/SKILL.md b/skills/codex-build/SKILL.md new file mode 100644 index 0000000..921150d --- /dev/null +++ b/skills/codex-build/SKILL.md @@ -0,0 +1,398 @@ +--- +name: codex-build +description: > + Orchestrate Codex to BUILD code in the background while this Claude Code + session (typically Opus) plans and reviews — never babysitting. The session + composes the implementation plan, kicks the dev-review runner detached via a + background Bash task with --preset codex-build, ENDS ITS TURN, and is woken on + exit to run a schema-bound review gate (ACCEPT / REVISE / ESCALATE, max 2 + rounds). Use when the user wants to "build with codex", "have codex build", + "have codex implement", "orchestrate codex", "kick off codex in the + background", "let codex grind", or "codex builds while I review". For an + interactive single-session compose-bounce-execute loop with no detach, use + /dev-review. For non-code document refinement, use /co-evolution. +allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Agent +--- + +# /codex-build - Detached Codex Execution with Gate-Based Review + +This is the **orchestration protocol** behind the "best Claude plans, Codex +executes, best Claude reviews" model ladder. The session is the composer and the +reviewer; Codex +is the executor. The defining rule: **the session is NEVER kept busy while Codex +grinds.** You kick the runner as a background task, end your turn, and the +harness wakes you when it exits. + +Use `/dev-review` instead when you want the classic interactive +compose-bounce-execute loop in one session (you watch every pass). Use +`/co-evolution` for questions, drafts, plans, and document refinement with no +code execution. + +The protocol has five steps. Steps 1-3 happen in the FIRST turn (preflight, +plan, kick-and-stop). Steps 4-5 happen on WAKE, after the background task exits. + +--- + +## Step 1: PREFLIGHT (one turn, before planning) + +Run these checks. Each gate either passes, degrades with a stated fallback, or +dies with a fix hint. + +### Codex present + +```bash +command -v codex +``` + +If missing, die with the install hint: + +```text +codex CLI not found. On this Mac with Codex.app installed: + ln -s "/Applications/Codex.app/Contents/Resources/codex" ~/.local/bin/codex +Otherwise install the CLI: + npm i -g @openai/codex +Then re-run /codex-build. +``` + +### Claude CLI auth probe (decides the verifier seat) + +The runner's `claude` verifier seat runs in a headless shell that does NOT carry +the interactive app's session token. Probe it: + +```bash +claude -p --output-format text --model claude-haiku-4-5-20251001 "ping" 2>&1 | head -2 +``` + +- If the output contains `Not logged in` (or `/login`): the runner's claude + verifier seat cannot run. **DEGRADE** — kick with `--verifier codex` so the + verify phase uses Codex's schema-bound review instead. Tell the user plainly: + the in-session review gate (Step 4, which YOU run) is then the only Claude-side + review of the diff, and to run `claude /login` in a terminal to restore the + full ladder (the claude verifier seat inside the runner). +- Otherwise: the full ladder is available. Kick with the preset's default claude + verifier (best/Opus, max effort) — no `--verifier` override needed. + +### No other active run + +```bash +bash dev-review/codex/dev-review-status.sh --list +``` + +If the `ACTIVE` section shows a non-terminal run (`status=pending` or +`status=null`), do NOT kick a second one — one orchestrated run per workdir. +Either wait for it (check its status on wake) or escalate to the user. + +### Isolation — clean vs dirty tree + +```bash +git -C "$(pwd)" status --short +``` + +- **Clean tree** → kick with `--branch auto` (the runner cuts + `dev-review/auto--` off HEAD before execute). +- **Dirty tree** → kick with `--worktree auto` (REQUIRED). A dirty tree + otherwise makes the runner's verify phase silently skip: it returns early with + `verification skipped - workdir had pre-existing uncommitted changes` because + it cannot isolate this run's diff. `--worktree auto` gives Codex a clean + sibling checkout so verify runs against an isolatable diff. + +### Local clone, never an SMB mount + +The workdir must be a local git clone. Never run from a network/SMB mount +(`/Volumes/...`, `\\server\...`) — it causes line-ending churn and the runner's +diff isolation is unreliable. If the cwd is a mount, stop and ask the user to +point at a local clone. + +--- + +## Step 2: PLAN (in-session — the session is the composer) + +This is the cheap, interactive part. You explore and write the plan; you do NOT +hand planning to Codex. + +1. Explore the codebase with Read / Grep / Glob (or an Explore Agent for + fan-out). Understand the task, the files involved, and the conventions. +2. Write the implementation plan to a temp file **outside the workdir**: + + ```bash + PLAN_FILE="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S).md" + ``` + + **NEVER** write the plan inside the workdir. An untracked plan file there + makes the runner's verify phase skip (`run left untracked files that cannot + be diffed automatically`). **NEVER** pass the plan path inside a prompt to + Codex — per repo convention the runner embeds plan content inline; the + orchestrator is the sole owner of the canonical plan file. + +3. Shape the plan the way `--skip-plan --plan FILE` expects. The runner's plan + validator wants a real plan artifact: a top-level heading, >= ~60 words, >= 5 + non-empty lines, and >= 2 structural lines (headings / bullets). Write the + plan with these sections: + + ```text + # + + ## Approach + + + ## Files to Change + - `path/to/file` — + + ## Steps + 1. + + ## Risks / Out of scope + - + ``` + + A thin or malformed plan makes the runner's plan-quality gate fail before + execute. + +### Optional plan hardening (≤ 2 synchronous bounce passes) + +For higher-stakes work, harden the plan against Codex BEFORE kicking, using the +bounce protocol markers. This is the only synchronous Codex use in this skill — +keep it short (a plan bounce is seconds-to-low-minutes, well under the 5-minute +synchronous ceiling): + +- Send the plan to Codex asking it to critique with `[CONTESTED]` (disagreement + + counter-argument) and `[CLARIFY]` (ambiguity + two interpretations) markers. +- Resolve EVERY `[CONTESTED]` / `[CLARIFY]` before kicking. Apply the 2-pass + expiry rule: any marker still open after 2 passes — decide it yourself or ask + the user. Do not kick with open markers. + +If you skip hardening, that is fine for routine tasks — go straight to Step 3. + +--- + +## Step 3: KICK + END TURN (the load-bearing step) + +Kick the runner as a **background** Bash task, report the run id, and **end your +turn**. This is what makes the model ladder pay off — the session pays only for +plan + review gates, not for watching Codex grind. + +Run this with the Bash tool, **`run_in_background: true`**: + +```bash +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset codex-build --skip-plan --plan "$PLAN_FILE" \ + [--verifier codex] \ + [--branch auto | --worktree auto] \ + [--parent-run ] \ + [--timeout 1800] \ + -- "" +``` + +Fill the brackets from Step 1: +- `--verifier codex` ONLY when the auth probe degraded (else omit — the preset's + best/max claude verifier is the default). +- `--branch auto` for a clean tree; `--worktree auto` for a dirty tree. +- `--parent-run ` only on a REVISE re-kick (Step 4), to tag lineage. +- `--timeout` defaults to 1800s; raise it for large tasks. + +The preset expands to: composer = Opus (high), executor = Codex (xhigh, model +left to the CLI's config), verifier = Opus (max), `--verify` on, bounces 2, +revise-loop 1. The two Claude seats default through the `best` alias (currently +`claude-opus-4-8`), so a future model bump is a one-line edit in the runner's +`resolve_claude_model_alias`. `CO_EVOLVE_TOKEN_CAPTURE=1` records per-phase tokens +into `state.json` so you can report spend at the gate. + +**Optional — make the seats follow THIS session's model.** By default the plan and +review seats run `best` (a strong model) regardless of what you're driving the +session with. To instead pin them to a specific model — e.g. match the session +you're in — prepend the seat env vars to the kick (fill-if-empty means they win +over the preset): + +```bash +COMPOSER_MODEL=claude-opus-4-8 VERIFIER_MODEL=claude-opus-4-8 \ +CO_EVOLVE_TOKEN_CAPTURE=1 bash dev-review/codex/dev-review.sh \ + --preset codex-build --skip-plan --plan "$PLAN_FILE" -- "" +``` + +Only the Claude seats follow this; Codex always executes. A weak model here means +weak planning/review — keep these on a strong model. + +After kicking: +1. Capture the run id. It is `dev-review-` and the run dir is + `runs/dev-review-/`. Read it from the runner's early stdout, or + resolve the newest run with `ls -dt runs/dev-review-* | head -1`. +2. Report to the user: the run id, and how to check it manually — + `bash dev-review/codex/dev-review-status.sh ` (human) or `--json` + (machine). +3. **END THE TURN.** + +### Hard rules for Step 3 + +- Do **NOT** poll. Do **NOT** sleep. Do **NOT** loop waiting on status. +- Do **NOT** run Codex synchronously for anything expected to take longer than + ~5 minutes — that defeats the entire purpose (the session burns cache reads + every turn while idle, the exact anti-pattern this skill exists to avoid). +- The harness re-invokes this session when the background task exits. **That + notification IS the wake signal.** You do not build the wake mechanism. +- Optional safety net for very long runs (reference, do not build inline): a + one-shot scheduled reminder that re-checks status after the expected duration, + in case a machine-sleep drops the wake notification. + +--- + +## Step 4: WAKE → REVIEW GATE (1-2 turns, after the background task exits) + +When the background task exits, you are re-invoked. Reconstruct state from disk — +read the status JSON FIRST, then read narrowly. Do not re-read the whole tree. + +```bash +bash dev-review/codex/dev-review-status.sh --json +``` + +This emits one object with: `status`, `verdict` (APPROVED / REVISE / null), +`verdict_json` (path to `verdict.json`), `verdict_present`, `diffstat_tail`, +`current_phase`, `marker_counts`, `assess`, and `exit_code` (the status reader's +liveness code: 0 done / 2 partial / 4 presumed-dead / 5 running / 3 no-run). + +Then, BEFORE reading any source files: +1. Read `verdict.json` (the path is in `.verdict_json`) — the schema-bound + verdict: `verdict`, `confidence`, `summary`, `issues[]`, + `scope_creep_detected`. +2. Read the diffstat (`.diffstat_tail`, or + `runs//execute-diffstat.txt`). + +Only THEN read targeted diff hunks and the specific files named in `issues[]`. +Never re-read the whole tree to form an opinion. + +Decide exactly ONE of three outcomes. + +### ACCEPT + +Conditions: verdict is `APPROVED` AND your own spot-check of the named diff +hunks agrees. Report and finish: + +- Branch or worktree path (where the code landed). +- Diffstat (files changed, insertions/deletions). +- Verdict confidence and one-line summary. +- Token totals: + + ```bash + jq '.tokens.totals' runs//state.json + ``` + +- Your own turn count for this orchestration (plan + kick + this gate). +- Suggested next step: merge / PR commands, e.g. + `git checkout master && git merge ` or `gh pr create`. + +Then you are done. **Never auto-merge** — surface the commands for the user. + +### REVISE (max 2 orchestrator rounds total) + +Conditions: verdict is `REVISE`, or your spot-check finds a real, fixable +problem, AND you have not already used 2 revise rounds. + +1. Copy the plan file to a new path (keep the original; lineage is by file): + + ```bash + REVISE_PLAN="${TMPDIR:-/tmp}/codex-build-plan-$(date +%Y%m%d-%H%M%S)-rN.md" + cp "$PLAN_FILE" "$REVISE_PLAN" + ``` + +2. Append a corrections section with **file-specific, actionable** fixes drawn + from `issues[]` and your spot-check: + + ```text + ## Reviewer Corrections (Round N) + - `path/to/file`: + ``` + +3. Re-kick Step 3 with the new plan and `--parent-run ` (tags the + re-kick's lineage). Then END THE TURN again. + +Never re-kick on identical feedback twice — if the same issue survives a round, +that is an ESCALATE, not another REVISE. + +### ESCALATE + +Conditions (any): verdict missing (`verdict_present: false` / +`verdict: null`) · runner exited non-zero (status `failed`) · status reader exit +4 (presumed-dead runner) · 2 revise rounds exhausted · `scope_creep_detected: +true`. + +Write a concise human summary — do NOT auto-merge, do NOT silently retry: +- What was attempted (task, rounds used). +- Artifact paths (run dir, `verdict.json`, diffstat, the plan file). +- Open issues (from `issues[]` and/or the dead-runner assessment). +- Recommended manual step (inspect the worktree, re-kick with `--skip-plan`, + fix by hand, etc.). + +--- + +## Step 5: BUDGET + TROUBLESHOOTING + +### Budget rules + +- **Max 2 orchestrator revise rounds.** Round 3 is an ESCALATE. +- The runner's internal `--revise-loop 1` (set by the preset) adds at most one + cheap automatic retry per kick, INSIDE the runner — that is separate from and + cheaper than your orchestrator rounds. You still cap your own re-kicks at 2. +- **One orchestrated run per workdir.** The Step 1 `--list` check enforces this. + +### Troubleshooting + +| Symptom | Cause | Action | +|---|---|---| +| status reader exits 4 (presumed-dead) | runner process gone mid-phase | `pkill -f 'codex exec'` to clear orphans, then re-kick `--skip-plan --plan "$PLAN_FILE"` (add `--parent-run `) | +| Machine slept during the run | wake notification may have been dropped | Run the status script manually; do not assume the run failed | +| `tokens` block missing from state.json | `CO_EVOLVE_TOKEN_CAPTURE=1` was not set, or `jq` is missing | Re-kick with the flag set; install `jq` for token capture | +| Verify silently skipped | dirty tree kicked without `--worktree auto`, or run left untracked files | Re-kick on a clean tree with `--branch auto`, or `--worktree auto` | +| `claude` verifier seat errors mid-run | headless shell not logged in | Re-kick with `--verifier codex`; run `claude /login` to restore the claude verifier seat | + +--- + +## Transport B: the OpenAI plugin (secondary, interactive flavor) + +When you are already mid-session on small ad-hoc work and want to delegate a +quick build/fix to Codex without the full runner, the official OpenAI plugin is +a supported alternative. + +Install (one-time, non-interactive): + +```bash +claude plugin marketplace add openai/codex-plugin-cc +claude plugin install codex@openai-codex +``` + +It provides `/codex:rescue --background` (delegate a task to Codex detached), +`/codex:status` (check the job), `/codex:result` (fetch the result), and +`/codex:cancel`. + +Protocol when using the plugin: +1. Delegate execution to `codex-rescue` (or `/codex:rescue --background`). +2. Apply the **same review contract** in-session: read the diff and fill the + review-verdict JSON yourself (`skills/dev-review/schemas/review-verdict.json` + — `verdict`, `confidence`, `summary`, `issues[]`, `scope_creep_detected`). +3. Same `≤ 2`-round revise cap. Same **never auto-merge** rule. + +State plainly to the user: the plugin path shares our templates/schema **by +convention** but is **NOT** exercised by evals/CI — there is no `--output-schema` +contract enforced through it (you fill the verdict by hand), and it is pinned to +the plugin's major version (`v1.x`). Format drift in the plugin's output is the +plugin's risk, not ours. The runner path (Transport A above) is the one that +carries CI. + +--- + +## Notes + +- **Why detached, not babysat:** the post that inspired this (strong model plans, + Codex executes, strong model reviews) keeps a session supervising inline, + burning cache reads every turn. This skill kicks the runner via background Bash, + ends the turn, and gets woken on exit — the session pays only for plan + review + gates. +- **The preset is the single source of truth for the seats.** `codex-build` + expands to best/high · Codex/xhigh · best/max · verify on · bounces 2 · + revise-loop 1, inside `dev-review/codex/dev-review.sh` (`apply_preset`). The + `best` alias resolves to the current Opus line. The + preset triple is pinned by `tests/docs-sync-simulation.sh` so this doc and the + runner cannot drift. +- **Live-mode caveat:** `--live` codex windows ignore model/effort overrides + (documented v1 limitation in /dev-review). This skill does not use `--live`. +- **Plan ownership:** the orchestrator owns the canonical plan file; Codex only + ever receives plan content inline (embedded by the runner) and writes to a + separate output. Never let Codex see the canonical plan path. diff --git a/skills/dev-review/SKILL.md b/skills/dev-review/SKILL.md index 36dd9ad..e6bda2f 100644 --- a/skills/dev-review/SKILL.md +++ b/skills/dev-review/SKILL.md @@ -26,6 +26,14 @@ Use `/co-evolution` for general questions, ideas, arguments, prompt refinement, or document bounces. Use `/dev-review` only when the intended deliverable is a repo change or code-focused verification workflow. +For the **detached** flavor — the session plans and reviews while Codex executes +in the background and the session is woken at gates instead of babysitting — use +`/codex-build` (`skills/codex-build/`). It kicks the standalone runner with +`--preset codex-build` (Opus plans at `high`, Codex executes at `xhigh`, Opus +reviews at `max`; `--verify` on, bounces `2`, revise-loop `1`) and applies a +schema-bound ACCEPT / REVISE / ESCALATE gate. `/dev-review` is the interactive +single-session loop; `/codex-build` is the same pipeline run detached. + When `--live` is enabled on Windows, each Codex pass runs in a visible PowerShell window so the user can watch progress in real time. The handoff contract stays file-based: prompts are written to disk, Codex writes the final message to disk, @@ -170,6 +178,28 @@ the warning from Step 2 is enough; still render the banner as `Live: no`. ### Initialize the plan document: Create `/tmp/dev-review-plan-{timestamp}.md` - this is the living document that bounces. +### Build the shared Codex argument block: `CODEX_ARGS` + +`--model` and the per-seat effort knob are parsed but must actually reach +`codex exec`. Build a `CODEX_ARGS` array ONCE here and splice it into every +headless `codex exec` invocation (compose, bounce, execute, verify). Codex takes +model and reasoning-effort as `-c key=value` overrides, not as named flags: + +```bash +CODEX_ARGS=() +[ -n "$CODEX_MODEL" ] && CODEX_ARGS+=( -c "model=$CODEX_MODEL" ) +[ -n "${CODEX_EFFORT:-}" ] && CODEX_ARGS+=( -c "model_reasoning_effort=$CODEX_EFFORT" ) +``` + +`$CODEX_MODEL` comes from `--model`; `$CODEX_EFFORT` (if your invocation sets it) +maps to the executor's reasoning effort, mirroring the standalone runner's +`-c model_reasoning_effort=` wiring. When both are empty, `CODEX_ARGS` is empty +and the codex commands are byte-identical to today's — Codex falls back to its +own `~/.codex/config.toml` defaults. + +Splice it into each headless codex command with `"${CODEX_ARGS[@]}"`, e.g. +`codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o `. + ### Shared helper: `launch_codex_live` Use one helper for every live Codex pass: compose, bounce, execute, and verify. @@ -329,7 +359,7 @@ launch_codex_live "Codex - Compose" "exec" \ If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md +printf '%s' "$COMPOSE_PROMPT" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-plan-{timestamp}.md ``` **Important:** Always pipe plan content through stdin or embed it in the prompt file. @@ -421,7 +451,7 @@ cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timest If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md +printf '%s' "$(cat /tmp/dev-review-bounce-prompt.md)" | codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" -o /tmp/dev-review-bounce-output-{timestamp}-{N}.md # Orchestrator overwrites canonical plan with Codex's output cp /tmp/dev-review-bounce-output-{timestamp}-{N}.md /tmp/dev-review-plan-{timestamp}.md ``` @@ -582,7 +612,7 @@ launch_codex_live "Codex - Execute" "exec" \ If `$LIVE_ACTIVE=false`, keep the existing headless behavior: ```bash -codex exec --full-auto -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1 +codex exec --full-auto "${CODEX_ARGS[@]}" -C "$(pwd)" < /tmp/dev-review-exec-prompt.md > /tmp/dev-review-exec-output.md 2>&1 ``` After execution, verify changes: @@ -631,6 +661,8 @@ Pass 1: `codex review --uncommitted` (if available, skip if not) - If `$LIVE_ACTIVE=false`, keep the current headless `codex review` behavior. Pass 2: `codex exec --output-schema` with the review prompt +- Headless: splice `"${CODEX_ARGS[@]}"` in too, e.g. + `codex exec --full-auto "${CODEX_ARGS[@]}" --output-schema -C "$(pwd)" -o /tmp/dev-review-verdict.json < /tmp/dev-review-review-prompt.md` - If `$LIVE_ACTIVE=true`, use `launch_codex_live` with title `Codex - Verify`, mode `exec-schema`, prompt `/tmp/dev-review-review-prompt.md`, output `/tmp/dev-review-verdict.json`, and schema path @@ -745,6 +777,11 @@ Dev-review is integrated into GSD workflows: - **`--live` is launch-mode only**: It does not change the artifact contract. Prompts still come from temp files, and the final Codex message still lands in the same output file before Claude Code continues. +- **`--live` ignores model/effort overrides** (documented v1 limitation): the + visible-window launcher (`launch_codex_live`) does not splice `CODEX_ARGS`, so + `--model` / the reasoning-effort knob have no effect on live Codex passes. + Those overrides apply only on the headless path. Use headless mode when a + pinned Codex model or effort matters. - **`--live` is Windows-only in the first release**: On non-Windows platforms, warn once and continue headless. - **Live windows auto-close after a short read delay**: The wrapper waits 5 seconds diff --git a/skills/dev-review/schemas/review-verdict.json b/skills/dev-review/schemas/review-verdict.json index 3d4835a..3e4193e 100644 --- a/skills/dev-review/schemas/review-verdict.json +++ b/skills/dev-review/schemas/review-verdict.json @@ -3,7 +3,7 @@ "title": "ReviewVerdict", "description": "Structured code review verdict for the dev-review loop", "type": "object", - "required": ["verdict", "confidence", "summary", "issues"], + "required": ["verdict", "confidence", "summary", "issues", "scope_creep_detected", "iteration_notes"], "properties": { "verdict": { "type": "string", @@ -24,7 +24,8 @@ "type": "array", "items": { "type": "object", - "required": ["severity", "description"], + "additionalProperties": false, + "required": ["severity", "file", "line_range", "description", "suggestion"], "properties": { "severity": { "type": "string", diff --git a/tests/docs-sync-simulation.sh b/tests/docs-sync-simulation.sh new file mode 100644 index 0000000..53e8d2c --- /dev/null +++ b/tests/docs-sync-simulation.sh @@ -0,0 +1,204 @@ +#!/usr/bin/env bash +# tests/docs-sync-simulation.sh +# Hermetic sync-token gate for v1.5 Phase 5/Feature 6: pin the build preset +# triples so the runner code and the user-facing docs cannot drift. +# +# The `codex-build` preset is defined in ONE place (dev-review/codex/dev-review.sh +# apply_preset) but DESCRIBED in two skill docs (skills/codex-build/SKILL.md and +# skills/dev-review/SKILL.md). The `claude-build` preset is defined beside it and +# described in dev-review/codex/claude-build.md. If someone edits the seats in one +# and forgets the others, the docs lie. This gate greps the load-bearing seat +# tokens out of each file and asserts they all agree on the same triples: +# +# composer = best (currently Opus), effort high +# executor = Codex, effort xhigh (model unpinned — CLI config rules) +# verifier = best (currently Opus), effort max +# bounces = 2 +# +# claude-build: +# composer = Codex, effort xhigh (model unpinned — CLI config rules) +# executor = best (currently Opus), effort high +# verifier = Codex, effort xhigh (model unpinned — CLI config rules) +# bounces = 2 +# +# Pattern: pure-grep assertions over the real files (no agents, no stubs). Each +# scenario is a (file, required-token) pair. This is a structural invariant test, +# the same flavour as the frozen-surface grep checks in classifier-simulation.sh. + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +RUNNER="$REPO_ROOT/dev-review/codex/dev-review.sh" +CODEX_BUILD_SKILL="$REPO_ROOT/skills/codex-build/SKILL.md" +CLAUDE_BUILD_DOC="$REPO_ROOT/dev-review/codex/claude-build.md" +DEV_REVIEW_SKILL="$REPO_ROOT/skills/dev-review/SKILL.md" + +TOTAL=0 +FAILURES=0 +pass() { printf "PASS: %s\n" "$1"; } +fail() { printf "FAIL: %s\n" "$1" >&2; FAILURES=$((FAILURES + 1)); } + +# assert_grep