Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion .planning/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,19 @@ Co-Evolution is a tooling repo for structured iterative refinement between AI ag
- [x] **v1.1 Polish & Ergonomics** (shipped 2026-04-17) — v1.0 code review fixes (WR-01/02/03) + runtime ergonomics (REVISE auto-loop, visible live mode, branch/worktree management). 4 phases, 6 requirements closed. PR [#2](https://github.com/alanshurafa/co-evolution/pull/2) · See [`milestones/v1.1-ROADMAP.md`](milestones/v1.1-ROADMAP.md) · [`milestones/v1.1-SUMMARY.md`](milestones/v1.1-SUMMARY.md) · [`milestones/v1.1-REQUIREMENTS.md`](milestones/v1.1-REQUIREMENTS.md)
- [x] **v1.3 Reliability, Measurement & Cross-Platform** (shipped 2026-06-11) — stranded-fix landing, macOS/bash-3.2+5.2 portability with 3-OS CI, silent-failure hardening, and the bounce measurement stack (state.json, deterministic scorer + marker-fate ledger, blind judge, human report). Headline: 17.6% deletion-convergence measured; Fable-5 judge 7/7 improved. 9 phases. See [`milestones/v1.3-SUMMARY.md`](milestones/v1.3-SUMMARY.md) · audit at `docs/audits/2026-06-10-v13-audit.md`

## Active Milestone: v1.4 Distribution — npm + MCP (2026-06-11)
## Active Milestone: v1.5 Build with Codex — model ladder + orchestrated execution (2026-06-12)

**Goal:** Adopt the Codex-execution / Fable-orchestration split (per @cjzafir's pattern) in the dev-review runner: fix 3 latent env-export bugs, add per-seat model/effort config, a `--preset codex-build` shortcut, detached background execution with harness-exit-wakeup, a status-reader script, token capture to measure the 50% cost claim, and a `/codex-build` orchestration skill for both the runner and plugin transports. Design basis: `.planning/v1.5-DESIGN.md` (approved 2026-06-12).

- [ ] **Phase 0: Environment + research** (2026-06-12, in progress) — codex symlink + smoke; plugin install; R1 pin `claude -p --output-format json` envelope; R2 pin codex end-of-run token line; register milestone in .planning/; research notes filed.
- [ ] **Phase 1: Seat plumbing + env-export correctness** — `lib/co-evolution.sh` effort knobs + `invoke_codex_schema` move (B2); `dev-review.sh` `export CODEX_MODEL` (B1) + `export WORKDIR` (B3) + `--verifier/--claude-model` flags + per-seat env via `apply_seat_env`. Gate: `tests/run-all.sh` green; byte-parity with knobs off.
- [ ] **Phase 2: Claude-verifier hardening + `--preset codex-build`** — fenced-JSON verdict fallback; preset expansion (fable/high → codex/xhigh → fable/max, bounces=2, revise-loop=1); banner; `tests/preset-expansion-simulation.sh`.
- [ ] **Phase 3: Runner observability + status reader** — `state.json` additions (`current_phase`, `runner_pid`, `pre/post_execute_sha`, `orchestration.parent_run_id`); new `dev-review-status.sh` (~120 lines, exit codes 0/2/3/4/5); `tests/status-reader-simulation.sh`.
- [ ] **Phase 4: Token capture** — `CO_EVOLVE_TOKEN_CAPTURE=1` (default off); `invoke_claude` gated JSON mode; codex stderr harvest; `collect_token_usage` → `state.json.tokens`; `tests/token-capture-simulation.sh`.
- [ ] **Phase 5: `/codex-build` skill + docs** — new `skills/codex-build/SKILL.md` (preflight → plan → kick → wake/gate loop, both runner and plugin transports); CLAUDE.md Default Rule update; routing doc updates.
- [ ] **Phase 6: Dogfood + evidence** — 2–3 real `/codex-build` tasks (ACCEPT / REVISE→ACCEPT / ESCALATE); token evidence note; MCP parity (`vendor.sh` + `npm test`); memory update.

## Previous Milestone: v1.4 Distribution — npm + MCP (2026-06-11)

**Goal:** Make the bounce protocol invocable without `git clone`: a Node/TS
MCP server (`@alanshurafa/co-evolution-mcp`, one `co_evolve` tool) published
Expand Down
33 changes: 22 additions & 11 deletions .planning/STATE.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
gsd_state_version: 1.0
milestone: v1.4
milestone_name: Distribution — npm + MCP
milestone: v1.5
milestone_name: Build with Codex — model ladder + orchestrated execution
status: executing
stopped_at: v1.4 Phases 0-4 EXECUTED (2026-06-11). mcp/ package built + 4/4 hermetic smoke tests + stdio handshake verified; CI gains 3-OS mcp job; publish-mcp.yml ready. Remaining = Phase 5 (HUMAN: verify npm scope @alanshurafa, add NPM_TOKEN secret, git tag -> auto-publish, Claude Desktop round-trip) + Phase 6 post-ship registry/awesome-list.
last_updated: "2026-06-10T23:45:00.000Z"
last_activity: 2026-06-11 -- v1.4 milestone registered; v1.3 archived
stopped_at: v1.5 Phases 0-5 EXECUTED (b672145..30a9a00). Phase 6 PARTIAL (2026-06-12) — MCP vendor parity green; codex-verifier degrade path now FULLY GREEN end-to-end: model-leak fix + review-verdict.json schema 400 fix both landed, first real /codex-build ACCEPT produced (subtract-helper task, exit 0, APPROVED conf 96, verify tokens captured). Remaining Phase 6: claude /login (human, for full-ladder + non-zero claude_* tokens), REVISE→ACCEPT row, interactive baseline. v1.4 Phase 5 BLOCKED ON HUMAN (npm scope + NPM_TOKEN + git tag) — running in parallel, untouched.
last_updated: "2026-06-12T17:43:00.000Z"
last_activity: 2026-06-12 -- v1.5 Phase 6: review-verdict.json schema 400 fixed; codex-verifier degrade path E2E green (first ACCEPT)
progress:
total_phases: 8
completed_phases: 8
total_plans: 17
completed_plans: 18
percent: 100
total_phases: 7
completed_phases: 6
total_plans: 0
completed_plans: 0
percent: 86
---

# Project State
Expand All @@ -30,7 +30,18 @@ Milestone: v1.4 Distribution — npm + MCP
Phase: 0-4 complete; Phase 5 (publish) blocked on human items
Status: npm scope verification + NPM_TOKEN secret + git tag are Alan's; everything else built and CI-gated. v1.2 SC-4 gate still open (VERIFY-SC4.md).
Last activity: 2026-06-10 -- Phase 0 merges + LF policy + audit report
Working directories: `~/co-evolution-v13/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC

Milestone: v1.5 Build with Codex — model ladder + orchestrated execution
Phase: 0-5 EXECUTED; Phase 6 (dogfood + evidence) PARTIAL
Status: Phases 0-5 shipped on feat/v1.5-codex-build (b672145 Phase 0 env + research · fb6862a Phase 1 seats + B1/B2/B3 fixes · ffe765f Phase 2 verifier hardening + preset · 13c2bee Phase 3 observability + status reader · fb965ad Phase 4 token capture · 30a9a00 Phase 5 /codex-build skill + docs). Phase 6 partial — see below.
Phase 6 progress (2026-06-12):
- MCP vendor parity GREEN: `bash mcp/scripts/vendor.sh` clean; `(cd mcp && npm test)` 4/4 pass. `mcp/vendor/` is gitignored (generated-at-publish via `npm run build:vendor`, NOT checked in) — Phase 1/3/4 lib changes were additive and broke nothing.
- First real `/codex-build` dogfood (slugify task, scratch repo under $TMPDIR, `--verifier codex` degrade since headless claude is logged out): execute phase SUCCEEDED (slugify landed, all 4 scratch tests pass), but verify phase ERRORED → runner exit 2, `verdict_present: false`, ESCALATE. codex_total_tokens=21497 (token capture works); claude_* totals=0 (ladder not exercised on the degrade path). One re-kick (`--parent-run`, lineage recorded) hit the same error. NO ACCEPT data point yet.
- Real bug found AND FIXED (2026-06-12): the documented `--verifier codex` degrade leaked the preset's `VERIFIER_MODEL=fable` into the codex seat (`apply_seat_env`, dev-review.sh:1370-1372) → codex on a ChatGPT account returned HTTP 400 "The 'fable' model is not supported". Fix = cross-agent leak guard in `apply_seat_env` + `resolve_seat_model_string` (drop a wrong-kind model+effort pair as a unit; codex seat falls back to `codex:(default)@(default)`). Sim scenario (h) added (preset-expansion-simulation.sh now 8/8; run-all 25/25 green). Re-run proof: fable 400 GONE, execute SUCCEEDED, scratch run-tests ALL PASS, codex_total_tokens=17237, wall 54s — BUT verdict still null: the verify seat now hit a SEPARATE, pre-existing schema 400 (`invalid_json_schema`: nested `issues.items` missing `additionalProperties:false` in skills/dev-review/schemas/review-verdict.json). Seat fix proven; degrade path was blocked one layer deeper by the schema bug. Detail in .planning/research/2026-06-12-token-evidence.md.
- Schema 400 FIXED (2026-06-12): OpenAI strict structured-output requires `additionalProperties:false` + a `required` list covering EVERY property on EVERY object node. `issues.items` was missing both; top-level `required` omitted `scope_creep_detected`/`iteration_notes`. Tightened all THREE canonical copies identically (schemas/, runners/codex-ps/schemas/, skills/dev-review/schemas/) so the drift guard stays green; shell `validate_review_verdict` is unaffected (stays loose, independent of this file). DEGRADE-PATH E2E NOW GREEN: real /codex-build (subtract-helper task, scratch repo under $TMPDIR, `--verifier codex`, --branch auto) → exit 0, verify OK, **verdict APPROVED conf 96**, verdict.json complete (all 6 strict fields), tokens execute=30606 verify=14485 codex_total=45091, wall 59s, scratch run-tests ALL 4 PASS. First full ACCEPT-path evidence row (degrade path). run-all 25/25 green. First ACCEPT data point logged in .planning/research/2026-06-12-token-evidence.md.
- Remaining Phase 6 (HUMAN + follow-up): `claude /login` on this Mac (unblocks full ladder + non-zero claude_* tokens; degrade-path claude_* are 0 by design); REVISE→ACCEPT row still owed (ESCALATE + ACCEPT now evidenced); an interactive `/dev-review` baseline for the 50%-claim denominator.
Last activity: 2026-06-12 -- Phase 6: schema 400 fixed (3 copies); degrade-path E2E green, first ACCEPT (.planning/research/2026-06-12-token-evidence.md)
Working directories: `~/Project/co-evolution/` on the Mac (per-machine clone; SMB checkout `/Volumes/Project/co-evolution` is sync-only), `C:/Users/alan/Project/co-evolution-*` on the PC

macOS baseline before v1.3 fixes: scorer-verification 11/14; code-proposer sim 1/16; pr-emitter sim 4/12; template-proposer sim 1/8; revise-loop sim aborts. Root causes: bash 3.2 (mapfile, source <(…)), BSD sed GNU-isms. Target after Phase 0.5: all green on macOS.

Expand Down
71 changes: 71 additions & 0 deletions .planning/research/2026-06-12-claude-json-envelope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# R1: claude -p JSON Envelope — Field Reference

**Date:** 2026-06-12
**Phase:** v1.5 Phase 0
**Purpose:** Pin the exact JSON output structure of `claude -p --output-format json` for Phase 4 token-capture parsing.

## Command run

```
claude -p --output-format json --model claude-haiku-4-5-20251001 "Reply with exactly: PING"
```

## Verbatim output (not-logged-in state; all usage fields present and zero)

```json
{"type":"result","subtype":"success","is_error":true,"api_error_status":null,"duration_ms":1103,"duration_api_ms":0,"num_turns":1,"result":"Not logged in · Please run /login","stop_reason":"stop_sequence","session_id":"4642f382-c299-4d72-bcaf-3e7bca396c7d","total_cost_usd":0,"usage":{"input_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":0,"server_tool_use":{"web_search_requests":0,"web_fetch_requests":0},"service_tier":"standard","cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"inference_geo":"","iterations":[],"speed":"standard"},"modelUsage":{},"permission_denials":[],"terminal_reason":"completed","fast_mode_state":"off","uuid":"def45115-84e3-4e92-a601-20e0dae05dda"}
```

Exit code: 1 (error because not logged in; envelope still emitted on stdout)

## Top-level fields

| Field | Type | Notes |
|---|---|---|
| `type` | string | Always `"result"` |
| `subtype` | string | `"success"` even on error |
| `is_error` | bool | `true` when auth fails or API error |
| `api_error_status` | null/int | HTTP status code on API errors; null here |
| `duration_ms` | int | Wall time in ms |
| `duration_api_ms` | int | API time in ms |
| `num_turns` | int | Number of conversation turns |
| `result` | string | The model's text output (or error message) |
| `stop_reason` | string | e.g. `"stop_sequence"`, `"end_turn"` |
| `session_id` | string | UUID |
| `total_cost_usd` | float | Total cost in USD (0 when not logged in) |
| `usage` | object | Token usage breakdown — see below |
| `modelUsage` | object | Per-model usage breakdown (empty when not logged in) |
| `permission_denials` | array | Tool permission denial events |
| `terminal_reason` | string | `"completed"` |
| `fast_mode_state` | string | `"off"` |
| `uuid` | string | Run UUID |

## usage subfields

| Field | Type | Notes |
|---|---|---|
| `input_tokens` | int | Prompt input tokens |
| `cache_creation_input_tokens` | int | Cache write tokens |
| `cache_read_input_tokens` | int | Cache hit tokens |
| `output_tokens` | int | Response tokens |
| `server_tool_use.web_search_requests` | int | Web search count |
| `server_tool_use.web_fetch_requests` | int | Web fetch count |
| `service_tier` | string | `"standard"` or `"priority"` |
| `cache_creation.ephemeral_1h_input_tokens` | int | 1-hour ephemeral cache write tokens |
| `cache_creation.ephemeral_5m_input_tokens` | int | 5-min ephemeral cache write tokens |
| `inference_geo` | string | Inference geography code |
| `iterations` | array | Per-iteration usage (for multi-turn) |
| `speed` | string | `"standard"` |

## Notes for Phase 4

- All token fields live under `.usage`. Phase 4's `invoke_claude` gated JSON mode should extract: `.usage.input_tokens`, `.usage.output_tokens`, `.usage.cache_creation_input_tokens`, `.usage.cache_read_input_tokens`.
- `.total_cost_usd` is a direct top-level field, not nested.
- The envelope is always emitted to **stdout** even when `is_error=true`.
- The exit code is 1 on auth error; Phase 4 must handle non-zero exit with usable envelope (capture stdout regardless of exit code using `|| true`).
- **Limitation:** This capture is from a not-logged-in shell. When logged in, `modelUsage` will be populated and `iterations` may have per-turn breakdown. Field names are stable across auth state.

## Auth status at capture time

`claude whoami` returns: `Not logged in · Please run /login`
(The Mac's interactive Claude Code sessions authenticate through the Electron app, not this shell. The sub-agent shell does not carry the session token.)
Loading
Loading