Adversarial AI-Collaborative Workflow for Statistical Software Development — Codex Edition
Adapted from StatsClaw for Claude Code by Yiqing Xu (Stanford) & Tianzhu Qin (Cambridge). Redesigned from the ground up for OpenAI Codex. See Acknowledgments.
StatsClaw for Codex is a multi-agent workflow framework for OpenAI Codex that helps researchers build, test, and document statistical software packages with AI agent teams. It implements the adversarial verification methodology introduced in:
Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."
The core principle:
The process that generates code must never be the same process that validates it.
You describe what you need — implement an estimator from a paper, fix a bug, run a Monte Carlo study, translate an R package to Python — and StatsClaw coordinates 8 specialized AI roles through an explicit state machine with enforced information barriers:
- The builder writes code without seeing the test spec
- The tester validates without seeing the source code
- The simulator generates data without knowing the algorithm
When all pipelines converge independently, confidence in correctness is high — analogous to independent replication in experimental science.
- Xuanyu Cai, City University of Macau — xuanyuCAI@outlook.com
- Wenli Xu, City University of Macau — wlxu@cityu.edu.mo
- Clone the repository:
git clone https://github.com/gorgeousfish/statsclaw-for-codex.git
- Copy the
statsclawforcodex/directory into your Codex workspace - Open Codex and describe what you want in natural language
- The orchestrator reads
AGENTS.md, routes your request, and runs the full workflow automatically
> Build an R package from this paper. Three probit estimation methods in C++
via Rcpp/Armadillo: MLE, Gibbs sampler, and Metropolis-Hastings. After
building, run a Monte Carlo simulation comparing all three.
StatsClaw auto-detects the language, selects the workflow type, and proceeds autonomously — raising HOLD signals when your domain expertise is needed.
StatsClaw orchestrates 8 specialized AI roles, each operating under strict information isolation:
| Role | Purpose |
|---|---|
| Orchestrator | Coordinates the workflow, dispatches roles, enforces the state machine |
| Planner | Reads your paper/formulas, executes deep comprehension protocol, produces isolated specifications |
| Builder | Writes source code from spec.md — never sees the test spec |
| Tester | Validates independently from test-spec.md — never sees the code spec or builder's implementation notes |
| Simulator | Runs Monte Carlo studies from sim-spec.md — never sees either spec |
| Scriber | Documents architecture, generates tutorials, maintains the audit trail |
| Reviewer | Cross-checks all pipelines, audits tolerance integrity, issues ship / no-ship verdict |
| Shipper | Commits, pushes, opens PRs — only after explicit human approval |
An optional Distiller role extracts reusable knowledge to a local .brain/ repository (opt-in).
The architecture's value lies in what each role cannot see:
| Role | Sees | Never Sees |
|---|---|---|
| Builder | spec.md |
test-spec.md, sim-spec.md |
| Tester | test-spec.md |
spec.md, sim-spec.md, source code |
| Simulator | sim-spec.md |
spec.md, test-spec.md, source code |
This prevents each role from teaching to the test. A bug that survives must simultaneously satisfy two or three independently derived behavioral contracts — analogous to independent replication in experimental science.
planner (bridge)
/ | \
spec.md / test-spec.md \ sim-spec.md
/ | \
builder ─ ─(parallel)─ ─ simulator
(code pipeline) | (simulation pipeline)
\ | /
implementation.md | simulation.md
\ | /
\ v /
tester <-- sequential, after merge-back
(test pipeline)
|
audit.md
|
scriber → [distiller]? → reviewer → shipper
| Workflow | Role Sequence |
|---|---|
| Code | orchestrator → planner → builder → tester → scriber → reviewer → shipper? |
| Docs-only | orchestrator → planner → scriber → reviewer → shipper? |
| Simulation + Code | orchestrator → planner → [builder ∥ simulator] → tester → scriber → reviewer → shipper? |
| Simulation-only | orchestrator → planner → simulator → tester → scriber → reviewer → shipper? |
| Validation-only | orchestrator → planner → tester → scriber → reviewer |
| Review-only | orchestrator → reviewer → shipper? |
| You say | What happens |
|---|---|
| "Build an R package from this paper" | Full Workflow: comprehension → spec → build → test → document → review |
| "Fix the failing test in this repo" | Single Fix: focused spec → build → validate → review → ship |
| "Run a Monte Carlo validation" | Monte Carlo: sim-spec → simulate → validate → review |
| "Resume the previous run" | Resume: restore state from last handoff and continue |
| "Review this before shipping" | Review Only: reviewer audits existing artifacts → verdict |
| "Set up weekly regression checks" | Automation: configure recurring validation patrol |
| "Update the README" | Docs Workflow: spec → scriber implements → review |
| "Just bump the version number" | Simplified: builder → tester → ship (user confirms) |
Short prompts work. Routing is semantic — you never need to learn StatsClaw terminology.
StatsClaw for Codex is not a drop-in port of the Claude Code version. The original StatsClaw was designed around Claude Code's built-in Agent tool, GitHub workspace repositories for persistent state, and /loop scheduling for recurring tasks. These primitives have no direct equivalents in OpenAI Codex. Rather than emulating Claude Code's execution model, the Codex edition rearchitects every subsystem around Codex-native capabilities while preserving the adversarial verification methodology.
| Aspect | Claude Code Edition | Codex Edition |
|---|---|---|
| Entry point | CLAUDE.md |
AGENTS.md |
| Agent definitions | agents/*.md (9 agent files) |
skills/*.md (five-section skill format with isolation contracts) |
| Orchestration | agents/leader.md (prompt-driven) |
skills/orchestrate.md + helpers/workflow_router.py (explicit routing engine) |
| Role dispatch | Claude Code Agent tool (in-session sub-agents) | Codex subagents or serial role execution with fresh context capsules |
| Canonical state | Remote GitHub workspace repository | Local .statsclaw/ run store (no remote dependency) |
| User interaction | Claude chat window | Codex ask_user tool; automations degrade to inbox-style items |
| Recurring tasks | /loop command |
Codex Desktop automations with presets in .statsclaw/state/automation-presets/ |
| Isolation enforcement | Prompt discipline ("never read X") | io-manifest.md evidence recording; default grade audited-soft-isolated |
| State executors | Agent prompts manage state directly | Python helpers as authoritative executors (helpers/*.py) |
| Ship in automations | Allowed with user approval | Never — automations can observe, plan, and report but never push, PR, or release |
| Lock model | Implicit (single-session) | Explicit tri-level locks (repo / run / write-surface) with heartbeat and expiry |
- Artifact-first execution — All decisions and evidence live in versioned
.mdfiles with unified frontmatter, not conversation history. Runs are resumable and auditable across sessions. This directly addresses session-volatility: Codex tasks start with a fresh context, so every piece of state must be persisted to disk. - Explicit state machine with hard gates — Every transition has preconditions verified by Python helpers. No state can be skipped or bypassed. The Claude Code edition relies on prompt discipline; the Codex edition makes this deterministic through code.
- Testable helper layer — State mutations go through Python scripts (
helpers/) that can be independently tested with pytest, separating declarative protocols (skill files) from imperative execution. - Tri-level lock model — Repo, run, and write-surface locks prevent concurrent conflicts, enabling future multi-agent parallel execution. Claude Code's single-session model provides implicit mutual exclusion; Codex's task-based model requires explicit coordination.
- Automation safety — Codex Desktop automations are restricted by design: they can monitor, validate, and report, but never push code or open PRs without human gating. This is stricter than the Claude Code edition.
- Local-first state — No remote workspace repository required. All run state lives in
.statsclaw/, reducing setup friction and eliminating GitHub-as-runtime dependency. Remote sync is available on demand.
Both editions share the same core methodology from the original paper:
- Deep comprehension protocol — mandatory understanding check before any code is generated
- Three-pipeline isolation — builder, tester, and simulator never see each other's specifications
- Adversarial verification — independent convergence across pipelines proves correctness
- HOLD / BLOCK / STOP signal system — structured interrupt handling for human–AI coordination
- Eight specialized roles with defined responsibilities and information access rules
| Language | Priority | Supported Workflows |
|---|---|---|
| R package | P0 | Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol |
| Python package | P0 | Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol |
| Stata project | P1 | Single Fix, Validation, Resume |
| C / C++ backends | Supporting | R/Python ecosystem backends |
statsclawforcodex/
├── AGENTS.md # Codex entry point — orchestration policy
├── skills/ # Role execution protocols (orchestrate, builder, tester, ...)
├── helpers/ # Python authoritative executors (status, routing, signals, locks)
├── templates/ # Artifact templates (status.md, review.md, spec.md, ...)
├── profiles/ # Language profiles (r-package, python-package, stata-project, ...)
├── schemas/ # Artifact schema definitions
├── automation/ # Automation contracts and signal handlers
├── examples/ # Benchmark packs and workflow demos
├── docs/ # Framework documentation
├── .statsclaw/ # Runtime state (runs, locks, archive) — populated on first use
└── .brain/ # Local knowledge repository (opt-in)
- Credentials first, work second. Verify access before creating a run.
- Orchestrator dispatches, never does. The orchestrator plans and coordinates; roles do the work.
- Multi-pipeline, fully isolated. Code, test, and simulation pipelines never see each other's specs.
- Planner first, always. Every non-trivial request starts with deep comprehension and dual-spec production.
- Adversarial verification by design. Independent convergence proves correctness.
- Hard gates, not soft advice. State transitions have preconditions; artifacts are verified before advancing.
- Artifact-first execution. Decisions live in versioned files, not conversation history.
- Explicit ship actions. Nothing is pushed without user instruction or active patrol skill.
StatsClaw for Codex is adapted from StatsClaw (https://github.com/statsclaw/statsclaw), a multi-agent architecture for Anthropic Claude Code created by:
- Yiqing Xu (徐轶青), Assistant Professor, Department of Political Science, Stanford University
- Tianzhu Qin (秦天柱), PhD Candidate, Centre for Human-Inspired AI, University of Cambridge
Their paper — StatsClaw: An AI-Collaborative Workflow for Statistical Software Development (Qin and Xu, 2026) — introduces the adversarial verification methodology that this project builds upon: enforcing information barriers between code generation and validation, requiring mandatory deep-comprehension checks before any code is written, and governing workflow progression through a state machine with hard gates at every transition. The paper demonstrates the approach across three real applications — paper-to-feature development (panelView), cross-language translation (interflex R → Python), and sustained multi-day refactoring (fect) — providing strong evidence that structured AI-assisted workflows can absorb engineering overhead while preserving researcher control over substantive methodological decisions.
The Codex edition takes the original Claude Code architecture and redesigns it from the ground up for OpenAI Codex, adapting every component — from agent dispatch and state management to isolation enforcement and automation — to Codex-native primitives. The core methodology is preserved: comprehension → isolation → verification → audit. The execution model is entirely new. See Codex Edition vs Claude Code Edition for details.
We are deeply grateful to Yiqing Xu and Tianzhu Qin for creating StatsClaw, making it open-source, and laying the foundation for AI-collaborative statistical software development. StatsClaw for Codex would not exist without their pioneering work.
If you use StatsClaw for Codex in your research or software development, please cite both the original StatsClaw paper and the Codex adaptation:
Original StatsClaw paper (methodology and design):
Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."
@misc{qinxu2026statsclaw,
title={StatsClaw: An AI-Collaborative Workflow for Statistical Software Development},
author={Qin, Tianzhu and Xu, Yiqing},
year={2026},
howpublished={Mimeo, Stanford University},
url={https://bit.ly/statsclaw}
}StatsClaw for Codex (Codex-native implementation):
Cai, Xuanyu and Wenli Xu. 2026. "StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)." GitHub repository, https://github.com/gorgeousfish/statsclaw-for-codex.
@misc{caixu2026statsclawcodex,
title={StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)},
author={Cai, Xuanyu and Xu, Wenli},
year={2026},
howpublished={GitHub},
url={https://github.com/gorgeousfish/statsclaw-for-codex}
}StatsClaw for Codex is released under the MIT License.
We are building StatsClaw for Codex in the open. Everyone is welcome.
- Report a bug — Open an issue
- Share an idea — Discussions
- Contribute code — Contributing guide
- See what is planned — Roadmap
Roadmap · Contributing · Architecture · Migration Guide
A framework for statisticians and econometricians. Works best with an expert in the loop.
