StatsClaw for Codex

Adversarial AI-Collaborative Workflow for Statistical Software Development — Codex Edition

Adapted from StatsClaw for Claude Code by Yiqing Xu (Stanford) & Tianzhu Qin (Cambridge). Redesigned from the ground up for OpenAI Codex. See Acknowledgments.

Overview

StatsClaw for Codex is a multi-agent workflow framework for OpenAI Codex that helps researchers build, test, and document statistical software packages with AI agent teams. It implements the adversarial verification methodology introduced in:

Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."

The core principle:

The process that generates code must never be the same process that validates it.

You describe what you need — implement an estimator from a paper, fix a bug, run a Monte Carlo study, translate an R package to Python — and StatsClaw coordinates 8 specialized AI roles through an explicit state machine with enforced information barriers:

The builder writes code without seeing the test spec
The tester validates without seeing the source code
The simulator generates data without knowing the algorithm

When all pipelines converge independently, confidence in correctness is high — analogous to independent replication in experimental science.

Authors

Xuanyu Cai, City University of Macau — xuanyuCAI@outlook.com
Wenli Xu, City University of Macau — wlxu@cityu.edu.mo

Quick Start

Clone the repository:

git clone https://github.com/gorgeousfish/statsclaw-for-codex.git

Copy the statsclawforcodex/ directory into your Codex workspace
Open Codex and describe what you want in natural language
The orchestrator reads AGENTS.md, routes your request, and runs the full workflow automatically

> Build an R package from this paper. Three probit estimation methods in C++
  via Rcpp/Armadillo: MLE, Gibbs sampler, and Metropolis-Hastings. After
  building, run a Monte Carlo simulation comparing all three.

StatsClaw auto-detects the language, selects the workflow type, and proceeds autonomously — raising HOLD signals when your domain expertise is needed.

How It Works

StatsClaw orchestrates 8 specialized AI roles, each operating under strict information isolation:

Role	Purpose
Orchestrator	Coordinates the workflow, dispatches roles, enforces the state machine
Planner	Reads your paper/formulas, executes deep comprehension protocol, produces isolated specifications
Builder	Writes source code from `spec.md` — never sees the test spec
Tester	Validates independently from `test-spec.md` — never sees the code spec or builder's implementation notes
Simulator	Runs Monte Carlo studies from `sim-spec.md` — never sees either spec
Scriber	Documents architecture, generates tutorials, maintains the audit trail
Reviewer	Cross-checks all pipelines, audits tolerance integrity, issues ship / no-ship verdict
Shipper	Commits, pushes, opens PRs — only after explicit human approval

An optional Distiller role extracts reusable knowledge to a local .brain/ repository (opt-in).

Information Barriers

The architecture's value lies in what each role cannot see:

Role	Sees	Never Sees
Builder	`spec.md`	`test-spec.md`, `sim-spec.md`
Tester	`test-spec.md`	`spec.md`, `sim-spec.md`, source code
Simulator	`sim-spec.md`	`spec.md`, `test-spec.md`, source code

This prevents each role from teaching to the test. A bug that survives must simultaneously satisfy two or three independently derived behavioral contracts — analogous to independent replication in experimental science.

Pipeline Architecture

                      planner (bridge)
                     /    |          \
          spec.md   / test-spec.md    \  sim-spec.md
                   /      |            \
            builder ─ ─(parallel)─ ─ simulator
       (code pipeline)    |    (simulation pipeline)
                   \      |            /
     implementation.md    |   simulation.md
                    \     |          /
                     \    v         /
                       tester           <-- sequential, after merge-back
                    (test pipeline)
                         |
                      audit.md
                         |
                    scriber → [distiller]? → reviewer → shipper

Workflow Types

Workflow	Role Sequence
Code	orchestrator → planner → builder → tester → scriber → reviewer → shipper?
Docs-only	orchestrator → planner → scriber → reviewer → shipper?
Simulation + Code	orchestrator → planner → [builder ∥ simulator] → tester → scriber → reviewer → shipper?
Simulation-only	orchestrator → planner → simulator → tester → scriber → reviewer → shipper?
Validation-only	orchestrator → planner → tester → scriber → reviewer
Review-only	orchestrator → reviewer → shipper?

Trigger Examples

You say	What happens
"Build an R package from this paper"	Full Workflow: comprehension → spec → build → test → document → review
"Fix the failing test in this repo"	Single Fix: focused spec → build → validate → review → ship
"Run a Monte Carlo validation"	Monte Carlo: sim-spec → simulate → validate → review
"Resume the previous run"	Resume: restore state from last handoff and continue
"Review this before shipping"	Review Only: reviewer audits existing artifacts → verdict
"Set up weekly regression checks"	Automation: configure recurring validation patrol
"Update the README"	Docs Workflow: spec → scriber implements → review
"Just bump the version number"	Simplified: builder → tester → ship (user confirms)

Short prompts work. Routing is semantic — you never need to learn StatsClaw terminology.

Codex Edition vs Claude Code Edition

StatsClaw for Codex is not a drop-in port of the Claude Code version. The original StatsClaw was designed around Claude Code's built-in Agent tool, GitHub workspace repositories for persistent state, and /loop scheduling for recurring tasks. These primitives have no direct equivalents in OpenAI Codex. Rather than emulating Claude Code's execution model, the Codex edition rearchitects every subsystem around Codex-native capabilities while preserving the adversarial verification methodology.

Architectural Differences

Aspect	Claude Code Edition	Codex Edition
Entry point	`CLAUDE.md`	`AGENTS.md`
Agent definitions	`agents/*.md` (9 agent files)	`skills/*.md` (five-section skill format with isolation contracts)
Orchestration	`agents/leader.md` (prompt-driven)	`skills/orchestrate.md` + `helpers/workflow_router.py` (explicit routing engine)
Role dispatch	Claude Code Agent tool (in-session sub-agents)	Codex subagents or serial role execution with fresh context capsules
Canonical state	Remote GitHub workspace repository	Local `.statsclaw/` run store (no remote dependency)
User interaction	Claude chat window	Codex `ask_user` tool; automations degrade to inbox-style items
Recurring tasks	`/loop` command	Codex Desktop automations with presets in `.statsclaw/state/automation-presets/`
Isolation enforcement	Prompt discipline ("never read X")	`io-manifest.md` evidence recording; default grade `audited-soft-isolated`
State executors	Agent prompts manage state directly	Python helpers as authoritative executors (`helpers/*.py`)
Ship in automations	Allowed with user approval	Never — automations can observe, plan, and report but never push, PR, or release
Lock model	Implicit (single-session)	Explicit tri-level locks (repo / run / write-surface) with heartbeat and expiry

Design Improvements in the Codex Edition

Artifact-first execution — All decisions and evidence live in versioned .md files with unified frontmatter, not conversation history. Runs are resumable and auditable across sessions. This directly addresses session-volatility: Codex tasks start with a fresh context, so every piece of state must be persisted to disk.
Explicit state machine with hard gates — Every transition has preconditions verified by Python helpers. No state can be skipped or bypassed. The Claude Code edition relies on prompt discipline; the Codex edition makes this deterministic through code.
Testable helper layer — State mutations go through Python scripts (helpers/) that can be independently tested with pytest, separating declarative protocols (skill files) from imperative execution.
Tri-level lock model — Repo, run, and write-surface locks prevent concurrent conflicts, enabling future multi-agent parallel execution. Claude Code's single-session model provides implicit mutual exclusion; Codex's task-based model requires explicit coordination.
Automation safety — Codex Desktop automations are restricted by design: they can monitor, validate, and report, but never push code or open PRs without human gating. This is stricter than the Claude Code edition.
Local-first state — No remote workspace repository required. All run state lives in .statsclaw/, reducing setup friction and eliminating GitHub-as-runtime dependency. Remote sync is available on demand.

What Stays the Same

Both editions share the same core methodology from the original paper:

Deep comprehension protocol — mandatory understanding check before any code is generated
Three-pipeline isolation — builder, tester, and simulator never see each other's specifications
Adversarial verification — independent convergence across pipelines proves correctness
HOLD / BLOCK / STOP signal system — structured interrupt handling for human–AI coordination
Eight specialized roles with defined responsibilities and information access rules

Supported Languages

Language	Priority	Supported Workflows
R package	P0	Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol
Python package	P0	Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol
Stata project	P1	Single Fix, Validation, Resume
C / C++ backends	Supporting	R/Python ecosystem backends

Directory Structure

statsclawforcodex/
├── AGENTS.md          # Codex entry point — orchestration policy
├── skills/            # Role execution protocols (orchestrate, builder, tester, ...)
├── helpers/           # Python authoritative executors (status, routing, signals, locks)
├── templates/         # Artifact templates (status.md, review.md, spec.md, ...)
├── profiles/          # Language profiles (r-package, python-package, stata-project, ...)
├── schemas/           # Artifact schema definitions
├── automation/        # Automation contracts and signal handlers
├── examples/          # Benchmark packs and workflow demos
├── docs/              # Framework documentation
├── .statsclaw/        # Runtime state (runs, locks, archive) — populated on first use
└── .brain/            # Local knowledge repository (opt-in)

Design Principles

Credentials first, work second. Verify access before creating a run.
Orchestrator dispatches, never does. The orchestrator plans and coordinates; roles do the work.
Multi-pipeline, fully isolated. Code, test, and simulation pipelines never see each other's specs.
Planner first, always. Every non-trivial request starts with deep comprehension and dual-spec production.
Adversarial verification by design. Independent convergence proves correctness.
Hard gates, not soft advice. State transitions have preconditions; artifacts are verified before advancing.
Artifact-first execution. Decisions live in versioned files, not conversation history.
Explicit ship actions. Nothing is pushed without user instruction or active patrol skill.

Acknowledgments

StatsClaw for Codex is adapted from StatsClaw (https://github.com/statsclaw/statsclaw), a multi-agent architecture for Anthropic Claude Code created by:

Yiqing Xu (徐轶青), Assistant Professor, Department of Political Science, Stanford University
Tianzhu Qin (秦天柱), PhD Candidate, Centre for Human-Inspired AI, University of Cambridge

Their paper — StatsClaw: An AI-Collaborative Workflow for Statistical Software Development (Qin and Xu, 2026) — introduces the adversarial verification methodology that this project builds upon: enforcing information barriers between code generation and validation, requiring mandatory deep-comprehension checks before any code is written, and governing workflow progression through a state machine with hard gates at every transition. The paper demonstrates the approach across three real applications — paper-to-feature development (panelView), cross-language translation (interflex R → Python), and sustained multi-day refactoring (fect) — providing strong evidence that structured AI-assisted workflows can absorb engineering overhead while preserving researcher control over substantive methodological decisions.

The Codex edition takes the original Claude Code architecture and redesigns it from the ground up for OpenAI Codex, adapting every component — from agent dispatch and state management to isolation enforcement and automation — to Codex-native primitives. The core methodology is preserved: comprehension → isolation → verification → audit. The execution model is entirely new. See Codex Edition vs Claude Code Edition for details.

We are deeply grateful to Yiqing Xu and Tianzhu Qin for creating StatsClaw, making it open-source, and laying the foundation for AI-collaborative statistical software development. StatsClaw for Codex would not exist without their pioneering work.

Citation

If you use StatsClaw for Codex in your research or software development, please cite both the original StatsClaw paper and the Codex adaptation:

Original StatsClaw paper (methodology and design):

Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."

@misc{qinxu2026statsclaw,
  title={StatsClaw: An AI-Collaborative Workflow for Statistical Software Development},
  author={Qin, Tianzhu and Xu, Yiqing},
  year={2026},
  howpublished={Mimeo, Stanford University},
  url={https://bit.ly/statsclaw}
}

StatsClaw for Codex (Codex-native implementation):

Cai, Xuanyu and Wenli Xu. 2026. "StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)." GitHub repository, https://github.com/gorgeousfish/statsclaw-for-codex.

@misc{caixu2026statsclawcodex,
  title={StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)},
  author={Cai, Xuanyu and Xu, Wenli},
  year={2026},
  howpublished={GitHub},
  url={https://github.com/gorgeousfish/statsclaw-for-codex}
}

License

StatsClaw for Codex is released under the MIT License.

Get Involved

We are building StatsClaw for Codex in the open. Everyone is welcome.

Report a bug — Open an issue
Share an idea — Discussions
Contribute code — Contributing guide
See what is planned — Roadmap

Roadmap · Contributing · Architecture · Migration Guide

A framework for statisticians and econometricians. Works best with an expert in the loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StatsClaw for Codex

Overview

Authors

Quick Start

How It Works

Information Barriers

Pipeline Architecture

Workflow Types

Trigger Examples

Codex Edition vs Claude Code Edition

Architectural Differences

Design Improvements in the Codex Edition

What Stays the Same

Supported Languages

Directory Structure

Design Principles

Acknowledgments

Citation

License

Get Involved

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.brain		.brain
.statsclaw		.statsclaw
automation		automation
docs		docs
examples		examples
helpers		helpers
image		image
profiles		profiles
schemas		schemas
skills		skills
templates		templates
.gitignore		.gitignore
.markdownlint.yml		.markdownlint.yml
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md

Folders and files

Latest commit

History

Repository files navigation

StatsClaw for Codex

Overview

Authors

Quick Start

How It Works

Information Barriers

Pipeline Architecture

Workflow Types

Trigger Examples

Codex Edition vs Claude Code Edition

Architectural Differences

Design Improvements in the Codex Edition

What Stays the Same

Supported Languages

Directory Structure

Design Principles

Acknowledgments

Citation

License

Get Involved

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages