Skip to content

gorgeousfish/statsclaw-for-codex

Repository files navigation

StatsClaw for Codex

Adversarial AI-Collaborative Workflow for Statistical Software Development — Codex Edition

License: MIT Platform: OpenAI Codex Version: 0.1.0

StatsClaw for Codex — The Process That Generates Must Never Validate

Adapted from StatsClaw for Claude Code by Yiqing Xu (Stanford) & Tianzhu Qin (Cambridge). Redesigned from the ground up for OpenAI Codex. See Acknowledgments.


Overview

StatsClaw for Codex is a multi-agent workflow framework for OpenAI Codex that helps researchers build, test, and document statistical software packages with AI agent teams. It implements the adversarial verification methodology introduced in:

Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."

The core principle:

The process that generates code must never be the same process that validates it.

You describe what you need — implement an estimator from a paper, fix a bug, run a Monte Carlo study, translate an R package to Python — and StatsClaw coordinates 8 specialized AI roles through an explicit state machine with enforced information barriers:

  • The builder writes code without seeing the test spec
  • The tester validates without seeing the source code
  • The simulator generates data without knowing the algorithm

When all pipelines converge independently, confidence in correctness is high — analogous to independent replication in experimental science.


Authors


Quick Start

  1. Clone the repository:
    git clone https://github.com/gorgeousfish/statsclaw-for-codex.git
  2. Copy the statsclawforcodex/ directory into your Codex workspace
  3. Open Codex and describe what you want in natural language
  4. The orchestrator reads AGENTS.md, routes your request, and runs the full workflow automatically
> Build an R package from this paper. Three probit estimation methods in C++
  via Rcpp/Armadillo: MLE, Gibbs sampler, and Metropolis-Hastings. After
  building, run a Monte Carlo simulation comparing all three.

StatsClaw auto-detects the language, selects the workflow type, and proceeds autonomously — raising HOLD signals when your domain expertise is needed.


How It Works

StatsClaw orchestrates 8 specialized AI roles, each operating under strict information isolation:

Role Purpose
Orchestrator Coordinates the workflow, dispatches roles, enforces the state machine
Planner Reads your paper/formulas, executes deep comprehension protocol, produces isolated specifications
Builder Writes source code from spec.md — never sees the test spec
Tester Validates independently from test-spec.md — never sees the code spec or builder's implementation notes
Simulator Runs Monte Carlo studies from sim-spec.md — never sees either spec
Scriber Documents architecture, generates tutorials, maintains the audit trail
Reviewer Cross-checks all pipelines, audits tolerance integrity, issues ship / no-ship verdict
Shipper Commits, pushes, opens PRs — only after explicit human approval

An optional Distiller role extracts reusable knowledge to a local .brain/ repository (opt-in).

Information Barriers

The architecture's value lies in what each role cannot see:

Role Sees Never Sees
Builder spec.md test-spec.md, sim-spec.md
Tester test-spec.md spec.md, sim-spec.md, source code
Simulator sim-spec.md spec.md, test-spec.md, source code

This prevents each role from teaching to the test. A bug that survives must simultaneously satisfy two or three independently derived behavioral contracts — analogous to independent replication in experimental science.

Pipeline Architecture

                      planner (bridge)
                     /    |          \
          spec.md   / test-spec.md    \  sim-spec.md
                   /      |            \
            builder ─ ─(parallel)─ ─ simulator
       (code pipeline)    |    (simulation pipeline)
                   \      |            /
     implementation.md    |   simulation.md
                    \     |          /
                     \    v         /
                       tester           <-- sequential, after merge-back
                    (test pipeline)
                         |
                      audit.md
                         |
                    scriber → [distiller]? → reviewer → shipper

Workflow Types

Workflow Role Sequence
Code orchestrator → planner → builder → tester → scriber → reviewer → shipper?
Docs-only orchestrator → planner → scriber → reviewer → shipper?
Simulation + Code orchestrator → planner → [builder ∥ simulator] → tester → scriber → reviewer → shipper?
Simulation-only orchestrator → planner → simulator → tester → scriber → reviewer → shipper?
Validation-only orchestrator → planner → tester → scriber → reviewer
Review-only orchestrator → reviewer → shipper?

Trigger Examples

You say What happens
"Build an R package from this paper" Full Workflow: comprehension → spec → build → test → document → review
"Fix the failing test in this repo" Single Fix: focused spec → build → validate → review → ship
"Run a Monte Carlo validation" Monte Carlo: sim-spec → simulate → validate → review
"Resume the previous run" Resume: restore state from last handoff and continue
"Review this before shipping" Review Only: reviewer audits existing artifacts → verdict
"Set up weekly regression checks" Automation: configure recurring validation patrol
"Update the README" Docs Workflow: spec → scriber implements → review
"Just bump the version number" Simplified: builder → tester → ship (user confirms)

Short prompts work. Routing is semantic — you never need to learn StatsClaw terminology.


Codex Edition vs Claude Code Edition

StatsClaw for Codex is not a drop-in port of the Claude Code version. The original StatsClaw was designed around Claude Code's built-in Agent tool, GitHub workspace repositories for persistent state, and /loop scheduling for recurring tasks. These primitives have no direct equivalents in OpenAI Codex. Rather than emulating Claude Code's execution model, the Codex edition rearchitects every subsystem around Codex-native capabilities while preserving the adversarial verification methodology.

Architectural Differences

Aspect Claude Code Edition Codex Edition
Entry point CLAUDE.md AGENTS.md
Agent definitions agents/*.md (9 agent files) skills/*.md (five-section skill format with isolation contracts)
Orchestration agents/leader.md (prompt-driven) skills/orchestrate.md + helpers/workflow_router.py (explicit routing engine)
Role dispatch Claude Code Agent tool (in-session sub-agents) Codex subagents or serial role execution with fresh context capsules
Canonical state Remote GitHub workspace repository Local .statsclaw/ run store (no remote dependency)
User interaction Claude chat window Codex ask_user tool; automations degrade to inbox-style items
Recurring tasks /loop command Codex Desktop automations with presets in .statsclaw/state/automation-presets/
Isolation enforcement Prompt discipline ("never read X") io-manifest.md evidence recording; default grade audited-soft-isolated
State executors Agent prompts manage state directly Python helpers as authoritative executors (helpers/*.py)
Ship in automations Allowed with user approval Never — automations can observe, plan, and report but never push, PR, or release
Lock model Implicit (single-session) Explicit tri-level locks (repo / run / write-surface) with heartbeat and expiry

Design Improvements in the Codex Edition

  • Artifact-first execution — All decisions and evidence live in versioned .md files with unified frontmatter, not conversation history. Runs are resumable and auditable across sessions. This directly addresses session-volatility: Codex tasks start with a fresh context, so every piece of state must be persisted to disk.
  • Explicit state machine with hard gates — Every transition has preconditions verified by Python helpers. No state can be skipped or bypassed. The Claude Code edition relies on prompt discipline; the Codex edition makes this deterministic through code.
  • Testable helper layer — State mutations go through Python scripts (helpers/) that can be independently tested with pytest, separating declarative protocols (skill files) from imperative execution.
  • Tri-level lock model — Repo, run, and write-surface locks prevent concurrent conflicts, enabling future multi-agent parallel execution. Claude Code's single-session model provides implicit mutual exclusion; Codex's task-based model requires explicit coordination.
  • Automation safety — Codex Desktop automations are restricted by design: they can monitor, validate, and report, but never push code or open PRs without human gating. This is stricter than the Claude Code edition.
  • Local-first state — No remote workspace repository required. All run state lives in .statsclaw/, reducing setup friction and eliminating GitHub-as-runtime dependency. Remote sync is available on demand.

What Stays the Same

Both editions share the same core methodology from the original paper:

  • Deep comprehension protocol — mandatory understanding check before any code is generated
  • Three-pipeline isolation — builder, tester, and simulator never see each other's specifications
  • Adversarial verification — independent convergence across pipelines proves correctness
  • HOLD / BLOCK / STOP signal system — structured interrupt handling for human–AI coordination
  • Eight specialized roles with defined responsibilities and information access rules

Supported Languages

Language Priority Supported Workflows
R package P0 Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol
Python package P0 Full Workflow, Single Fix, Validation, Monte Carlo, Resume, Patrol
Stata project P1 Single Fix, Validation, Resume
C / C++ backends Supporting R/Python ecosystem backends

Directory Structure

statsclawforcodex/
├── AGENTS.md          # Codex entry point — orchestration policy
├── skills/            # Role execution protocols (orchestrate, builder, tester, ...)
├── helpers/           # Python authoritative executors (status, routing, signals, locks)
├── templates/         # Artifact templates (status.md, review.md, spec.md, ...)
├── profiles/          # Language profiles (r-package, python-package, stata-project, ...)
├── schemas/           # Artifact schema definitions
├── automation/        # Automation contracts and signal handlers
├── examples/          # Benchmark packs and workflow demos
├── docs/              # Framework documentation
├── .statsclaw/        # Runtime state (runs, locks, archive) — populated on first use
└── .brain/            # Local knowledge repository (opt-in)

Design Principles

  • Credentials first, work second. Verify access before creating a run.
  • Orchestrator dispatches, never does. The orchestrator plans and coordinates; roles do the work.
  • Multi-pipeline, fully isolated. Code, test, and simulation pipelines never see each other's specs.
  • Planner first, always. Every non-trivial request starts with deep comprehension and dual-spec production.
  • Adversarial verification by design. Independent convergence proves correctness.
  • Hard gates, not soft advice. State transitions have preconditions; artifacts are verified before advancing.
  • Artifact-first execution. Decisions live in versioned files, not conversation history.
  • Explicit ship actions. Nothing is pushed without user instruction or active patrol skill.

Acknowledgments

StatsClaw for Codex is adapted from StatsClaw (https://github.com/statsclaw/statsclaw), a multi-agent architecture for Anthropic Claude Code created by:

  • Yiqing Xu (徐轶青), Assistant Professor, Department of Political Science, Stanford University
  • Tianzhu Qin (秦天柱), PhD Candidate, Centre for Human-Inspired AI, University of Cambridge

Their paper — StatsClaw: An AI-Collaborative Workflow for Statistical Software Development (Qin and Xu, 2026) — introduces the adversarial verification methodology that this project builds upon: enforcing information barriers between code generation and validation, requiring mandatory deep-comprehension checks before any code is written, and governing workflow progression through a state machine with hard gates at every transition. The paper demonstrates the approach across three real applications — paper-to-feature development (panelView), cross-language translation (interflex R → Python), and sustained multi-day refactoring (fect) — providing strong evidence that structured AI-assisted workflows can absorb engineering overhead while preserving researcher control over substantive methodological decisions.

The Codex edition takes the original Claude Code architecture and redesigns it from the ground up for OpenAI Codex, adapting every component — from agent dispatch and state management to isolation enforcement and automation — to Codex-native primitives. The core methodology is preserved: comprehension → isolation → verification → audit. The execution model is entirely new. See Codex Edition vs Claude Code Edition for details.

We are deeply grateful to Yiqing Xu and Tianzhu Qin for creating StatsClaw, making it open-source, and laying the foundation for AI-collaborative statistical software development. StatsClaw for Codex would not exist without their pioneering work.


Citation

If you use StatsClaw for Codex in your research or software development, please cite both the original StatsClaw paper and the Codex adaptation:

Original StatsClaw paper (methodology and design):

Qin, Tianzhu and Yiqing Xu. 2026. "StatsClaw: An AI-Collaborative Workflow for Statistical Software Development."

@misc{qinxu2026statsclaw,
  title={StatsClaw: An AI-Collaborative Workflow for Statistical Software Development},
  author={Qin, Tianzhu and Xu, Yiqing},
  year={2026},
  howpublished={Mimeo, Stanford University},
  url={https://bit.ly/statsclaw}
}

StatsClaw for Codex (Codex-native implementation):

Cai, Xuanyu and Wenli Xu. 2026. "StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)." GitHub repository, https://github.com/gorgeousfish/statsclaw-for-codex.

@misc{caixu2026statsclawcodex,
  title={StatsClaw for Codex: An AI-Collaborative Workflow for Statistical Software Development (Codex Edition)},
  author={Cai, Xuanyu and Xu, Wenli},
  year={2026},
  howpublished={GitHub},
  url={https://github.com/gorgeousfish/statsclaw-for-codex}
}

License

StatsClaw for Codex is released under the MIT License.


Get Involved

We are building StatsClaw for Codex in the open. Everyone is welcome.


Roadmap · Contributing · Architecture · Migration Guide

A framework for statisticians and econometricians. Works best with an expert in the loop.

About

A multi-agent workflow framework for OpenAI Codex that builds, tests, and documents statistical software packages with adversarial verification — adapted from StatsClaw (Claude Code) by Qin and Xu (2026).(Public Preview, testing...)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages