This guide is the fastest path from clone to a verified local result.
- Rust toolchain pinned by
rust-toolchain.toml - Python
3.10+ - Docker (required for SWE-bench Verified and SWE-bench-Live evaluation surfaces)
cargo build --workspace
cargo run --bin eval-ladder -- schema validate
cargo run --bin eval-ladder -- demo run --out runs/demo --tasks 2
cargo run --bin eval-ladder -- verify run-dir --run-dir runs/demo/bundlesExpected outcome:
- Demo run completes quickly and writes frozen reproducibility outputs under
runs/demo/. verify run-dirreports no invalid bundles.
just reproduce-paper-tablesPrimary export locations:
paper/exports/live_panel_v2_postbatch/paper/exports/l2_verified_flagship_v1/paper/exports/strict_feasibility_report.json
- Use
docs/cli_reference.mdfor command-level workflows. - Use
docs/evidence_manual.mdfor longer batch execution and optimization. - Use
docs/troubleshooting.mdif setup or runtime checks fail.