A deterministic evaluator for repository-level coding-agent benchmarks.
eval-ladder evaluates existing candidate patches; it does not generate patches.
It is built to make benchmark claims auditable, reproducible, and explicitly evaluator-conditioned.
This repository accompanies the research paper Eval-Ladder: Evaluator-Conditioned Measurement for Repository-Level Coding-Agent Benchmarks and ships frozen evidence exports for independent verification.
The artifact evaluates fixed candidate patches. It does not generate patches.
Headline empirical surfaces
- Live v2 static-vs-live diagnostic:
runs/released/live_panel_v2/results_opt/paper/exports/live_panel_v2_postbatch/
- L2 flagship diagnostic:
runs/released/l2_verified_flagship_v1/results/paper/exports/l2_verified_flagship_v1/
The regression stress-control arm is a negative-control protocol surface. Its reversals demonstrate that evaluator-induced score changes are surfaced and labeled. They must not be interpreted as natural product regressions.
Evidence-frontier surfaces
- Verified strict comparison (inventory bound):
paper/exports/strict_feasibility_report.json
- Rust proof subset:
runs/released/rust_proof_subset_v1/results_seal/
Minimal reproduction
cargo build --workspace
cargo run --bin eval-ladder -- schema validate
cargo run --bin eval-ladder -- demo run --out runs/demo --tasks 2Claim discipline: docs/paper_claim_sources.json (machine-readable map; YAML mirror alongside), paper/exports/CLAIM_SOURCE_MAP.md when present, docs/scientific_scope.md, and ci/scripts/check_paper_claim_sources.py.
Engineering closure: paper/exports/release/final_validation_matrix.md (gate log) and
paper/exports/release/MANUSCRIPT_READY_SIGNOFF.md (manuscript-ready sign-off). Confirm
green release-tag.yml runs with gh run list --workflow=release-tag.yml or the GitHub Actions UI.
- It does not generate coding-agent patches.
- It does not replace SWE-bench.
- It does not estimate population-level bug rates from the L2 diagnostic slice.
- It does not prove full semantic correctness of candidate patches.
- It does not use synthetic L4 counterexamples as headline empirical evidence.
Benchmark pass rates can change when evaluator assumptions change. eval-ladder
makes those assumptions explicit through a levelled evaluation model and
evidence-first outputs.
| Level | Name | What it asks |
|---|---|---|
L0 |
Official | Does the benchmark's native scorer mark success? |
L1 |
Trusted rerun | Does success survive deterministic replay? |
L2 |
Strengthened | Does success hold under stronger validators? |
L3 |
Policy-conformant | Was success achieved through an allowed process? |
L4 |
Semantic | Does the patch satisfy a machine-checkable obligation? |
More detail: docs/evaluation_ladder.md
Prerequisites:
- Rust toolchain pinned by
rust-toolchain.toml - Python
3.10+ - Docker (for SWE-bench Verified / SWE-bench-Live runs)
# Build
cargo build --workspace
# Inspect CLI
cargo run --bin eval-ladder -- --help
# Validate schemas
cargo run --bin eval-ladder -- schema validate
# Run reproducibility demo (fast, local, no upstream data)
cargo run --bin eval-ladder -- demo run --out runs/demo --tasks 2Common task wrappers:
just ci-tier1
just ci-tier2eval-ladder ingest verified --manifest configs/evaluator/verified.toml
eval-ladder ingest live --manifest configs/evaluator/live.toml
eval-ladder ingest rust --manifest configs/evaluator/rust.tomlExample using the frozen Live v2 panel (headline comparative surface in the paper):
eval-ladder evaluate batch \
--input runs/released/live_panel_v2/panel.jsonl \
--config configs/evaluator/default.toml \
--levels L0,L1 \
--resume \
--jobs 2 \
--out runs/released/live_panel_v2/results_opt/Verified-style and Rust panels use different --input paths (for example
runs/released/agent_panel_v3_r1/); see docs/evidence_manual.md.
eval-ladder analyze paper-export \
--run-dir runs/released/live_panel_v2/results_opt \
--out-dir paper/exports/live_panel_v2_postbatcheval-ladder verify run-dir --run-dir runs/released/live_panel_v2/results_optSmaller frozen panels (for example runs/released/agent_panel_v1/) remain in-tree for
regression tests and long-form examples in docs/evidence_manual.md. For Verified
flagship, batch optimization, and Rust proof recipes, follow Milestone H there rather
than assuming agent_panel_v1 paths.
packages/rust/ evaluator core, runner, policy, analysis, CLI
packages/python/ benchmark compatibility scripts and pipelines
packages/lean/ L4 semantic obligations
benchmarks/ benchmark adapters + manifests
configs/ evaluator, policy, and strengthening configs
schemas/ JSON schemas for persisted artifacts
datasets/ source links + derived proof subset metadata
runs/ released and local run artifacts
docs/ architecture, runbook, scope, and evidence docs
paper/ paper-facing exports and tables
ci/ CI scripts and release hygiene tooling
tests/ Rust + Python + integration tests
Release-track outputs live under runs/released/ and are accompanied by
paper-facing exports in paper/exports/.
Start here:
runs/released/agent_panel_v3_r1/runs/released/l2_verified_flagship_v1/runs/released/live_panel_v2/runs/released/rust_proof_subset_v1/
- Docs index:
docs/readme.md - Claim source map:
paper/exports/CLAIM_SOURCE_MAP.md - Getting started:
docs/getting_started.md - Evidence manual (protocols + operations):
docs/evidence_manual.md - Artifact model:
docs/architecture.md(bundles, manifests, analysis) - Public terminology:
docs/public_terminology.md - Submission/release checklist:
docs/submission_checklist.md
Main workflows in .github/workflows/:
ci-tier1-fast.ymlci-tier2-medium.ymlci-tier3-heavy.ymlrelease-tag.yml
Tag releases with semantic version tags (v*.*.*) to trigger release checks.
- Contribution guide:
CONTRIBUTING.md - Security policy:
SECURITY.md - Code of conduct:
CODE_OF_CONDUCT.md
Dual-licensed under Apache-2.0 or MIT:
Additional attribution: