agent-eval

Reusable recipes for evaluating a locally-served (vLLM) model on two agentic coding benchmarks:

Benchmark	What	Metric	Folder
SWE-bench Verified	500 real GitHub issues, fix-and-test	% Resolved	`swebench/`
ClawEval	300 agentic tasks (general / multimodal / multi-turn)	Pass^3	`claweval/`

Both drive the model through an OpenAI-compatible endpoint, so anything you can serve with vLLM (or vLLM-compatible) works. Each folder has a README.md (runbook), a SKILL.md (a Claude Code skill capturing the full playbook + gotchas), and ready -to-run scripts / config templates.

Layout

agent-eval/
├── swebench/
│   ├── README.md                     # runbook
│   ├── SKILL.md                      # Claude Code skill (full playbook)
│   ├── scripts/
│   │   ├── serve_vllm.sh             # start the model server (tool-calling enabled)
│   │   ├── run_stage_a.sh            # inference  -> preds.json   (resumable)
│   │   ├── run_stage_b.sh            # scoring    -> report.json  (% Resolved)
│   │   ├── prepull_images.py         # warm the per-task image cache
│   │   └── triage_failures.py        # split infra vs model failures
│   └── patches/
│       └── abort_on_dead_container.md  # mini-swe-agent zombie-container fix
└── claweval/
    ├── README.md                     # runbook + how to wire a custom judge gateway
    ├── SKILL.md                      # Claude Code skill
    ├── config_vllm.example.yaml      # text-task config template
    ├── config_vllm_mm.example.yaml   # multimodal config template
    └── providers/
        └── amd_anthropic.py          # example judge provider (Anthropic-gateway)

TL;DR

# SWE-bench Verified
MODEL_PATH=/path/to/Model TP=8 TOOL_PARSER=kimi_k2 swebench/scripts/serve_vllm.sh   # serve
MODEL_ID=/path/to/Model swebench/scripts/run_stage_a.sh runs/r1                      # inference
swebench/scripts/run_stage_b.sh runs/r1 r1_eval                                      # score

# ClawEval (after cloning claw-eval + building its sandbox image)
claw-eval batch --config config_vllm.yaml --sandbox --trials 3 --parallel 8

Key lessons baked into the scripts

Tool-calling must be enabled on the server (--enable-auto-tool-choice --tool-call-parser <X>) or the agent's calls 400.
Local models aren't in litellm's price map → MSWEA_COST_TRACKING=ignore_errors.
Long runs spawn zombie containers when a task container self-destructs; the docker.py patch turns that into a clean abort. Raise container_timeout (10h).
Run one inference process per output dir — two writing the same preds.json silently corrupts results.
Self-hosted throughput varies → prefer avg@k over a single run.

Full, battle-tested detail (every failure mode + fix) lives in each SKILL.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
claweval		claweval
swebench		swebench
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval

Layout

TL;DR

Key lessons baked into the scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-eval

Layout

TL;DR

Key lessons baked into the scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages