Reusable recipes for evaluating a locally-served (vLLM) model on two agentic coding benchmarks:
| Benchmark | What | Metric | Folder |
|---|---|---|---|
| SWE-bench Verified | 500 real GitHub issues, fix-and-test | % Resolved | swebench/ |
| ClawEval | 300 agentic tasks (general / multimodal / multi-turn) | Pass^3 | claweval/ |
Both drive the model through an OpenAI-compatible endpoint, so anything you can
serve with vLLM (or vLLM-compatible) works. Each folder has a README.md (runbook),
a SKILL.md (a Claude Code skill capturing the full playbook + gotchas), and ready
-to-run scripts / config templates.
agent-eval/
├── swebench/
│ ├── README.md # runbook
│ ├── SKILL.md # Claude Code skill (full playbook)
│ ├── scripts/
│ │ ├── serve_vllm.sh # start the model server (tool-calling enabled)
│ │ ├── run_stage_a.sh # inference -> preds.json (resumable)
│ │ ├── run_stage_b.sh # scoring -> report.json (% Resolved)
│ │ ├── prepull_images.py # warm the per-task image cache
│ │ └── triage_failures.py # split infra vs model failures
│ └── patches/
│ └── abort_on_dead_container.md # mini-swe-agent zombie-container fix
└── claweval/
├── README.md # runbook + how to wire a custom judge gateway
├── SKILL.md # Claude Code skill
├── config_vllm.example.yaml # text-task config template
├── config_vllm_mm.example.yaml # multimodal config template
└── providers/
└── amd_anthropic.py # example judge provider (Anthropic-gateway)
# SWE-bench Verified
MODEL_PATH=/path/to/Model TP=8 TOOL_PARSER=kimi_k2 swebench/scripts/serve_vllm.sh # serve
MODEL_ID=/path/to/Model swebench/scripts/run_stage_a.sh runs/r1 # inference
swebench/scripts/run_stage_b.sh runs/r1 r1_eval # score
# ClawEval (after cloning claw-eval + building its sandbox image)
claw-eval batch --config config_vllm.yaml --sandbox --trials 3 --parallel 8- Tool-calling must be enabled on the server (
--enable-auto-tool-choice --tool-call-parser <X>) or the agent's calls 400. - Local models aren't in litellm's price map →
MSWEA_COST_TRACKING=ignore_errors. - Long runs spawn zombie containers when a task container self-destructs; the
docker.pypatch turns that into a clean abort. Raisecontainer_timeout(10h). - Run one inference process per output dir — two writing the same
preds.jsonsilently corrupts results. - Self-hosted throughput varies → prefer avg@k over a single run.
Full, battle-tested detail (every failure mode + fix) lives in each SKILL.md.