Skip to content

wangjiaxin99/agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

agent-eval

Reusable recipes for evaluating a locally-served (vLLM) model on two agentic coding benchmarks:

Benchmark What Metric Folder
SWE-bench Verified 500 real GitHub issues, fix-and-test % Resolved swebench/
ClawEval 300 agentic tasks (general / multimodal / multi-turn) Pass^3 claweval/

Both drive the model through an OpenAI-compatible endpoint, so anything you can serve with vLLM (or vLLM-compatible) works. Each folder has a README.md (runbook), a SKILL.md (a Claude Code skill capturing the full playbook + gotchas), and ready -to-run scripts / config templates.

Layout

agent-eval/
├── swebench/
│   ├── README.md                     # runbook
│   ├── SKILL.md                      # Claude Code skill (full playbook)
│   ├── scripts/
│   │   ├── serve_vllm.sh             # start the model server (tool-calling enabled)
│   │   ├── run_stage_a.sh            # inference  -> preds.json   (resumable)
│   │   ├── run_stage_b.sh            # scoring    -> report.json  (% Resolved)
│   │   ├── prepull_images.py         # warm the per-task image cache
│   │   └── triage_failures.py        # split infra vs model failures
│   └── patches/
│       └── abort_on_dead_container.md  # mini-swe-agent zombie-container fix
└── claweval/
    ├── README.md                     # runbook + how to wire a custom judge gateway
    ├── SKILL.md                      # Claude Code skill
    ├── config_vllm.example.yaml      # text-task config template
    ├── config_vllm_mm.example.yaml   # multimodal config template
    └── providers/
        └── amd_anthropic.py          # example judge provider (Anthropic-gateway)

TL;DR

# SWE-bench Verified
MODEL_PATH=/path/to/Model TP=8 TOOL_PARSER=kimi_k2 swebench/scripts/serve_vllm.sh   # serve
MODEL_ID=/path/to/Model swebench/scripts/run_stage_a.sh runs/r1                      # inference
swebench/scripts/run_stage_b.sh runs/r1 r1_eval                                      # score

# ClawEval (after cloning claw-eval + building its sandbox image)
claw-eval batch --config config_vllm.yaml --sandbox --trials 3 --parallel 8

Key lessons baked into the scripts

  • Tool-calling must be enabled on the server (--enable-auto-tool-choice --tool-call-parser <X>) or the agent's calls 400.
  • Local models aren't in litellm's price mapMSWEA_COST_TRACKING=ignore_errors.
  • Long runs spawn zombie containers when a task container self-destructs; the docker.py patch turns that into a clean abort. Raise container_timeout (10h).
  • Run one inference process per output dir — two writing the same preds.json silently corrupts results.
  • Self-hosted throughput varies → prefer avg@k over a single run.

Full, battle-tested detail (every failure mode + fix) lives in each SKILL.md.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors