Pass via --dataset <name> to python -m agent_cap.agents. The loader
lives in agent_cap/runner/unified_runner.py:_load_dataset_tasks.
SWE-Bench Lite / Pro are covered separately in
RUN_NO_DOCKER.md §3.
| Name | Source | Eval | Extra setup |
|---|---|---|---|
imo-answerbench |
HF Hwilner/imo-answerbench |
imo (math_verify + llm-judge fallback) |
none |
mcp-atlas |
HF ScaleAI/mcp-atlas (filtered to free-tier servers) |
gtfa (Gemini judge) |
MCP server running on :1984 |
tau2-banking |
local clone | tau2 (built-in scorer) |
clone tau2-bench |
medagentbench |
local clone | exact or llm-judge |
clone MedAgentBench |
financebench |
local clone | llm-judge |
clone financebench |
All HuggingFace datasets are downloaded to ~/.cache/huggingface/ on
first run. Local clones go under third_party/.
Math problems with short numerical answers; tools optional.
python -m agent_cap.agents \
--strategy single \
--model openai/gpt-oss-120b \
--base-url http://localhost:30000/v1 \
--dataset imo-answerbench \
--evaluator imo \
--no-tools \
--num-tasks 20For LLM-judge fallback when math_verify fails, set
OPENROUTER_API_KEY. Set --tool-backend math-python to let the agent
run Python during reasoning.
Tool-use over 20+ MCP servers. Requires the MCP server running — see mcp-server.md.
# Terminal 1
bash mcp-server/start.sh
# Terminal 2
python -m agent_cap.agents \
--strategy single \
--model openai/gpt-oss-120b \
--base-url http://localhost:30000/v1 \
--dataset mcp-atlas \
--evaluator gtfa \
--tool-backend mcp \
--mcp-server-url http://localhost:1984 \
--num-tasks 60The loader filters the public dataset to tasks whose ENABLED_TOOLS
fit a 22-server free subset (deterministic, no shuffle). gtfa
evaluator uses Gemini via OpenRouter — set OPENROUTER_API_KEY.
Multi-turn customer-service simulation. Built-in tau2 scorer.
git clone https://github.com/sierra-research/tau2-bench third_party/tau2-bench
python -m agent_cap.agents \
--strategy single \
--model openai/gpt-oss-120b \
--base-url http://localhost:30000/v1 \
--dataset tau2-banking \
--evaluator tau2 \
--num-tasks 25Reads third_party/tau2-bench/data/tau2/domains/banking_knowledge/tasks/task_*.json.
The evaluator uses tau2's own retrieval + reward-basis scoring.
EHR / clinical reasoning tasks with structured function calls.
git clone https://github.com/stanfordnlp/MedAgentBench third_party/MedAgentBench
python -m agent_cap.agents \
--strategy single \
--model openai/gpt-oss-120b \
--base-url http://localhost:30000/v1 \
--dataset medagentbench \
--evaluator llm-judge \
--judge model=openai/gpt-4o-mini,api_key=$OPENROUTER_API_KEY,base_url=https://openrouter.ai/api/v1 \
--num-tasks 30Reads third_party/MedAgentBench/data/medagentbench/{test_data_v1,funcs_v1}.json.
SEC-filing QA. Agent gets search_document + calculate tools.
git clone https://github.com/patronus-ai/financebench third_party/financebench
python -m agent_cap.agents \
--strategy single \
--model openai/gpt-oss-120b \
--base-url http://localhost:30000/v1 \
--dataset financebench \
--evaluator llm-judge \
--judge model=openai/gpt-4o-mini,api_key=$OPENROUTER_API_KEY,base_url=https://openrouter.ai/api/v1 \
--num-tasks 50Reads third_party/financebench/data/financebench_open_source.jsonl.
| Flag | Default | Notes |
|---|---|---|
--num-tasks N |
0 (all) | Cap at first N tasks (deterministic, no shuffle) |
--output-dir DIR |
none | Persist per-task results + results.jsonl |
--resume |
off | Skip tasks already in <output-dir>/results.jsonl |
--max-turns N |
10 | Agent turn cap per task |
--strategy |
single |
Also: plan-execute, supervisor, sequential |
--mock |
off | Use MockLLMClient for offline smoke tests |
See cli.md for the full flag matrix and strategies.md for when to pick which strategy.