Datasets

Pass via --dataset <name> to python -m agent_cap.agents. The loader lives in agent_cap/runner/unified_runner.py:_load_dataset_tasks. SWE-Bench Lite / Pro are covered separately in RUN_NO_DOCKER.md §3.

Name	Source	Eval	Extra setup
`imo-answerbench`	HF `Hwilner/imo-answerbench`	`imo` (math_verify + llm-judge fallback)	none
`mcp-atlas`	HF `ScaleAI/mcp-atlas` (filtered to free-tier servers)	`gtfa` (Gemini judge)	MCP server running on :1984
`tau2-banking`	local clone	`tau2` (built-in scorer)	clone tau2-bench
`medagentbench`	local clone	`exact` or `llm-judge`	clone MedAgentBench
`financebench`	local clone	`llm-judge`	clone financebench

All HuggingFace datasets are downloaded to ~/.cache/huggingface/ on first run. Local clones go under third_party/.

imo-answerbench

Math problems with short numerical answers; tools optional.

python -m agent_cap.agents \
  --strategy single \
  --model openai/gpt-oss-120b \
  --base-url http://localhost:30000/v1 \
  --dataset imo-answerbench \
  --evaluator imo \
  --no-tools \
  --num-tasks 20

For LLM-judge fallback when math_verify fails, set OPENROUTER_API_KEY. Set --tool-backend math-python to let the agent run Python during reasoning.

mcp-atlas

Tool-use over 20+ MCP servers. Requires the MCP server running — see mcp-server.md.

# Terminal 1
bash mcp-server/start.sh

# Terminal 2
python -m agent_cap.agents \
  --strategy single \
  --model openai/gpt-oss-120b \
  --base-url http://localhost:30000/v1 \
  --dataset mcp-atlas \
  --evaluator gtfa \
  --tool-backend mcp \
  --mcp-server-url http://localhost:1984 \
  --num-tasks 60

The loader filters the public dataset to tasks whose ENABLED_TOOLS fit a 22-server free subset (deterministic, no shuffle). gtfa evaluator uses Gemini via OpenRouter — set OPENROUTER_API_KEY.

tau2-banking

Multi-turn customer-service simulation. Built-in tau2 scorer.

git clone https://github.com/sierra-research/tau2-bench third_party/tau2-bench

python -m agent_cap.agents \
  --strategy single \
  --model openai/gpt-oss-120b \
  --base-url http://localhost:30000/v1 \
  --dataset tau2-banking \
  --evaluator tau2 \
  --num-tasks 25

Reads third_party/tau2-bench/data/tau2/domains/banking_knowledge/tasks/task_*.json. The evaluator uses tau2's own retrieval + reward-basis scoring.

medagentbench

EHR / clinical reasoning tasks with structured function calls.

git clone https://github.com/stanfordnlp/MedAgentBench third_party/MedAgentBench

python -m agent_cap.agents \
  --strategy single \
  --model openai/gpt-oss-120b \
  --base-url http://localhost:30000/v1 \
  --dataset medagentbench \
  --evaluator llm-judge \
  --judge model=openai/gpt-4o-mini,api_key=$OPENROUTER_API_KEY,base_url=https://openrouter.ai/api/v1 \
  --num-tasks 30

Reads third_party/MedAgentBench/data/medagentbench/{test_data_v1,funcs_v1}.json.

financebench

SEC-filing QA. Agent gets search_document + calculate tools.

git clone https://github.com/patronus-ai/financebench third_party/financebench

python -m agent_cap.agents \
  --strategy single \
  --model openai/gpt-oss-120b \
  --base-url http://localhost:30000/v1 \
  --dataset financebench \
  --evaluator llm-judge \
  --judge model=openai/gpt-4o-mini,api_key=$OPENROUTER_API_KEY,base_url=https://openrouter.ai/api/v1 \
  --num-tasks 50

Reads third_party/financebench/data/financebench_open_source.jsonl.

Common flags

Flag	Default	Notes
`--num-tasks N`	0 (all)	Cap at first N tasks (deterministic, no shuffle)
`--output-dir DIR`	none	Persist per-task results + `results.jsonl`
`--resume`	off	Skip tasks already in `<output-dir>/results.jsonl`
`--max-turns N`	10	Agent turn cap per task
`--strategy`	`single`	Also: `plan-execute`, `supervisor`, `sequential`
`--mock`	off	Use `MockLLMClient` for offline smoke tests

See cli.md for the full flag matrix and strategies.md for when to pick which strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datasets

imo-answerbench

mcp-atlas

tau2-banking

medagentbench

financebench

Common flags

Uh oh!

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Datasets

imo-answerbench

mcp-atlas

tau2-banking

medagentbench

financebench

Common flags