Pluggable agent benchmarks for BIRD-Interact mini (300 SQLite tasks). Evaluates query generation via SLayer semantic layer and raw SQL, across multiple agent frameworks.
See ROADMAP.md for the multi-framework plan.
-
Clone BIRD-Interact as a sibling of this checkout:
# cd into the parent directory of where you cloned bird-interact-agents git clone https://github.com/bird-bench/BIRD-Interact.gitThe
originalandallextras (used byscripts/run_three_way.sh) wire up the upstreammini-interact-agentpackage via a relative path (../BIRD-Interact/mini_interact/knowledge_based/mini_interact_agent) — so the directory layout matters. If you work from agit worktreedirectory the relative path won't resolve; either symlinkBIRD-Interactnext to the worktree oruv pip install -e <absolute-path>/mini_interact/knowledge_based/mini_interact_agentinto the worktree's venv afteruv sync. -
Get the mini-interact dataset (SQLite DBs + metadata) from HuggingFace or use a local copy.
-
Set environment variables:
export BIRD_BIRD_INTERACT_ROOT=/path/to/BIRD-Interact export ANTHROPIC_API_KEY=sk-ant-...
pip install -e ".[claude-sdk,dev]"# Validate eval pipeline (submits ground-truth SQL, no LLM)
bird-interact --mode oracle \
--data /path/to/mini_interact.jsonl \
--db-path /path/to/mini-interact/
# Run with Claude Agent SDK, raw SQL mode
bird-interact --framework claude_sdk --query-mode raw --mode a-interact \
--data /path/to/mini_interact.jsonl \
--db-path /path/to/mini-interact/ \
--limit 10 --concurrency 3
# Run with SLayer mode (requires SLayer models to be ingested)
bird-interact --framework claude_sdk --query-mode slayer --mode a-interact \
--data /path/to/mini_interact.jsonl \
--db-path /path/to/mini-interact/scripts/run_three_way.sh runs the upstream BIRD-Interact harness, our raw-SQL flavour, and our SLayer flavour on the same instance_id slice and emits a side-by-side comparison.json.
Prerequisites:
export BIRD_BIRD_INTERACT_ROOT=/path/to/BIRD-Interact
export BIRD_DATA_PATH=/path/to/mini_interact.jsonl
export BIRD_DB_PATH=/path/to/mini-interact
export ANTHROPIC_API_KEY=sk-ant-...
uv sync --extra all --extra dev # brings in the upstream harness via tool.uv.sourcesRun:
bash scripts/run_three_way.sh --mode a-interact --limit 30 --concurrency 4Defaults to --framework pydantic_ai because claude_sdk cannot run from inside an active Claude Code session (stdio collision with the spawned claude subprocess). --parallel runs the three versions concurrently. The output directory contains:
original/results.jsonl,raw/eval.json,slayer/eval.json— raw per-version outputscomparison.json—{summary: {<version>: {n, phase1_rate, phase2_rate, avg_reward, errors}}, per_task: {<id>: {<version>: ...}}}for direct row-by-row comparison- A Markdown table is also printed to stdout
Pass any LiteLLM-style provider/model string via --agent-model (and optionally --user-sim-model). LiteLLM auto-resolves the base URL and reads the matching API-key env var:
--agent-model cerebras/zai-glm-4.7 # GLM-4.7 on Cerebras (preview, fast tool calling)
--agent-model anthropic/claude-sonnet-4-5 # Default; required for claude_sdk framework
--agent-model openrouter/z-ai/glm-4.7-flash # GLM-4.7 Flash via OpenRouter
--agent-model fireworks_ai/glm-4p7 # GLM-4.7 on Fireworks
--agent-model cerebras/llama3.1-8b # Llama 3.1 8B on CerebrasSet the corresponding env var: CEREBRAS_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY, FIREWORKS_API_KEY, ZHIPU_API_KEY.
Caveats:
claude_sdkis locked to Anthropic by SDK design — passing a non-Anthropic--agent-modelcauses that single framework to skip with a warning (other frameworks run normally).mcp_agentships only Anthropic + OpenAI augmented LLMs, so non-Anthropic models route through OpenAI-compatible endpoints (configured via_build_settings).- The user-sim model defaults to
anthropic/claude-haiku-4-5-20251001. Swap with--user-sim-model cerebras/llama3.1-8bfor fully-non-Anthropic runs.
raw: Agent gets direct DB tools (execute_sql,get_schema,get_column_meaning, etc.) and writes SQL.slayer: Agent uses SLayer MCP tools (models_summary,inspect_model,query). Doesn't know about SQL/SQLite.
| Framework | Status | Install extra |
|---|---|---|
| Claude Agent SDK | Active | claude-sdk |
| PydanticAI | Planned | — |
| smolagents | Planned | — |
| Agno | Planned | — |
| mcp-agent | Planned | — |
MIT