Skip to content

[Leaderboard] LabRat (Claude Sonnet 4.6) — Stratified Pass@1 0.5801#54

Open
esagduyu wants to merge 1 commit into
ucbepic:mainfrom
esagduyu:labrat-leaderboard-submission
Open

[Leaderboard] LabRat (Claude Sonnet 4.6) — Stratified Pass@1 0.5801#54
esagduyu wants to merge 1 commit into
ucbepic:mainfrom
esagduyu:labrat-leaderboard-submission

Conversation

@esagduyu
Copy link
Copy Markdown

@esagduyu esagduyu commented Jun 1, 2026

LabRat — Leaderboard Submission

Agent name: LabRat
Project page: github.com/esagduyu/labrat
Backbone LLM: Claude Sonnet 4.6 (claude-sonnet-4-6) via the claude CLI (Anthropic Max plan / OAuth — not metered API; see methodology disclosure below)
Hints: No (used db_description.txt, not db_description_withhint.txt)
Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Validator version: DAB main at commit 634cd61a (includes the relaxation commits referenced in PR #44)

Result

Stratified Pass@1: 0.5801 (58.01%)

Independent re-score of submission.json against the official validate.py files reproduces the headline number exactly.

Per-dataset

Dataset DB stack Pass@1
agnews Mongo + SQLite 0.9500
bookreview Postgres + SQLite 0.9333
crmarenapro SQLite × 3 + DuckDB × 2 + Postgres 0.8154
stockindex DuckDB + SQLite 1.0000
stockmarket DuckDB + SQLite 0.8000
PANCANCER_ATLAS Postgres + DuckDB 0.6667
yelp Mongo + DuckDB 0.6286
GITHUB_REPOS DuckDB + SQLite 0.5000
googlelocal Postgres + SQLite 0.5000
DEPS_DEV_V1 DuckDB + SQLite 0.1000
music_brainz_20k DuckDB + SQLite 0.0667
PATENTS Postgres + SQLite 0.0000
Stratified mean 0.5801

25 of 54 queries scored 5/5 (perfect).

Architecture

LabRat is an open-source terminal-native AI data agent (Python + Textual, AGPL-3.0). Its substrate is composed of four layers, each separately testable:

  1. AgentLoop — provider-agnostic, tool-agnostic round-trip driver with optional max_turns / max_tool_calls caps.
  2. ToolRegistry + multi-DB ToolContext — typed Pydantic tools, dispatched via a registry. ToolContext.connections: dict[str, Connection] lets tools route to any named database.
  3. ModelProvider ABC with three implementations (Anthropic SDK, Claude CLI, OpenAI-compatible).
  4. Connection ABC with seven warehouse adapters (DuckDB, Postgres, Snowflake, BigQuery, Redshift, Trino, MySQL). DuckDB gets a public attach() method that wraps the postgres / sqlite / mysql extensions for cross-DB JOINs.

For this submission we mounted the LabRat tool registry as an MCP stdio server (labrat.mcp.server) inside claude --print --strict-mcp-config. The agent has access to:

  • Schema discovery: list_tables, describe_table, search_columns, sample_rows, column_stats
  • SQL execution: run_sql (mutation refusal + auto-LIMIT + history), explain_sql
  • Cross-database: attach_database (SQLite / Postgres / MySQL into the primary DuckDB session); load_mongo_collection (materializes a Mongo collection into a DuckDB TEMP table — nested fields become STRUCTs queryable via dot notation)

Per-task setup:

  • env.py parses each dataset's db_config.yaml and builds a DabTaskEnv with primary DuckDB connections plus an attachable: list[AttachSpec] (SQLite + Postgres) and mongo: list[MongoSpec].
  • A per-trial mcp-config.json is generated and passed to claude --print --mcp-config. The agent calls attach_database / load_mongo_collection as needed during its turn budget.
  • System prompt is generic and identical across all 54 queries: it lists the available tools and surfaces the attachable databases / mongo collections for the current trial. No per-query or per-dataset hints, no failure-analysis prompt engineering.

Configuration:

  • Max turns: 200 per trial (sent to claude --max-turns)
  • Per-trial wall-clock timeout: 600s
  • Concurrency: 1 trial at a time (sequential)
  • Validator and ground-truth files are inaccessible to the agent (the scorer runs validate.py(llm_output) after the trial completes; the agent only sees db_description.txt + the question)

Methodology disclosure: Max-plan session-limit retries

The run used Anthropic Max-plan billing via the claude CLI subprocess, not a metered API key. Max plan has rolling session usage windows (~5 hours each). When a window's budget exhausts, every subsequent claude --print invocation returns "You've hit your session limit · resets HH:MM (timezone)" as the response text in ~1.5 seconds — the model never actually runs.

LabRat's DAB harness recognizes this pattern (and the analogous timeout pattern) at the run_trial seam, marks affected trials with reason="infra:session_limit" (or infra:timeout), and excludes them from aggregate scoring. The harness's resume logic (scripts/eval_dab.py --output-dir <id>) treats infra-marked trials as not-completed and re-attempts them on the next invocation. We waited for each Max-plan session window to reset, then resumed.

The 270-trial run required five eval_dab.py invocations across ~30 hours of wall-clock time to give every (query, run) pair at least one non-infra attempt. 217 of 270 (80%) trials hit at least one session-limit window before producing a real attempt. Once every pair had a real attempt, the run was complete.

The submission.json contains, for each (query, run) pair, the agent's answer from the latest non-infra attempt. The raw trials.jsonl (preserved at runs/dab/dab-1780210698/trials.jsonl in our repo) contains every attempt including the infra-retry chain, for full transparency.

We believe this matches the spirit of "5 trials per query" — we're measuring agent capability, not the resilience of our billing tier — but we want to be explicit about it. If the leaderboard standard requires a single contiguous run with no infra retries, the right next step is to re-run on a metered Anthropic API key (no session windows, just rate limits) so the same answers come from a single uninterrupted invocation. Please let us know if a metered re-run is required before this can be added to the leaderboard. We're happy to fund that run if so.

Note on execution traces

The DAB built-in agent produces a rich per-run audit artifact set (final_agent.json, llm_calls.jsonl, tool_calls.jsonl). Our traces are summary-only for this submission. Per (query, run) we have:

  • passed / reason from the validator
  • tool_calls (count, derived from claude --print's num_turns - 1)
  • latency_seconds
  • cost_usd (always 0 — Max-plan OAuth, no per-call billing)
  • The final answer (the answer field of submission.json)
  • The per-trial mcp-config.json (showing exactly which connections the agent was given access to — confirms no hint-file injection)

What we do not have per-trial:

  • The intermediate tool-call inputs and outputs (the actual SQL the agent ran, the results it saw)
  • The LLM message history (assistant turns, thinking, tool_use / tool_result blocks)

Why: we invoked claude --print --output-format json (single bundled result), not --output-format stream-json (which would have streamed every message block). Our MCP server also doesn't currently log incoming tool calls server-side — it just dispatches them. So the per-call trajectory happened in memory and wasn't persisted.

What we offer instead for leakage / methodology audit:

  1. Fully open-source code at every layer — MCP server, harness driver, system-prompt builder, tool implementations, env builder. Available on the LabRat repo linked above; the relevant commit pins are in the run config.
  2. Per-trial mcp-config.json — proves only db_description.txt (not the _withhint.txt variant) was reachable to the agent. The MCP server reads its connection spec from this file; ground-truth and validator paths are not present.
  3. Per-trial summary metrics in trials.jsonltool_calls count + latency_seconds give some signal (e.g., a deps_dev_v1 trial with 16 tool calls and 175s clearly ran real queries; a music_brainz trial with 3 tool calls and 13s is the answer-from-context pattern we flag in the failure notes below).
  4. submission.json re-scores independently to 0.5801 against the official validate.py files — the headline number reproduces from the file alone, no harness needed.

If full per-call traces are required for this submission to be considered, we'd need to re-run with --output-format stream-json capture plus server-side MCP tool-call logging. That's a ~1 hour code change and roughly the same compute cost as the original run (which on a metered API key would naturally also resolve the session-limit retry methodology question). We're happy to do this if the maintainers consider it a blocker.

Notes on specific datasets

  • PATENTS (0%): Consistent with PR #44 (Altimate)'s analysis. The agent produces well-formed analysis output across all 15 trials (3–22 tool calls per trial, 127–408s latency), but reaches a different CPC code interpretation than the reference. This appears to be query-interpretation ambiguity (EMA initialization convention + CPC hierarchy level) rather than a tool or harness failure.
  • music_brainz_20k (6.67%): The agent has the LabRat tools (including attach_database for the SQLite secondary), the prompt surfaces the SQLite attachable, and Sonnet still returns the same wrong answers ($601.44 instead of $1059.46; Systemisch bled instead of Zo gaat het leven aan je voor). Low tool-call counts (avg 3.1) suggest the agent often answers from context rather than running the cross-DB JOIN. We have a roadmap item for a force-query prompt rule that should recover several queries here.
  • DEPS_DEV_V1 (10%): This dataset shows visible run-to-run stochasticity. An earlier 17-query / 5-trial run against the same prompts scored 40% on this dataset; the full-DAB run scored 10%. Both runs used the same model, same driver, same code. n=5 is noisy.

Reproducibility

# clone + setup (LabRat repo)
git clone https://github.com/esagduyu/labrat
cd labrat
uv sync

# point at a local DAB checkout (commit 634cd61a or later)
export DAB_DIR=~/repos/DataAgentBench

# kick off the run (uses local Postgres + MongoDB per docs/dab_local_setup.md)
uv run python scripts/eval_dab.py --driver claude-mcp --n-trials 5 \
  --datasets agnews,bookreview,crmarenapro,deps_dev_v1,github_repos,googlelocal,music_brainz_20k,pancancer_atlas,patents,stockindex,stockmarket,yelp

# resume any time after a session-limit window expires
uv run python scripts/eval_dab.py --output-dir runs/dab/dab-<id>

Full per-trial JSONL (trials.jsonl), auto-generated report.md, and submission JSON are preserved in the run directory.

Files in this PR

  • submissions/labrat_claude-sonnet-4-6_n5.json — the 270-record submission JSON
  • README leaderboard entry update

Submitted with the intent of full transparency. Happy to re-run on metered API if methodology requires it.

@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

Ruiying-Ma commented Jun 2, 2026

Hi @esagduyu — thank you!

We would greatly appreciate it if you could provide the complete traces for all trials, including intermediate tool-call inputs and outputs, as well as the full LLM interaction history.

The reported accuracy on the agnews dataset is particularly surprising to us, since solving these tasks generally requires the agent to read each news article and perform semantic classification. In our experience, unusually high performance on this dataset is often caused by data leakage—for example, an agent directly retrieving the ground-truth labels from Hugging Face via load_dataset("ag_news") rather than classifying the articles themselves.

Having access to the full traces would help us better understand the agent's behavior and verify the results. Thank you again for your contribution!

@esagduyu
Copy link
Copy Markdown
Author

esagduyu commented Jun 3, 2026

Hi @Ruiying-Ma — thank you for the careful review, and for flagging agnews
specifically. You were right to be suspicious. I pulled the full per-trial traces
and double-checked every trial against its validation.

There is leakage, and the root cause is a flaw in my harness — not the
benchmark.
My claude-mcp runner invoked the agent with
--permission-mode bypassPermissions but without restricting the tool set
(--allowedTools/--disallowedTools). --strict-mcp-config only constrains MCP
configuration, so the agent kept the full Claude Code native toolset (Bash,
WebFetch, subagents) alongside my MCP data tools — and my DataAgentBench checkout,
including every validate.py and ground_truth.csv, was on the same filesystem
and readable. I intended the MCP server to be the only data interface; it was in
fact one tool among many.

The traces show exactly what you suspected:

  • Reading the answer key directly, e.g. cat .../query_agnews/query3/validate.py,
    with a subagent reporting "The benchmark ground truth from validate.py is
    GROUND_TRUTH = 336.6363636363636."
  • Loading external labels, e.g. load_dataset("fancyzhx/ag_news") and mapping
    article_id → label — one trial states "I solved this by mapping article_ids
    to categories using the HuggingFace AG News labeled dataset."

What I checked and what I'm withdrawing. Going trial by trial across all 270
(54 queries × 5), 18 accessed answer-key/validator files or external labels:
16 of 20 agnews trials, plus one isolated bookreview trial and one yelp trial.
The other nine datasets show no such access. I'm withdrawing those 18
contaminated trials.
Counting them as non-passes and leaving every other trial
untouched, the corrected stratified score is:

as submitted corrected
agnews 95.0% 15.0%
bookreview 93.3% 86.7%
yelp 62.9% 60.0%
overall 58.0% 50.5%

So I'd ask that the entry be treated as 50.5%, with agnews effectively
unscored. I've attached the complete trace bundle for all 270 trials — full
LLM message history and every tool-call input/output, local paths scrubbed — plus
a manifest.json with per-trial contamination flags and a CONTAMINATION_AUDIT.md
write-up. Please do double-check and verify my numbers — this is my first time
submitting to a benchmark like this, it's a solo passion project, and I'd genuinely
rather you catch anything I missed.

I completely understand that leakage like this is disqualifying for the affected
results, and I take it seriously. I'll do a second, fully clean run with the agent
properly sandboxed (tools restricted to the MCP server, the benchmark repo off the
agent's filesystem, no network egress) so this is impossible by construction, and
post the corrected results here. That will take a little time since I'm running
everything on subscription-based plans rather than metered API, but it's coming.

Thanks again for catching this — it was exactly the right thing to flag, and I
appreciate the patience.

trace_bundle.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants