[Leaderboard] LabRat (Claude Sonnet 4.6) — Stratified Pass@1 0.5801#54
[Leaderboard] LabRat (Claude Sonnet 4.6) — Stratified Pass@1 0.5801#54esagduyu wants to merge 1 commit into
Conversation
|
Hi @esagduyu — thank you! We would greatly appreciate it if you could provide the complete traces for all trials, including intermediate tool-call inputs and outputs, as well as the full LLM interaction history. The reported accuracy on the agnews dataset is particularly surprising to us, since solving these tasks generally requires the agent to read each news article and perform semantic classification. In our experience, unusually high performance on this dataset is often caused by data leakage—for example, an agent directly retrieving the ground-truth labels from Hugging Face via Having access to the full traces would help us better understand the agent's behavior and verify the results. Thank you again for your contribution! |
|
Hi @Ruiying-Ma — thank you for the careful review, and for flagging agnews There is leakage, and the root cause is a flaw in my harness — not the The traces show exactly what you suspected:
What I checked and what I'm withdrawing. Going trial by trial across all 270
So I'd ask that the entry be treated as 50.5%, with agnews effectively I completely understand that leakage like this is disqualifying for the affected Thanks again for catching this — it was exactly the right thing to flag, and I |
LabRat — Leaderboard Submission
Agent name: LabRat
Project page: github.com/esagduyu/labrat
Backbone LLM: Claude Sonnet 4.6 (
claude-sonnet-4-6) via theclaudeCLI (Anthropic Max plan / OAuth — not metered API; see methodology disclosure below)Hints: No (used
db_description.txt, notdb_description_withhint.txt)Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Validator version: DAB
mainat commit634cd61a(includes the relaxation commits referenced in PR #44)Result
Stratified Pass@1: 0.5801 (58.01%)
Independent re-score of
submission.jsonagainst the officialvalidate.pyfiles reproduces the headline number exactly.Per-dataset
25 of 54 queries scored 5/5 (perfect).
Architecture
LabRat is an open-source terminal-native AI data agent (Python + Textual, AGPL-3.0). Its substrate is composed of four layers, each separately testable:
AgentLoop— provider-agnostic, tool-agnostic round-trip driver with optionalmax_turns/max_tool_callscaps.ToolRegistry+ multi-DBToolContext— typed Pydantic tools, dispatched via a registry.ToolContext.connections: dict[str, Connection]lets tools route to any named database.ModelProviderABC with three implementations (Anthropic SDK, Claude CLI, OpenAI-compatible).ConnectionABC with seven warehouse adapters (DuckDB, Postgres, Snowflake, BigQuery, Redshift, Trino, MySQL). DuckDB gets a publicattach()method that wraps the postgres / sqlite / mysql extensions for cross-DB JOINs.For this submission we mounted the LabRat tool registry as an MCP stdio server (
labrat.mcp.server) insideclaude --print --strict-mcp-config. The agent has access to:list_tables,describe_table,search_columns,sample_rows,column_statsrun_sql(mutation refusal + auto-LIMIT + history),explain_sqlattach_database(SQLite / Postgres / MySQL into the primary DuckDB session);load_mongo_collection(materializes a Mongo collection into a DuckDB TEMP table — nested fields become STRUCTs queryable via dot notation)Per-task setup:
env.pyparses each dataset'sdb_config.yamland builds aDabTaskEnvwith primary DuckDB connections plus anattachable: list[AttachSpec](SQLite + Postgres) andmongo: list[MongoSpec].mcp-config.jsonis generated and passed toclaude --print --mcp-config. The agent callsattach_database/load_mongo_collectionas needed during its turn budget.Configuration:
claude --max-turns)validate.py(llm_output)after the trial completes; the agent only seesdb_description.txt+ the question)Methodology disclosure: Max-plan session-limit retries
The run used Anthropic Max-plan billing via the
claudeCLI subprocess, not a metered API key. Max plan has rolling session usage windows (~5 hours each). When a window's budget exhausts, every subsequentclaude --printinvocation returns"You've hit your session limit · resets HH:MM (timezone)"as the response text in ~1.5 seconds — the model never actually runs.LabRat's DAB harness recognizes this pattern (and the analogous timeout pattern) at the
run_trialseam, marks affected trials withreason="infra:session_limit"(orinfra:timeout), and excludes them from aggregate scoring. The harness's resume logic (scripts/eval_dab.py --output-dir <id>) treats infra-marked trials as not-completed and re-attempts them on the next invocation. We waited for each Max-plan session window to reset, then resumed.The 270-trial run required five
eval_dab.pyinvocations across ~30 hours of wall-clock time to give every(query, run)pair at least one non-infra attempt. 217 of 270 (80%) trials hit at least one session-limit window before producing a real attempt. Once every pair had a real attempt, the run was complete.The
submission.jsoncontains, for each(query, run)pair, the agent's answer from the latest non-infra attempt. The rawtrials.jsonl(preserved atruns/dab/dab-1780210698/trials.jsonlin our repo) contains every attempt including the infra-retry chain, for full transparency.We believe this matches the spirit of "5 trials per query" — we're measuring agent capability, not the resilience of our billing tier — but we want to be explicit about it. If the leaderboard standard requires a single contiguous run with no infra retries, the right next step is to re-run on a metered Anthropic API key (no session windows, just rate limits) so the same answers come from a single uninterrupted invocation. Please let us know if a metered re-run is required before this can be added to the leaderboard. We're happy to fund that run if so.
Note on execution traces
The DAB built-in agent produces a rich per-run audit artifact set (
final_agent.json,llm_calls.jsonl,tool_calls.jsonl). Our traces are summary-only for this submission. Per(query, run)we have:passed/reasonfrom the validatortool_calls(count, derived fromclaude --print'snum_turns - 1)latency_secondscost_usd(always 0 — Max-plan OAuth, no per-call billing)answerfield ofsubmission.json)mcp-config.json(showing exactly which connections the agent was given access to — confirms no hint-file injection)What we do not have per-trial:
Why: we invoked
claude --print --output-format json(single bundled result), not--output-format stream-json(which would have streamed every message block). Our MCP server also doesn't currently log incoming tool calls server-side — it just dispatches them. So the per-call trajectory happened in memory and wasn't persisted.What we offer instead for leakage / methodology audit:
mcp-config.json— proves onlydb_description.txt(not the_withhint.txtvariant) was reachable to the agent. The MCP server reads its connection spec from this file; ground-truth and validator paths are not present.trials.jsonl—tool_callscount +latency_secondsgive some signal (e.g., a deps_dev_v1 trial with 16 tool calls and 175s clearly ran real queries; a music_brainz trial with 3 tool calls and 13s is the answer-from-context pattern we flag in the failure notes below).submission.jsonre-scores independently to 0.5801 against the officialvalidate.pyfiles — the headline number reproduces from the file alone, no harness needed.If full per-call traces are required for this submission to be considered, we'd need to re-run with
--output-format stream-jsoncapture plus server-side MCP tool-call logging. That's a ~1 hour code change and roughly the same compute cost as the original run (which on a metered API key would naturally also resolve the session-limit retry methodology question). We're happy to do this if the maintainers consider it a blocker.Notes on specific datasets
attach_databasefor the SQLite secondary), the prompt surfaces the SQLite attachable, and Sonnet still returns the same wrong answers ($601.44instead of$1059.46;Systemisch bledinstead ofZo gaat het leven aan je voor). Low tool-call counts (avg 3.1) suggest the agent often answers from context rather than running the cross-DB JOIN. We have a roadmap item for a force-query prompt rule that should recover several queries here.Reproducibility
Full per-trial JSONL (
trials.jsonl), auto-generatedreport.md, and submission JSON are preserved in the run directory.Files in this PR
submissions/labrat_claude-sonnet-4-6_n5.json— the 270-record submission JSONSubmitted with the intent of full transparency. Happy to re-run on metered API if methodology requires it.