Skip to content

chore: benchmark routing + tool-call validation#98

Open
saschabuehrle wants to merge 9 commits intomainfrom
feat/benchmark-truthfulqa
Open

chore: benchmark routing + tool-call validation#98
saschabuehrle wants to merge 9 commits intomainfrom
feat/benchmark-truthfulqa

Conversation

@saschabuehrle
Copy link
Collaborator

Summary

  • unify benchmark model overrides + provider/cost resolution
  • enrich benchmark routing metrics + add real-world tool-call benchmark
  • fix full benchmark dataset handling and baseline cost calculations
  • add mmlu benchmark concurrency and fix bfcl parallel tool-call evaluation

Testing

  • python3 -m pytest
  • benchmarks: longbench_full, gsm8k_full, mmlu_full, mtbench_full, bfcl_full, ruler_full, truthfulqa, tool_calls(+agentic/realworld), bfcl agentic, basic_usage (py/ts)

@github-actions github-actions bot added documentation Improvements or additions to documentation lang: python tests core size/xl labels Feb 6, 2026
@saschabuehrle saschabuehrle force-pushed the feat/benchmark-truthfulqa branch from 8a96eb9 to bac1d7d Compare March 14, 2026 20:03
@saschabuehrle
Copy link
Collaborator Author

Rebased onto main, resolved all merge conflicts.

Changes during rebase:

  • Skipped the 'fix: improve direct routing cost tracking' commit — main now has a more complete pricing_resolver implementation that supersedes it
  • Added cascadeflow_latency_ms metric to benchmark results alongside existing diagnostic fields
  • Integrated resolve_model_provider/resolve_model_cost utilities with main's benchmark structure
  • Combined real-world tool-calling benchmark with main's conditional provider comparison logic

Ready for review.

Greetings, saschabuehrle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core documentation Improvements or additions to documentation lang: python size/xl tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant