chore: benchmark routing + tool-call validation by saschabuehrle · Pull Request #98 · lemony-ai/cascadeflow

saschabuehrle · 2026-02-06T09:21:31Z

Summary

unify benchmark model overrides + provider/cost resolution
enrich benchmark routing metrics + add real-world tool-call benchmark
fix full benchmark dataset handling and baseline cost calculations
add mmlu benchmark concurrency and fix bfcl parallel tool-call evaluation

Testing

python3 -m pytest
benchmarks: longbench_full, gsm8k_full, mmlu_full, mtbench_full, bfcl_full, ruler_full, truthfulqa, tool_calls(+agentic/realworld), bfcl agentic, basic_usage (py/ts)

saschabuehrle · 2026-03-14T20:03:36Z

Rebased onto main, resolved all merge conflicts.

Changes during rebase:

Skipped the 'fix: improve direct routing cost tracking' commit — main now has a more complete pricing_resolver implementation that supersedes it
Added cascadeflow_latency_ms metric to benchmark results alongside existing diagnostic fields
Integrated resolve_model_provider/resolve_model_cost utilities with main's benchmark structure
Combined real-world tool-calling benchmark with main's conditional provider comparison logic

Ready for review.

Greetings, saschabuehrle

github-actions bot added documentation Improvements or additions to documentation lang: python tests core size/xl labels Feb 6, 2026

saschabuehrle added 9 commits March 14, 2026 21:01

chore: enrich benchmark routing metrics

f0a4492

feat: add real-world tool calls benchmark

2e9ac71

fix: download full HumanEval dataset

05b2bfc

chore: unify benchmark model overrides

825c366

chore: show resolved models in gsm8k full benchmark

56ad907

fix: correct baseline cost estimation in full benchmarks

61ed3bd

chore: add concurrency to mmlu full benchmark

dbb4eb3

fix: handle parallel tool calls in bfcl benchmark

b3bef0e

style(benchmarks): apply black formatting for truthfulqa PR

bac1d7d

saschabuehrle force-pushed the feat/benchmark-truthfulqa branch from 8a96eb9 to bac1d7d Compare March 14, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: benchmark routing + tool-call validation#98

chore: benchmark routing + tool-call validation#98
saschabuehrle wants to merge 9 commits intomainfrom
feat/benchmark-truthfulqa

saschabuehrle commented Feb 6, 2026

Uh oh!

saschabuehrle commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saschabuehrle commented Feb 6, 2026

Summary

Testing

Uh oh!

saschabuehrle commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant