Add benchflow evaluation framework#2139
Conversation
BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
|
End-to-end demo that the new framework key threads through every consumer surface. 1. Built artefacts carry the entry2. Downstream
|
|
Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now. |
Summary
benchflowto theEVALUATION_FRAMEWORKSconstant inpackages/tasks/src/eval.ts.BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
This entry follows the same shape as the recent
claw-eval(#2129),parsebench(#2096), andexgentic(#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.Traction
SkillsBench is an active benchmark with material adoption:
Tested locally
Followup
After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).
Note
Low Risk
Low risk: this only adds a new entry to the
EVALUATION_FRAMEWORKSregistry with metadata and should not affect execution flow beyond allowingeval.yamlto reference the new key.Overview
Adds
benchflowto theEVALUATION_FRAMEWORKSconstant inpackages/tasks/src/eval.ts, including its name, description, and GitHub URL so benchmarkeval.yamlfiles can reference the new framework.Reviewed by Cursor Bugbot for commit b7523dd. Bugbot is set up for automated code reviews on this repo. Configure here.