Add imaginary hello-world sample category#694
Conversation
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
|
Workflow run failed due to
|
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
Fixed in 7c7d9ac. |
|
@copilot now the workflow can't run because the commit sha is empty, can you simply take some from code review category? The goal is that this can run end to end |
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
Done. Both |
Adds a deliberately tiny, imaginary, self-contained
hello-worldevaluation category as a worked example for demonstrating how to add a new category to BC-Bench. Also refreshes the staleCATEGORIES.md.What
hello-worldisThe smallest possible category: the agent writes a tiny AL
Greetingcodeunit returning a "Hello, World!" message in a given language. No BC container, no symbols; judge-based scoring, so it reuses the existingJudgeBased*result/summary/aggregate classes likenl2aldoes. Intended for demonstrating the plumbing, not real scores (dataset entries use placeholder gold patches).Changes
types.py):HELLO_WORLDenum + everymatcharm (dataset path, entry/result/summary/aggregate classes, pipeline, evaluators, core_score, requires_container, runner).HelloWorldEntrymodel + exports;dataset/hello_world.jsonlwith two sample tasks. The entries point at the real publicmicrosoft/BCAppsrepo and a validbase_commit(borrowed from the code-review category) so the setup clone step succeeds and the category runs end to end; the pipeline still rebuilds its own self-contained scaffold, so the cloned repo is only there to satisfy setup.scripts/BCBenchUtils.psm1): registeredhello-world(→hello_world.jsonl) in theGet-BCBenchDatasetPathValidateSet/switchso the setup scripts accept the category.evaluate/hello_world.py): reset repo → git init → run agent → capture diff as raw judge-based output.hello-world-templateinconfig.yaml.commands/evaluate.py): handles the new category so the CI mock-evaluation job stays green.hello-worldadded to copilot/claude category choice lists.test_hello_world_pipeline.py.CATEGORIES.md: corrected the "two categories" claim (there are now several), documented the previously-omitted touch points (theGet-BCBenchDatasetPathPowerShellValidateSet,aggregate_class,evaluators/core_score,requires_container/runner, mock pipeline, test wiring), and added a runnablehello-worldexample.Run it