Skip to content

Add imaginary hello-world sample category#694

Draft
Copilot wants to merge 4 commits into
mainfrom
copilot/add-new-category-code-review
Draft

Add imaginary hello-world sample category#694
Copilot wants to merge 4 commits into
mainfrom
copilot/add-new-category-code-review

Conversation

Copilot AI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Adds a deliberately tiny, imaginary, self-contained hello-world evaluation category as a worked example for demonstrating how to add a new category to BC-Bench. Also refreshes the stale CATEGORIES.md.

What hello-world is

The smallest possible category: the agent writes a tiny AL Greeting codeunit returning a "Hello, World!" message in a given language. No BC container, no symbols; judge-based scoring, so it reuses the existing JudgeBased* result/summary/aggregate classes like nl2al does. Intended for demonstrating the plumbing, not real scores (dataset entries use placeholder gold patches).

Changes

  • Registry (types.py): HELLO_WORLD enum + every match arm (dataset path, entry/result/summary/aggregate classes, pipeline, evaluators, core_score, requires_container, runner).
  • Dataset: HelloWorldEntry model + exports; dataset/hello_world.jsonl with two sample tasks. The entries point at the real public microsoft/BCApps repo and a valid base_commit (borrowed from the code-review category) so the setup clone step succeeds and the category runs end to end; the pipeline still rebuilds its own self-contained scaffold, so the cloned repo is only there to satisfy setup.
  • PowerShell (scripts/BCBenchUtils.psm1): registered hello-world (→ hello_world.jsonl) in the Get-BCBenchDatasetPath ValidateSet/switch so the setup scripts accept the category.
  • Pipeline (evaluate/hello_world.py): reset repo → git init → run agent → capture diff as raw judge-based output.
  • Prompt: hello-world-template in config.yaml.
  • Mock pipeline (commands/evaluate.py): handles the new category so the CI mock-evaluation job stays green.
  • Workflows: hello-world added to copilot/claude category choice lists.
  • Tests: conftest fixtures + exhaustiveness/mock wiring, plus test_hello_world_pipeline.py.
  • CATEGORIES.md: corrected the "two categories" claim (there are now several), documented the previously-omitted touch points (the Get-BCBenchDatasetPath PowerShell ValidateSet, aggregate_class, evaluators/core_score, requires_container/runner, mock pipeline, test wiring), and added a runnable hello-world example.

Run it

uv run bcbench run copilot helloworld__greeting-english-1 --category hello-world --repo-path /tmp/hello-world-repo

Copilot AI and others added 2 commits June 25, 2026 19:53
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>
@haoranpb

Copy link
Copy Markdown
Collaborator

@copilot

Workflow run failed due to

Get-BCBenchDatasetPath: /home/runner/work/BC-Bench/BC-Bench/scripts/Setup-ContainerAndRepository.ps1:23
Line |
23 | … [string]$DatasetPath = (Get-BCBenchDatasetPath -Category $Category),
| ~~~~~~~~~
| Cannot validate argument on parameter 'Category'. The argument
| "hello-world" does not belong to the set
| "bug-fix,test-generation,code-review,nl2al" specified by the ValidateSet
| attribute. Supply an argument that is in the set and then try the
| command again.
Error: Process completed with exit code 1.

Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>

Copilot AI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@copilot

Workflow run failed due to
...

Fixed in 7c7d9ac. Get-BCBenchDatasetPath in scripts/BCBenchUtils.psm1 had its own ValidateSet that I missed — I added hello-world (mapping to hello_world.jsonl) to both the ValidateSet and the switch, and documented this PowerShell touch point in the CATEGORIES.md checklist.

Copilot AI requested a review from haoranpb June 26, 2026 07:18
@haoranpb

Copy link
Copy Markdown
Collaborator

@copilot now the workflow can't run because the commit sha is empty, can you simply take some from code review category?

The goal is that this can run end to end

Co-authored-by: haoranpb <27280733+haoranpb@users.noreply.github.com>

Copilot AI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@copilot now the workflow can't run because the commit sha is empty, can you simply take some from code review category?

The goal is that this can run end to end

Done. Both hello_world.jsonl entries now use microsoft/BCApps with base_commit 70fd0246a0a4dbc72cb183ca719106722c03be4d (borrowed from the code-review category), so the setup clone step succeeds. Since hello-world has requires_container = False, the workflow runs with skip-container, and the pipeline rebuilds a self-contained scaffold — so the clone is just there to satisfy setup and let it run end to end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants