Experiment: composed BCQuality super-skill/sub-skill code review by WaelAbuSeada · Pull Request #661 · microsoft/BC-Bench

WaelAbuSeada · 2026-06-13T05:57:31Z

Adds a new code-review experiment that implements BCQuality's composed super-skill / sub-skill review pattern.

What this adds

Vendors the BCQuality composed-review framework into src/bcbench/agent/shared/instructions/microsoft-BCApps/skills/al-code-review/ as committed static copies (no runtime fetch).
- Meta-skill contracts (read.md, do.md)
- Super-skill al-code-review.md + 5 domain leaf sub-skills (performance, security, privacy, upgrade, style)
- 123 knowledge articles + AL samples across the 5 domains, plus a generated knowledge-index.json
- UI domain dropped to match BC-Bench's 5 domains
Authored Copilot SKILL.md entry point that orchestrates the composition and maps BCQuality findings to BC-Bench's review.json schema (blocker->critical, major->high, minor->medium, info->low; from-sub-skill->domain).
config.yaml: code-review-template invokes the skill directly via /al-code-review (matching the experiment/code-review-al-skill pattern); skills.enabled: true.

Notes

CI's existing setup_agent_skills copies the committed skill folder into repo/.github/skills/ — no BCQuality clone needed.
Relevant tests pass (agent-skills, experiment-config, copilot-prompt, dataset-integrity).

…C-Bench into category/code-review

…display and refactoring comment parsing logic

…egory/code-review

…columns

…egory/code-review

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

…egory/code-review

…t branch Snapshots all non-experiment files from experiment/code-review-al-skill onto category/code-review. Experiment-specific assets (al-code-review skill and custom instructions under microsoft-BCApps/instructions/*.md) remain only on the experiment branch. Highlights: - Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade) with in-domain expected_comments; cleaned up OOD bait across pre-existing entries; renumbered performance and privacy entries to be contiguous. - Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed matches, improved review parsing, grouped per-domain summary layout. - Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1. - Tooling: probe_codereview_case/batch harness for local skill testing, apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2} scripts used to produce the dataset enrichment, dump_entries, ood_worklist, run_entry helpers. - Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture and unit tests. - Workflow: copilot-evaluation.yml updates for category routing and metrics. - Lint: ruff/ty cleanups across tools/, tests/, and shared hooks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

The /al-code-review skill prompt, custom instructions, and skills are experiment-specific. Revert config.yaml to the defaults (/review prompt, instructions/skills disabled). Experiment branch keeps its own version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ic copies

haoranpb · 2026-06-24T06:04:13Z

Marked experiment PRs as draft

haoranpb and others added 30 commits April 8, 2026 13:57

few more udpates for new categories

db388bd

Refactor evaluation and dataset operations for improved workspace setup

57c004e

enable skipping container setup in action

8e2f216

fix missing implementation for MockEvaluationPipeline

69a8db8

Refactor evaluation result classes to be more generic

7549d92

Merge branch 'main' into fix/more-ready-for-categories

f32dd00

Improve readabilty of GitHub Action summary

a4089b9

fix failing tests

99af6b2

Code Review POC

e1b0b93

Merge branch 'main' into category/code-review

1a68d78

fix merge conflict resolution mistake

3ec10a0

Merge branch 'main' into category/code-review

4e52832

Make container parameters optional in evaluate and run commands

a9f59d9

Merge branch 'category/code-review' of https://github.com/microsoft/B…

065e1aa

…C-Bench into category/code-review

Enhance code review functionality by adding expected review comments …

4ad4bd9

…display and refactoring comment parsing logic

better hanlding container for not required categories

92951c4

Merge branch 'main' into category/code-review

7902610

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

dad9289

…egory/code-review

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

f1c4894

…egory/code-review

prefer copilot.exe executable

aa48a29

Normalize code-review dataset and preserve eval outputs

a244503

Fix code-review branch setup and workflow wiring

9f6c353

Require review.json and add log-based recovery fallback

1a58e44

Harden code-review prompt for Windows copilot.cmd parsing

d0e8076

Expand code-review detailed table metrics

7411246

Update config and container setup action

c7131a4

Remove unused apply_patch import from code-review evaluate

213ce7f

Refactor code-review metrics into pipeline and split comment display …

2e1ced0

…columns

Normalize code-review test-run instance IDs to valid pattern

3ff6876

Use plain code-review IDs (security_001 style) and relax ID pattern

6e68751

haoranpb and others added 27 commits May 29, 2026 11:57

make run step OS indenpendent

0c58e8c

fix score mismatch

b076b98

extract github action related commands

df11718

test should not test runner name

541f6e4

make code review patches proper git diff

c9193e5

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

4408974

…egory/code-review

refactor to seperate the logics

859ec99

make more steps OS independent

0feba63

skip leaderboard update and stricter field for codereview resutl

db12ed4

simplify import/export

820b767

move CodeReviewResultSummary into codereview result file

7848f4b

strongly type CodeReviewResultSummary and reuse metrics util

64f37c0

saperate leaderboard from summary and make it generic

d216e42

fix failing tests

49a5cef

Potential fix for pull request finding 'Module imports itself'

35c5045

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

eaa1a2c

…egory/code-review

add CodeReview to mock tests

e00e939

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

13b568c

…egory/code-review

Remove skills and instructions from category branch

d34742b

Restore al-test-generation assets on category branch

9eab86e

Potential fix for pull request finding 'Variable defined multiple times'

f5825f2

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Include domain and suggestedCode in code-review finding schema

e77c93b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add composed BCQuality super-skill/sub-skill code-review experiment

124474f

Remove BCQuality vendoring script; skill assets are committed as stat…

3cd0831

…ic copies

Invoke composed review skill directly via /al-code-review slash command

b91cd02

Base automatically changed from category/code-review to main June 23, 2026 13:15

haoranpb marked this pull request as draft June 24, 2026 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment: composed BCQuality super-skill/sub-skill code review#661

Experiment: composed BCQuality super-skill/sub-skill code review#661
WaelAbuSeada wants to merge 69 commits into
mainfrom
experiment/code-review-composed-skills

WaelAbuSeada commented Jun 13, 2026

Uh oh!

haoranpb commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WaelAbuSeada commented Jun 13, 2026

What this adds

Notes

Uh oh!

haoranpb commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants