[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm by gggdttt · Pull Request #696 · microsoft/BC-Bench

gggdttt · 2026-06-26T07:23:25Z

Goal

Enable a fair, faithful before/after comparison for the code-review category so we can measure whether consuming microsoft/BCQuality actually improves AL code review over what BCApps shipped previously:

Arm	Config toggle	Role
vanilla	all false	anchor (no BC review knowledge)
old inline	`instructions.enabled: true`	"before" — BCApps pre-#8700 inline knowledge
live BCQuality	`bcquality.enabled: true`	"after" — clone + filter + `entry.md` routing

Does not touch bug-fix / test-generation behavior. All arms default off (vanilla baseline).

What's in this PR

1. Live BCQuality consumption (code-review only)

config.yaml: new bcquality: section (default enabled: false). Pins repo + ref (40-char SHA), enabled-layers, disabled-skills, knowledge allow/deny globs, and task-context dimensions.
evaluate/codereview_bcquality.py: pure, type-safe building blocks that faithfully replicate BCApps' today flow — clone_bcquality (pinned SHA, shallow), filter_clone (mirrors Invoke-BCQualityFilter.ps1, writes _filter-report.json), task-context writer, and a bootstrap prompt that routes the agent through skills/entry.md and emits the BC-Bench review.json schema.
copilot/agent.py: live branch — the filtered BCQuality clone becomes the Copilot CWD (knowledge read before the diff), the repo under review is granted via --add-dir, static injection skipped, hooks installed into the clone.
types.py: ExperimentConfiguration.bcquality flag.

2. Faithful "old inline" baseline arm

Extracts the 6 domain checklists (accessibility / performance / privacy / security / style / upgrade) verbatim from BCApps 30e2b18ca3^ — the version BCApps actually shipped before adopting BCQuality (deleted in #8700). Line counts match the original exactly.
Deliberately not the experiment/code-review-al-skill snapshot, which was tuned on the benchmark (privacy heavily rewritten, 3 others nudged) and would contaminate the comparison.
AGENTS.md: review section routing /review through the 6 domain checklists.

3. Tests: 23 unit tests for the BCQuality module (config parsing, glob matching, clone filtering, task-context, bootstrap prompt, shipped-config alignment).

Security

The BCQuality clone becomes the agent's working directory, so it is treated as a prompt-injection surface: ref must be a pinned, reviewed 40-char SHA and repo must be a trusted HTTPS source (validated in BCQualityConfig.validate()).

Verified locally

ruff format + ruff check + ty check clean.
Full suite: 571 passed, 1 skipped (no regressions).
Old-inline arm smoke test (synthetic__security-001): Copilot read the injected .github/instructions/security.md and produced a valid review.json (critical hardcoded key + high plain-text secret), both grounded in the checklist.

How to run online

This PR is to drive the code-review evaluation workflow per arm. Flip the relevant config toggle (or workflow input) for vanilla / old-inline / live-BCQuality and run with repeat=5, then compare micro/macro F1.

Draft — opening so we can run the online evaluation and review the before/after numbers.

Adds a bcquality config section (default disabled) and a Python module that clones BCQuality at a pinned SHA, filters it per enabled-layers/knowledge globs, builds task-context, and a skills/entry.md bootstrap prompt -- replicating how microsoft/BCApps consumes microsoft/BCQuality today. Not yet wired into the agent; no effect on existing categories.

- ExperimentConfiguration: add bcquality flag - copilot agent: live BCQuality branch (clone CWD, --add-dir repo, skip static injection) - add 23 unit tests for codereview_bcquality module

…line arm - Extract the 6 faithful domain checklists (accessibility/performance/privacy/ security/style/upgrade) verbatim from BCApps 30e2b18ca3^ (the version BCApps shipped before adopting BCQuality), NOT the benchmark-tuned experiment snapshot - AGENTS.md: add review section routing /review through the 6 domain checklists - Enables a faithful before/after comparison: vanilla < old inline < live BCQuality - Inert by default (instructions.enabled=false); arm activated via config toggle

…re list)

…pdates)

…/ BCQuality arms)

…iment Leaderboard

…icro F1

…nistic severity mapping, relocate bcquality module to agent/shared)

…erity mapping

…er entry); surface git stderr on failure

…derr surfacing

…nja2 template

…to BCQuality bootstrap prompt

haoranpb

Good work on the SHA-pinning, I'd like to push on the design here a bit though:

My concern: we are making BCQuality a first-class experimentation flag, when it is really just lots of *.md files. Where we already have the mechanisms to do that with instructions and skills in the config.yaml.

My first instinct is, pull BCQuality at a pinned SHA and just distribute those md files into our existing infrastructure (folder structure), then let the existing config handle it. The most manual way to think of it, just copy paste those md files into BC-Bench under the instructions folder.

What do you think? Also, try to make it generic as well, folks probably want to experiment with it in other categories as well

…-driven

wenjiefan added 13 commits June 25, 2026 14:36

code-review: wire live BCQuality path into copilot agent + tests

15c3feb

- ExperimentConfiguration: add bcquality flag - copilot agent: live BCQuality branch (clone CWD, --add-dir repo, skip static injection) - add 23 unit tests for codereview_bcquality module

code-review: markdown formatting in BCApps AGENTS.md (blank line befo…

058a5e1

…re list)

Merge main into code-review-live-bcquality (sync leaderboard + main u…

2fa1ccd

…pdates)

code-review docs: add Experiment Leaderboard table (vanilla / inline …

e8713f7

…/ BCQuality arms)

code-review docs: add Agent column, drop Vanilla reference from Exper…

b19f7e5

…iment Leaderboard

Fix pre-commit whitespace in instruction files; rename F1 column to M…

fd275cd

…icro F1

code-review: address self-review (reuse review.json constant, determi…

7da69c5

…nistic severity mapping, relocate bcquality module to agent/shared)

code-review: reuse review.json constant + deterministic BCQuality sev…

5bf1745

…erity mapping

code-review: cache BCQuality clone per-SHA (clone once, copy+filter p…

edd6dbd

…er entry); surface git stderr on failure

code-review: drop BCQuality clone cache (clone is cheap); keep git st…

b07213b

…derr surfacing

code-review: externalize BCQuality bootstrap prompt to config.yaml Ji…

72f7c51

…nja2 template

gggdttt mentioned this pull request Jun 26, 2026

Provision Copilot CLI for code-review judge in Claude workflow #701

Merged

code-review: add super-skill execution-discipline / progress markers …

0dc121c

…to BCQuality bootstrap prompt

gggdttt marked this pull request as ready for review June 26, 2026 13:16

Merge branch 'main' into private/wenjiefan/code-review-live-bcquality

5a3b5a7

gggdttt changed the title ~~code-review: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm~~ [Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm Jun 26, 2026

haoranpb requested changes Jun 26, 2026

View reviewed changes

code-review: make BCQuality task-context goal/inputs-available config…

4c6c104

…-driven

gggdttt marked this pull request as draft June 26, 2026 21:33

gggdttt mentioned this pull request Jun 26, 2026

docs(code-review): add experiment leaderboard table #705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696

[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696
gggdttt wants to merge 16 commits into
mainfrom
private/wenjiefan/code-review-live-bcquality

gggdttt commented Jun 26, 2026

Uh oh!

haoranpb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gggdttt commented Jun 26, 2026

Goal

What's in this PR

Security

Verified locally

How to run online

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants