[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696
Draft
gggdttt wants to merge 16 commits into
Draft
[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696gggdttt wants to merge 16 commits into
gggdttt wants to merge 16 commits into
Conversation
added 13 commits
June 25, 2026 14:36
Adds a bcquality config section (default disabled) and a Python module that clones BCQuality at a pinned SHA, filters it per enabled-layers/knowledge globs, builds task-context, and a skills/entry.md bootstrap prompt -- replicating how microsoft/BCApps consumes microsoft/BCQuality today. Not yet wired into the agent; no effect on existing categories.
- ExperimentConfiguration: add bcquality flag - copilot agent: live BCQuality branch (clone CWD, --add-dir repo, skip static injection) - add 23 unit tests for codereview_bcquality module
…line arm - Extract the 6 faithful domain checklists (accessibility/performance/privacy/ security/style/upgrade) verbatim from BCApps 30e2b18ca3^ (the version BCApps shipped before adopting BCQuality), NOT the benchmark-tuned experiment snapshot - AGENTS.md: add review section routing /review through the 6 domain checklists - Enables a faithful before/after comparison: vanilla < old inline < live BCQuality - Inert by default (instructions.enabled=false); arm activated via config toggle
…/ BCQuality arms)
…iment Leaderboard
…nistic severity mapping, relocate bcquality module to agent/shared)
…er entry); surface git stderr on failure
…to BCQuality bootstrap prompt
haoranpb
requested changes
Jun 26, 2026
haoranpb
left a comment
Collaborator
There was a problem hiding this comment.
Good work on the SHA-pinning, I'd like to push on the design here a bit though:
My concern: we are making BCQuality a first-class experimentation flag, when it is really just lots of *.md files. Where we already have the mechanisms to do that with instructions and skills in the config.yaml.
My first instinct is, pull BCQuality at a pinned SHA and just distribute those md files into our existing infrastructure (folder structure), then let the existing config handle it. The most manual way to think of it, just copy paste those md files into BC-Bench under the instructions folder.
What do you think? Also, try to make it generic as well, folks probably want to experiment with it in other categories as well
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Enable a fair, faithful before/after comparison for the
code-reviewcategory so we can measure whether consumingmicrosoft/BCQualityactually improves AL code review over what BCApps shipped previously:instructions.enabled: truebcquality.enabled: trueentry.mdroutingDoes not touch
bug-fix/test-generationbehavior. All arms default off (vanilla baseline).What's in this PR
1. Live BCQuality consumption (code-review only)
config.yaml: newbcquality:section (defaultenabled: false). Pinsrepo+ref(40-char SHA),enabled-layers,disabled-skills, knowledge allow/deny globs, and task-context dimensions.evaluate/codereview_bcquality.py: pure, type-safe building blocks that faithfully replicate BCApps' today flow —clone_bcquality(pinned SHA, shallow),filter_clone(mirrorsInvoke-BCQualityFilter.ps1, writes_filter-report.json), task-context writer, and a bootstrap prompt that routes the agent throughskills/entry.mdand emits the BC-Benchreview.jsonschema.copilot/agent.py: live branch — the filtered BCQuality clone becomes the Copilot CWD (knowledge read before the diff), the repo under review is granted via--add-dir, static injection skipped, hooks installed into the clone.types.py:ExperimentConfiguration.bcqualityflag.2. Faithful "old inline" baseline arm
30e2b18ca3^— the version BCApps actually shipped before adopting BCQuality (deleted in #8700). Line counts match the original exactly.experiment/code-review-al-skillsnapshot, which was tuned on the benchmark (privacy heavily rewritten, 3 others nudged) and would contaminate the comparison.AGENTS.md: review section routing/reviewthrough the 6 domain checklists.3. Tests: 23 unit tests for the BCQuality module (config parsing, glob matching, clone filtering, task-context, bootstrap prompt, shipped-config alignment).
Security
The BCQuality clone becomes the agent's working directory, so it is treated as a prompt-injection surface:
refmust be a pinned, reviewed 40-char SHA andrepomust be a trusted HTTPS source (validated inBCQualityConfig.validate()).Verified locally
ty checkclean.synthetic__security-001): Copilot read the injected.github/instructions/security.mdand produced a validreview.json(critical hardcoded key + high plain-text secret), both grounded in the checklist.How to run online
This PR is to drive the
code-reviewevaluation workflow per arm. Flip the relevant config toggle (or workflow input) for vanilla / old-inline / live-BCQuality and run withrepeat=5, then compare micro/macro F1.