Skip to content

[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696

Draft
gggdttt wants to merge 16 commits into
mainfrom
private/wenjiefan/code-review-live-bcquality
Draft

[Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm#696
gggdttt wants to merge 16 commits into
mainfrom
private/wenjiefan/code-review-live-bcquality

Conversation

@gggdttt

@gggdttt gggdttt commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Goal

Enable a fair, faithful before/after comparison for the code-review category so we can measure whether consuming microsoft/BCQuality actually improves AL code review over what BCApps shipped previously:

Arm Config toggle Role
vanilla all false anchor (no BC review knowledge)
old inline instructions.enabled: true "before" — BCApps pre-#8700 inline knowledge
live BCQuality bcquality.enabled: true "after" — clone + filter + entry.md routing

Does not touch bug-fix / test-generation behavior. All arms default off (vanilla baseline).

What's in this PR

1. Live BCQuality consumption (code-review only)

  • config.yaml: new bcquality: section (default enabled: false). Pins repo + ref (40-char SHA), enabled-layers, disabled-skills, knowledge allow/deny globs, and task-context dimensions.
  • evaluate/codereview_bcquality.py: pure, type-safe building blocks that faithfully replicate BCApps' today flow — clone_bcquality (pinned SHA, shallow), filter_clone (mirrors Invoke-BCQualityFilter.ps1, writes _filter-report.json), task-context writer, and a bootstrap prompt that routes the agent through skills/entry.md and emits the BC-Bench review.json schema.
  • copilot/agent.py: live branch — the filtered BCQuality clone becomes the Copilot CWD (knowledge read before the diff), the repo under review is granted via --add-dir, static injection skipped, hooks installed into the clone.
  • types.py: ExperimentConfiguration.bcquality flag.

2. Faithful "old inline" baseline arm

  • Extracts the 6 domain checklists (accessibility / performance / privacy / security / style / upgrade) verbatim from BCApps 30e2b18ca3^ — the version BCApps actually shipped before adopting BCQuality (deleted in #8700). Line counts match the original exactly.
  • Deliberately not the experiment/code-review-al-skill snapshot, which was tuned on the benchmark (privacy heavily rewritten, 3 others nudged) and would contaminate the comparison.
  • AGENTS.md: review section routing /review through the 6 domain checklists.

3. Tests: 23 unit tests for the BCQuality module (config parsing, glob matching, clone filtering, task-context, bootstrap prompt, shipped-config alignment).

Security

The BCQuality clone becomes the agent's working directory, so it is treated as a prompt-injection surface: ref must be a pinned, reviewed 40-char SHA and repo must be a trusted HTTPS source (validated in BCQualityConfig.validate()).

Verified locally

  • ruff format + ruff check + ty check clean.
  • Full suite: 571 passed, 1 skipped (no regressions).
  • Old-inline arm smoke test (synthetic__security-001): Copilot read the injected .github/instructions/security.md and produced a valid review.json (critical hardcoded key + high plain-text secret), both grounded in the checklist.

How to run online

This PR is to drive the code-review evaluation workflow per arm. Flip the relevant config toggle (or workflow input) for vanilla / old-inline / live-BCQuality and run with repeat=5, then compare micro/macro F1.

Draft — opening so we can run the online evaluation and review the before/after numbers.

wenjiefan added 13 commits June 25, 2026 14:36
Adds a bcquality config section (default disabled) and a Python module that clones BCQuality at a pinned SHA, filters it per enabled-layers/knowledge globs, builds task-context, and a skills/entry.md bootstrap prompt -- replicating how microsoft/BCApps consumes microsoft/BCQuality today. Not yet wired into the agent; no effect on existing categories.
- ExperimentConfiguration: add bcquality flag
- copilot agent: live BCQuality branch (clone CWD, --add-dir repo, skip static injection)
- add 23 unit tests for codereview_bcquality module
…line arm

- Extract the 6 faithful domain checklists (accessibility/performance/privacy/
  security/style/upgrade) verbatim from BCApps 30e2b18ca3^ (the version BCApps
  shipped before adopting BCQuality), NOT the benchmark-tuned experiment snapshot
- AGENTS.md: add review section routing /review through the 6 domain checklists
- Enables a faithful before/after comparison: vanilla < old inline < live BCQuality
- Inert by default (instructions.enabled=false); arm activated via config toggle
…nistic severity mapping, relocate bcquality module to agent/shared)
@gggdttt gggdttt marked this pull request as ready for review June 26, 2026 13:16
@gggdttt gggdttt changed the title code-review: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm [Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm Jun 26, 2026

@haoranpb haoranpb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the SHA-pinning, I'd like to push on the design here a bit though:

My concern: we are making BCQuality a first-class experimentation flag, when it is really just lots of *.md files. Where we already have the mechanisms to do that with instructions and skills in the config.yaml.

My first instinct is, pull BCQuality at a pinned SHA and just distribute those md files into our existing infrastructure (folder structure), then let the existing config handle it. The most manual way to think of it, just copy paste those md files into BC-Bench under the instructions folder.

What do you think? Also, try to make it generic as well, folks probably want to experiment with it in other categories as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants