migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761
migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761Evangelink wants to merge 1 commit into
Conversation
…e (no false negatives) The eval was tripping false-negative graders: the skill activates and the agent EXPLAINS what it converted (e.g., "I removed using Xunit;", "Converting [Theory] to [TestMethod]"), so output-not-matches on patterns like `using Xunit;`, `[Theory]`, `Trait\(`, `IClassFixture`, `TestContext.Current.CancellationToken`, etc. fired on the agent's prose even when the migrated file was correct. Conversely output-matches on `Microsoft\.VisualStudio\.TestTools\.UnitTesting`, `[TestClass]`, `[DataRow(`, etc. could match the agent's explanation without verifying the file. Convert every grader that is really checking FILE state into file-contains / file-not-contains (with the relevant glob: **/*.cs or **/*.csproj). Keep the prose-checking graders where the rubric genuinely wants the agent to talk about something (unsupported-TFM explanation, scope-decision narrative, already-on-MSTest "no migration needed" message). The two CancellationToken positive graders stay on prose because they use regex alternation that file-contains (literal only) can't express; the file is still indirectly checked by the dotnet test run-command. Mirrored across both eval.vally.yaml (vally runner) and eval.yaml (skill-validator). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/evaluate |
There was a problem hiding this comment.
Pull request overview
This PR hardens the dotnet-test/migrate-xunit-to-mstest evaluation rubric by moving graders that are intended to verify migrated file contents from response-output matching to file-state checks, preventing false negatives/positives caused by the agent quoting old/new tokens in its explanation.
Changes:
- Replaced
output_matches/output_not_matcheschecks that were validating code/project state withfile_contains/file_not_contains(skill-validator) andfile-contains/file-not-contains(vally). - Kept a small set of response/prose graders where the rubric explicitly requires narrative (unsupported TFM explanation, ICollectionFixture scope decision explanation, and “already on MSTest” messaging).
- Added inline comments documenting why certain graders remain output-based (notably the CancellationToken OR-regex cases).
Show a summary per file
| File | Description |
|---|---|
| tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml | Converts file-state assertions from output regex matching to literal file content checks across **/*.cs / **/*.csproj, retaining only rubric-required prose assertions. |
| tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml | Applies the same conversion for vally graders using file-contains/file-not-contains, with matching rationale comments and preserved prose-only graders. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 0
Skill Validation Results
[1] Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
|
✅ Evaluation passed for |
Problem
The eval was tripping false-negative graders: the skill activates and the agent EXPLAINS what it converted (e.g., "I removed
using Xunit;", "Converting[Theory]to[TestMethod]"), sooutput-not-matcheson patterns likeusing Xunit;,[Theory],Trait(,IClassFixture,TestContext.Current.CancellationToken, etc. fired on the agent's prose even when the migrated file was correct. Converselyoutput-matchesonMicrosoft.VisualStudio.TestTools.UnitTesting,[TestClass],[DataRow(, etc. could match the agent's explanation without verifying the file.This dragged down the per-scenario assertion score AND propagated into the run-command / judge scores. Latest eval regressions traceable to this:
Change
Convert every grader that is really checking FILE state into
file-contains/file-not-contains(with the relevant glob:**/*.csor**/*.csproj). Keep the prose-checking graders where the rubric genuinely wants the agent to talk about something:[DoNotParallelize]check moves to file-state but the scope/sharing/parallelization prose check stays.The two CancellationToken positive graders stay on prose because they use regex alternation (
TestContext.CancellationToken|_testContext.CancellationToken) thatfile-contains(literal only) can't express; the file shape is still indirectly checked by thedotnet testrun-command.Skill content (
SKILL.md,mapping-cheatsheet.md) is left unchanged — those are tightened only if the grader fix alone isn't enough.Files
tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml— vally runner.tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml— skill-validator runner.Validation
dotnet run --project eng/skill-validator/src -- check --plugin ./plugins/dotnet-testpasses.Note on triggering /evaluate
The
evaluation.ymlworkflow on main is currently broken — therun-nameexpression contains an unescaped#which YAML parses as a comment, causing the${{block to be unclosed. That is being fixed by #759 — once #759 merges,/evaluateon this PR will trigger a fresh eval comparing before/after this change.