Skip to content

migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761

Open
Evangelink wants to merge 1 commit into
mainfrom
dev/amauryleve/migrate-xunit-eval-grader-fix
Open

migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761
Evangelink wants to merge 1 commit into
mainfrom
dev/amauryleve/migrate-xunit-eval-grader-fix

Conversation

@Evangelink

@Evangelink Evangelink commented Jun 12, 2026

Copy link
Copy Markdown
Member

Problem

The eval was tripping false-negative graders: the skill activates and the agent EXPLAINS what it converted (e.g., "I removed using Xunit;", "Converting [Theory] to [TestMethod]"), so output-not-matches on patterns like using Xunit;, [Theory], Trait(, IClassFixture, TestContext.Current.CancellationToken, etc. fired on the agent's prose even when the migrated file was correct. Conversely output-matches on Microsoft.VisualStudio.TestTools.UnitTesting, [TestClass], [DataRow(, etc. could match the agent's explanation without verifying the file.

This dragged down the per-scenario assertion score AND propagated into the run-command / judge scores. Latest eval regressions traceable to this:

  • Migrate basic xUnit v2: 4.0/5 → 3.7/5
  • Migrate basic xUnit v3: 5.0/5 → 3.3/5
  • Convert IClassFixture: 4.0/5 → 3.3/5
  • Convert Skip/Trait/Timeout: 3.7/5 → 3.0/5
  • Preserve Parallelize: 3.0/5 → 2.7/5

Change

Convert every grader that is really checking FILE state into file-contains / file-not-contains (with the relevant glob: **/*.cs or **/*.csproj). Keep the prose-checking graders where the rubric genuinely wants the agent to talk about something:

  • Unsupported-TFM scenario: the rubric requires the agent to identify and explain — kept on prose.
  • ICollectionFixture scope decision: the rubric requires explicit narrative — the [DoNotParallelize] check moves to file-state but the scope/sharing/parallelization prose check stays.
  • Already-on-MSTest: the rubric requires the agent to TELL the user no migration is needed — kept on prose.

The two CancellationToken positive graders stay on prose because they use regex alternation (TestContext.CancellationToken|_testContext.CancellationToken) that file-contains (literal only) can't express; the file shape is still indirectly checked by the dotnet test run-command.

Skill content (SKILL.md, mapping-cheatsheet.md) is left unchanged — those are tightened only if the grader fix alone isn't enough.

Files

  • tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml — vally runner.
  • tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml — skill-validator runner.

Validation

  • dotnet run --project eng/skill-validator/src -- check --plugin ./plugins/dotnet-test passes.
  • YAML parses cleanly with all 12 scenarios preserved on both files.

Note on triggering /evaluate

The evaluation.yml workflow on main is currently broken — the run-name expression contains an unescaped # which YAML parses as a comment, causing the ${{ block to be unclosed. That is being fixed by #759 — once #759 merges, /evaluate on this PR will trigger a fresh eval comparing before/after this change.

…e (no false negatives)

The eval was tripping false-negative graders: the skill activates and the agent
EXPLAINS what it converted (e.g., "I removed using Xunit;", "Converting [Theory]
to [TestMethod]"), so output-not-matches on patterns like `using Xunit;`, `[Theory]`,
`Trait\(`, `IClassFixture`, `TestContext.Current.CancellationToken`, etc. fired on
the agent's prose even when the migrated file was correct. Conversely output-matches
on `Microsoft\.VisualStudio\.TestTools\.UnitTesting`, `[TestClass]`, `[DataRow(`,
etc. could match the agent's explanation without verifying the file.

Convert every grader that is really checking FILE state into file-contains /
file-not-contains (with the relevant glob: **/*.cs or **/*.csproj). Keep the
prose-checking graders where the rubric genuinely wants the agent to talk about
something (unsupported-TFM explanation, scope-decision narrative, already-on-MSTest
"no migration needed" message). The two CancellationToken positive graders stay on
prose because they use regex alternation that file-contains (literal only) can't
express; the file is still indirectly checked by the dotnet test run-command.

Mirrored across both eval.vally.yaml (vally runner) and eval.yaml (skill-validator).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 16:37
@Evangelink

Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the dotnet-test/migrate-xunit-to-mstest evaluation rubric by moving graders that are intended to verify migrated file contents from response-output matching to file-state checks, preventing false negatives/positives caused by the agent quoting old/new tokens in its explanation.

Changes:

  • Replaced output_matches / output_not_matches checks that were validating code/project state with file_contains / file_not_contains (skill-validator) and file-contains / file-not-contains (vally).
  • Kept a small set of response/prose graders where the rubric explicitly requires narrative (unsupported TFM explanation, ICollectionFixture scope decision explanation, and “already on MSTest” messaging).
  • Added inline comments documenting why certain graders remain output-based (notably the CancellationToken OR-regex cases).
Show a summary per file
File Description
tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml Converts file-state assertions from output regex matching to literal file content checks across **/*.cs / **/*.csproj, retaining only rubric-required prose assertions.
tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml Applies the same conversion for vally graders using file-contains/file-not-contains, with matching rationale comments and preserved prose-only graders.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@github-actions github-actions Bot added the pr-state/ready-for-eval PR is mergeable and awaiting evaluation label Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-xunit-to-mstest Migrate basic xUnit v2 project to MSTest v4 preserving VSTest 4.7/5 → 4.0/5 🔴 ✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash ✅ 0.10
migrate-xunit-to-mstest Migrate basic xUnit v3 project to MSTest v4 preserving VSTest 3.7/5 → 3.0/5 🔴 ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10 [1]
migrate-xunit-to-mstest Stop when target framework is unsupported by MSTest v4 1.0/5 → 2.0/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash ✅ 0.10
migrate-xunit-to-mstest Convert IClassFixture to ClassInitialize 3.0/5 → 3.3/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, create ✅ 0.10 [2]
migrate-xunit-to-mstest Handle ICollectionFixture explicitly (do not silently widen scope) 2.0/5 → 3.0/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10
migrate-xunit-to-mstest Convert ITestOutputHelper to TestContext 4.7/5 → 5.0/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10 [3]
migrate-xunit-to-mstest Map exception assertions correctly (xUnit Throws -> MSTest ThrowsExactly) 4.3/5 → 5.0/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10 [4]
migrate-xunit-to-mstest Convert MemberData and TheoryData to DynamicData 5.0/5 → 3.0/5 🔴 ✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash ✅ 0.10 [5]
migrate-xunit-to-mstest Convert Skip, Trait, and Timeout 3.3/5 → 3.3/5 ✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill, create, read_bash ✅ 0.10 [6]
migrate-xunit-to-mstest Preserve xUnit parallelization default with [assembly: Parallelize] 3.3/5 → 3.0/5 🔴 ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10 [7]
migrate-xunit-to-mstest Convert xUnit v3 TestContext.Current.CancellationToken 1.7/5 → 2.3/5 🟢 ✅ migrate-xunit-to-mstest; tools: skill, edit, bash / ✅ migrate-xunit-to-mstest; tools: skill, edit, bash, read_bash ✅ 0.10 [8]
migrate-xunit-to-mstest Recognize project already on MSTest 5.0/5 → 5.0/5 ✅ migrate-xunit-to-mstest; tools: skill ✅ 0.10 [9]

[1] ⚠️ High run-to-run variance (CV=202%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -2.1% due to: tokens (55778 → 84172)
[3] ⚠️ High run-to-run variance (CV=183%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.5% due to: tokens (69269 → 199115), time (36.7s → 52.1s), tool calls (8 → 10)
[4] ⚠️ High run-to-run variance (CV=332%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=58%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=89%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -13.3% due to: judgment, tokens (60594 → 83299)
[7] ⚠️ High run-to-run variance (CV=11333%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +14.9% due to: efficiency metrics
[8] ⚠️ High run-to-run variance (CV=358%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -3.6% due to: judgment, tokens (17444 → 59324), time (38.5s → 70.0s)
[9] ⚠️ High run-to-run variance (CV=151%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.0% due to: quality, tokens (43574 → 50441)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 761 in dotnet/skills, download eval artifacts with gh run download 27439844362 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/6a73a76b062affcb0fe1e955eb376dace4f8f4b2/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added waiting-on-review PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 6a73a76. cc @dotnet/dotnet-testing — please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants