migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives) by Evangelink · Pull Request #761 · dotnet/skills

Evangelink · 2026-06-12T16:37:40Z

Problem

The eval was tripping false-negative graders: the skill activates and the agent EXPLAINS what it converted (e.g., "I removed using Xunit;", "Converting [Theory] to [TestMethod]"), so output-not-matches on patterns like using Xunit;, [Theory], Trait(, IClassFixture, TestContext.Current.CancellationToken, etc. fired on the agent's prose even when the migrated file was correct. Conversely output-matches on Microsoft.VisualStudio.TestTools.UnitTesting, [TestClass], [DataRow(, etc. could match the agent's explanation without verifying the file.

This dragged down the per-scenario assertion score AND propagated into the run-command / judge scores. Latest eval regressions traceable to this:

Migrate basic xUnit v2: 4.0/5 → 3.7/5
Migrate basic xUnit v3: 5.0/5 → 3.3/5
Convert IClassFixture: 4.0/5 → 3.3/5
Convert Skip/Trait/Timeout: 3.7/5 → 3.0/5
Preserve Parallelize: 3.0/5 → 2.7/5

Change

Convert every grader that is really checking FILE state into file-contains / file-not-contains (with the relevant glob: **/*.cs or **/*.csproj). Keep the prose-checking graders where the rubric genuinely wants the agent to talk about something:

Unsupported-TFM scenario: the rubric requires the agent to identify and explain — kept on prose.
ICollectionFixture scope decision: the rubric requires explicit narrative — the [DoNotParallelize] check moves to file-state but the scope/sharing/parallelization prose check stays.
Already-on-MSTest: the rubric requires the agent to TELL the user no migration is needed — kept on prose.

The two CancellationToken positive graders stay on prose because they use regex alternation (TestContext.CancellationToken|_testContext.CancellationToken) that file-contains (literal only) can't express; the file shape is still indirectly checked by the dotnet test run-command.

Skill content (SKILL.md, mapping-cheatsheet.md) is left unchanged — those are tightened only if the grader fix alone isn't enough.

Files

tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml — vally runner.
tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml — skill-validator runner.

Validation

dotnet run --project eng/skill-validator/src -- check --plugin ./plugins/dotnet-test passes.
YAML parses cleanly with all 12 scenarios preserved on both files.

Note on triggering /evaluate

The evaluation.yml workflow on main is currently broken — the run-name expression contains an unescaped # which YAML parses as a comment, causing the ${{ block to be unclosed. That is being fixed by #759 — once #759 merges, /evaluate on this PR will trigger a fresh eval comparing before/after this change.

…e (no false negatives) The eval was tripping false-negative graders: the skill activates and the agent EXPLAINS what it converted (e.g., "I removed using Xunit;", "Converting [Theory] to [TestMethod]"), so output-not-matches on patterns like `using Xunit;`, `[Theory]`, `Trait\(`, `IClassFixture`, `TestContext.Current.CancellationToken`, etc. fired on the agent's prose even when the migrated file was correct. Conversely output-matches on `Microsoft\.VisualStudio\.TestTools\.UnitTesting`, `[TestClass]`, `[DataRow(`, etc. could match the agent's explanation without verifying the file. Convert every grader that is really checking FILE state into file-contains / file-not-contains (with the relevant glob: **/*.cs or **/*.csproj). Keep the prose-checking graders where the rubric genuinely wants the agent to talk about something (unsupported-TFM explanation, scope-decision narrative, already-on-MSTest "no migration needed" message). The two CancellationToken positive graders stay on prose because they use regex alternation that file-contains (literal only) can't express; the file is still indirectly checked by the dotnet test run-command. Mirrored across both eval.vally.yaml (vally runner) and eval.yaml (skill-validator). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-12T16:37:49Z

/evaluate

Copilot

Pull request overview

This PR hardens the dotnet-test/migrate-xunit-to-mstest evaluation rubric by moving graders that are intended to verify migrated file contents from response-output matching to file-state checks, preventing false negatives/positives caused by the agent quoting old/new tokens in its explanation.

Changes:

Replaced output_matches / output_not_matches checks that were validating code/project state with file_contains / file_not_contains (skill-validator) and file-contains / file-not-contains (vally).
Kept a small set of response/prose graders where the rubric explicitly requires narrative (unsupported TFM explanation, ICollectionFixture scope decision explanation, and “already on MSTest” messaging).
Added inline comments documenting why certain graders remain output-based (notably the CancellationToken OR-regex cases).

Show a summary per file

File	Description
tests/dotnet-test/migrate-xunit-to-mstest/eval.yaml	Converts file-state assertions from output regex matching to literal file content checks across `*/.cs` / `*/.csproj`, retaining only rubric-required prose assertions.
tests/dotnet-test/migrate-xunit-to-mstest/eval.vally.yaml	Applies the same conversion for vally graders using `file-contains`/`file-not-contains`, with matching rationale comments and preserved prose-only graders.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 0

github-actions · 2026-06-12T20:20:18Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
migrate-xunit-to-mstest	Migrate basic xUnit v2 project to MSTest v4 preserving VSTest	4.7/5 → 4.0/5 🔴	✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash	✅ 0.10	❌
migrate-xunit-to-mstest	Migrate basic xUnit v3 project to MSTest v4 preserving VSTest	3.7/5 → 3.0/5 🔴	✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	✅ [1]
migrate-xunit-to-mstest	Stop when target framework is unsupported by MSTest v4	1.0/5 → 2.0/5 🟢	✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash	✅ 0.10	✅
migrate-xunit-to-mstest	Convert IClassFixture to ClassInitialize	3.0/5 → 3.3/5 🟢	✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, create	✅ 0.10	❌ [2]
migrate-xunit-to-mstest	Handle ICollectionFixture explicitly (do not silently widen scope)	2.0/5 → 3.0/5 🟢	✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	✅
migrate-xunit-to-mstest	Convert ITestOutputHelper to TestContext	4.7/5 → 5.0/5 🟢	✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	❌ [3]
migrate-xunit-to-mstest	Map exception assertions correctly (xUnit Throws -> MSTest ThrowsExactly)	4.3/5 → 5.0/5 🟢	✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	✅ [4]
migrate-xunit-to-mstest	Convert MemberData and TheoryData to DynamicData	5.0/5 → 3.0/5 🔴	✅ migrate-xunit-to-mstest; tools: skill / ✅ migrate-xunit-to-mstest; tools: skill, read_bash	✅ 0.10	❌ [5]
migrate-xunit-to-mstest	Convert Skip, Trait, and Timeout	3.3/5 → 3.3/5	✅ migrate-xunit-to-mstest; tools: skill, create / ✅ migrate-xunit-to-mstest; tools: skill, create, read_bash	✅ 0.10	❌ [6]
migrate-xunit-to-mstest	Preserve xUnit parallelization default with [assembly: Parallelize]	3.3/5 → 3.0/5 🔴	✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	✅ [7]
migrate-xunit-to-mstest	Convert xUnit v3 TestContext.Current.CancellationToken	1.7/5 → 2.3/5 🟢	✅ migrate-xunit-to-mstest; tools: skill, edit, bash / ✅ migrate-xunit-to-mstest; tools: skill, edit, bash, read_bash	✅ 0.10	❌ [8]
migrate-xunit-to-mstest	Recognize project already on MSTest	5.0/5 → 5.0/5	✅ migrate-xunit-to-mstest; tools: skill	✅ 0.10	❌ [9]

[1] ⚠️ High run-to-run variance (CV=202%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -2.1% due to: tokens (55778 → 84172)
[3] ⚠️ High run-to-run variance (CV=183%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -1.5% due to: tokens (69269 → 199115), time (36.7s → 52.1s), tool calls (8 → 10)
[4] ⚠️ High run-to-run variance (CV=332%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=58%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=89%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -13.3% due to: judgment, tokens (60594 → 83299)
[7] ⚠️ High run-to-run variance (CV=11333%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +14.9% due to: efficiency metrics
[8] ⚠️ High run-to-run variance (CV=358%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -3.6% due to: judgment, tokens (17444 → 59324), time (38.5s → 70.0s)
[9] ⚠️ High run-to-run variance (CV=151%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.0% due to: quality, tokens (43574 → 50441)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 761 in dotnet/skills, download eval artifacts with gh run download 27439844362 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/6a73a76b062affcb0fe1e955eb376dace4f8f4b2/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-12T21:00:45Z

✅ Evaluation passed for 6a73a76. cc @dotnet/dotnet-testing — please review.

Copilot AI review requested due to automatic review settings June 12, 2026 16:37

Copilot started reviewing on behalf of Evangelink June 12, 2026 16:38 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

github-actions Bot added the pr-state/ready-for-eval PR is mergeable and awaiting evaluation label Jun 12, 2026

github-actions Bot added waiting-on-review PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761

migrate-xunit-to-mstest: switch file-state graders from output to file (no false negatives)#761
Evangelink wants to merge 1 commit into
mainfrom
dev/amauryleve/migrate-xunit-eval-grader-fix

Evangelink commented Jun 12, 2026 •

edited

Loading

Uh oh!

Evangelink commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Evangelink commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Files

Validation

Note on triggering /evaluate

Uh oh!

Evangelink commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot commented Jun 12, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evangelink commented Jun 12, 2026 •

edited

Loading