Skip to content

Add MCP servers to dotnet-blazor plugin#703

Open
javiercn wants to merge 8 commits into
mainfrom
javiercn/add-mcp-servers-to-blazor-plugin
Open

Add MCP servers to dotnet-blazor plugin#703
javiercn wants to merge 8 commits into
mainfrom
javiercn/add-mcp-servers-to-blazor-plugin

Conversation

@javiercn

Copy link
Copy Markdown
Member

Summary

  • add the Microsoft Learn MCP server to plugins/dotnet-blazor/plugin.json
  • add the Playwright MCP server to plugins/dotnet-blazor/plugin.json
  • allowlist both MCP server declarations in eng/allowed-external-deps.txt

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 15:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dotnet-blazor plugin manifest to declare two MCP server dependencies (Microsoft Learn + Playwright) and adds the corresponding allowlist entries so the skill-validator’s external dependency checks don’t flag them.

Changes:

  • Add mcpServers entries to plugins/dotnet-blazor/plugin.json.
  • Allowlist the new MCP server declarations in eng/allowed-external-deps.txt.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
plugins/dotnet-blazor/plugin.json Declares two MCP servers for the dotnet-blazor plugin.
eng/allowed-external-deps.txt Adds allowlist entries for the newly declared MCP servers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/dotnet-blazor/plugin.json
Comment thread plugins/dotnet-blazor/plugin.json Outdated
Comment thread plugins/dotnet-blazor/plugin.json
Comment thread plugins/dotnet-blazor/plugin.json Outdated
@javiercn javiercn requested review from danroth27 and lewing May 29, 2026 15:23
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-blazor author-component 0/1 0%
dotnet-blazor collect-user-input 0/4 0%
dotnet-blazor configure-auth 0/3 0%
dotnet-blazor coordinate-components 0/2 0%
dotnet-blazor fetch-and-send-data 1/5 20%
dotnet-blazor plan-ui-change 0/1 0%
dotnet-blazor support-prerendering 0/3 0%
dotnet-blazor use-js-interop 0/2 0%
Uncovered: dotnet-blazor/author-component
  • [CodePattern] [Parameter] (line 31)
Uncovered: dotnet-blazor/collect-user-input
  • [CodePattern] [CascadingParameter] (line 162)
  • [CodePattern] [Range] (line 135)
  • [CodePattern] [SupplyParameterFromForm] (line 31)
  • [CodePattern] [Required] (line 135)
Uncovered: dotnet-blazor/configure-auth
  • [CodePattern] [CascadingParameter] (line 51)
  • [CodePattern] [ExcludeFromInteractiveRouting] (line 152)
  • [CodePattern] [Authorize] (line 101)
Uncovered: dotnet-blazor/coordinate-components
  • [CodePattern] [CascadingParameter] (line 63)
  • [CodePattern] readonly (line 137)
Uncovered: dotnet-blazor/fetch-and-send-data
  • [CodePattern] [StreamRendering] (line 74)
  • [CodePattern] [Parameter] (line 159)
  • [CodePattern] [SupplyParameterFromQuery] (line 159)
  • [CodePattern] [PersistentState] (line 84)
Uncovered: dotnet-blazor/plan-ui-change
  • [CodePattern] [Parameter] (line 64)
Uncovered: dotnet-blazor/support-prerendering
  • [CodePattern] [CascadingParameter] (line 159)
  • [CodePattern] [ExcludeFromInteractiveRouting] (line 148)
  • [CodePattern] [PersistentState] (line 39)
Uncovered: dotnet-blazor/use-js-interop
  • [CodePattern] readonly (line 118)
  • [CodePattern] sealed (line 118)

javiercn and others added 2 commits May 29, 2026 17:27
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 15:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread plugins/dotnet-blazor/plugin.json
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 16:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread plugins/dotnet-blazor/plugin.json Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 29, 2026 16:12
Comment thread plugins/dotnet-blazor/plugin.json Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread plugins/dotnet-blazor/plugin.json Outdated
@AbhitejJohn

Copy link
Copy Markdown
Contributor

/evaluate

github-actions Bot added a commit that referenced this pull request May 29, 2026
github-actions Bot added a commit that referenced this pull request May 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
mcp-csharp-create Implement MCP tools with proper attributes and DI 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill ✅ 0.16
mcp-csharp-create Create an HTTP MCP server with tools and resources 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill ✅ 0.16
mcp-csharp-create Create an MCP server with tools, prompts, and proper logging 4.0/5 → 4.0/5 ✅ mcp-csharp-create; tools: skill ✅ 0.16
mcp-csharp-test Write unit and integration tests for an MCP server 3.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view / ✅ mcp-csharp-test; tools: report_intent, skill, view 🟡 0.23
mcp-csharp-test Test an HTTP MCP server with WebApplicationFactory 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view 🟡 0.23 [1]
mcp-csharp-test Create evaluations for an MCP server 2.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, view 🟡 0.23
technology-selection ML.NET classification on tabular data 3.0/5 → 4.0/5 🟢 ✅ technology-selection; tools: skill, read_bash, stop_bash / ✅ technology-selection; tools: skill 🟡 0.34
technology-selection LLM integration with MEAI abstraction 1.0/5 → 1.0/5 ⚠️ NOT ACTIVATED 🟡 0.34 [2]
technology-selection Reject LLM for tabular classification 3.0/5 → 5.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.34
technology-selection Agentic workflow with guardrails 3.0/5 → 3.0/5 ✅ technology-selection; tools: skill / ✅ technology-selection; tools: skill, create 🟡 0.34 [3]
technology-selection Natural-language scenario decomposition — RAG chatbot 4.0/5 → 5.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.34 [4]
technology-selection RAG pipeline with vector search 4.0/5 → 5.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.34
mcp-csharp-publish Publish an MCP server as a NuGet tool package 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill ✅ 0.19
mcp-csharp-publish Deploy an HTTP MCP server to Azure Container Apps 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, report_intent, view ✅ 0.19
mcp-csharp-publish Publish to the MCP Registry 1.0/5 → 3.0/5 🟢 ✅ mcp-csharp-publish; tools: skill ✅ 0.19
mcp-csharp-debug Debug an MCP server with MCP Inspector 4.0/5 → 4.0/5 ✅ mcp-csharp-debug; tools: skill ✅ 0.10 [5]
mcp-csharp-debug Configure VS Code to use an MCP server 4.0/5 → 4.0/5 ✅ mcp-csharp-debug; tools: skill, report_intent, grep, glob / ✅ mcp-csharp-debug; tools: skill ✅ 0.10
mcp-csharp-debug Debug a failing MCP server tool 5.0/5 → 4.0/5 🔴 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill ✅ 0.10
template-authoring Validate a template.json file 3.0/5 → 5.0/5 🟢 ✅ template-authoring; tools: glob, skill / ⚠️ NOT ACTIVATED ✅ 0.18
template-authoring Create template from existing project 3.0/5 → 4.0/5 🟢 ✅ template-authoring; tools: skill ✅ 0.18
template-validation Validate template with multiple errors 3.0/5 → 4.0/5 🟢 ✅ template-validation; tools: skill ✅ 0.10
template-validation Validate correct template and suggest improvements 1.0/5 → 3.0/5 🟢 ✅ template-validation; tools: glob, skill / ✅ template-validation; tools: skill ✅ 0.10
template-discovery Find template for web API project 2.0/5 → 4.0/5 🟢 ✅ template-discovery; tools: report_intent, skill, bash 🟡 0.23
template-discovery Inspect template parameters and compare choices 3.0/5 → 4.0/5 🟢 ✅ template-discovery; tools: skill 🟡 0.23
template-discovery Search NuGet for specialized template 4.0/5 → 4.0/5 ✅ template-discovery; tools: skill 🟡 0.23
template-discovery Resolve ambiguous project intent to multiple candidates 4.0/5 → 4.0/5 ✅ template-discovery; tools: skill 🟡 0.23
template-discovery Preview project creation with dry run 3.0/5 → 3.0/5 ✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23
template-instantiation Create a console application 4.0/5 → 5.0/5 🟢 ✅ template-instantiation; tools: skill ✅ 0.19

[1] (Isolated) Quality improved but weighted score is -8.2% due to: tokens (13230 → 45021), tool calls (0 → 3), time (17.2s → 22.2s)
[2] (Isolated) Quality unchanged but weighted score is -4.9% due to: tokens (85496 → 137447), time (42.5s → 64.1s), tool calls (13 → 16)
[3] (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (190806 → 1858992), tool calls (12 → 49), time (109.6s → 281.1s)
[4] (Plugin) Quality dropped but weighted score is +11.9% due to: efficiency metrics
[5] (Plugin) Quality unchanged but weighted score is -8.0% due to: tokens (12823 → 30428), tool calls (0 → 1), time (11.5s → 13.8s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 703 in dotnet/skills, download eval artifacts with gh run download 26653053060 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/579fcb7269f9d16cd40d4fe9521d1fb1f747fbc0/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@github-actions github-actions Bot added the waiting-on-review PR state label label Jun 3, 2026
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

✅ Evaluation passed for 579fcb7. cc @ViktorHofer @JanKrivanek @dotnet/aspnet — please review.

@danroth27 danroth27 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the MS Learn MCP server to the core dotnet plugin, but otherwise this looks fine to me.

Comment thread plugins/dotnet-blazor/plugin.json Outdated
Comment thread plugins/dotnet-blazor/plugin.json

Copilot AI commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@javiercn I've opened a new pull request, #723, to work on those changes. Once the pull request is ready, I'll request review from you.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

👋 @javiercn — this PR has 2 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

…ation (#723)

* Initial plan

* feat(dotnet-blazor): pin playwright MCP to 0.0.75 and add weekly update workflow

* fix(update-playwright-mcp-version): use env vars in node scripts and descending sort

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 4, 2026 08:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@AbhitejJohn

Copy link
Copy Markdown
Contributor

/evaluate

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Skill Validation Results

❌ Skill validation errors

  • assertion-quality: Eval scenario 'Identify self-referential assertions in identity and round-trip tests' prompt mentions target name 'assertion-quality' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.
Skill Scenario Quality Skills Loaded Overfit Verdict
test-gap-analysis Find boundary mutation gaps in tiered discount and shipping logic 4.7/5 → 5.0/5 🟢 ✅ test-gap-analysis; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [1]
test-gap-analysis Find logic and null-check mutation gaps in access control code 4.3/5 → 5.0/5 🟢 ✅ test-gap-analysis; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [2]
test-gap-analysis Acknowledge well-tested code with few surviving mutations 4.7/5 → 4.0/5 🔴 ✅ test-gap-analysis; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08 [3]
test-gap-analysis Decline request to write new tests from scratch 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.08 [4]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 4.0/5 → 4.0/5 ✅ code-testing-agent; tools: skill, task, read_agent / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent ✅ 0.04 [5]
writing-mstest-tests Write unit tests for a service class 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill, glob, bash, edit / ✅ writing-mstest-tests; tools: skill, glob 🟡 0.34 [6]
writing-mstest-tests Write data-driven tests for a calculator 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent, view, create / ✅ writing-mstest-tests; tools: view, skill, bash, edit, report_intent, create 🟡 0.34
writing-mstest-tests Write async tests with cancellation 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ✅ writing-mstest-tests; tools: report_intent, skill 🟡 0.34
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [7]
writing-mstest-tests Modernize legacy test patterns 4.3/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [8]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.34 [9]
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.7/5 🔴 ⚠️ NOT ACTIVATED / ✅ writing-mstest-tests; tools: report_intent, skill 🟡 0.34
writing-mstest-tests Use proper type assertions instead of casts 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED / ✅ writing-mstest-tests; tools: report_intent, skill 🟡 0.34 [10]
writing-mstest-tests Set up test lifecycle correctly 2.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ✅ writing-mstest-tests; tools: report_intent, skill, view 🟡 0.34
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.34 [11]
writing-mstest-tests Use string assertions for format validation 3.7/5 → 4.7/5 ⏰ 🟢 ✅ writing-mstest-tests; tools: skill, bash, edit, view 🟡 0.34 [12]
writing-mstest-tests Use comparison assertions for boundary testing 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.34
writing-mstest-tests Write tests with collection, null, and reference assertions 4.0/5 → 4.7/5 🟢 ✅ writing-mstest-tests; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.34 [13]
writing-mstest-tests Configure conditional execution, retry, and cleanup 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ✅ writing-mstest-tests; tools: skill, report_intent 🟡 0.34
writing-mstest-tests Configure test parallelization and MSTest.Sdk project 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.34
test-smell-detection Detect multiple test smells in order processing test suite 3.0/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.44 [14]
test-smell-detection Recognize well-written tests with no significant smells 4.3/5 → 5.0/5 🟢 ✅ test-smell-detection; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.44 [15]
test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ test-smell-detection; tools: skill / ✅ test-anti-patterns; test-smell-detection; tools: skill 🟡 0.44 [16]
test-smell-detection Decline request to write new tests from scratch 4.3/5 → 4.7/5 🟢 ℹ️ not activated (expected) 🟡 0.44 [17]
test-tagging Tag an untagged MSTest test suite 2.3/5 → 2.7/5 🟢 ✅ test-tagging; tools: skill / ✅ test-tagging; tools: glob, skill 🟡 0.29 [18]
test-tagging Tag an untagged xUnit test suite 2.7/5 → 2.7/5 ✅ test-tagging; tools: skill, bash / ⚠️ NOT ACTIVATED 🟡 0.29 [19]
test-tagging Tag an untagged NUnit test suite 2.7/5 → 2.3/5 🔴 ✅ test-tagging; tools: glob, skill, bash / ⚠️ NOT ACTIVATED 🟡 0.29 [20]
test-tagging Audit test distribution without modifying files 5.0/5 → 5.0/5 ✅ test-tagging; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [21]
test-tagging Decline request to write new tests 4.0/5 → 3.3/5 🔴 ℹ️ not activated (expected) 🟡 0.29 [22]
test-tagging Tag a partially-tagged MSTest suite without duplicating existing traits 4.0/5 → 4.7/5 🟢 ✅ test-tagging; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [23]
test-tagging Accurately classify NUnit tests with misleading method names 4.0/5 → 5.0/5 🟢 ✅ test-tagging; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.29 [24]
test-tagging Tag MSTest tests and verify the project still builds 5.0/5 → 4.7/5 🔴 ✅ test-tagging; tools: skill 🟡 0.29 [25]

[1] ⚠️ High run-to-run variance (CV=339%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=116%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=86%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=162%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -21.4% due to: judgment, quality, time (44.6s → 56.7s), tool calls (5 → 6)
[5] ⚠️ High run-to-run variance (CV=363%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -20.4% due to: judgment, quality
[6] ⚠️ High run-to-run variance (CV=412%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=88%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -20.3% due to: judgment, tokens (12870 → 30170), tool calls (0 → 1), time (16.5s → 21.5s)
[8] ⚠️ High run-to-run variance (CV=115%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=435%) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=844%) — consider re-running with --runs 5
[11] (Isolated) Quality unchanged but weighted score is -8.6% due to: tokens (12899 → 29929), tool calls (0 → 1), time (7.2s → 10.4s)
[12] ⚠️ High run-to-run variance (CV=175%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -18.8% due to: judgment, tokens (112499 → 603106), tool calls (6 → 25), time (66.1s → 159.8s)
[13] ⚠️ High run-to-run variance (CV=150%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.6% due to: tokens (180148 → 246970)
[14] ⚠️ High run-to-run variance (CV=332%) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -2.2% due to: tokens (41133 → 55233), time (23.8s → 28.9s)
[15] ⚠️ High run-to-run variance (CV=139%) — consider re-running with --runs 5
[16] (Plugin) Quality unchanged but weighted score is -8.3% due to: tokens (41106 → 110745), tool calls (4 → 7), time (35.8s → 55.8s)
[17] ⚠️ High run-to-run variance (CV=272%) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=117%) — consider re-running with --runs 5
[19] ⚠️ High run-to-run variance (CV=303%) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -17.2% due to: judgment, quality, tokens (87033 → 129397)
[20] ⚠️ High run-to-run variance (CV=96%) — consider re-running with --runs 5
[21] ⚠️ High run-to-run variance (CV=65%) — consider re-running with --runs 5
[22] ⚠️ High run-to-run variance (CV=73%) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=69%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -4.4% due to: tokens (155131 → 282226), time (76.4s → 96.2s)
[24] ⚠️ High run-to-run variance (CV=239%) — consider re-running with --runs 5
[25] ⚠️ High run-to-run variance (CV=231%) — consider re-running with --runs 5

timeout — run(s) hit the (120s, 180s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 703 in dotnet/skills, download eval artifacts with gh run download 26987777071 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/8a00c7e65386eb3b4ea68dacd5f923a766682b70/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

github-actions Bot added a commit that referenced this pull request Jun 5, 2026
@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation and removed waiting-on-author PR state label labels Jun 5, 2026
@JanKrivanek JanKrivanek added the evaluate-now Trigger evaluation.yml for current PR head (transient) label Jun 5, 2026
@github-actions github-actions Bot removed the evaluate-now Trigger evaluation.yml for current PR head (transient) label Jun 5, 2026
github-actions Bot added a commit that referenced this pull request Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
binlog-generation Build project with /bl flag 1.0/5 → 2.0/5 🟢 ⚠️ NOT ACTIVATED 🟡 0.35
binlog-generation Build with /bl in PowerShell 3.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.35 [1]
binlog-generation Build multiple configurations with unique binlogs 4.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.35
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; tools: skill, binlog-binlog_overview, binlog-binlog_diagnose, binlog-binlog_expensive_projects, binlog-binlog_expensive_tasks, binlog-binlog_expensive_analyzers, binlog-binlog_double_writes / ⚠️ NOT ACTIVATED 🟡 0.26 [2]
eval-performance Analyze MSBuild evaluation performance issues 5.0/5 → 5.0/5 ✅ eval-performance; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.20
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ✅ including-generated-files; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.23
incremental-build Analyze incremental build issues 3.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.13 [3]
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.06 [4]
msbuild-server Recommend MSBuild Server for slow CLI incremental builds 3.0/5 → 5.0/5 🟢 ✅ msbuild-server; tools: skill ✅ 0.15
resolve-project-references Explain misleading ResolveProjectReferences time 4.0/5 → 5.0/5 🟢 ✅ resolve-project-references; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.16 [5]
build-parallelism Analyze build parallelism bottlenecks 4.0/5 → 4.0/5 ✅ build-parallelism; tools: skill, binlog-binlog_overview, binlog-binlog_expensive_projects, binlog-binlog_projects, glob, binlog-binlog_search, binlog-binlog_expensive_targets, binlog-binlog_project_target_times / ⚠️ NOT ACTIVATED ✅ 0.20 [6]
build-perf-diagnostics Diagnose slow build for a small project 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.21 [7]
check-bin-obj-clash Diagnose bin/obj output path clashes 4.0/5 → 5.0/5 🟢 ✅ check-bin-obj-clash; tools: skill, binlog-binlog_overview, binlog-binlog_double_writes, binlog-binlog_evaluations, binlog-binlog_properties, binlog-binlog_evaluation_properties / ✅ check-bin-obj-clash; tools: skill, binlog-binlog_overview, binlog-binlog_errors, binlog-binlog_evaluations, binlog-binlog_properties, binlog-binlog_evaluation_global_properties, binlog-binlog_evaluation_properties ✅ 0.16 [8]
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 5.0/5 🟢 ✅ directory-build-organization; tools: skill / ✅ directory-build-organization; tools: skill, create, edit, bash ✅ 0.15
extension-points Diagnose build extension point failures 3.0/5 → 5.0/5 🟢 ✅ extension-points; tools: skill ✅ 0.08
extension-points Diagnose NuGet package and repo extension conflicts 3.0/5 → 3.0/5 ✅ extension-points; tools: skill / ✅ extension-points; tools: skill, edit ✅ 0.08 [9]
extension-points Fix extension point anti-patterns 5.0/5 → 5.0/5 ✅ extension-points; tools: skill ✅ 0.08
item-management Diagnose item group and batching issues 4.0/5 → 5.0/5 🟢 ✅ item-management; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.24
item-management Diagnose cascading item and batching bugs in code generation pipeline 4.0/5 → 4.0/5 ✅ item-management; tools: skill, edit, bash / ⚠️ NOT ACTIVATED 🟡 0.24 [10]
item-management Fix item management anti-patterns 4.0/5 → 4.0/5 ✅ item-management; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.24
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.09
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 → 5.0/5 ✅ msbuild-antipatterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.09 [11]
msbuild-antipatterns Add a module to an F# project 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.09 [12]
msbuild-antipatterns Fix broken file order causing FS0039 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.09 [13]
msbuild-antipatterns Add a signature file to define public API 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.09 [14]
property-patterns Diagnose shared build property issues 5.0/5 → 5.0/5 ✅ property-patterns; tools: skill ✅ 0.16 [15]
property-patterns Diagnose multi-level property hierarchy bugs 4.0/5 → 5.0/5 🟢 ✅ property-patterns; tools: skill ✅ 0.16
property-patterns Fix shared property configuration 5.0/5 → 5.0/5 ✅ property-patterns; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.16 [16]
target-authoring Diagnose custom target build regression 3.0/5 → 5.0/5 🟢 ✅ target-authoring; tools: skill, bash / ✅ target-authoring; tools: skill 🟡 0.21
target-authoring Diagnose broken SDK target chain across files 3.0/5 → 3.0/5 ✅ target-authoring; tools: skill 🟡 0.21 [17]
target-authoring Fix custom target anti-patterns 4.0/5 → 5.0/5 🟢 ✅ target-authoring; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.21

[1] (Plugin) Quality unchanged but weighted score is -3.7% due to: tokens (25703 → 43968)
[2] (Plugin) Quality unchanged but weighted score is -0.2% due to: tokens (138107 → 527176), quality, time (64.8s → 138.7s), tool calls (18 → 29)
[3] (Plugin) Quality unchanged but weighted score is -4.2% due to: tokens (26372 → 44627), time (16.9s → 22.1s)
[4] (Plugin) Quality unchanged but weighted score is -2.9% due to: tokens (71862 → 117654)
[5] (Plugin) Quality unchanged but weighted score is -7.1% due to: quality, tokens (54412 → 90374)
[6] (Plugin) Quality unchanged but weighted score is -4.7% due to: tokens (85882 → 264225), time (43.5s → 72.9s), tool calls (10 → 15)
[7] (Isolated) Quality unchanged but weighted score is -5.5% due to: tokens (27429 → 56800), time (22.2s → 26.6s)
[8] (Isolated) Quality improved but weighted score is -6.9% due to: quality, tokens (124272 → 186252), tool calls (11 → 18)
[9] (Plugin) Quality unchanged but weighted score is -0.3% due to: tokens (57731 → 217025), time (65.7s → 112.6s), tool calls (10 → 16)
[10] (Plugin) Quality unchanged but weighted score is -8.7% due to: tokens (42945 → 199607), tool calls (5 → 17), time (56.0s → 82.0s)
[11] (Plugin) Quality unchanged but weighted score is -11.4% due to: tokens (59956 → 190372), quality, time (44.2s → 111.0s), tool calls (15 → 19)
[12] (Plugin) Quality unchanged but weighted score is -3.4% due to: tokens (98559 → 162241)
[13] (Plugin) Quality unchanged but weighted score is -3.8% due to: tokens (68233 → 113622)
[14] (Plugin) Quality unchanged but weighted score is -4.2% due to: tokens (67584 → 114194)
[15] (Plugin) Quality unchanged but weighted score is -4.9% due to: tokens (158177 → 355582)
[16] (Plugin) Quality unchanged but weighted score is -4.0% due to: tokens (132842 → 252543)
[17] (Isolated) Quality unchanged but weighted score is -7.2% due to: tokens (72813 → 137912), time (37.3s → 111.3s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 703 in dotnet/skills, download eval artifacts with gh run download 27010466421 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/a8bb6fc544b5456f92c984c96db75c91dbd35437/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@AbhitejJohn

Copy link
Copy Markdown
Contributor

@javiercn : Looks like the token consumption increased without much of a change in quality, based on the evals. Would you mind taking a deeper look please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-state/ready-for-eval PR is mergeable and awaiting evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants