Improve dotnet-template-engine plugin: accuracy, dedup, and two new skills by YuliiaKovalova · Pull Request #745 · dotnet/skills

YuliiaKovalova · 2026-06-10T10:17:17Z

Summary

Improves the dotnet-template-engine plugin: corrects factually wrong guidance, removes duplicated/divergent validation rules, expands the intent→template mappings, makes post-creation steps explicit, and adds two new skills. All changes are prompt/markdown-only — no code, scripts, or MCP/server concepts. The source of truth for the new content is the dotnet-template-mcp intent dictionary and tool reference; the skills continue to drive the dotnet new CLI.

Changes

Correctness

Fixed the inaccurate reserved-shortName rule in both template-validation and template-authoring. The old list claimed top-level dotnet verbs (build, run, test, publish, restore, clean, pack) conflict — they do not (dotnet new test does not collide with dotnet test). Replaced with the real colliding dotnet new subcommands: install, uninstall, update, list, search, details, create, plus a one-line explanation and corrected pitfall rows/examples.

Single source of truth

Consolidated validation rules into template-validation. template-authoring Step 2 no longer re-implements a shorter, divergent rule list — it now points to template-validation with a brief summary. The valid-datatype list (string, bool, choice, int, float, hex, text) now appears in exactly one place.

Discoverability & routing

Expanded template-discovery Step 1 from a ~9-row table to full Intent → template and Keyword → parameter tables sourced from the MCP IntentSynonymDictionary, with a note to always confirm real parameter names via dotnet new <template> --help.
Made template-validation reachable from the agent. Updated the Triage/Routing table so validation requests route to template-validation (previously orphaned) and added a Skills Inventory listing all six skills.

Project creation guidance

Added an explicit, ordered CPM + latest-version procedure to template-instantiation (detect Directory.Packages.props, strip inline versions, centralize as <PackageVersion>, optionally refresh stale versions via the NuGet V3 flat-container index.json with an opt-out, then build).
Added a 6-item "preserve from source" checklist to template-authoring (SDK type, analyzer/package reference metadata, key properties, CPM participation, custom build props/targets, repo conventions).

New skills

template-smart-defaults — applies cross-parameter default rules (AOT → latest compatible framework; auth ≠ None → don't disable HTTPS; controllers → no minimal-API flag; never override explicit user values), with a rules table, validation checklist, and heuristics pitfall.
template-comparison — compares 2+ templates side by side (parameters, feature support, frameworks, classifications). Cross-linked from template-discovery.

Both new skills are auto-discovered via the existing "skills": ["./skills/"] glob (consistent with all sibling plugins), and the agent routing table now references them.

Verification

All plugin.json files parse as valid JSON.
All SKILL.md and the agent frontmatter parse as valid YAML with required keys (name, description, license: MIT).
Every skill name matches its folder name.
No remaining inaccurate build/run/test/publish reserved-name claims (only the corrected "do NOT conflict" notes).
No cross-reference points to a non-existent skill.

Notes

The repo has no .codex-plugin/plugin.json for this plugin (no sibling plugin uses that convention), so the requested manifest-formatting change was a no-op.

…kills Fix inaccurate reserved-shortName guidance, consolidate validation rules into a single skill, expand discovery mappings, add explicit CPM/version steps, and introduce template-comparison and template-smart-defaults skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the dotnet-template-engine agent/skills documentation to better route user intents, add new skills for comparison and smart defaults, and refine guidance around template validation and Central Package Management (CPM).

Changes:

Added new skills: template-comparison (side-by-side template comparison) and template-smart-defaults (cross-parameter defaulting rules).
Refined template discovery mappings and agent routing to include comparison/smart-default intents.
Updated template validation/authoring guidance around reserved dotnet new subcommand shortNames and expanded CPM post-create steps.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
plugins/dotnet-template-engine/skills/template-validation/SKILL.md	Updates reserved shortName guidance to focus on `dotnet new` subcommands and clarifies conflict rationale.
plugins/dotnet-template-engine/skills/template-smart-defaults/SKILL.md	Adds a new skill describing cross-parameter default heuristics during project creation.
plugins/dotnet-template-engine/skills/template-instantiation/SKILL.md	Expands post-creation CPM steps and adds optional “refresh stale versions” guidance.
plugins/dotnet-template-engine/skills/template-discovery/SKILL.md	Enhances intent→template/parameter mappings and routes detailed comparisons to a new skill.
plugins/dotnet-template-engine/skills/template-comparison/SKILL.md	Adds a new skill to compare templates side-by-side using `dotnet new <template> --help`.
plugins/dotnet-template-engine/skills/template-authoring/SKILL.md	Improves preservation checklist and delegates validation rules to template-validation.
plugins/dotnet-template-engine/agents/template-engine.agent.md	Updates intent routing and lists the skills inventory including the new skills.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Make .codex-plugin/plugin.json byte-consistent with plugin.json (2-space indent on the agents line). Add eval.yaml + eval.vally.yaml capability evals for the new template-comparison and template-smart-defaults skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-06-10T10:24:54Z

Follow-up after rubber-duck review:

Fixed .codex-plugin/plugin.json\ indentation so it is byte-identical to \plugin.json\ (the byte-consistency item that was previously missed).
Added capability evals (\�val.yaml\ + \�val.vally.yaml) for the two new skills under \ ests/dotnet-template-engine/template-comparison/\ and \ ests/dotnet-template-engine/template-smart-defaults/, matching the existing eval schema used by the other template-engine skills.

…workload/package availability, tighten version-refresh - Clarify the reserved shortName set is the current dotnet new subcommands (authoritative source: dotnet new --help); create is verified as a real subcommand (alias behind dotnet new <template>). - template-discovery: note that some mapped short names (maui, winui3, aspire, func, orleans) need workloads/template packages, with fallback to dotnet new list/search. - template-instantiation: keep template versions by default; if refreshing, use dotnet list package --outdated + user confirmation and constrain to same major/minor rather than always latest stable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 5 comments.

…rtion, reframe reserved list - smart-defaults evals: enforce --no-https absence (auth scenario), absence of minimal-API flag (controllers scenario), and no newer --framework TFM when net8.0 is explicitly required, using output_not_contains/output_not_matches. - comparison eval: split the combined (auth|aot|docker|controllers) check into four separate output_matches assertions so partial comparisons fail. - template-validation/authoring: reframe the reserved shortName list as non-exhaustive examples and source the authoritative set from dotnet new --help; drop the specific create-alias assertion in favor of parsing-ambiguity wording. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Evangelink · 2026-06-10T10:51:34Z

/evaluate

github-actions · 2026-06-10T11:01:31Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
template-validation	Validate template with multiple errors	4.0/5 → 4.7/5 🟢	✅ template-validation; tools: skill	✅ 0.14	✅ [1]
template-validation	Validate correct template and suggest improvements	1.7/5 → 4.3/5 🟢	✅ template-validation; tools: glob, skill	✅ 0.14	✅
template-authoring	Validate a template.json file	3.7/5 → 4.0/5 🟢	✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.18	✅ [2]
template-authoring	Create template from existing project	3.3/5 → 5.0/5 🟢	✅ template-authoring; tools: skill, read_bash	✅ 0.18	✅
template-instantiation	Create a console application	4.0/5 → 4.7/5 🟢	✅ template-instantiation; tools: skill	🟡 0.22	✅ [3]
template-comparison	Compare webapi vs webapp side by side	3.0/5 → 4.3/5 🟢	✅ template-comparison; tools: skill	🟡 0.28	✅
template-comparison	Choose between blazorserver and blazorwasm	2.0/5 → 1.0/5 🔴	✅ template-comparison; tools: report_intent, skill, bash	🟡 0.28	❌
template-comparison	Decide which template fits a background processing scenario	1.7/5 → 2.0/5 🟢	✅ template-comparison; tools: skill, bash	🟡 0.28	❌ [4]
template-smart-defaults	AOT implies a compatible framework	2.3/5 → 4.3/5 🟢	✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED	🟡 0.36	✅
template-smart-defaults	Auth implies HTTPS stays enabled	1.3/5 → 3.3/5 🟢	✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED	🟡 0.36	✅
template-smart-defaults	Controllers exclude the minimal-API flag	1.0/5 → 5.0/5 🟢	✅ template-smart-defaults; tools: skill, report_intent, bash / ⚠️ NOT ACTIVATED	🟡 0.36	✅
template-smart-defaults	Never override an explicit user value	2.0/5 → 5.0/5 🟢	✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED	🟡 0.36	✅
template-discovery	Find template for web API project	2.3/5 → 2.0/5 🔴	✅ template-discovery; tools: skill, report_intent, bash / ✅ template-discovery; tools: report_intent, skill, bash	🟡 0.21	❌ [5]
template-discovery	Inspect template parameters and compare choices	1.0/5 → 2.0/5 🟢	✅ template-discovery; tools: skill, bash / ⚠️ NOT ACTIVATED	🟡 0.21	✅ [6]
template-discovery	Search NuGet for specialized template	1.0/5 → 2.3/5 🟢	✅ template-discovery; tools: skill, stop_bash / ✅ template-discovery; tools: skill	🟡 0.21	✅ [7]
template-discovery	Resolve ambiguous project intent to multiple candidates	2.0/5 → 1.0/5 🔴	✅ template-discovery; tools: skill	🟡 0.21	❌ [8]
template-discovery	Preview project creation with dry run	1.7/5 → 2.7/5 🟢	✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.21	✅ [9]

[1] ⚠️ High run-to-run variance (CV=142%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=126%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=136%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=417%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -33.4% due to: judgment, quality, tokens (21335 → 32035)
[5] ⚠️ High run-to-run variance (CV=162%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=89%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=53%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=216%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=473%) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27271170865 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/bdaaf9795cba19775d6fc84842591dcd11a6d3a4/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Switch the Blazor comparison scenario from blazorserver (absent in the CI SDK) to blazor (Blazor Web App) vs blazorwasm, both reliably present in .NET 8+, and instruct the agent to inspect each via --help. Update the SKILL.md example reference for currency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-06-10T11:08:48Z

Iterating on the skill-validation eval results

I dug into the artifacts from the eval run. Summary of what the verdicts actually show:

The skill content is sound. Across every scenario, the With-Skill Plugin judge scored 4–4.7/5 — the agent loads the right skill, runs dotnet new <template> --help per template, and produces a correct side-by-side comparison and recommendation.

The ❌ verdicts come from the Isolated runner crashing, not from the skill. In the failing isolated transcripts the judge prose says the agent ran 'dotnet new ... --help' ... but the command failed with a **mutex error** and no output was produced → (no output) → low score. This is the .NET template engine's global persistence mutex colliding in the sandboxed isolated environment, i.e. infra flakiness, not a content issue. It's consistent with the enormous run-to-run variance reported (CV 417–473%); the footnotes' own recommendation of --runs 5 would smooth this out.

What I changed to make my evals more robust (688afff):

The Blazor comparison scenario used blazorserver, which is not present in the CI SDK (the plugin run had to pivot to blazor). Switched it to blazor (Blazor Web App) vs blazorwasm — both reliably present in .NET 8+ — and instructed the agent to inspect each via --help. Updated the template-comparison SKILL.md example for currency.

Not changed: the two template-discovery ❌ scenarios are pre-existing evals I didn't author; their isolated runs failed for the same mutex/no-output reason (the baseline judges already scored ~2.3/5 there), so they're unrelated to this PR's content.

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

The isolated eval runs failed because the agent ran 'dotnet new <t> --help', hit the template engine's global-mutex/persistence error (common when the command runs concurrently in a sandbox), and then returned no answer at all. Instruct both skills to run 'dotnet new' calls sequentially, retry once on a transient mutex/persistence error, and fall back to the intent/parameter mapping so a concrete answer is always produced instead of empty output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-06-10T11:12:51Z

Follow-up: fixing the content root cause behind the discovery failures

You asked me to actually fix the two template-discovery failures rather than write them off as infra. I re-read the isolated transcripts: in both, the agent ran dotnet new <template> --help, the call returned a template-engine "mutex"/"persistence" error with no output, and the agent then ended the turn with no answer to the user — that empty result is what scored ~2/5.

That second half is a content gap the skill can close, so I fixed it (3ecd536):

template-discovery SKILL.md — added a Resilience note + a pitfall row: run dotnet new calls sequentially (the template engine holds a global mutex, so concurrent --help/--dry-run calls collide and error), retry once, and if it still fails fall back to the intent/parameter mapping and give a concrete recommendation — never end with empty output because a CLI call errored.
template-comparison SKILL.md — same sequential-execution + retry + don't-return-empty guidance, since it also fans out multiple --help calls.

This makes the agent produce a correct answer from the (authoritative-in-this-skill) intent mapping even when the sandbox's dotnet new invocation flakes, which is exactly the path that was scoring zero. The mapped fix applies to the pre-existing discovery scenarios too, even though those evals predate this PR.

github-actions · 2026-06-10T11:31:53Z

👋 @YuliiaKovalova — this PR has 7 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

- template-smart-defaults SKILL.md: drop the non-existent --publish-aot flag. Clarify --aot is a dotnet new flag only on templates that expose it (console/worker/grpc, not webapi) and that publish-time AOT is the MSBuild PublishAot=true property, not a dotnet new flag. - template-discovery SKILL.md: replace the hardcoded --enable-docker mapping (not a real flag on common templates) with generic 'confirm with --help'. - smart-defaults evals: tighten the negative-assertion prompts to output only the command line and not mention unused flags, so a negated explanation can't trip output_not_contains/output_not_matches. Switch the AOT scenario from webapi to worker (which actually supports --aot). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

…lists - smart-defaults evals: anchor the negative assertions to the 'dotnet new' command line (same-line regex) instead of whole-output substring/regex, so a flag mentioned only in prose can't fail the test. - template-validation / template-authoring: mark the dotnet new subcommand examples as illustrative/version-dependent and tell readers not to hardcode them; the live 'dotnet new --help' output is canonical. - template-comparison: fix the example table's AOT row — webapi/webapp do not expose a --aot template flag; native AOT is publish-time via PublishAot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JanKrivanek · 2026-06-10T15:39:19Z

/evaluate

github-actions · 2026-06-10T15:49:17Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
template-validation	Validate template with multiple errors	4.0/5 → 4.7/5 🟢	✅ template-validation; tools: skill	✅ 0.12	✅ [1]
template-validation	Validate correct template and suggest improvements	1.3/5 → 4.3/5 🟢	✅ template-validation; tools: glob, skill	✅ 0.12	✅
template-discovery	Find template for web API project	2.0/5 → 3.0/5 🟢	✅ template-discovery; tools: report_intent, skill, bash	✅ 0.19	❌ [2]
template-discovery	Inspect template parameters and compare choices	2.0/5 → 1.0/5 🔴	✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	✅ [3]
template-discovery	Search NuGet for specialized template	3.0/5 → 1.0/5 🔴	✅ template-discovery; tools: skill	✅ 0.19	❌ [4]
template-discovery	Resolve ambiguous project intent to multiple candidates	1.0/5 → 1.0/5	✅ template-discovery; tools: skill	✅ 0.19	✅ [5]
template-discovery	Preview project creation with dry run	1.7/5 → 2.0/5 🟢	✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	✅ [6]
template-instantiation	Create a console application	4.0/5 → 4.7/5 🟢	✅ template-instantiation; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.22	❌ [7]
template-authoring	Validate a template.json file	4.3/5 → 3.7/5 🔴	✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.14	✅ [8]
template-authoring	Create template from existing project	3.7/5 → 4.7/5 🟢	✅ template-authoring; tools: skill, stop_bash / ✅ template-authoring; tools: skill, read_bash	✅ 0.14	✅

[1] ⚠️ High run-to-run variance (CV=277%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=752%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -56.0% due to: judgment, quality, tokens (12709 → 38975), errors (0 → 1), tool calls (0 → 3), time (13.6s → 34.3s)
[3] ⚠️ High run-to-run variance (CV=525%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +29.2% due to: tokens (49195 → 33636), tool calls (6 → 4)
[4] ⚠️ High run-to-run variance (CV=126%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=180%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.4% due to: efficiency metrics
[8] ⚠️ High run-to-run variance (CV=494%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +8.6% due to: completion (✗ → ✓), tool calls (4 → 3)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27287650326 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/706bcd7c10a50b31471a7e167feab56e8a5d7908/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Evangelink · 2026-06-11T13:25:49Z

/evaluate

github-actions · 2026-06-11T13:39:06Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
template-validation	Validate template with multiple errors	4.7/5 → 5.0/5 🟢	✅ template-validation; tools: skill	✅ 0.14	✅ [1]
template-validation	Validate correct template and suggest improvements	1.7/5 → 4.0/5 🟢	✅ template-validation; tools: glob, skill	✅ 0.14	✅
template-comparison	Compare webapi vs webapp side by side	4.0/5 → 4.7/5 🟢	✅ template-comparison; tools: skill	🟡 0.42	✅
template-comparison	Choose between blazor and blazorwasm	4.0/5 → 4.0/5	✅ template-comparison; tools: skill	🟡 0.42	✅ [2]
template-comparison	Decide which template fits a background processing scenario	2.7/5 → 4.0/5 🟢	✅ template-comparison; tools: skill, bash	🟡 0.42	✅
template-discovery	Find template for web API project	2.7/5 → 2.0/5 🔴	✅ template-discovery; tools: report_intent, skill, bash	🟡 0.21	❌ [3]
template-discovery	Inspect template parameters and compare choices	1.3/5 → 1.0/5 🔴	✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.21	✅ [4]
template-discovery	Search NuGet for specialized template	1.0/5 → 1.3/5 🟢	✅ template-discovery; tools: skill	🟡 0.21	✅ [5]
template-discovery	Resolve ambiguous project intent to multiple candidates	2.0/5 → 1.0/5 🔴	✅ template-discovery; tools: skill	🟡 0.21	❌ [6]
template-discovery	Preview project creation with dry run	2.3/5 → 2.0/5 🔴	✅ template-discovery; tools: skill / ✅ template-instantiation; template-discovery; tools: skill	🟡 0.21	✅ [7]
template-instantiation	Create a console application	4.0/5 → 5.0/5 🟢	✅ template-instantiation; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.22	❌ [8]
template-authoring	Validate a template.json file	4.3/5 → 3.7/5 🔴	✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	✅ [9]
template-authoring	Create template from existing project	3.7/5 → 4.0/5 🟢	✅ template-authoring; tools: skill	✅ 0.19	✅
template-smart-defaults	AOT implies a compatible framework	2.0/5 → 4.3/5 🟢	✅ template-smart-defaults; tools: skill, report_intent, bash / ⚠️ NOT ACTIVATED	🟡 0.40	✅
template-smart-defaults	Auth implies HTTPS stays enabled	1.7/5 → 4.0/5 🟢	✅ template-smart-defaults; tools: report_intent, skill / ⚠️ NOT ACTIVATED	🟡 0.40	✅ [10]
template-smart-defaults	Controllers exclude the minimal-API flag	3.0/5 → 5.0/5 🟢	✅ template-smart-defaults; tools: skill, report_intent / ⚠️ NOT ACTIVATED	🟡 0.40	❌ [11]
template-smart-defaults	Never override an explicit user value	2.0/5 → 5.0/5 🟢	✅ template-smart-defaults; tools: report_intent, skill / ⚠️ NOT ACTIVATED	🟡 0.40	✅

[1] ⚠️ High run-to-run variance (CV=770%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=148%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=3387%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +48.8% due to: tool calls (4 → 3), tokens (38288 → 33628)
[5] ⚠️ High run-to-run variance (CV=86%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=705%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1123%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +3.1% due to: completion (✗ → ✓)
[8] ⚠️ High run-to-run variance (CV=1279%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.7% due to: quality
[9] ⚠️ High run-to-run variance (CV=425%) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +5.1% due to: completion (✗ → ✓)
[10] ⚠️ High run-to-run variance (CV=59%) — consider re-running with --runs 5
[11] (Plugin) Quality unchanged but weighted score is -1.6% due to: time (9.9s → 14.1s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27350088874 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/706bcd7c10a50b31471a7e167feab56e8a5d7908/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

github-actions · 2026-06-11T13:44:04Z

✅ Approved by @Evangelink. cc @dotnet/skills-merge-approvers — ready to merge.

Copilot AI review requested due to automatic review settings June 10, 2026 10:17

YuliiaKovalova requested a review from JanKrivanek as a code owner June 10, 2026 10:17

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 10, 2026 10:29

Copilot AI reviewed Jun 10, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 10, 2026

Update PR token usage data (PR #745)

b731455

Copilot AI review requested due to automatic review settings June 10, 2026 11:08

Copilot AI reviewed Jun 10, 2026

View reviewed changes

github-actions Bot added the waiting-on-author PR state label label Jun 10, 2026

Copilot AI review requested due to automatic review settings June 10, 2026 11:50

Copilot AI reviewed Jun 10, 2026

View reviewed changes

github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation evaluate-now Trigger evaluation.yml for current PR head (transient) and removed waiting-on-author PR state label labels Jun 10, 2026

github-actions Bot added a commit that referenced this pull request Jun 10, 2026

Update PR token usage data (PR #745)

3e9c098

JanKrivanek mentioned this pull request Jun 10, 2026

Fix PR-triage eval trigger: dispatch evaluation.yml instead of bot-applied label #746

Open

Evangelink approved these changes Jun 11, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 11, 2026

Update PR token usage data (PR #745)

80285f8

github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 11, 2026

Conversation

YuliiaKovalova commented Jun 10, 2026

Summary

Changes

Correctness

Single source of truth

Discoverability & routing

Project creation guidance

New skills

Verification

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuliiaKovalova commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Evangelink commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Skill Validation Results

Uh oh!

YuliiaKovalova commented Jun 10, 2026

Iterating on the skill-validation eval results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuliiaKovalova commented Jun 10, 2026

Follow-up: fixing the content root cause behind the discovery failures

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanKrivanek commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Skill Validation Results

Uh oh!

Evangelink commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Skill Validation Results

Uh oh!

github-actions Bot commented Jun 11, 2026