Skip to content

Improve dotnet-template-engine plugin: accuracy, dedup, and two new skills#745

Open
YuliiaKovalova wants to merge 8 commits into
dotnet:mainfrom
YuliiaKovalova:ykovalova/template-engine-skill-improvements
Open

Improve dotnet-template-engine plugin: accuracy, dedup, and two new skills#745
YuliiaKovalova wants to merge 8 commits into
dotnet:mainfrom
YuliiaKovalova:ykovalova/template-engine-skill-improvements

Conversation

@YuliiaKovalova

Copy link
Copy Markdown
Member

Summary

Improves the dotnet-template-engine plugin: corrects factually wrong guidance, removes duplicated/divergent validation rules, expands the intent→template mappings, makes post-creation steps explicit, and adds two new skills. All changes are prompt/markdown-only — no code, scripts, or MCP/server concepts. The source of truth for the new content is the dotnet-template-mcp intent dictionary and tool reference; the skills continue to drive the dotnet new CLI.

Changes

Correctness

  • Fixed the inaccurate reserved-shortName rule in both template-validation and template-authoring. The old list claimed top-level dotnet verbs (build, run, test, publish, restore, clean, pack) conflict — they do not (dotnet new test does not collide with dotnet test). Replaced with the real colliding dotnet new subcommands: install, uninstall, update, list, search, details, create, plus a one-line explanation and corrected pitfall rows/examples.

Single source of truth

  • Consolidated validation rules into template-validation. template-authoring Step 2 no longer re-implements a shorter, divergent rule list — it now points to template-validation with a brief summary. The valid-datatype list (string, bool, choice, int, float, hex, text) now appears in exactly one place.

Discoverability & routing

  • Expanded template-discovery Step 1 from a ~9-row table to full Intent → template and Keyword → parameter tables sourced from the MCP IntentSynonymDictionary, with a note to always confirm real parameter names via dotnet new <template> --help.
  • Made template-validation reachable from the agent. Updated the Triage/Routing table so validation requests route to template-validation (previously orphaned) and added a Skills Inventory listing all six skills.

Project creation guidance

  • Added an explicit, ordered CPM + latest-version procedure to template-instantiation (detect Directory.Packages.props, strip inline versions, centralize as <PackageVersion>, optionally refresh stale versions via the NuGet V3 flat-container index.json with an opt-out, then build).
  • Added a 6-item "preserve from source" checklist to template-authoring (SDK type, analyzer/package reference metadata, key properties, CPM participation, custom build props/targets, repo conventions).

New skills

  • template-smart-defaults — applies cross-parameter default rules (AOT → latest compatible framework; auth ≠ None → don't disable HTTPS; controllers → no minimal-API flag; never override explicit user values), with a rules table, validation checklist, and heuristics pitfall.
  • template-comparison — compares 2+ templates side by side (parameters, feature support, frameworks, classifications). Cross-linked from template-discovery.

Both new skills are auto-discovered via the existing "skills": ["./skills/"] glob (consistent with all sibling plugins), and the agent routing table now references them.

Verification

  • All plugin.json files parse as valid JSON.
  • All SKILL.md and the agent frontmatter parse as valid YAML with required keys (name, description, license: MIT).
  • Every skill name matches its folder name.
  • No remaining inaccurate build/run/test/publish reserved-name claims (only the corrected "do NOT conflict" notes).
  • No cross-reference points to a non-existent skill.

Notes

  • The repo has no .codex-plugin/plugin.json for this plugin (no sibling plugin uses that convention), so the requested manifest-formatting change was a no-op.

…kills

Fix inaccurate reserved-shortName guidance, consolidate validation rules into
a single skill, expand discovery mappings, add explicit CPM/version steps, and
introduce template-comparison and template-smart-defaults skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 10:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the dotnet-template-engine agent/skills documentation to better route user intents, add new skills for comparison and smart defaults, and refine guidance around template validation and Central Package Management (CPM).

Changes:

  • Added new skills: template-comparison (side-by-side template comparison) and template-smart-defaults (cross-parameter defaulting rules).
  • Refined template discovery mappings and agent routing to include comparison/smart-default intents.
  • Updated template validation/authoring guidance around reserved dotnet new subcommand shortNames and expanded CPM post-create steps.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
plugins/dotnet-template-engine/skills/template-validation/SKILL.md Updates reserved shortName guidance to focus on dotnet new subcommands and clarifies conflict rationale.
plugins/dotnet-template-engine/skills/template-smart-defaults/SKILL.md Adds a new skill describing cross-parameter default heuristics during project creation.
plugins/dotnet-template-engine/skills/template-instantiation/SKILL.md Expands post-creation CPM steps and adds optional “refresh stale versions” guidance.
plugins/dotnet-template-engine/skills/template-discovery/SKILL.md Enhances intent→template/parameter mappings and routes detailed comparisons to a new skill.
plugins/dotnet-template-engine/skills/template-comparison/SKILL.md Adds a new skill to compare templates side-by-side using dotnet new <template> --help.
plugins/dotnet-template-engine/skills/template-authoring/SKILL.md Improves preservation checklist and delegates validation rules to template-validation.
plugins/dotnet-template-engine/agents/template-engine.agent.md Updates intent routing and lists the skills inventory including the new skills.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/dotnet-template-engine/skills/template-validation/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-authoring/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-authoring/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
Make .codex-plugin/plugin.json byte-consistent with plugin.json (2-space
indent on the agents line). Add eval.yaml + eval.vally.yaml capability
evals for the new template-comparison and template-smart-defaults skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@YuliiaKovalova

Copy link
Copy Markdown
Member Author

Follow-up after rubber-duck review:

  • Fixed .codex-plugin/plugin.json\ indentation so it is byte-identical to \plugin.json\ (the byte-consistency item that was previously missed).
  • Added capability evals (\�val.yaml\ + \�val.vally.yaml) for the two new skills under \ ests/dotnet-template-engine/template-comparison/\ and \ ests/dotnet-template-engine/template-smart-defaults/, matching the existing eval schema used by the other template-engine skills.

…workload/package availability, tighten version-refresh

- Clarify the reserved shortName set is the current dotnet new subcommands
  (authoritative source: dotnet new --help); create is verified as a real
  subcommand (alias behind dotnet new <template>).
- template-discovery: note that some mapped short names (maui, winui3, aspire,
  func, orleans) need workloads/template packages, with fallback to
  dotnet new list/search.
- template-instantiation: keep template versions by default; if refreshing,
  use dotnet list package --outdated + user confirmation and constrain to
  same major/minor rather than always latest stable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 10:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 5 comments.

Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml
Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml
Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml
Comment thread tests/dotnet-template-engine/template-comparison/eval.yaml Outdated
Comment thread plugins/dotnet-template-engine/skills/template-validation/SKILL.md Outdated
…rtion, reframe reserved list

- smart-defaults evals: enforce --no-https absence (auth scenario), absence of
  minimal-API flag (controllers scenario), and no newer --framework TFM when
  net8.0 is explicitly required, using output_not_contains/output_not_matches.
- comparison eval: split the combined (auth|aot|docker|controllers) check into
  four separate output_matches assertions so partial comparisons fail.
- template-validation/authoring: reframe the reserved shortName list as
  non-exhaustive examples and source the authoritative set from dotnet new --help;
  drop the specific create-alias assertion in favor of parsing-ambiguity wording.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink

Copy link
Copy Markdown
Member

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
template-validation Validate template with multiple errors 4.0/5 → 4.7/5 🟢 ✅ template-validation; tools: skill ✅ 0.14 [1]
template-validation Validate correct template and suggest improvements 1.7/5 → 4.3/5 🟢 ✅ template-validation; tools: glob, skill ✅ 0.14
template-authoring Validate a template.json file 3.7/5 → 4.0/5 🟢 ✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.18 [2]
template-authoring Create template from existing project 3.3/5 → 5.0/5 🟢 ✅ template-authoring; tools: skill, read_bash ✅ 0.18
template-instantiation Create a console application 4.0/5 → 4.7/5 🟢 ✅ template-instantiation; tools: skill 🟡 0.22 [3]
template-comparison Compare webapi vs webapp side by side 3.0/5 → 4.3/5 🟢 ✅ template-comparison; tools: skill 🟡 0.28
template-comparison Choose between blazorserver and blazorwasm 2.0/5 → 1.0/5 🔴 ✅ template-comparison; tools: report_intent, skill, bash 🟡 0.28
template-comparison Decide which template fits a background processing scenario 1.7/5 → 2.0/5 🟢 ✅ template-comparison; tools: skill, bash 🟡 0.28 [4]
template-smart-defaults AOT implies a compatible framework 2.3/5 → 4.3/5 🟢 ✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED 🟡 0.36
template-smart-defaults Auth implies HTTPS stays enabled 1.3/5 → 3.3/5 🟢 ✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED 🟡 0.36
template-smart-defaults Controllers exclude the minimal-API flag 1.0/5 → 5.0/5 🟢 ✅ template-smart-defaults; tools: skill, report_intent, bash / ⚠️ NOT ACTIVATED 🟡 0.36
template-smart-defaults Never override an explicit user value 2.0/5 → 5.0/5 🟢 ✅ template-smart-defaults; tools: report_intent, skill, bash / ⚠️ NOT ACTIVATED 🟡 0.36
template-discovery Find template for web API project 2.3/5 → 2.0/5 🔴 ✅ template-discovery; tools: skill, report_intent, bash / ✅ template-discovery; tools: report_intent, skill, bash 🟡 0.21 [5]
template-discovery Inspect template parameters and compare choices 1.0/5 → 2.0/5 🟢 ✅ template-discovery; tools: skill, bash / ⚠️ NOT ACTIVATED 🟡 0.21 [6]
template-discovery Search NuGet for specialized template 1.0/5 → 2.3/5 🟢 ✅ template-discovery; tools: skill, stop_bash / ✅ template-discovery; tools: skill 🟡 0.21 [7]
template-discovery Resolve ambiguous project intent to multiple candidates 2.0/5 → 1.0/5 🔴 ✅ template-discovery; tools: skill 🟡 0.21 [8]
template-discovery Preview project creation with dry run 1.7/5 → 2.7/5 🟢 ✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.21 [9]

[1] ⚠️ High run-to-run variance (CV=142%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=126%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=136%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=417%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -33.4% due to: judgment, quality, tokens (21335 → 32035)
[5] ⚠️ High run-to-run variance (CV=162%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=89%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=53%) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=216%) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=473%) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27271170865 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/bdaaf9795cba19775d6fc84842591dcd11a6d3a4/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Switch the Blazor comparison scenario from blazorserver (absent in the CI
SDK) to blazor (Blazor Web App) vs blazorwasm, both reliably present in
.NET 8+, and instruct the agent to inspect each via --help. Update the
SKILL.md example reference for currency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 11:08
@YuliiaKovalova

Copy link
Copy Markdown
Member Author

Iterating on the skill-validation eval results

I dug into the artifacts from the eval run. Summary of what the verdicts actually show:

The skill content is sound. Across every scenario, the With-Skill Plugin judge scored 4–4.7/5 — the agent loads the right skill, runs dotnet new <template> --help per template, and produces a correct side-by-side comparison and recommendation.

The ❌ verdicts come from the Isolated runner crashing, not from the skill. In the failing isolated transcripts the judge prose says the agent ran 'dotnet new ... --help' ... but the command failed with a **mutex error** and no output was produced(no output) → low score. This is the .NET template engine's global persistence mutex colliding in the sandboxed isolated environment, i.e. infra flakiness, not a content issue. It's consistent with the enormous run-to-run variance reported (CV 417–473%); the footnotes' own recommendation of --runs 5 would smooth this out.

What I changed to make my evals more robust (688afff):

  • The Blazor comparison scenario used blazorserver, which is not present in the CI SDK (the plugin run had to pivot to blazor). Switched it to blazor (Blazor Web App) vs blazorwasm — both reliably present in .NET 8+ — and instructed the agent to inspect each via --help. Updated the template-comparison SKILL.md example for currency.

Not changed: the two template-discovery ❌ scenarios are pre-existing evals I didn't author; their isolated runs failed for the same mutex/no-output reason (the baseline judges already scored ~2.3/5 there), so they're unrelated to this PR's content.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 7 comments.

Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml Outdated
Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml Outdated
Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.vally.yaml Outdated
Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.vally.yaml Outdated
Comment thread plugins/dotnet-template-engine/skills/template-smart-defaults/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-discovery/SKILL.md
The isolated eval runs failed because the agent ran 'dotnet new <t> --help',
hit the template engine's global-mutex/persistence error (common when the
command runs concurrently in a sandbox), and then returned no answer at all.

Instruct both skills to run 'dotnet new' calls sequentially, retry once on a
transient mutex/persistence error, and fall back to the intent/parameter
mapping so a concrete answer is always produced instead of empty output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@YuliiaKovalova

Copy link
Copy Markdown
Member Author

Follow-up: fixing the content root cause behind the discovery failures

You asked me to actually fix the two template-discovery failures rather than write them off as infra. I re-read the isolated transcripts: in both, the agent ran dotnet new <template> --help, the call returned a template-engine "mutex"/"persistence" error with no output, and the agent then ended the turn with no answer to the user — that empty result is what scored ~2/5.

That second half is a content gap the skill can close, so I fixed it (3ecd536):

  • template-discovery SKILL.md — added a Resilience note + a pitfall row: run dotnet new calls sequentially (the template engine holds a global mutex, so concurrent --help/--dry-run calls collide and error), retry once, and if it still fails fall back to the intent/parameter mapping and give a concrete recommendation — never end with empty output because a CLI call errored.
  • template-comparison SKILL.md — same sequential-execution + retry + don't-return-empty guidance, since it also fans out multiple --help calls.

This makes the agent produce a correct answer from the (authoritative-in-this-skill) intent mapping even when the sandbox's dotnet new invocation flakes, which is exactly the path that was scoring zero. The mapped fix applies to the pre-existing discovery scenarios too, even though those evals predate this PR.

@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @YuliiaKovalova — this PR has 7 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

- template-smart-defaults SKILL.md: drop the non-existent --publish-aot
  flag. Clarify --aot is a dotnet new flag only on templates that expose it
  (console/worker/grpc, not webapi) and that publish-time AOT is the MSBuild
  PublishAot=true property, not a dotnet new flag.
- template-discovery SKILL.md: replace the hardcoded --enable-docker mapping
  (not a real flag on common templates) with generic 'confirm with --help'.
- smart-defaults evals: tighten the negative-assertion prompts to output only
  the command line and not mention unused flags, so a negated explanation
  can't trip output_not_contains/output_not_matches. Switch the AOT scenario
  from webapi to worker (which actually supports --aot).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 11:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

Comment thread tests/dotnet-template-engine/template-smart-defaults/eval.yaml Outdated
Comment thread plugins/dotnet-template-engine/skills/template-validation/SKILL.md
Comment thread plugins/dotnet-template-engine/skills/template-authoring/SKILL.md Outdated
Comment thread plugins/dotnet-template-engine/skills/template-comparison/SKILL.md
…lists

- smart-defaults evals: anchor the negative assertions to the 'dotnet new'
  command line (same-line regex) instead of whole-output substring/regex, so a
  flag mentioned only in prose can't fail the test.
- template-validation / template-authoring: mark the dotnet new subcommand
  examples as illustrative/version-dependent and tell readers not to hardcode
  them; the live 'dotnet new --help' output is canonical.
- template-comparison: fix the example table's AOT row — webapi/webapp do not
  expose a --aot template flag; native AOT is publish-time via PublishAot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation evaluate-now Trigger evaluation.yml for current PR head (transient) and removed waiting-on-author PR state label labels Jun 10, 2026
@JanKrivanek

Copy link
Copy Markdown
Member

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
template-validation Validate template with multiple errors 4.0/5 → 4.7/5 🟢 ✅ template-validation; tools: skill ✅ 0.12 [1]
template-validation Validate correct template and suggest improvements 1.3/5 → 4.3/5 🟢 ✅ template-validation; tools: glob, skill ✅ 0.12
template-discovery Find template for web API project 2.0/5 → 3.0/5 🟢 ✅ template-discovery; tools: report_intent, skill, bash ✅ 0.19 [2]
template-discovery Inspect template parameters and compare choices 2.0/5 → 1.0/5 🔴 ✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [3]
template-discovery Search NuGet for specialized template 3.0/5 → 1.0/5 🔴 ✅ template-discovery; tools: skill ✅ 0.19 [4]
template-discovery Resolve ambiguous project intent to multiple candidates 1.0/5 → 1.0/5 ✅ template-discovery; tools: skill ✅ 0.19 [5]
template-discovery Preview project creation with dry run 1.7/5 → 2.0/5 🟢 ✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [6]
template-instantiation Create a console application 4.0/5 → 4.7/5 🟢 ✅ template-instantiation; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.22 [7]
template-authoring Validate a template.json file 4.3/5 → 3.7/5 🔴 ✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.14 [8]
template-authoring Create template from existing project 3.7/5 → 4.7/5 🟢 ✅ template-authoring; tools: skill, stop_bash / ✅ template-authoring; tools: skill, read_bash ✅ 0.14

[1] ⚠️ High run-to-run variance (CV=277%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=752%) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -56.0% due to: judgment, quality, tokens (12709 → 38975), errors (0 → 1), tool calls (0 → 3), time (13.6s → 34.3s)
[3] ⚠️ High run-to-run variance (CV=525%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +29.2% due to: tokens (49195 → 33636), tool calls (6 → 4)
[4] ⚠️ High run-to-run variance (CV=126%) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=56%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=103%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=180%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.4% due to: efficiency metrics
[8] ⚠️ High run-to-run variance (CV=494%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +8.6% due to: completion (✗ → ✓), tool calls (4 → 3)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27287650326 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/706bcd7c10a50b31471a7e167feab56e8a5d7908/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@Evangelink

Copy link
Copy Markdown
Member

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
template-validation Validate template with multiple errors 4.7/5 → 5.0/5 🟢 ✅ template-validation; tools: skill ✅ 0.14 [1]
template-validation Validate correct template and suggest improvements 1.7/5 → 4.0/5 🟢 ✅ template-validation; tools: glob, skill ✅ 0.14
template-comparison Compare webapi vs webapp side by side 4.0/5 → 4.7/5 🟢 ✅ template-comparison; tools: skill 🟡 0.42
template-comparison Choose between blazor and blazorwasm 4.0/5 → 4.0/5 ✅ template-comparison; tools: skill 🟡 0.42 [2]
template-comparison Decide which template fits a background processing scenario 2.7/5 → 4.0/5 🟢 ✅ template-comparison; tools: skill, bash 🟡 0.42
template-discovery Find template for web API project 2.7/5 → 2.0/5 🔴 ✅ template-discovery; tools: report_intent, skill, bash 🟡 0.21 [3]
template-discovery Inspect template parameters and compare choices 1.3/5 → 1.0/5 🔴 ✅ template-discovery; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.21 [4]
template-discovery Search NuGet for specialized template 1.0/5 → 1.3/5 🟢 ✅ template-discovery; tools: skill 🟡 0.21 [5]
template-discovery Resolve ambiguous project intent to multiple candidates 2.0/5 → 1.0/5 🔴 ✅ template-discovery; tools: skill 🟡 0.21 [6]
template-discovery Preview project creation with dry run 2.3/5 → 2.0/5 🔴 ✅ template-discovery; tools: skill / ✅ template-instantiation; template-discovery; tools: skill 🟡 0.21 [7]
template-instantiation Create a console application 4.0/5 → 5.0/5 🟢 ✅ template-instantiation; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.22 [8]
template-authoring Validate a template.json file 4.3/5 → 3.7/5 🔴 ✅ template-authoring; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [9]
template-authoring Create template from existing project 3.7/5 → 4.0/5 🟢 ✅ template-authoring; tools: skill ✅ 0.19
template-smart-defaults AOT implies a compatible framework 2.0/5 → 4.3/5 🟢 ✅ template-smart-defaults; tools: skill, report_intent, bash / ⚠️ NOT ACTIVATED 🟡 0.40
template-smart-defaults Auth implies HTTPS stays enabled 1.7/5 → 4.0/5 🟢 ✅ template-smart-defaults; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.40 [10]
template-smart-defaults Controllers exclude the minimal-API flag 3.0/5 → 5.0/5 🟢 ✅ template-smart-defaults; tools: skill, report_intent / ⚠️ NOT ACTIVATED 🟡 0.40 [11]
template-smart-defaults Never override an explicit user value 2.0/5 → 5.0/5 🟢 ✅ template-smart-defaults; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.40

[1] ⚠️ High run-to-run variance (CV=770%) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=54%) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=148%) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=3387%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +48.8% due to: tool calls (4 → 3), tokens (38288 → 33628)
[5] ⚠️ High run-to-run variance (CV=86%) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=705%) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1123%) — consider re-running with --runs 5. (Isolated) Quality dropped but weighted score is +3.1% due to: completion (✗ → ✓)
[8] ⚠️ High run-to-run variance (CV=1279%) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -5.7% due to: quality
[9] ⚠️ High run-to-run variance (CV=425%) — consider re-running with --runs 5. (Plugin) Quality dropped but weighted score is +5.1% due to: completion (✗ → ✓)
[10] ⚠️ High run-to-run variance (CV=59%) — consider re-running with --runs 5
[11] (Plugin) Quality unchanged but weighted score is -1.6% due to: time (9.9s → 14.1s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 745 in dotnet/skills, download eval artifacts with gh run download 27350088874 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/706bcd7c10a50b31471a7e167feab56e8a5d7908/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@github-actions github-actions Bot added ready-to-merge PR state label and removed pr-state/ready-for-eval PR is mergeable and awaiting evaluation labels Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ Approved by @Evangelink. cc @dotnet/skills-merge-approvers — ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evaluate-now Trigger evaluation.yml for current PR head (transient) ready-to-merge PR state label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants