Add reusable shared baseline to skill-validator evaluate (--baseline-out / --baseline-from) by YuliiaKovalova · Pull Request #754 · dotnet/skills

YuliiaKovalova · 2026-06-11T15:48:26Z

Implements #751 — adds a reusable, shared no-skill/no-agent baseline ("shared control group") to skill-validator evaluate so the baseline arm can be computed once and reused across many invocations, eliminating redundant baseline runs and removing baseline run-to-run variance from cross-config comparisons.

What's new

--baseline-out <path> — after the run, persist each scenario's averaged baseline (honoring --runs) for later reuse.
--baseline-from <path> — reuse a precomputed baseline instead of re-running the baseline arm. The two options are mutually exclusive.

Identity & safety

The baseline file is keyed per scenario on (promptSha, targetSha) and carries a header with the schema version and --model. Following the (prompt, model, targetSha) contract:

promptSha — SHA-256 of the scenario prompt.
targetSha — SHA-256 of the scenario's materialized inputs (files copied via copy_test_files, explicit setup files, and the setup-command recipe). This prevents two cases that share a prompt but feed different fixtures (e.g. a different build.binlog) from reusing each other's baseline.
Reuse fails fast on model mismatch, unsupported schema version, or any scenario whose prompt+fixture identity is missing from the file.

Behavior on reuse

When a cached baseline matches, the baseline agent run (and its assertions/constraints/task-completion/judging) is skipped; the cached metrics + judge result are used for deltas and pairwise/independent judging. Such scenarios are reported with the baseline-reused session phase and a reused baseline status. Pairwise judging runs in the skilled run's work dir (the cached baseline's work dir no longer exists) and does not re-attribute tokens to the baseline.

Tests & docs

New BaselineStoreTests (9 facts): prompt-SHA determinism, save/load round-trip, model-mismatch / unsupported-version / missing-file failures, FindMissingScenarios, write-store-not-reuse, target-SHA stability + content sensitivity, and same-prompt/different-fixture non-reuse.
Full suite green: 560 passed, 0 failed. Build: 0 warnings, 0 errors.
Updated eng/skill-validator/src/README.md (examples, flags table, "Shared baseline reuse" section) and docs/InvestigatingResults.md.

Closes #751.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…tnet#751) Add --baseline-out and --baseline-from options to the evaluate command so the no-skill/no-agent baseline arm can be computed once and reused as a shared control group across multiple skill/agent evaluations. This eliminates redundant baseline runs and removes baseline run-to-run variance from cross-config comparisons. - New BaselineStore + BaselineFile/BaselineScenarioEntry models, keyed per scenario on SHA-256(prompt) with a header recording version, model, validator version and runs. Load validates version + model and fails fast on mismatch or missing scenarios. - Register the new serializable types in SkillValidatorJsonContext (AOT source-gen). - Wire two mutually-exclusive CLI options into ValidatorConfig; thread an optional BaselineStore through both execution paths. - On reuse, skip the baseline agent run, its assertions/constraints/ task-completion/judging, and attribute no extra pairwise tokens to the baseline; report the scenario with the baseline-reused session phase and a reused status. In write mode, record each scenario's averaged baseline and persist it after the run. - Add unit tests for BaselineStore and document the feature in README and InvestigatingResults. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Align baseline reuse with the (prompt, model, targetSha) identity contract from the upstream eval-harness design: previously the per-scenario reuse key was the prompt SHA + model only, so two scenarios that share an identical prompt but feed the agent different input artifacts (e.g. a different build.binlog) would collide and silently reuse the wrong baseline. - Add BaselineScenarioEntry.TargetSha: a SHA-256 over the scenario's materialized inputs — files auto-copied via copy_test_files, explicit setup files (inline content or copied sources), and the setup command recipe. The reuse key is now (promptSha, targetSha); both must match. Bump the on-disk schema to version 2. - Memoize target hashing per process via a cheap, file-I/O-free setup signature to avoid re-hashing large fixtures across the N runs. - Thread the originating eval.yaml path into Record/TryGetBaseline/ FindMissingScenarios so inputs can be fingerprinted. - Tests: target SHA is stable and content-sensitive; same-prompt/different-fixture scenarios do not reuse each other's baseline and are surfaced by FindMissingScenarios. Update README and InvestigatingResults. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-11T15:48:38Z

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds shared-baseline persistence and reuse to skill-validator evaluate, enabling faster and less noisy comparisons across multiple skills/agents by skipping redundant baseline runs when a matching baseline file is provided.

Changes:

Introduces --baseline-out (write) / --baseline-from (reuse) options and threads a BaselineStore through evaluation execution.
Implements on-disk baseline schema + prompt/fixture identity hashing to prevent cross-scenario contamination.
Adds unit tests and updates docs to explain baseline reuse behavior and reporting.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
eng/skill-validator/src/Evaluate/EvaluateCommand.cs	Adds CLI options, validation, baseline preflight checks, and execution-path changes for baseline reuse/write.
eng/skill-validator/src/Evaluate/BaselineStore.cs	New baseline file format + keyed storage and hashing for prompt + setup/fixtures.
eng/skill-validator/tests/Evaluate/BaselineStoreTests.cs	Adds tests for hashing determinism, save/load, model/version validation, and fixture-sensitive reuse.
eng/skill-validator/src/docs/InvestigatingResults.md	Documents how reused baselines appear in results/phases and how identity is determined.
eng/skill-validator/src/README.md	Documents new flags and adds a “Shared baseline reuse” section.
eng/skill-validator/src/SkillValidatorJsonContext.cs	Registers baseline types for source-generated `System.Text.Json` serialization.
eng/skill-validator/src/Evaluate/Models.cs	Adds config fields for baseline out/from options.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address rubber-duck review findings on the shared-baseline feature: - Fix Sha256Hex 32-byte bug: a 32-byte input was treated as an already-computed digest and not hashed. Split into Sha256Hex (always hash) + HexDigest (encode existing digest). - Broaden reuse identity: the cached baseline RunResult depends on the judge model and on per-scenario evaluation criteria (rubric, assertions, expect/reject tools, turn/token/timeout limits). Add JudgeModel to the baseline header (validated on load) and fold the criteria into the per-scenario targetSha so changing them invalidates reuse instead of silently serving a stale result. - Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures: exclude only the top-level eval.yaml (nested eval.yaml files are copied, so they must be hashed). - Make the target-SHA cache instance-scoped (memoizing only the expensive fixture-input hashing) so it can't serve stale hashes or leak across evaluations/tests; hash inline file Content in the cache key instead of embedding it. - Deterministic Save ordering (Name, PromptSha, TargetSha); guard Load against null Scenarios; enrich FindMissingScenarios output with the eval path. - Document that setup commands are fingerprinted by recipe, so reuse assumes they are deterministic/hermetic. - Tests + docs updated; add judge-model-mismatch and criteria-identity tests (562 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures: enumerate only the files actually copied (top-level siblings except eval.yaml, recursing into directories) and skip reparse points and out-of-root junctions, instead of blindly hashing every file under the eval directory. This keeps the fixture identity restricted to the intentionally-copied set so stray output/log files can't poison reuse. - Stream baseline JSON to/from disk (File.OpenRead/File.Create with JsonSerializer) so large baselines never materialize as one giant in-memory string. - Enrich the fail-fast 'missing scenario' output with the eval path and short prompt/target SHA prefixes so it is actionable when scenario names collide across eval files. - Add a test locking in recursive (nested-directory) fixture hashing. 563 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-06-11T16:10:56Z

Addressed the review comments in 82f487f and cc7dd46:

--baseline-from option description & README flag table: updated to state reuse must match --model, --judge-model, and each scenario's prompt + setup/fixture inputs + evaluation criteria (not prompt alone).
BuildTargetCacheKey inlining f.Content: the cache key now stores a SHA-256 of inline content rather than the raw string, and the cache is now instance-scoped (no longer static), so large inline content no longer bloats or is retained process-wide.
copy_test_files hashing all files under the eval dir: hashing now mirrors AgentRunner.SetupWorkDir/CopyDirectory exactly — it enumerates only the files actually copied (top-level siblings except eval.yaml, recursing into directories) and skips reparse points / out-of-root junctions, so stray output/log files under the eval directory can't poison the fixture identity.
Load/Save reading/writing the whole JSON as a string: both now stream to/from disk via File.OpenRead/File.Create with JsonSerializer, so large baselines never materialize as one giant in-memory string.
Fail-fast message listing names only: FindMissingScenarios output now includes the eval path and short prompt/target SHA prefixes, making it actionable when scenario names collide across eval files.

Also broadened the reuse identity to cover the judge model and per-scenario criteria (rubric/assertions/expect+reject tools/turn-token-timeout limits), and fixed a Sha256Hex 32-byte-input bug. 563 tests pass.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

JanKrivanek · 2026-06-11T16:19:46Z

/evaluate

github-actions · 2026-06-11T16:56:32Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	5.0/5 → 5.0/5	✅ dotnet-pinvoke; tools: skill	✅ 0.06	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.06	✅
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, view	✅ 0.10	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, glob, view / ✅ nuget-trusted-publishing; tools: skill, view, glob	✅ 0.10	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	3.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, view / ✅ nuget-trusted-publishing; tools: skill	✅ 0.10	✅
csharp-scripts	Avoid activating for language-agnostic calendar script	5.0/5 → 3.0/5 🔴	✅ csharp-scripts; tools: skill, create / ✅ csharp-scripts; tools: skill, create, edit	🟡 0.31	❌
csharp-scripts	Test a C# language feature with a file-based app	3.0/5 → 4.0/5 🟢	✅ csharp-scripts; tools: skill, create, edit	🟡 0.31	✅
csharp-scripts	Compose a file-based app from helper files	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.31	❌ [1]
coordinate-components	Warehouse dashboard with site selector and live stock alerts	4.0/5 → 5.0/5 🟢	✅ coordinate-components; tools: skill	🟡 0.24	✅
coordinate-components	Multi-tenant notification hub with cross-component fan-out	2.0/5 → 5.0/5 🟢	✅ coordinate-components; tools: skill	🟡 0.24	✅
use-js-interop	Auto-saving notepad that survives page reloads	4.0/5 → 4.0/5	✅ use-js-interop; tools: skill	✅ 0.19	✅
use-js-interop	User activity tracker that detects idle timeout	4.0/5 → 4.0/5	✅ use-js-interop; tools: skill	✅ 0.19	✅
use-js-interop	Responsive layout that adapts to screen size	4.0/5 → 4.0/5	✅ use-js-interop; tools: skill	✅ 0.19	✅
use-js-interop	Infinite scroll list using IntersectionObserver	4.0/5 → 4.0/5	✅ use-js-interop; tools: skill	✅ 0.19	✅
create-blazor-project	University course catalog with enrollment form	4.0/5 → 2.0/5 🔴	✅ create-blazor-project; tools: skill	🟡 0.27	❌
create-blazor-project	Recipe community with interactive ratings on static pages	4.0/5 → 5.0/5 🟢	✅ create-blazor-project; tools: skill, glob	🟡 0.27	✅
create-blazor-project	Global logistics tracking for worldwide users	2.0/5 → 4.0/5 🟢	✅ create-blazor-project; tools: skill	🟡 0.27	✅
support-prerendering	Equipment inventory loaded once	4.0/5 → 4.0/5	✅ support-prerendering; tools: skill	🟡 0.24	✅
support-prerendering	Notifications page with live polling	4.0/5 → 3.0/5 🔴	✅ support-prerendering; tools: skill	🟡 0.24	✅
fetch-and-send-data	Recipe browser with resilient data loading	3.0/5 → 4.0/5 🟢	✅ fetch-and-send-data; tools: skill	✅ 0.19	✅
fetch-and-send-data	Real-time shipment tracker with Auto interactivity	4.0/5 → 4.0/5	⚠️ NOT ACTIVATED / ✅ fetch-and-send-data; tools: skill	✅ 0.19	✅
author-component	Author a data-loading search component	3.0/5 → 4.0/5 🟢	✅ author-component; tools: skill, view	✅ 0.18	✅
author-component	Author a multi-step wizard with async validation and shared state	3.0/5 → 4.0/5 🟢	✅ author-component; tools: skill, view	✅ 0.18	✅
author-component	Author a generic data table component	4.0/5 → 5.0/5 🟢	✅ author-component; tools: skill, bash	✅ 0.18	❌ [2]
author-component	Author a real-time notification badge component	2.0/5 → 5.0/5 🟢	✅ author-component; tools: skill, view	✅ 0.18	✅
author-component	Author a sortable list with code-behind pattern	4.0/5 → 4.0/5	✅ author-component; tools: skill, bash / ✅ author-component; tools: skill	✅ 0.18	❌ [3]
configure-auth	Login and account management in a globally interactive app	5.0/5 → 5.0/5	✅ configure-auth; tools: skill	✅ 0.08	✅
configure-auth	Multi-tier app with WebAssembly auth	3.0/5 → 5.0/5 🟢	✅ configure-auth; tools: skill	✅ 0.08	✅
plan-ui-change	Project management Kanban board	2.0/5 → 4.0/5 🟢	✅ plan-ui-change; tools: skill, edit	🟡 0.26	✅
plan-ui-change	E-commerce product catalog with filters and pagination	3.0/5 → 5.0/5 🟢	✅ plan-ui-change; tools: skill, edit / ✅ plan-ui-change; tools: skill, grep, edit	🟡 0.26	✅
plan-ui-change	Multi-step job application wizard	2.0/5 → 4.0/5 🟢	✅ plan-ui-change; tools: skill	🟡 0.26	✅
plan-ui-change	Application settings page with nested tab panels	3.0/5 → 4.0/5 🟢	✅ plan-ui-change; tools: skill	🟡 0.26	✅
plan-ui-change	Team chat interface with message threads	2.0/5 → 5.0/5 🟢	✅ plan-ui-change; tools: skill / ✅ plan-ui-change; tools: skill, edit	🟡 0.26	✅
collect-user-input	Event registration with custom validation	4.0/5 → 5.0/5 🟢	✅ collect-user-input; tools: skill	✅ 0.11	✅
collect-user-input	Multi-step booking form with cross-field validation	4.0/5 → 4.0/5	✅ collect-user-input; tools: skill	✅ 0.11	✅

[1] (Plugin) Quality unchanged but weighted score is -1.0% due to: tokens (97378 → 115389)
[2] (Plugin) Quality unchanged but weighted score is -5.1% due to: tokens (26523 → 62538), tool calls (2 → 4)
[3] (Isolated) Quality unchanged but weighted score is -4.5% due to: tokens (26981 → 57359), tool calls (3 → 5), time (16.9s → 23.6s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 754 in dotnet/skills, download eval artifacts with gh run download 27361292234 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/cc7dd466f10567e954b966daad6b55af2610b4cf/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

- Skip top-level reparse points in EnumerateCopiedFixtures (not just nested ones) so a top-level symlink/junction can't cause hashing of data outside the eval directory; code now matches the docstring. - Record uses first-writer-wins (TryAdd) instead of overwriting, so a scenario identity recorded by multiple parallel targets yields a deterministic --baseline-out regardless of completion order. - Persist the baseline judge result to the session DB even when the baseline is reused, so the registered 'baseline-reused' session record is complete for downstream investigation tooling (pairwise was already saved; the judge result was incorrectly gated on a fresh run). - Add first-writer-wins test (564 pass). Note: BaselineStore stays internal — the test project already has InternalsVisibleTo, so it compiles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-06-11T18:08:58Z

Addressed the second round of review comments in 2b3cd2a:

EnumerateCopiedFixtures not skipping top-level reparse points: now skips reparse points for both files and directories at the top level (not just nested ones), so a top-level symlink/junction can't cause hashing of data outside the eval directory. The code now matches the docstring.
Record unconditionally overwriting: switched to first-writer-wins (TryAdd). When several targets evaluated in parallel share the same scenario identity, the persisted --baseline-out is now deterministic regardless of completion order (later identical-key records differing only by run-to-run noise are ignored). Added a test.
Reused baseline judge result not saved to the session DB: the baseline judge result is now persisted even when reused, so the registered baseline-reused session record is complete for downstream investigation tooling. (The pairwise result was already saved; only the judge-result save was incorrectly gated on a fresh run.)
BaselineStore internal vs. test access: no change needed — the test project already has InternalsVisibleTo in SkillValidator.csproj, so it compiles (564 tests pass).

The earlier seven comments were addressed in 82f487f / cc7dd46 (see prior summary).

github-actions · 2026-06-11T19:17:14Z

👋 @YuliiaKovalova — this PR has 11 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

The transitive MessagePack 2.5.198 (via GitHub.Copilot.SDK -> StreamJsonRpc) has a high-severity vulnerability (GHSA-hv8m-jj95-wg3x) that fails the build under TreatWarningsAsErrors. Pin a direct reference to the patched 2.5.301. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

… consistent token attribution - ComputeInputsSha: normalize evalPath via Path.GetFullPath so a bare filename still hashes sibling fixtures (avoids TargetSha collisions / unsafe reuse). - RunMetrics.Clone(): per-run copy with fresh collections; reuse paths now clone the cached baseline so concurrent evaluations never share a mutable instance. - Pairwise judge tokens attributed to both compared runs in every mode (the baseline clone makes this safe), keeping token deltas comparable across --baseline-from modes. - Reword Record first-writer-wins doc to describe the within-run stabilization guarantee rather than order-independence. - Add tests for bare-filename fixture hashing and clone isolation (566 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JanKrivanek · 2026-06-12T13:17:45Z

/evaluate

github-actions · 2026-06-12T13:45:58Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
microbenchmarking	Investigate runtime upgrade performance impact	4.0/5 → 4.0/5	✅ microbenchmarking; tools: skill, create / ⚠️ NOT ACTIVATED	✅ 0.11	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	5.0/5 → 5.0/5	✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob	✅ 0.10	❌ [1]
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	3.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob	✅ 0.10	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	5.0/5 → 5.0/5	✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob	✅ 0.10	❌ [2]
clr-activation-debugging	Analyze healthy managed EXE activation	4.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill, glob	✅ 0.10	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	4.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill, glob	✅ 0.10	❌ [3]
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	3.0/5 → 4.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5 → 4.0/5 🟢	✅ clr-activation-debugging; tools: skill, glob / ℹ️ not activated (expected)	✅ 0.10	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	2.0/5 → 3.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, bash, read_bash, stop_bash / ✅ android-tombstone-symbolication; tools: skill, bash, read_bash	✅ 0.19	❌
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill / ⚠️ NOT ACTIVATED	✅ 0.19	❌ [4]
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	4.0/5 → 4.0/5	✅ android-tombstone-symbolication; tools: skill	✅ 0.19	❌ [5]
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5 → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, bash	✅ 0.19	✅
android-tombstone-symbolication	Symbolicate multi-thread tombstone	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill	✅ 0.19	✅
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	4.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill	✅ 0.19	✅
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	3.0/5 → 2.0/5 ⏰ 🔴	✅ android-tombstone-symbolication; tools: skill, read_bash / ✅ android-tombstone-symbolication; tools: skill	✅ 0.19	❌
android-tombstone-symbolication	Reject iOS crash log as wrong format	4.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.19	❌
apple-crash-symbolication	Parse .NET frames and locate dSYMs from an iOS crash log	3.0/5 → 4.0/5 🟢	✅ apple-crash-symbolication; tools: skill, bash	✅ 0.17	✅
apple-crash-symbolication	Investigate root cause of a .NET MAUI iOS crash	2.0/5 → 3.0/5 🟢	✅ apple-crash-symbolication; tools: skill, bash	✅ 0.17	✅
apple-crash-symbolication	Reject Android tombstone passed as iOS crash log	4.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.17	❌ [6]
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5 → 5.0/5	✅ dump-collect; tools: report_intent, skill, view / ✅ dump-collect; tools: skill, report_intent, view	🟡 0.30	❌ [7]
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, view	🟡 0.30	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	4.0/5 → 4.0/5	⚠️ NOT ACTIVATED	🟡 0.30	❌
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view	🟡 0.30	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.30	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view	🟡 0.30	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view	🟡 0.30	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	4.0/5 → 4.0/5	✅ dump-collect; tools: skill / ✅ dump-collect; tools: skill, view	🟡 0.30	✅
dump-collect	Decline dump analysis request	2.0/5 → 2.0/5	ℹ️ not activated (expected)	🟡 0.30	❌ [8]
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view	🟡 0.21	❌ [9]
dotnet-trace-collect	.NET Framework on Windows without admin privileges	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	2.0/5 → 3.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view	🟡 0.21	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	3.0/5 → 3.0/5	✅ dotnet-trace-collect; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	🟡 0.21	✅
dotnet-trace-collect	Slow requests on Windows with PerfView	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill / ✅ dotnet-trace-collect; tools: report_intent, skill	🟡 0.21	❌ [10]
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: report_intent, skill	🟡 0.21	✅
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	3.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	✅
dotnet-trace-collect	Windows container high CPU with PerfView	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view, grep	🟡 0.21	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	3.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, view	🟡 0.21	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view	🟡 0.21	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, view / ⚠️ NOT ACTIVATED	🟡 0.21	❌ [11]
dotnet-trace-collect	Memory leak on .NET Framework Windows	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, view / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	✅
dotnet-trace-collect	Container installation without .NET SDK	4.0/5 → 4.0/5	✅ dotnet-trace-collect; tools: report_intent, skill, view	🟡 0.21	❌ [12]
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	🟡 0.21	✅
dotnet-trace-collect	Assembly loading failure on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: skill	🟡 0.21	❌ [13]
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	4.0/5 → 4.0/5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	5.0/5 → 5.0/5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	❌ [14]
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	5.0/5 → 5.0/5	✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	❌
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	5.0/5 → 4.0/5 🔴	✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	❌
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	5.0/5 → 5.0/5	✅ analyzing-dotnet-performance; tools: skill, grep / ✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	❌ [15]
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	3.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	3.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	4.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill, grep	✅ 0.15	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	4.0/5 → 3.0/5 🔴	✅ analyzing-dotnet-performance; tools: skill / ✅ analyzing-dotnet-performance; tools: skill, bash	✅ 0.15	❌ [16]
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	4.0/5 → 4.0/5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	❌ [17]
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	3.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill, grep / ✅ analyzing-dotnet-performance; tools: skill	✅ 0.15	✅
exp-mock-usage-analysis	Detect unused and unreachable mock setups	4.0/5 → 4.0/5	✅ exp-mock-usage-analysis; tools: skill	✅ 0.08	✅
exp-mock-usage-analysis	Detect redundant mock configurations duplicated across tests	3.0/5 → 3.0/5	✅ exp-mock-usage-analysis; tools: skill, edit	✅ 0.08	✅
exp-mock-usage-analysis	Detect mocking of stable framework types	3.0/5 → 5.0/5 🟢	✅ exp-mock-usage-analysis; tools: skill	✅ 0.08	✅
exp-mock-usage-analysis	Analyze mock usage in NSubstitute tests	3.0/5 → 3.0/5	✅ exp-mock-usage-analysis; tools: skill	✅ 0.08	✅
exp-mock-usage-analysis	Analyze mock usage in FakeItEasy tests	4.0/5 → 5.0/5 🟢	✅ exp-mock-usage-analysis; tools: skill	✅ 0.08	❌ [18]
exp-mock-usage-analysis	Detect excessive mock configuration sprawl	3.0/5 → 4.0/5 🟢	✅ exp-mock-usage-analysis; tools: skill	✅ 0.08	✅
exp-test-maintainability	Recommend data-driven patterns with display names for unclear parameters	4.0/5 → 4.0/5	⚠️ NOT ACTIVATED	✅ 0.14	❌ [19]
exp-test-maintainability	Recognize well-maintained tests that need minimal changes	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED / ✅ exp-test-maintainability; tools: report_intent, skill	✅ 0.14	❌ [20]
exp-test-maintainability	Detect repeated object construction and setup across test methods	2.0/5 → 4.0/5 🟢	✅ exp-test-maintainability; tools: skill	✅ 0.14	✅
exp-test-maintainability	Recognize tests with minimal boilerplate that need no refactoring	4.0/5 → 5.0/5 🟢	✅ exp-test-maintainability; tools: skill	✅ 0.14	❌ [21]
exp-simd-vectorization	Optimize manual min/max with TensorPrimitives	1.0/5 → 5.0/5 🟢	✅ exp-simd-vectorization; tools: skill, glob, create, bash	🟡 0.20	✅
exp-simd-vectorization	Optimize manual product with TensorPrimitives	1.0/5 → 5.0/5 🟢	✅ exp-simd-vectorization; tools: skill, glob / ✅ exp-simd-vectorization; tools: skill, glob, create, bash	🟡 0.20	✅
exp-simd-vectorization	No optimization opportunity — dictionary-based lookup service	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	🟡 0.20	✅
exp-simd-vectorization	Optimize int array conditional increment with SIMD	4.0/5 → 4.0/5	✅ exp-simd-vectorization; tools: skill	🟡 0.20	❌ [22]
exp-simd-vectorization	Optimize byte buffer bit reversal with SIMD	4.0/5 → 4.0/5	✅ exp-simd-vectorization; tools: skill	🟡 0.20	✅

[1] (Plugin) Quality unchanged but weighted score is -6.5% due to: tokens (42962 → 80074), time (26.7s → 44.6s), tool calls (5 → 6)
[2] (Isolated) Quality unchanged but weighted score is -1.4% due to: tokens (41077 → 72664), tool calls (4 → 5)
[3] (Isolated) Quality improved but weighted score is -2.3% due to: tokens (40979 → 72460), time (21.1s → 28.4s), tool calls (3 → 4)
[4] (Isolated) Quality unchanged but weighted score is -5.6% due to: tokens (26407 → 45326), tool calls (2 → 3), time (14.4s → 18.7s)
[5] (Plugin) Quality unchanged but weighted score is -4.2% due to: tokens (60204 → 149446), tool calls (5 → 9)
[6] (Plugin) Quality improved but weighted score is -12.1% due to: completion (✓ → ✗), tokens (27601 → 88815), tool calls (2 → 6), time (22.6s → 36.9s)
[7] (Plugin) Quality unchanged but weighted score is -8.7% due to: tokens (12863 → 46395), tool calls (0 → 3), time (11.1s → 16.6s)
[8] (Plugin) Quality unchanged but weighted score is -1.1% due to: tokens (12649 → 14261)
[9] (Isolated) Quality improved but weighted score is -5.2% due to: tokens (13219 → 53318), tool calls (0 → 3), time (19.0s → 25.8s)
[10] (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (12930 → 34975), tool calls (0 → 2)
[11] (Plugin) Quality unchanged but weighted score is -0.1% due to: quality
[12] (Plugin) Quality unchanged but weighted score is -13.2% due to: tokens (12932 → 56799), quality, tool calls (0 → 3), time (12.0s → 20.0s)
[13] (Isolated) Quality improved but weighted score is -8.3% due to: tokens (13241 → 74264), tool calls (0 → 4), time (19.6s → 26.0s)
[14] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29403 → 84200), tool calls (2 → 7), time (21.7s → 66.2s)
[15] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29625 → 79738), tool calls (2 → 6), time (17.8s → 52.5s)
[16] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29874 → 99946), tool calls (2 → 5), time (23.0s → 54.4s)
[17] (Plugin) Quality unchanged but weighted score is -4.7% due to: tokens (45201 → 98830), tool calls (3 → 7), time (27.6s → 61.3s)
[18] (Isolated) Quality improved but weighted score is -1.6% due to: tokens (27669 → 44598), time (16.5s → 27.9s), tool calls (3 → 4)
[19] (Isolated) Quality unchanged but weighted score is -15.2% due to: judgment, quality
[20] (Plugin) Quality unchanged but weighted score is -3.1% due to: tokens (13534 → 30720), tool calls (0 → 2), time (14.7s → 26.5s)
[21] (Plugin) Quality unchanged but weighted score is -6.3% due to: tokens (39993 → 60020), time (16.7s → 37.2s), tool calls (4 → 6)
[22] (Isolated) Quality unchanged but weighted score is -16.0% due to: judgment, quality

⏰ timeout — run(s) hit the (120s, 300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 754 in dotnet/skills, download eval artifacts with gh run download 27418155760 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/79ffa9c33a8b48b07e9a9bd7a11a932112c6092e/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

YuliiaKovalova and others added 2 commits June 11, 2026 17:07

Copilot AI review requested due to automatic review settings June 11, 2026 15:48

YuliiaKovalova requested review from JanKrivanek and ViktorHofer as code owners June 11, 2026 15:48

Copilot AI reviewed Jun 11, 2026

View reviewed changes

YuliiaKovalova and others added 2 commits June 11, 2026 18:05

Copilot AI review requested due to automatic review settings June 11, 2026 16:10

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs

Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs

JanKrivanek approved these changes Jun 11, 2026

View reviewed changes

github-actions Bot added a commit that referenced this pull request Jun 11, 2026

Update PR token usage data (PR #754)

62cd908

JanKrivanek enabled auto-merge (squash) June 11, 2026 18:02

auto-merge was automatically disabled June 11, 2026 18:08
Head branch was pushed to by a user without write access

github-actions Bot added the waiting-on-author PR state label label Jun 11, 2026

github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation evaluate-now Trigger evaluation.yml for current PR head (transient) and removed waiting-on-author PR state label labels Jun 11, 2026

YuliiaKovalova force-pushed the dev/ykovalova/baseline-reuse branch from 2b3cd2a to c9f0f44 Compare June 12, 2026 09:39

Copilot AI review requested due to automatic review settings June 12, 2026 10:11

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs

Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs

Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs Outdated

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs

github-actions Bot added a commit that referenced this pull request Jun 12, 2026

Update PR token usage data (PR #754)

6782d95

YuliiaKovalova merged commit bcaa918 into dotnet:main Jun 12, 2026
39 checks passed

YuliiaKovalova mentioned this pull request Jun 12, 2026

Fix evaluation.yml run-name: quote expression so '#' isn't a YAML comment #759

Merged

Conversation

YuliiaKovalova commented Jun 11, 2026

What's new

Identity & safety

Behavior on reuse

Tests & docs

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuliiaKovalova commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanKrivanek commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Skill Validation Results

Uh oh!

YuliiaKovalova commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JanKrivanek commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants