Skip to content

Add reusable shared baseline to skill-validator evaluate (--baseline-out / --baseline-from)#754

Merged
YuliiaKovalova merged 7 commits into
dotnet:mainfrom
YuliiaKovalova:dev/ykovalova/baseline-reuse
Jun 12, 2026
Merged

Add reusable shared baseline to skill-validator evaluate (--baseline-out / --baseline-from)#754
YuliiaKovalova merged 7 commits into
dotnet:mainfrom
YuliiaKovalova:dev/ykovalova/baseline-reuse

Conversation

@YuliiaKovalova

Copy link
Copy Markdown
Member

Implements #751 — adds a reusable, shared no-skill/no-agent baseline ("shared control group") to skill-validator evaluate so the baseline arm can be computed once and reused across many invocations, eliminating redundant baseline runs and removing baseline run-to-run variance from cross-config comparisons.

What's new

  • --baseline-out <path> — after the run, persist each scenario's averaged baseline (honoring --runs) for later reuse.
  • --baseline-from <path> — reuse a precomputed baseline instead of re-running the baseline arm. The two options are mutually exclusive.

Identity & safety

The baseline file is keyed per scenario on (promptSha, targetSha) and carries a header with the schema version and --model. Following the (prompt, model, targetSha) contract:

  • promptSha — SHA-256 of the scenario prompt.
  • targetSha — SHA-256 of the scenario's materialized inputs (files copied via copy_test_files, explicit setup files, and the setup-command recipe). This prevents two cases that share a prompt but feed different fixtures (e.g. a different build.binlog) from reusing each other's baseline.
  • Reuse fails fast on model mismatch, unsupported schema version, or any scenario whose prompt+fixture identity is missing from the file.

Behavior on reuse

When a cached baseline matches, the baseline agent run (and its assertions/constraints/task-completion/judging) is skipped; the cached metrics + judge result are used for deltas and pairwise/independent judging. Such scenarios are reported with the baseline-reused session phase and a reused baseline status. Pairwise judging runs in the skilled run's work dir (the cached baseline's work dir no longer exists) and does not re-attribute tokens to the baseline.

Tests & docs

  • New BaselineStoreTests (9 facts): prompt-SHA determinism, save/load round-trip, model-mismatch / unsupported-version / missing-file failures, FindMissingScenarios, write-store-not-reuse, target-SHA stability + content sensitivity, and same-prompt/different-fixture non-reuse.
  • Full suite green: 560 passed, 0 failed. Build: 0 warnings, 0 errors.
  • Updated eng/skill-validator/src/README.md (examples, flags table, "Shared baseline reuse" section) and docs/InvestigatingResults.md.

Closes #751.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

YuliiaKovalova and others added 2 commits June 11, 2026 17:07
…tnet#751)

Add --baseline-out and --baseline-from options to the evaluate command so the
no-skill/no-agent baseline arm can be computed once and reused as a shared
control group across multiple skill/agent evaluations. This eliminates redundant
baseline runs and removes baseline run-to-run variance from cross-config
comparisons.

- New BaselineStore + BaselineFile/BaselineScenarioEntry models, keyed per
  scenario on SHA-256(prompt) with a header recording version, model,
  validator version and runs. Load validates version + model and fails fast on
  mismatch or missing scenarios.
- Register the new serializable types in SkillValidatorJsonContext (AOT
  source-gen).
- Wire two mutually-exclusive CLI options into ValidatorConfig; thread an
  optional BaselineStore through both execution paths.
- On reuse, skip the baseline agent run, its assertions/constraints/
  task-completion/judging, and attribute no extra pairwise tokens to the
  baseline; report the scenario with the baseline-reused session phase and a
  reused status. In write mode, record each scenario's averaged baseline and
  persist it after the run.
- Add unit tests for BaselineStore and document the feature in README and
  InvestigatingResults.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align baseline reuse with the (prompt, model, targetSha) identity contract from
the upstream eval-harness design: previously the per-scenario reuse key was the
prompt SHA + model only, so two scenarios that share an identical prompt but
feed the agent different input artifacts (e.g. a different build.binlog) would
collide and silently reuse the wrong baseline.

- Add BaselineScenarioEntry.TargetSha: a SHA-256 over the scenario's materialized
  inputs — files auto-copied via copy_test_files, explicit setup files (inline
  content or copied sources), and the setup command recipe. The reuse key is now
  (promptSha, targetSha); both must match. Bump the on-disk schema to version 2.
- Memoize target hashing per process via a cheap, file-I/O-free setup signature
  to avoid re-hashing large fixtures across the N runs.
- Thread the originating eval.yaml path into Record/TryGetBaseline/
  FindMissingScenarios so inputs can be fingerprinted.
- Tests: target SHA is stable and content-sensitive; same-prompt/different-fixture
  scenarios do not reuse each other's baseline and are surfaced by
  FindMissingScenarios. Update README and InvestigatingResults.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 11, 2026 15:48
@github-actions

Copy link
Copy Markdown
Contributor

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds shared-baseline persistence and reuse to skill-validator evaluate, enabling faster and less noisy comparisons across multiple skills/agents by skipping redundant baseline runs when a matching baseline file is provided.

Changes:

  • Introduces --baseline-out (write) / --baseline-from (reuse) options and threads a BaselineStore through evaluation execution.
  • Implements on-disk baseline schema + prompt/fixture identity hashing to prevent cross-scenario contamination.
  • Adds unit tests and updates docs to explain baseline reuse behavior and reporting.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
eng/skill-validator/src/Evaluate/EvaluateCommand.cs Adds CLI options, validation, baseline preflight checks, and execution-path changes for baseline reuse/write.
eng/skill-validator/src/Evaluate/BaselineStore.cs New baseline file format + keyed storage and hashing for prompt + setup/fixtures.
eng/skill-validator/tests/Evaluate/BaselineStoreTests.cs Adds tests for hashing determinism, save/load, model/version validation, and fixture-sensitive reuse.
eng/skill-validator/src/docs/InvestigatingResults.md Documents how reused baselines appear in results/phases and how identity is determined.
eng/skill-validator/src/README.md Documents new flags and adds a “Shared baseline reuse” section.
eng/skill-validator/src/SkillValidatorJsonContext.cs Registers baseline types for source-generated System.Text.Json serialization.
eng/skill-validator/src/Evaluate/Models.cs Adds config fields for baseline out/from options.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs Outdated
Comment thread eng/skill-validator/src/README.md Outdated
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs Outdated
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs Outdated
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs Outdated
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs Outdated
Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs
YuliiaKovalova and others added 2 commits June 11, 2026 18:05
Address rubber-duck review findings on the shared-baseline feature:

- Fix Sha256Hex 32-byte bug: a 32-byte input was treated as an
  already-computed digest and not hashed. Split into Sha256Hex (always
  hash) + HexDigest (encode existing digest).
- Broaden reuse identity: the cached baseline RunResult depends on the
  judge model and on per-scenario evaluation criteria (rubric,
  assertions, expect/reject tools, turn/token/timeout limits). Add
  JudgeModel to the baseline header (validated on load) and fold the
  criteria into the per-scenario targetSha so changing them invalidates
  reuse instead of silently serving a stale result.
- Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures:
  exclude only the top-level eval.yaml (nested eval.yaml files are
  copied, so they must be hashed).
- Make the target-SHA cache instance-scoped (memoizing only the
  expensive fixture-input hashing) so it can't serve stale hashes or
  leak across evaluations/tests; hash inline file Content in the cache
  key instead of embedding it.
- Deterministic Save ordering (Name, PromptSha, TargetSha); guard Load
  against null Scenarios; enrich FindMissingScenarios output with the
  eval path.
- Document that setup commands are fingerprinted by recipe, so reuse
  assumes they are deterministic/hermetic.
- Tests + docs updated; add judge-model-mismatch and criteria-identity
  tests (562 pass).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures:
  enumerate only the files actually copied (top-level siblings except
  eval.yaml, recursing into directories) and skip reparse points and
  out-of-root junctions, instead of blindly hashing every file under the
  eval directory. This keeps the fixture identity restricted to the
  intentionally-copied set so stray output/log files can't poison reuse.
- Stream baseline JSON to/from disk (File.OpenRead/File.Create with
  JsonSerializer) so large baselines never materialize as one giant
  in-memory string.
- Enrich the fail-fast 'missing scenario' output with the eval path and
  short prompt/target SHA prefixes so it is actionable when scenario
  names collide across eval files.
- Add a test locking in recursive (nested-directory) fixture hashing.

563 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 11, 2026 16:10
@YuliiaKovalova

Copy link
Copy Markdown
Member Author

Addressed the review comments in 82f487f and cc7dd46:

  • --baseline-from option description & README flag table: updated to state reuse must match --model, --judge-model, and each scenario's prompt + setup/fixture inputs + evaluation criteria (not prompt alone).
  • BuildTargetCacheKey inlining f.Content: the cache key now stores a SHA-256 of inline content rather than the raw string, and the cache is now instance-scoped (no longer static), so large inline content no longer bloats or is retained process-wide.
  • copy_test_files hashing all files under the eval dir: hashing now mirrors AgentRunner.SetupWorkDir/CopyDirectory exactly — it enumerates only the files actually copied (top-level siblings except eval.yaml, recursing into directories) and skips reparse points / out-of-root junctions, so stray output/log files under the eval directory can't poison the fixture identity.
  • Load/Save reading/writing the whole JSON as a string: both now stream to/from disk via File.OpenRead/File.Create with JsonSerializer, so large baselines never materialize as one giant in-memory string.
  • Fail-fast message listing names only: FindMissingScenarios output now includes the eval path and short prompt/target SHA prefixes, making it actionable when scenario names collide across eval files.

Also broadened the reuse identity to cover the judge model and per-scenario criteria (rubric/assertions/expect+reject tools/turn-token-timeout limits), and fixed a Sha256Hex 32-byte-input bug. 563 tests pass.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs
Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs
@JanKrivanek

Copy link
Copy Markdown
Member

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 5.0/5 → 5.0/5 ✅ dotnet-pinvoke; tools: skill ✅ 0.06
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.06
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, view ✅ 0.10
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, glob, view / ✅ nuget-trusted-publishing; tools: skill, view, glob ✅ 0.10
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 3.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, view / ✅ nuget-trusted-publishing; tools: skill ✅ 0.10
csharp-scripts Avoid activating for language-agnostic calendar script 5.0/5 → 3.0/5 🔴 ✅ csharp-scripts; tools: skill, create / ✅ csharp-scripts; tools: skill, create, edit 🟡 0.31
csharp-scripts Test a C# language feature with a file-based app 3.0/5 → 4.0/5 🟢 ✅ csharp-scripts; tools: skill, create, edit 🟡 0.31
csharp-scripts Compose a file-based app from helper files 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.31 [1]
coordinate-components Warehouse dashboard with site selector and live stock alerts 4.0/5 → 5.0/5 🟢 ✅ coordinate-components; tools: skill 🟡 0.24
coordinate-components Multi-tenant notification hub with cross-component fan-out 2.0/5 → 5.0/5 🟢 ✅ coordinate-components; tools: skill 🟡 0.24
use-js-interop Auto-saving notepad that survives page reloads 4.0/5 → 4.0/5 ✅ use-js-interop; tools: skill ✅ 0.19
use-js-interop User activity tracker that detects idle timeout 4.0/5 → 4.0/5 ✅ use-js-interop; tools: skill ✅ 0.19
use-js-interop Responsive layout that adapts to screen size 4.0/5 → 4.0/5 ✅ use-js-interop; tools: skill ✅ 0.19
use-js-interop Infinite scroll list using IntersectionObserver 4.0/5 → 4.0/5 ✅ use-js-interop; tools: skill ✅ 0.19
create-blazor-project University course catalog with enrollment form 4.0/5 → 2.0/5 🔴 ✅ create-blazor-project; tools: skill 🟡 0.27
create-blazor-project Recipe community with interactive ratings on static pages 4.0/5 → 5.0/5 🟢 ✅ create-blazor-project; tools: skill, glob 🟡 0.27
create-blazor-project Global logistics tracking for worldwide users 2.0/5 → 4.0/5 🟢 ✅ create-blazor-project; tools: skill 🟡 0.27
support-prerendering Equipment inventory loaded once 4.0/5 → 4.0/5 ✅ support-prerendering; tools: skill 🟡 0.24
support-prerendering Notifications page with live polling 4.0/5 → 3.0/5 🔴 ✅ support-prerendering; tools: skill 🟡 0.24
fetch-and-send-data Recipe browser with resilient data loading 3.0/5 → 4.0/5 🟢 ✅ fetch-and-send-data; tools: skill ✅ 0.19
fetch-and-send-data Real-time shipment tracker with Auto interactivity 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED / ✅ fetch-and-send-data; tools: skill ✅ 0.19
author-component Author a data-loading search component 3.0/5 → 4.0/5 🟢 ✅ author-component; tools: skill, view ✅ 0.18
author-component Author a multi-step wizard with async validation and shared state 3.0/5 → 4.0/5 🟢 ✅ author-component; tools: skill, view ✅ 0.18
author-component Author a generic data table component 4.0/5 → 5.0/5 🟢 ✅ author-component; tools: skill, bash ✅ 0.18 [2]
author-component Author a real-time notification badge component 2.0/5 → 5.0/5 🟢 ✅ author-component; tools: skill, view ✅ 0.18
author-component Author a sortable list with code-behind pattern 4.0/5 → 4.0/5 ✅ author-component; tools: skill, bash / ✅ author-component; tools: skill ✅ 0.18 [3]
configure-auth Login and account management in a globally interactive app 5.0/5 → 5.0/5 ✅ configure-auth; tools: skill ✅ 0.08
configure-auth Multi-tier app with WebAssembly auth 3.0/5 → 5.0/5 🟢 ✅ configure-auth; tools: skill ✅ 0.08
plan-ui-change Project management Kanban board 2.0/5 → 4.0/5 🟢 ✅ plan-ui-change; tools: skill, edit 🟡 0.26
plan-ui-change E-commerce product catalog with filters and pagination 3.0/5 → 5.0/5 🟢 ✅ plan-ui-change; tools: skill, edit / ✅ plan-ui-change; tools: skill, grep, edit 🟡 0.26
plan-ui-change Multi-step job application wizard 2.0/5 → 4.0/5 🟢 ✅ plan-ui-change; tools: skill 🟡 0.26
plan-ui-change Application settings page with nested tab panels 3.0/5 → 4.0/5 🟢 ✅ plan-ui-change; tools: skill 🟡 0.26
plan-ui-change Team chat interface with message threads 2.0/5 → 5.0/5 🟢 ✅ plan-ui-change; tools: skill / ✅ plan-ui-change; tools: skill, edit 🟡 0.26
collect-user-input Event registration with custom validation 4.0/5 → 5.0/5 🟢 ✅ collect-user-input; tools: skill ✅ 0.11
collect-user-input Multi-step booking form with cross-field validation 4.0/5 → 4.0/5 ✅ collect-user-input; tools: skill ✅ 0.11

[1] (Plugin) Quality unchanged but weighted score is -1.0% due to: tokens (97378 → 115389)
[2] (Plugin) Quality unchanged but weighted score is -5.1% due to: tokens (26523 → 62538), tool calls (2 → 4)
[3] (Isolated) Quality unchanged but weighted score is -4.5% due to: tokens (26981 → 57359), tool calls (3 → 5), time (16.9s → 23.6s)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 754 in dotnet/skills, download eval artifacts with gh run download 27361292234 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/cc7dd466f10567e954b966daad6b55af2610b4cf/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@JanKrivanek JanKrivanek enabled auto-merge (squash) June 11, 2026 18:02
- Skip top-level reparse points in EnumerateCopiedFixtures (not just
  nested ones) so a top-level symlink/junction can't cause hashing of
  data outside the eval directory; code now matches the docstring.
- Record uses first-writer-wins (TryAdd) instead of overwriting, so a
  scenario identity recorded by multiple parallel targets yields a
  deterministic --baseline-out regardless of completion order.
- Persist the baseline judge result to the session DB even when the
  baseline is reused, so the registered 'baseline-reused' session record
  is complete for downstream investigation tooling (pairwise was already
  saved; the judge result was incorrectly gated on a fresh run).
- Add first-writer-wins test (564 pass).

Note: BaselineStore stays internal — the test project already has
InternalsVisibleTo, so it compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
auto-merge was automatically disabled June 11, 2026 18:08

Head branch was pushed to by a user without write access

@YuliiaKovalova

Copy link
Copy Markdown
Member Author

Addressed the second round of review comments in 2b3cd2a:

  • EnumerateCopiedFixtures not skipping top-level reparse points: now skips reparse points for both files and directories at the top level (not just nested ones), so a top-level symlink/junction can't cause hashing of data outside the eval directory. The code now matches the docstring.
  • Record unconditionally overwriting: switched to first-writer-wins (TryAdd). When several targets evaluated in parallel share the same scenario identity, the persisted --baseline-out is now deterministic regardless of completion order (later identical-key records differing only by run-to-run noise are ignored). Added a test.
  • Reused baseline judge result not saved to the session DB: the baseline judge result is now persisted even when reused, so the registered baseline-reused session record is complete for downstream investigation tooling. (The pairwise result was already saved; only the judge-result save was incorrectly gated on a fresh run.)
  • BaselineStore internal vs. test access: no change needed — the test project already has InternalsVisibleTo in SkillValidator.csproj, so it compiles (564 tests pass).

The earlier seven comments were addressed in 82f487f / cc7dd46 (see prior summary).

@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @YuliiaKovalova — this PR has 11 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation evaluate-now Trigger evaluation.yml for current PR head (transient) and removed waiting-on-author PR state label labels Jun 11, 2026
@YuliiaKovalova YuliiaKovalova force-pushed the dev/ykovalova/baseline-reuse branch from 2b3cd2a to c9f0f44 Compare June 12, 2026 09:39
The transitive MessagePack 2.5.198 (via GitHub.Copilot.SDK -> StreamJsonRpc)
has a high-severity vulnerability (GHSA-hv8m-jj95-wg3x) that fails the build
under TreatWarningsAsErrors. Pin a direct reference to the patched 2.5.301.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 10:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs
Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs
Comment thread eng/skill-validator/src/Evaluate/EvaluateCommand.cs Outdated
Comment thread eng/skill-validator/src/Evaluate/BaselineStore.cs
… consistent token attribution

- ComputeInputsSha: normalize evalPath via Path.GetFullPath so a bare filename
  still hashes sibling fixtures (avoids TargetSha collisions / unsafe reuse).
- RunMetrics.Clone(): per-run copy with fresh collections; reuse paths now clone
  the cached baseline so concurrent evaluations never share a mutable instance.
- Pairwise judge tokens attributed to both compared runs in every mode (the
  baseline clone makes this safe), keeping token deltas comparable across
  --baseline-from modes.
- Reword Record first-writer-wins doc to describe the within-run stabilization
  guarantee rather than order-independence.
- Add tests for bare-filename fixture hashing and clone isolation (566 pass).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JanKrivanek

Copy link
Copy Markdown
Member

/evaluate

github-actions Bot added a commit that referenced this pull request Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
microbenchmarking Investigate runtime upgrade performance impact 4.0/5 → 4.0/5 ✅ microbenchmarking; tools: skill, create / ⚠️ NOT ACTIVATED ✅ 0.11
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 5.0/5 → 5.0/5 ✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob ✅ 0.10 [1]
clr-activation-debugging Diagnose FOD suppressed but activation still failing 3.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob ✅ 0.10
clr-activation-debugging Explain why same binary behaves differently under different launch methods 5.0/5 → 5.0/5 ✅ clr-activation-debugging; tools: skill / ✅ clr-activation-debugging; tools: skill, glob ✅ 0.10 [2]
clr-activation-debugging Analyze healthy managed EXE activation 4.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill, glob ✅ 0.10
clr-activation-debugging Identify multiple activation sequences in a single log 4.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill, glob ✅ 0.10 [3]
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 3.0/5 → 4.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 → 4.0/5 🟢 ✅ clr-activation-debugging; tools: skill, glob / ℹ️ not activated (expected) ✅ 0.10
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 2.0/5 → 3.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, bash, read_bash, stop_bash / ✅ android-tombstone-symbolication; tools: skill, bash, read_bash ✅ 0.19
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [4]
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 4.0/5 → 4.0/5 ✅ android-tombstone-symbolication; tools: skill ✅ 0.19 [5]
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, bash ✅ 0.19
android-tombstone-symbolication Symbolicate multi-thread tombstone 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill ✅ 0.19
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 4.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill ✅ 0.19
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 3.0/5 → 2.0/5 ⏰ 🔴 ✅ android-tombstone-symbolication; tools: skill, read_bash / ✅ android-tombstone-symbolication; tools: skill ✅ 0.19
android-tombstone-symbolication Reject iOS crash log as wrong format 4.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.19
apple-crash-symbolication Parse .NET frames and locate dSYMs from an iOS crash log 3.0/5 → 4.0/5 🟢 ✅ apple-crash-symbolication; tools: skill, bash ✅ 0.17
apple-crash-symbolication Investigate root cause of a .NET MAUI iOS crash 2.0/5 → 3.0/5 🟢 ✅ apple-crash-symbolication; tools: skill, bash ✅ 0.17
apple-crash-symbolication Reject Android tombstone passed as iOS crash log 4.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.17 [6]
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 → 5.0/5 ✅ dump-collect; tools: report_intent, skill, view / ✅ dump-collect; tools: skill, report_intent, view 🟡 0.30 [7]
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, view 🟡 0.30
dump-collect Recover crash dump from macOS NativeAOT without createdump 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED 🟡 0.30
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view 🟡 0.30
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.30
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view 🟡 0.30
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view 🟡 0.30
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill / ✅ dump-collect; tools: skill, view 🟡 0.30
dump-collect Decline dump analysis request 2.0/5 → 2.0/5 ℹ️ not activated (expected) 🟡 0.30 [8]
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view 🟡 0.21 [9]
dotnet-trace-collect .NET Framework on Windows without admin privileges 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill / ✅ dotnet-trace-collect; tools: skill 🟡 0.21
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 2.0/5 → 3.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view 🟡 0.21
dotnet-trace-collect Memory leak on Linux (.NET 8) 3.0/5 → 3.0/5 ✅ dotnet-trace-collect; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED 🟡 0.21
dotnet-trace-collect Slow requests on Windows with PerfView 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill / ✅ dotnet-trace-collect; tools: report_intent, skill 🟡 0.21 [10]
dotnet-trace-collect Excessive GC on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: report_intent, skill 🟡 0.21
dotnet-trace-collect Hang or deadlock diagnosis on Linux 3.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: skill 🟡 0.21
dotnet-trace-collect Windows container high CPU with PerfView 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view, grep 🟡 0.21
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 3.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, view 🟡 0.21
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: report_intent, skill, view 🟡 0.21
dotnet-trace-collect Windows modern .NET with admin high CPU 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, view / ⚠️ NOT ACTIVATED 🟡 0.21 [11]
dotnet-trace-collect Memory leak on .NET Framework Windows 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: skill 🟡 0.21
dotnet-trace-collect Kubernetes with console access prefers console tools 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, view / ✅ dotnet-trace-collect; tools: skill 🟡 0.21
dotnet-trace-collect Container installation without .NET SDK 4.0/5 → 4.0/5 ✅ dotnet-trace-collect; tools: report_intent, skill, view 🟡 0.21 [12]
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view / ✅ dotnet-trace-collect; tools: skill 🟡 0.21
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view 🟡 0.21
dotnet-trace-collect Assembly loading failure on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view / ✅ dotnet-trace-collect; tools: skill 🟡 0.21 [13]
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 4.0/5 → 4.0/5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 5.0/5 → 5.0/5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15 [14]
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 5.0/5 → 5.0/5 ✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 5.0/5 → 4.0/5 🔴 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 5.0/5 → 5.0/5 ✅ analyzing-dotnet-performance; tools: skill, grep / ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15 [15]
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 3.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 3.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 4.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill, bash / ✅ analyzing-dotnet-performance; tools: skill, grep ✅ 0.15
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 4.0/5 → 3.0/5 🔴 ✅ analyzing-dotnet-performance; tools: skill / ✅ analyzing-dotnet-performance; tools: skill, bash ✅ 0.15 [16]
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 4.0/5 → 4.0/5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15 [17]
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 3.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill, grep / ✅ analyzing-dotnet-performance; tools: skill ✅ 0.15
exp-mock-usage-analysis Detect unused and unreachable mock setups 4.0/5 → 4.0/5 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.08
exp-mock-usage-analysis Detect redundant mock configurations duplicated across tests 3.0/5 → 3.0/5 ✅ exp-mock-usage-analysis; tools: skill, edit ✅ 0.08
exp-mock-usage-analysis Detect mocking of stable framework types 3.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.08
exp-mock-usage-analysis Analyze mock usage in NSubstitute tests 3.0/5 → 3.0/5 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.08
exp-mock-usage-analysis Analyze mock usage in FakeItEasy tests 4.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.08 [18]
exp-mock-usage-analysis Detect excessive mock configuration sprawl 3.0/5 → 4.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.08
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.14 [19]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.14 [20]
exp-test-maintainability Detect repeated object construction and setup across test methods 2.0/5 → 4.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.14
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 4.0/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.14 [21]
exp-simd-vectorization Optimize manual min/max with TensorPrimitives 1.0/5 → 5.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, glob, create, bash 🟡 0.20
exp-simd-vectorization Optimize manual product with TensorPrimitives 1.0/5 → 5.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, glob / ✅ exp-simd-vectorization; tools: skill, glob, create, bash 🟡 0.20
exp-simd-vectorization No optimization opportunity — dictionary-based lookup service 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED 🟡 0.20
exp-simd-vectorization Optimize int array conditional increment with SIMD 4.0/5 → 4.0/5 ✅ exp-simd-vectorization; tools: skill 🟡 0.20 [22]
exp-simd-vectorization Optimize byte buffer bit reversal with SIMD 4.0/5 → 4.0/5 ✅ exp-simd-vectorization; tools: skill 🟡 0.20

[1] (Plugin) Quality unchanged but weighted score is -6.5% due to: tokens (42962 → 80074), time (26.7s → 44.6s), tool calls (5 → 6)
[2] (Isolated) Quality unchanged but weighted score is -1.4% due to: tokens (41077 → 72664), tool calls (4 → 5)
[3] (Isolated) Quality improved but weighted score is -2.3% due to: tokens (40979 → 72460), time (21.1s → 28.4s), tool calls (3 → 4)
[4] (Isolated) Quality unchanged but weighted score is -5.6% due to: tokens (26407 → 45326), tool calls (2 → 3), time (14.4s → 18.7s)
[5] (Plugin) Quality unchanged but weighted score is -4.2% due to: tokens (60204 → 149446), tool calls (5 → 9)
[6] (Plugin) Quality improved but weighted score is -12.1% due to: completion (✓ → ✗), tokens (27601 → 88815), tool calls (2 → 6), time (22.6s → 36.9s)
[7] (Plugin) Quality unchanged but weighted score is -8.7% due to: tokens (12863 → 46395), tool calls (0 → 3), time (11.1s → 16.6s)
[8] (Plugin) Quality unchanged but weighted score is -1.1% due to: tokens (12649 → 14261)
[9] (Isolated) Quality improved but weighted score is -5.2% due to: tokens (13219 → 53318), tool calls (0 → 3), time (19.0s → 25.8s)
[10] (Plugin) Quality unchanged but weighted score is -7.8% due to: tokens (12930 → 34975), tool calls (0 → 2)
[11] (Plugin) Quality unchanged but weighted score is -0.1% due to: quality
[12] (Plugin) Quality unchanged but weighted score is -13.2% due to: tokens (12932 → 56799), quality, tool calls (0 → 3), time (12.0s → 20.0s)
[13] (Isolated) Quality improved but weighted score is -8.3% due to: tokens (13241 → 74264), tool calls (0 → 4), time (19.6s → 26.0s)
[14] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29403 → 84200), tool calls (2 → 7), time (21.7s → 66.2s)
[15] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29625 → 79738), tool calls (2 → 6), time (17.8s → 52.5s)
[16] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (29874 → 99946), tool calls (2 → 5), time (23.0s → 54.4s)
[17] (Plugin) Quality unchanged but weighted score is -4.7% due to: tokens (45201 → 98830), tool calls (3 → 7), time (27.6s → 61.3s)
[18] (Isolated) Quality improved but weighted score is -1.6% due to: tokens (27669 → 44598), time (16.5s → 27.9s), tool calls (3 → 4)
[19] (Isolated) Quality unchanged but weighted score is -15.2% due to: judgment, quality
[20] (Plugin) Quality unchanged but weighted score is -3.1% due to: tokens (13534 → 30720), tool calls (0 → 2), time (14.7s → 26.5s)
[21] (Plugin) Quality unchanged but weighted score is -6.3% due to: tokens (39993 → 60020), time (16.7s → 37.2s), tool calls (4 → 6)
[22] (Isolated) Quality unchanged but weighted score is -16.0% due to: judgment, quality

timeout — run(s) hit the (120s, 300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 754 in dotnet/skills, download eval artifacts with gh run download 27418155760 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/79ffa9c33a8b48b07e9a9bd7a11a932112c6092e/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

@YuliiaKovalova YuliiaKovalova merged commit bcaa918 into dotnet:main Jun 12, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evaluate-now Trigger evaluation.yml for current PR head (transient) pr-state/ready-for-eval PR is mergeable and awaiting evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

skill-validator: support precomputed/shared baseline reuse (--baseline-from / --baseline-out)

3 participants