Add reusable shared baseline to skill-validator evaluate (--baseline-out / --baseline-from)#754
Conversation
…tnet#751) Add --baseline-out and --baseline-from options to the evaluate command so the no-skill/no-agent baseline arm can be computed once and reused as a shared control group across multiple skill/agent evaluations. This eliminates redundant baseline runs and removes baseline run-to-run variance from cross-config comparisons. - New BaselineStore + BaselineFile/BaselineScenarioEntry models, keyed per scenario on SHA-256(prompt) with a header recording version, model, validator version and runs. Load validates version + model and fails fast on mismatch or missing scenarios. - Register the new serializable types in SkillValidatorJsonContext (AOT source-gen). - Wire two mutually-exclusive CLI options into ValidatorConfig; thread an optional BaselineStore through both execution paths. - On reuse, skip the baseline agent run, its assertions/constraints/ task-completion/judging, and attribute no extra pairwise tokens to the baseline; report the scenario with the baseline-reused session phase and a reused status. In write mode, record each scenario's averaged baseline and persist it after the run. - Add unit tests for BaselineStore and document the feature in README and InvestigatingResults. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align baseline reuse with the (prompt, model, targetSha) identity contract from the upstream eval-harness design: previously the per-scenario reuse key was the prompt SHA + model only, so two scenarios that share an identical prompt but feed the agent different input artifacts (e.g. a different build.binlog) would collide and silently reuse the wrong baseline. - Add BaselineScenarioEntry.TargetSha: a SHA-256 over the scenario's materialized inputs — files auto-copied via copy_test_files, explicit setup files (inline content or copied sources), and the setup command recipe. The reuse key is now (promptSha, targetSha); both must match. Bump the on-disk schema to version 2. - Memoize target hashing per process via a cheap, file-I/O-free setup signature to avoid re-hashing large fixtures across the N runs. - Thread the originating eval.yaml path into Record/TryGetBaseline/ FindMissingScenarios so inputs can be fingerprinted. - Tests: target SHA is stable and content-sensitive; same-prompt/different-fixture scenarios do not reuse each other's baseline and are surfaced by FindMissingScenarios. Update README and InvestigatingResults. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note This PR is from a fork and modifies infrastructure files ( Changes to infrastructure typically need to be submitted from a branch in Please consider recreating this PR from an upstream branch. If you don't have push access to |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds shared-baseline persistence and reuse to skill-validator evaluate, enabling faster and less noisy comparisons across multiple skills/agents by skipping redundant baseline runs when a matching baseline file is provided.
Changes:
- Introduces
--baseline-out(write) /--baseline-from(reuse) options and threads aBaselineStorethrough evaluation execution. - Implements on-disk baseline schema + prompt/fixture identity hashing to prevent cross-scenario contamination.
- Adds unit tests and updates docs to explain baseline reuse behavior and reporting.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| eng/skill-validator/src/Evaluate/EvaluateCommand.cs | Adds CLI options, validation, baseline preflight checks, and execution-path changes for baseline reuse/write. |
| eng/skill-validator/src/Evaluate/BaselineStore.cs | New baseline file format + keyed storage and hashing for prompt + setup/fixtures. |
| eng/skill-validator/tests/Evaluate/BaselineStoreTests.cs | Adds tests for hashing determinism, save/load, model/version validation, and fixture-sensitive reuse. |
| eng/skill-validator/src/docs/InvestigatingResults.md | Documents how reused baselines appear in results/phases and how identity is determined. |
| eng/skill-validator/src/README.md | Documents new flags and adds a “Shared baseline reuse” section. |
| eng/skill-validator/src/SkillValidatorJsonContext.cs | Registers baseline types for source-generated System.Text.Json serialization. |
| eng/skill-validator/src/Evaluate/Models.cs | Adds config fields for baseline out/from options. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address rubber-duck review findings on the shared-baseline feature: - Fix Sha256Hex 32-byte bug: a 32-byte input was treated as an already-computed digest and not hashed. Split into Sha256Hex (always hash) + HexDigest (encode existing digest). - Broaden reuse identity: the cached baseline RunResult depends on the judge model and on per-scenario evaluation criteria (rubric, assertions, expect/reject tools, turn/token/timeout limits). Add JudgeModel to the baseline header (validated on load) and fold the criteria into the per-scenario targetSha so changing them invalidates reuse instead of silently serving a stale result. - Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures: exclude only the top-level eval.yaml (nested eval.yaml files are copied, so they must be hashed). - Make the target-SHA cache instance-scoped (memoizing only the expensive fixture-input hashing) so it can't serve stale hashes or leak across evaluations/tests; hash inline file Content in the cache key instead of embedding it. - Deterministic Save ordering (Name, PromptSha, TargetSha); guard Load against null Scenarios; enrich FindMissingScenarios output with the eval path. - Document that setup commands are fingerprinted by recipe, so reuse assumes they are deterministic/hermetic. - Tests + docs updated; add judge-model-mismatch and criteria-identity tests (562 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Mirror AgentRunner.SetupWorkDir exactly when hashing copied fixtures: enumerate only the files actually copied (top-level siblings except eval.yaml, recursing into directories) and skip reparse points and out-of-root junctions, instead of blindly hashing every file under the eval directory. This keeps the fixture identity restricted to the intentionally-copied set so stray output/log files can't poison reuse. - Stream baseline JSON to/from disk (File.OpenRead/File.Create with JsonSerializer) so large baselines never materialize as one giant in-memory string. - Enrich the fail-fast 'missing scenario' output with the eval path and short prompt/target SHA prefixes so it is actionable when scenario names collide across eval files. - Add a test locking in recursive (nested-directory) fixture hashing. 563 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the review comments in 82f487f and cc7dd46:
Also broadened the reuse identity to cover the judge model and per-scenario criteria (rubric/assertions/expect+reject tools/turn-token-timeout limits), and fixed a |
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality unchanged but weighted score is -1.0% due to: tokens (97378 → 115389) Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
- Skip top-level reparse points in EnumerateCopiedFixtures (not just nested ones) so a top-level symlink/junction can't cause hashing of data outside the eval directory; code now matches the docstring. - Record uses first-writer-wins (TryAdd) instead of overwriting, so a scenario identity recorded by multiple parallel targets yields a deterministic --baseline-out regardless of completion order. - Persist the baseline judge result to the session DB even when the baseline is reused, so the registered 'baseline-reused' session record is complete for downstream investigation tooling (pairwise was already saved; the judge result was incorrectly gated on a fresh run). - Add first-writer-wins test (564 pass). Note: BaselineStore stays internal — the test project already has InternalsVisibleTo, so it compiles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Head branch was pushed to by a user without write access
|
Addressed the second round of review comments in 2b3cd2a:
The earlier seven comments were addressed in 82f487f / cc7dd46 (see prior summary). |
|
👋 @YuliiaKovalova — this PR has 11 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the |
2b3cd2a to
c9f0f44
Compare
The transitive MessagePack 2.5.198 (via GitHub.Copilot.SDK -> StreamJsonRpc) has a high-severity vulnerability (GHSA-hv8m-jj95-wg3x) that fails the build under TreatWarningsAsErrors. Pin a direct reference to the patched 2.5.301. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… consistent token attribution - ComputeInputsSha: normalize evalPath via Path.GetFullPath so a bare filename still hashes sibling fixtures (avoids TargetSha collisions / unsafe reuse). - RunMetrics.Clone(): per-run copy with fresh collections; reuse paths now clone the cached baseline so concurrent evaluations never share a mutable instance. - Pairwise judge tokens attributed to both compared runs in every mode (the baseline clone makes this safe), keeping token deltas comparable across --baseline-from modes. - Reword Record first-writer-wins doc to describe the within-run stabilization guarantee rather than order-independence. - Add tests for bare-filename fixture hashing and clone isolation (566 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/evaluate |
Skill Validation Results
[1] (Plugin) Quality unchanged but weighted score is -6.5% due to: tokens (42962 → 80074), time (26.7s → 44.6s), tool calls (5 → 6)
Model: claude-opus-4.6 | Judge: claude-opus-4.6 🔍 Full Results - additional metrics and failure investigation steps
▶ Sessions Visualisation -- interactive replay of all evaluation sessions |
Implements #751 — adds a reusable, shared no-skill/no-agent baseline ("shared control group") to
skill-validator evaluateso the baseline arm can be computed once and reused across many invocations, eliminating redundant baseline runs and removing baseline run-to-run variance from cross-config comparisons.What's new
--baseline-out <path>— after the run, persist each scenario's averaged baseline (honoring--runs) for later reuse.--baseline-from <path>— reuse a precomputed baseline instead of re-running the baseline arm. The two options are mutually exclusive.Identity & safety
The baseline file is keyed per scenario on
(promptSha, targetSha)and carries a header with the schema version and--model. Following the(prompt, model, targetSha)contract:copy_test_files, explicit setup files, and the setup-command recipe). This prevents two cases that share a prompt but feed different fixtures (e.g. a differentbuild.binlog) from reusing each other's baseline.Behavior on reuse
When a cached baseline matches, the baseline agent run (and its assertions/constraints/task-completion/judging) is skipped; the cached metrics + judge result are used for deltas and pairwise/independent judging. Such scenarios are reported with the
baseline-reusedsession phase and areusedbaseline status. Pairwise judging runs in the skilled run's work dir (the cached baseline's work dir no longer exists) and does not re-attribute tokens to the baseline.Tests & docs
BaselineStoreTests(9 facts): prompt-SHA determinism, save/load round-trip, model-mismatch / unsupported-version / missing-file failures,FindMissingScenarios, write-store-not-reuse, target-SHA stability + content sensitivity, and same-prompt/different-fixture non-reuse.eng/skill-validator/src/README.md(examples, flags table, "Shared baseline reuse" section) anddocs/InvestigatingResults.md.Closes #751.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com