Releases: Anyesh/skillprobe
Releases · Anyesh/skillprobe
v0.5.0
[0.5.0] - 2026-04-12
Skill combination testing. Multi-skill scenarios, a matrix expansion block for sweeping one skill against many, and a baseline diff mode that separates real combination regressions from natural model variance. A measure subcommand for characterizing scenario variance before picking thresholds. A local content-keyed run cache so repeated iterations do not burn the same API budget twice.
Added
- Multi-skill workspace support:
skills:list at suite level loads more than one skill per scenario. Backward compatible with the existing singularskill:field. matrix:expansion block for sweeping one base skill against apair_withlist. Each scenario runs once per pairing; reporter groups results by pairing.--baselinemode that runs every matrix pairing three times (solo A, solo B, combined) and classifies each assertion into one ofregression,shared_failure,flaky, orok. Accompanied by--regression-margin(default 0.15) and--baseline-runs(default 5, with a warning when below 10). Per-pairing and grand-total cost lines at the end of baseline runs.skillprobe measure <test.yaml> [--runs N]subcommand that runs each scenario N times and reports per-assertion pass rate, 95 percent Wilson confidence interval, and a variance classification ofdeterministic,probabilistic,noisy, orunreliable. Human-readable output by default,--jsonfor machine-readable.- Content-keyed local run cache at
~/.cache/skillprobe/runs/(or$XDG_CACHE_HOME/skillprobe/runs/). Key includes SHA256 of skill file contents, prompt, model, harness, and skillprobe version. TTL 24 hours, configurable viaSKILLPROBE_CACHE_TTL_HOURS. Flags--no-cache,--force-refresh,--cache-dir. Cached scenarios are marked[cache hit]in the reporter output. - Loader validates assertion type names against the real handler registry at parse time. Unknown types raise
ValueErrorwith the list of valid types, rather than being silently skipped at runtime. - Loader discriminates scenario files from activation files: a file that contains an
activation:block is rejected byload_scenario_suite, and vice versa forload_activation_suite. Files missing both blocks raise a clear error. - Loader detects skill name collisions in
skills:lists at parse time: two skill paths that map to the same target directory inside.claude/skills/or.cursor/skills/raise before any workspace is created. - Bundled
write-testsskill teaches the new YAML surface: combination tests, matrix expansion, baseline mode, and the measure command. - README sections for combinations, matrix expansion, baseline mode, measure, caching behavior, and PyPI/CI badges at the top.
CHANGELOG.mdand a GitHub Release creation step in the publish workflow.
Changed
WorkspaceManager.createsignature changed fromskill: Path | Nonetoskills: list[Path] | None. The orchestrator and activation runner are updated to pass the list form. Existing scenario YAMLs using the singularskill:field continue to work without modification because the loader normalizes into the list form.- Cursor adapter now captures stderr from the subprocess and raises a clear
RuntimeErrorwhen the process exited non-zero with stderr content, or when the parsed stdout has no assistant text and no tool calls and stderr is non-empty. Previously, cursor usage-limit and invalid-model errors were silently written to stderr while the adapter returned an emptyresponse_textthat caused every downstream assertion to fail as "not found in response." --max-costnow emits a warning and clears the value when used with a harness other than claude-code. It was previously a silent no-op on cursor.- Reporter adds a
[cache hit]marker next to scenario lines where every step was fully served from the run cache, so replayed cost numbers on cached lines are not mistaken for new spend.
Fixed
WorkspaceManager.createno longer silently skips missing skill paths. It collects every missing path and raisesFileNotFoundErrorwith all of them listed, so typos inskill:orskills:surface immediately rather than producing workspaces with fewer skills than declared.examples/tests/test-combo-sample.yamlscenario 2 threshold tuned tomin_pass_rate: 0.6based on a real 10-run variance probe against claude-haiku-4-5 that measured the observed pass rate at 70 percent with a CI spanning 0.40 to 0.89. The previous 0.8 threshold was above observed variance and produced flaky shipped examples.examples/tests/combo-exemplars/test-activation-formatters.yamlwas using a non-existentskill_loadedassertion type; renamed toskill_activated. Later deleted along with the rest of thecombo-exemplars/directory because the exemplar was fundamentally broken by design (identical-description fixture skills made activation a coin flip).
Full Changelog: v0.4.0...v0.5.0