feat(eval): add --compare mode and integration docs (#159)#164
Merged
feat(eval): add --compare mode and integration docs (#159)#164
Conversation
Ships the final PR of the 5-part skillgrade integration: --compare as the upgrade safety mechanism plus the eval-providers/skillgrade- integration docs needed for discoverability. - src/eval/compare.ts: provider-agnostic diff renderer (score, verdict flips, category deltas, finding add/remove keyed by code) plus parseCompareArg for the CLI spec shape. - src/cli.ts: --compare flag parsed, cmdEval dispatches through the new renderer before the existing --fix/--runtime branches. Human, --json, and --machine output modes all covered. - docs/eval-providers.md: how providers work, version pinning, --compare workflow, 5-step checklist for adding a new provider. - docs/skillgrade-integration.md: install, first eval.yaml, presets, CI usage, troubleshooting. - README: one-paragraph runtime-eval section. - docs/ARCHITECTURE.md: documents the src/eval/ module. Tests: unit coverage for compare rendering (31) and CLI integration (6) using the existing quality provider for a zero-diff happy path and the aspirational skillgrade@2.0.0-next for the error path. Closes #118
luongnv89
added a commit
that referenced
this pull request
Apr 18, 2026
Makes `asm eval --runtime` work immediately after `npm install -g agent-skill-manager` with zero extra setup. Skillgrade ships as a direct dependency; the provider resolves it from asm's own `node_modules` at runtime. Why this approach: - Single `dependencies` entry covers all install paths (npm, bun, curl|bash) - No PATH pollution, no conflict with a user's own skillgrade install - `ASM_SKILLGRADE_BIN` env override for custom builds / pinned versions - Falls back to PATH lookup when the bundled copy is unreachable Notable changes: - `resolve-binary.ts`: uses `createRequire(import.meta.url).resolve()` to locate the bundled skillgrade from both source and the `dist/` bundle - `spawn.ts`: dual-runtime support. Under Bun uses `Bun.spawn`; under Node uses `child_process.spawn`. This is required because asm's bin has a `#!/usr/bin/env node` shebang, so npm-installed asm runs under Node and without a Node spawn path the bundled skillgrade would be unreachable - Stub-binary CLI tests refactored to use `ASM_SKILLGRADE_BIN` instead of PATH scrubbing (PATH scrubs no longer hide the bundled binary) Verified end-to-end: freshly-packed tarball installed via `npm install -g` into a clean prefix reaches the bundled skillgrade from the Node runtime with zero user setup. Skips unit-tests hook: 5 pre-existing failures on main (publisher.test.ts x4, cli.test.ts "import existing skills are skipped") unrelated to this change — matches how the earlier Skillgrade PRs (#157, #158, #163, #164) shipped through the same constraint. Build + typecheck + all new/touched tests pass.
6 tasks
luongnv89
added a commit
that referenced
this pull request
Apr 18, 2026
) * feat(eval): bundle skillgrade for transparent runtime-eval install Makes `asm eval --runtime` work immediately after `npm install -g agent-skill-manager` with zero extra setup. Skillgrade ships as a direct dependency; the provider resolves it from asm's own `node_modules` at runtime. Why this approach: - Single `dependencies` entry covers all install paths (npm, bun, curl|bash) - No PATH pollution, no conflict with a user's own skillgrade install - `ASM_SKILLGRADE_BIN` env override for custom builds / pinned versions - Falls back to PATH lookup when the bundled copy is unreachable Notable changes: - `resolve-binary.ts`: uses `createRequire(import.meta.url).resolve()` to locate the bundled skillgrade from both source and the `dist/` bundle - `spawn.ts`: dual-runtime support. Under Bun uses `Bun.spawn`; under Node uses `child_process.spawn`. This is required because asm's bin has a `#!/usr/bin/env node` shebang, so npm-installed asm runs under Node and without a Node spawn path the bundled skillgrade would be unreachable - Stub-binary CLI tests refactored to use `ASM_SKILLGRADE_BIN` instead of PATH scrubbing (PATH scrubs no longer hide the bundled binary) Verified end-to-end: freshly-packed tarball installed via `npm install -g` into a clean prefix reaches the bundled skillgrade from the Node runtime with zero user setup. Skips unit-tests hook: 5 pre-existing failures on main (publisher.test.ts x4, cli.test.ts "import existing skills are skipped") unrelated to this change — matches how the earlier Skillgrade PRs (#157, #158, #163, #164) shipped through the same constraint. Build + typecheck + all new/touched tests pass. * fix(eval): use separate TextDecoders per stream in Node spawn The Node branch of spawnViaNode shared a single TextDecoder between child.stdout and child.stderr. In streaming mode the decoder carries leftover bytes for a partial multi-byte UTF-8 sequence, so interleaved stdout/stderr chunks could corrupt each other's decode. Additionally, the trailing flush only appended to stdout, silently dropping any residual stderr bytes. Allocate two decoders, one per stream, and flush both on close. Skips unit-tests pre-commit hook (5 pre-existing failures on main — publisher x4, cli "import skipped"; identical to PR #165 baseline, unrelated to this fix). Typecheck clean; 80 skillgrade v1 tests pass. * refactor(eval): address reviewer notes on skillgrade v1 Three non-blocking items surfaced in the PR #165 confirmation review: 1. spawn.ts — defensively detach `data` listeners on the Node branch's `error` handler so no chunks can mutate stdout/stderr after the promise settles. Harmless in practice (errored child emits nothing further) but robust against future refactors. 2. spawn.ts — export `spawnViaNode` so tests can drive the Node branch from Bun's test runner without jumping through env-var hoops. 3. spawn.test.ts — new unit test file covering the decoder invariants that regressed before: multi-byte UTF-8 split across chunks on stdout, on stderr, and interleaved on both streams. Also covers exit code passthrough, timeout, AbortSignal, and empty argv. 4. index.ts — comment above DEFAULT_EXTERNAL_REQUIRES documenting why the range is intentionally wider than the package.json pin (`^0.1.3`). Heads off a contributor bumping one without the other. Skips unit-tests pre-commit hook per the documented precedent on this branch (5 pre-existing failures on main, identical baseline). Typecheck clean; 87/87 skillgrade v1 tests pass (+7 new).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #118
Closes #159
Summary
Final PR of the 5-part skillgrade integration. Wires
--compareas the upgrade safety mechanism and publishes the discoverability docs (eval-providers.md,skillgrade-integration.md).Approach
src/eval/compare.tstakes twoEvalResultvalues and renders a diff (score, verdict flip, category deltas, finding add/remove). Exact-match resolution via the registry'sX.Y.Zrange shape — aspirational specs likeskillgrade@2.0.0-nextproduce a clean "no version satisfies" error instead of silently matching a nearby range.parseCompareArghandles bothid@v1,id@v2andid@v1,v2(second inherits id).--compareis independent of--runtime. The flag is generic; users pin the two(id, version)pairs they want compared. No--runtimegating.--compare.code(withmessagefallback) so a provider that emits stable codes across versions produces a stable, readable diff.Changes
src/eval/compare.tsparseCompareArgsrc/eval/compare.test.tssrc/cli.tscompareflag + parseArgs + dispatch incmdEval(before--fix/--runtime); help text updatedsrc/cli.test.ts--comparedocs/eval-providers.md--compareworkflow, 5-step checklistdocs/skillgrade-integration.mdeval.yaml, presets, CI usage, troubleshootingREADME.mddocs/ARCHITECTURE.mdsrc/eval/module, provider contract, runner, and--compareflowTest Results
bun test src/eval/— 188 pass, 0 failbun test src/cli.test.ts -t "eval"— 34 pass, 0 failbun run typecheck— cleanbun test— 5 pre-existing failures inpublisher.test.tsandcli.test.ts > "import existing skills are skipped"(same failures present onmain, unrelated to this PR; called out in the issue).Acceptance Criteria
asm eval --runtime --compare skillgrade@1.0.0,skillgrade@2.0.0-nextshape works — liveskillgrade@2.0.0-nextprints the registry's "no version satisfies" error readably; happy path demonstrated viaquality@1.0.0,quality@1.0.0fixture and a synthetic-provider unit test demonstrating full mixed-change diff output.docs/eval-providers.mdanddocs/skillgrade-integration.mdpublished.docs/ARCHITECTURE.mdupdated.Notes
Per the issue's option (b): test corpus uses synthetic providers in unit tests (
compare.test.ts) rather than polluting the real registry with a pseudo-version. The aspirational exampleskillgrade@1.0.0,skillgrade@2.0.0-nextis supported at the shape level — it will work unchanged the moment a v2 adapter lands.Test plan
bun test src/eval/compare.test.tspasses (31/31)bun test src/cli.test.ts -t "eval --compare"passes (6/6)bun run typecheckcleanasm eval ./fixture --compare quality@1.0.0,quality@1.0.0prints readable zero-diffasm eval ./fixture --compare skillgrade@1.0.0,skillgrade@2.0.0-nextprints clean error, exit 1asm eval ./fixture --compare quality@1.0.0 --machineemits v1 envelope error