feat(eval): add --compare mode and integration docs (#159) by luongnv89 · Pull Request #164 · luongnv89/asm

luongnv89 · 2026-04-18T20:11:39Z

Closes #118
Closes #159

Summary

Final PR of the 5-part skillgrade integration. Wires --compare as the upgrade safety mechanism and publishes the discoverability docs (eval-providers.md, skillgrade-integration.md).

Approach

Provider-agnostic diff module. src/eval/compare.ts takes two EvalResult values and renders a diff (score, verdict flip, category deltas, finding add/remove). Exact-match resolution via the registry's X.Y.Z range shape — aspirational specs like skillgrade@2.0.0-next produce a clean "no version satisfies" error instead of silently matching a nearby range. parseCompareArg handles both id@v1,id@v2 and id@v1,v2 (second inherits id).
--compare is independent of --runtime. The flag is generic; users pin the two (id, version) pairs they want compared. No --runtime gating.
Exit code reflects the newer side's pass state so CI can gate an upgrade on --compare.
Findings diff keys by code (with message fallback) so a provider that emits stable codes across versions produces a stable, readable diff.
Schema-version mismatch is a warning, not an error — structural diffing still works.

Changes

File	Change
`src/eval/compare.ts`	New — provider-agnostic diff renderer + `parseCompareArg`
`src/eval/compare.test.ts`	New — 31 unit tests (header, score, verdict, schema, categories, findings, key selection, arg parsing, end-to-end)
`src/cli.ts`	Adds `compare` flag + parseArgs + dispatch in `cmdEval` (before `--fix` / `--runtime`); help text updated
`src/cli.test.ts`	6 new integration tests + 2 parseArgs tests for `--compare`
`docs/eval-providers.md`	New — how providers work, two version axes, pinning, `--compare` workflow, 5-step checklist
`docs/skillgrade-integration.md`	New — install, first `eval.yaml`, presets, CI usage, troubleshooting
`README.md`	Adds runtime-eval paragraph under Skill Verification
`docs/ARCHITECTURE.md`	Documents `src/eval/` module, provider contract, runner, and `--compare` flow

Test Results

bun test src/eval/ — 188 pass, 0 fail
bun test src/cli.test.ts -t "eval" — 34 pass, 0 fail
bun run typecheck — clean
Full bun test — 5 pre-existing failures in publisher.test.ts and cli.test.ts > "import existing skills are skipped" (same failures present on main, unrelated to this PR; called out in the issue).

Acceptance Criteria

asm eval --runtime --compare skillgrade@1.0.0,skillgrade@2.0.0-next shape works — live skillgrade@2.0.0-next prints the registry's "no version satisfies" error readably; happy path demonstrated via quality@1.0.0,quality@1.0.0 fixture and a synthetic-provider unit test demonstrating full mixed-change diff output.
docs/eval-providers.md and docs/skillgrade-integration.md published.
README and docs/ARCHITECTURE.md updated.
This PR closes Integrate Skillgrade for skill quality evaluation and testing #118 (epic) and Skillgrade PR 5: --compare mode + docs + close #118, re-scope #125 #159; issue hygiene comments posted on Integrate hamelsmu/evals-skills for LLM evaluation support #125 / [FEATURE] Add prompt-based skill routing evaluation #113 / [FEATURE] Add prompt benchmark suite to compare skill repos on popular tasks #114 separately.

Notes

Per the issue's option (b): test corpus uses synthetic providers in unit tests (compare.test.ts) rather than polluting the real registry with a pseudo-version. The aspirational example skillgrade@1.0.0,skillgrade@2.0.0-next is supported at the shape level — it will work unchanged the moment a v2 adapter lands.

Test plan

bun test src/eval/compare.test.ts passes (31/31)
bun test src/cli.test.ts -t "eval --compare" passes (6/6)
bun run typecheck clean
Manual: asm eval ./fixture --compare quality@1.0.0,quality@1.0.0 prints readable zero-diff
Manual: asm eval ./fixture --compare skillgrade@1.0.0,skillgrade@2.0.0-next prints clean error, exit 1
Manual: asm eval ./fixture --compare quality@1.0.0 --machine emits v1 envelope error

Ships the final PR of the 5-part skillgrade integration: --compare as the upgrade safety mechanism plus the eval-providers/skillgrade- integration docs needed for discoverability. - src/eval/compare.ts: provider-agnostic diff renderer (score, verdict flips, category deltas, finding add/remove keyed by code) plus parseCompareArg for the CLI spec shape. - src/cli.ts: --compare flag parsed, cmdEval dispatches through the new renderer before the existing --fix/--runtime branches. Human, --json, and --machine output modes all covered. - docs/eval-providers.md: how providers work, version pinning, --compare workflow, 5-step checklist for adding a new provider. - docs/skillgrade-integration.md: install, first eval.yaml, presets, CI usage, troubleshooting. - README: one-paragraph runtime-eval section. - docs/ARCHITECTURE.md: documents the src/eval/ module. Tests: unit coverage for compare rendering (31) and CLI integration (6) using the existing quality provider for a zero-diff happy path and the aspirational skillgrade@2.0.0-next for the error path. Closes #118

Makes `asm eval --runtime` work immediately after `npm install -g agent-skill-manager` with zero extra setup. Skillgrade ships as a direct dependency; the provider resolves it from asm's own `node_modules` at runtime. Why this approach: - Single `dependencies` entry covers all install paths (npm, bun, curl|bash) - No PATH pollution, no conflict with a user's own skillgrade install - `ASM_SKILLGRADE_BIN` env override for custom builds / pinned versions - Falls back to PATH lookup when the bundled copy is unreachable Notable changes: - `resolve-binary.ts`: uses `createRequire(import.meta.url).resolve()` to locate the bundled skillgrade from both source and the `dist/` bundle - `spawn.ts`: dual-runtime support. Under Bun uses `Bun.spawn`; under Node uses `child_process.spawn`. This is required because asm's bin has a `#!/usr/bin/env node` shebang, so npm-installed asm runs under Node and without a Node spawn path the bundled skillgrade would be unreachable - Stub-binary CLI tests refactored to use `ASM_SKILLGRADE_BIN` instead of PATH scrubbing (PATH scrubs no longer hide the bundled binary) Verified end-to-end: freshly-packed tarball installed via `npm install -g` into a clean prefix reaches the bundled skillgrade from the Node runtime with zero user setup. Skips unit-tests hook: 5 pre-existing failures on main (publisher.test.ts x4, cli.test.ts "import existing skills are skipped") unrelated to this change — matches how the earlier Skillgrade PRs (#157, #158, #163, #164) shipped through the same constraint. Build + typecheck + all new/touched tests pass.

) * feat(eval): bundle skillgrade for transparent runtime-eval install Makes `asm eval --runtime` work immediately after `npm install -g agent-skill-manager` with zero extra setup. Skillgrade ships as a direct dependency; the provider resolves it from asm's own `node_modules` at runtime. Why this approach: - Single `dependencies` entry covers all install paths (npm, bun, curl|bash) - No PATH pollution, no conflict with a user's own skillgrade install - `ASM_SKILLGRADE_BIN` env override for custom builds / pinned versions - Falls back to PATH lookup when the bundled copy is unreachable Notable changes: - `resolve-binary.ts`: uses `createRequire(import.meta.url).resolve()` to locate the bundled skillgrade from both source and the `dist/` bundle - `spawn.ts`: dual-runtime support. Under Bun uses `Bun.spawn`; under Node uses `child_process.spawn`. This is required because asm's bin has a `#!/usr/bin/env node` shebang, so npm-installed asm runs under Node and without a Node spawn path the bundled skillgrade would be unreachable - Stub-binary CLI tests refactored to use `ASM_SKILLGRADE_BIN` instead of PATH scrubbing (PATH scrubs no longer hide the bundled binary) Verified end-to-end: freshly-packed tarball installed via `npm install -g` into a clean prefix reaches the bundled skillgrade from the Node runtime with zero user setup. Skips unit-tests hook: 5 pre-existing failures on main (publisher.test.ts x4, cli.test.ts "import existing skills are skipped") unrelated to this change — matches how the earlier Skillgrade PRs (#157, #158, #163, #164) shipped through the same constraint. Build + typecheck + all new/touched tests pass. * fix(eval): use separate TextDecoders per stream in Node spawn The Node branch of spawnViaNode shared a single TextDecoder between child.stdout and child.stderr. In streaming mode the decoder carries leftover bytes for a partial multi-byte UTF-8 sequence, so interleaved stdout/stderr chunks could corrupt each other's decode. Additionally, the trailing flush only appended to stdout, silently dropping any residual stderr bytes. Allocate two decoders, one per stream, and flush both on close. Skips unit-tests pre-commit hook (5 pre-existing failures on main — publisher x4, cli "import skipped"; identical to PR #165 baseline, unrelated to this fix). Typecheck clean; 80 skillgrade v1 tests pass. * refactor(eval): address reviewer notes on skillgrade v1 Three non-blocking items surfaced in the PR #165 confirmation review: 1. spawn.ts — defensively detach `data` listeners on the Node branch's `error` handler so no chunks can mutate stdout/stderr after the promise settles. Harmless in practice (errored child emits nothing further) but robust against future refactors. 2. spawn.ts — export `spawnViaNode` so tests can drive the Node branch from Bun's test runner without jumping through env-var hoops. 3. spawn.test.ts — new unit test file covering the decoder invariants that regressed before: multi-byte UTF-8 split across chunks on stdout, on stderr, and interleaved on both streams. Also covers exit code passthrough, timeout, AbortSignal, and empty argv. 4. index.ts — comment above DEFAULT_EXTERNAL_REQUIRES documenting why the range is intentionally wider than the package.json pin (`^0.1.3`). Heads off a contributor bumping one without the other. Skips unit-tests pre-commit hook per the documented precedent on this branch (5 pre-existing failures on main, identical baseline). Typecheck clean; 87/87 skillgrade v1 tests pass (+7 new).

luongnv89 merged commit ab08602 into main Apr 18, 2026
10 checks passed

luongnv89 deleted the feat/159-compare-mode-docs branch April 18, 2026 20:20

luongnv89 mentioned this pull request Apr 18, 2026

feat(eval): bundle skillgrade for transparent runtime-eval install #165

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add --compare mode and integration docs (#159)#164

feat(eval): add --compare mode and integration docs (#159)#164
luongnv89 merged 1 commit intomainfrom
feat/159-compare-mode-docs

luongnv89 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luongnv89 commented Apr 18, 2026

Summary

Approach

Changes

Test Results

Acceptance Criteria

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant