Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib) by Bwvolleyball · Pull Request #44 · stackhawk/agent-skills

Bwvolleyball · 2026-05-29T20:01:49Z

Summary

Adds three capabilities to the skill-eval system, built on a uv-managed shared library so every future grader improvement lands once instead of per-harness:

Compare mode (uv run compare) — runs each should_trigger prompt with and without the skill loaded and reports the verdict + cost delta. This is the missing "did the skill actually help?" signal: previously a skill that never fired and a useless skill graded identically.
Per-prompt expected behavior + budgets — prompts.csv → prompts.yaml; each prompt can carry an expected list (signal / anti_pattern / check_id) and a budget (cost_usd / bash_commands / output_tokens / wall_seconds). Process-checks can be scoped to specific prompts via applies_to.
Three-state efficiency grader — PASS / PASS-SLOW / FAIL. A correct run that breaches a budget grades PASS-SLOW, catching over-orchestration without losing the "it worked" signal.

Supporting scaffolding:

uv project + unified CLI: uv run evals | compare | regrade | validate.
Pydantic strict-validated config (extra="forbid") — config typos become load-time errors naming the offending prompt, before any API spend.
Replay-from-trace (uv run regrade <trace.jsonl>) — regrade a saved run with zero API cost; the iteration loop for the CI-only testbed.
Shared lib evals/lib/ (models, config, grading, harness, replay, compare, reporting) + claude-code adapter. The 6 old run-evals.py scripts become thin shims into the CLI (−2447 lines of duplicated parsing/grading).
Tiered CI: every PR runs keys-free validate + pytest + a cheap claude-code/Haiku run; merge-to-main runs the full Sonnet+Opus+Haiku matrix. codex/agy/cursor are dispatch-only until their adapters land.

Test Plan

Verified locally (free layers):

uv run pytest — 39 unit tests (models, config, grading incl. all check types + 3-state verdict, harness parse/trigger, replay, compare lift, reporting)
uv run validate — both skills' prompts.yaml + process-checks.json parse & validate
uv run regrade tests/fixtures/hw-07.trace.jsonl --skill hawkscan → PASS
actionlint clean on the workflow

Exercised by this PR's CI (first live agent run):

eval-claude-code (Haiku, hawkscan + api) — the live launch() → claude -p path
PR comment posts trigger-accuracy + process-score table

Notes / follow-ups

Requires ANTHROPIC_API_KEY repo secret for eval-claude-code; the two free jobs pass without it.
This PR is the first end-to-end run of the live agent path — local verification covered parsing/grading/validation/replay only.
Future work (documented, non-blocking): per-platform adapter.py for codex/cursor/copilot/agy (dispatch-only until then); rubric grading lives only in the manual harness (not yet ported to the lib); output_tokens/wall_seconds budget axes are graded but not yet populated by the claude-code adapter.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ecks, +tests) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Generate evals/api/prompts.yaml (16 prompts) via migrate_prompts.py and delete both evals/api/prompts.csv and evals/hawkscan/prompts.csv. prompts.yaml is now the single source of truth for both skills; harness shims still reference prompts.csv and will be updated in Task 12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rop unused --rubric Process checks, ad-hoc expectations, and budget checks now only run when prompt.should_trigger and did_trigger are both true. Correct non-triggers, false positives, and false negatives are graded purely on trigger accuracy, fixing the critical bug where a 100%-correct run would exit non-zero in CI. Also removes the parsed-but-never-read --rubric flag from _common_args. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rop rubric plumbing, refresh stale docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-05-29T20:02:04Z

Claude finished @Bwvolleyball's task in 3m 0s —— View job

Claude Code Review

Gather context and understand the PR
Read changed files and diff
Review correctness, security, performance, testing, conventions
Post final verdict

⚠️ Needs Attention

Solid, well-tested refactor. The shared lib, strict Pydantic config, replay-from-trace, and 39-unit-test suite are all real improvements. Three issues from the previous round are confirmed fixed.

✅ Fixed from prior review

Codex duplicate --sandbox — correctly branches on CI env var now
Cursor CURSOR_API_KEY CLI exposure — injected via environment, never on the command line
codex/agy/cursor running on every PR — workflow is workflow_dispatch-only

Remaining warnings

evals/lib/compare.py:21-29 — compare_skill() has no per-prompt try/except; a single adapter.launch() failure aborts the entire compare run and produces a partial lift.json. The cli.main() loop already handles this correctly — compare should mirror it. Fix this →
evals/lib/reporting.py:217-219 — render_job_summary() leaves why blank when a blocking process-check fails but trigger_correct=True. The _fail_reason() helper in the same file already handles this with "blocking check failed" — use it. Fix this →
evals/harnesses/agy/adapter.py — load_skill is a no-op for agy; both with and without compare arms run the same globally-installed skill, so compare always reports effect: "none" for agy. At minimum add a note; ideally filter agy from compare runs.
evals/harnesses/codex/adapter.py:82-87 — seen set deduplicates item.started events by command string, so a command that legitimately runs twice is only counted once — under-counting bash_commands for budget checks.
evals/cli.py:102-103 — import json and from pathlib import Path inside compare() are redundant; both already imported at module level.

Implements CodexAdapter with CLI_SIGNALS, INVOCATION_SIGNALS, parse_stream (item.started/item.completed/turn.completed), detect_trigger, and launch (codex exec --json --sandbox workspace-write --skip-git-repo-check), resolving the C2 defect where uv run evals --harness codex raised ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implements CursorAdapter with cursor-specific stream-json event keys (tool_call/subtype:started/shellToolCall, not claude-code's tool_use blocks), the full CLI_SIGNALS and INVOCATION_SIGNALS from pre-shim, launch flags matching the pre-shim invocation, and tests backed by a minimal cursor.txt fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ive-run fidelity) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds AgyAdapter with plain-text parse_stream (wraps full stdout in output_text, bash_commands always empty), INVOCATION_SIGNALS recovered verbatim from pre-shim ALL_SIGNALS plus evaluation-format backtick variants, and launch() mirroring the pre-shim agy -p / --print-timeout invocation. CLI_SIGNALS is empty (agy has no shell commands to scan). Skills are installed globally in CI via agy plugin install; load_skill is a no-op. AGY_API_KEY flows through os.environ as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ve-run fidelity) Restores the pre-shim OBSERVE_SUFFIX (verbatim) and appends it to the prompt inside launch() before invoking agy. In --print mode agy hangs on tool approvals, so the suffix makes the agent declare 'SKILL: hawkscan' / 'SKILL: api' / 'SKILL: none' up front — that declaration is what the pre-shim SKILL: signals in INVOCATION_SIGNALS match. Without it, live agy runs emit no detectable trigger text (all false-negatives). Both signal sets are retained: pre-shim SKILL: entries AND the backtick evaluation-format variants, so detection is robust regardless of which format agy emits. Adds a unit test asserting OBSERVE_SUFFIX is non-empty, requests the SKILL: declaration, and that detect_trigger fires on it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… C2) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ds badge Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds render_digest() to reporting.py, a new `report` CLI entrypoint that discovers cell.json artifacts via rglob and writes a consolidated digest.md, and replaces the flat JS-built comment in skill-evals.yml with two clean steps (uv build + thin github-script post). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eful) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ters Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cursor best-effort; wire score_delta into overview Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-05-29T21:46:04Z

+            if model:
+                cmd += ["-m", model]
+            if not full_auto:
+                cmd += ["--sandbox", "read-only"]


[critical] --sandbox is specified twice with conflicting values. workspace-write is always added on line 128, then when full_auto=False (the default), read-only is appended on line 134. The resulting command is:

codex exec --json --sandbox workspace-write --skip-git-repo-check --sandbox read-only <prompt>

Depending on how Codex resolves duplicate flags this will either use the wrong sandbox level or error out. Fix by making it conditional:

sandbox = "workspace-write" if full_auto else "read-only" cmd = ["codex", "exec", "--json", "--sandbox", sandbox, "--skip-git-repo-check"]

Fix this →

claude · 2026-05-29T21:46:18Z

+                _setup_skill(tmpdir)
+            api_key = os.environ.get("CURSOR_API_KEY", "")
+            cmd = [
+                "agent", "-p", prompt,


[warning] The CURSOR_API_KEY is passed as a command-line argument, making it visible in process listings (ps aux, /proc/<pid>/cmdline). Prefer injecting secrets via environment variable rather than --api-key <value>. The cursor CLI likely reads CURSOR_API_KEY from the environment already (consistent with how ANTHROPIC_API_KEY is used in the claude-code adapter — never passed on the command line there).

claude · 2026-05-29T21:46:27Z

+    ) -> ParsedRun:
+        # Skills are installed globally via `agy plugin install` in CI;
+        # load_skill toggling is a no-op here.
+        tmpdir = tempfile.mkdtemp(prefix=f"hawkeval_{run_id}_")


[warning] The agy adapter documents load_skill as a no-op because skills are installed globally via agy plugin install in CI. This means compare_skill() cannot actually test "without skill" for agy — both the load=True and load=False runs use the same globally-installed skill, so the compare will always report effect: "none" for this platform, producing misleading lift data. Consider either: (a) filtering agy out of compare runs until per-run skill toggle is supported, or (b) emitting a warning in compare output when the adapter returns noop for load_skill=False.

claude · 2026-05-29T21:46:42Z

-      github.event_name != 'workflow_dispatch' ||
+      github.event_name == 'pull_request' ||
+      github.event_name == 'push' ||
      inputs.platform == 'all' ||


[warning] codex, agy, and cursor jobs all run on every PR (github.event_name == pull_request), but they produce empty results because the API keys are not set for PR builds. The existing CI comment shows "No results" for all three platforms. Each job installs a CLI, fails at plugin install, then exits — burning ~30 CI min per PR (3 platforms × 2 skills). Consider gating them to workflow_dispatch only until the adapters and API keys are ready.

…ort, fix total_cost_usd - ParsedRun gains returncode + stderr_tail fields; EvalResult gains note field - grade() propagates run.error → EvalResult.note on both return paths - render_job_summary() appends note to the "why" column when present - All four adapters (claude-code, codex, cursor, agy) now capture proc.returncode and proc.stderr after subprocess.run, set run.error on non-zero exit or empty output - claude-code adapter parse_stream reads total_cost_usd (new key) before cost_usd (legacy key) so cost stops showing $0.00 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te cell+summary+trace Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…to run summary; capture-baseline full matrix Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…st to run summary on dispatch Evals run real agents against tool CLIs and were never an automatic PR gate (origin/main commit c860e47 deliberately removed the pull_request trigger). Auto-PR runs surfaced env gaps (CLIs not installed, skills not loading under --bare) that were never set up for CI. Restore manual dispatch; the report job now writes the consolidated digest to GITHUB_STEP_SUMMARY on dispatch and only posts a PR comment when a PR context exists. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- codex: pick --sandbox value once (workspace-write vs read-only). Passing both made codex exit 2 ("--sandbox cannot be used multiple times"), failing every non-full-auto run before the agent started. - cursor: pass CURSOR_API_KEY via the child environment instead of --api-key on the command line (the flag leaked the secret into process listings/logs; the agent CLI reads it from the env directly). - agy: mark CLI install/verify/plugin-install/run steps continue-on-error so a flaky preview installer no longer aborts the job before evals run — the eval CLI records the launch failure and uploads a result the digest can surface (matches cursor's best-effort treatment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The hawkscan skill grades on whether the agent runs the documented hawk commands (hawk version/config/validate/scan) — all command_executed checks. With no hawk on the runner, the agent improvised (docker) and never emitted a hawk* trigger signal, so every triggering prompt scored FN. Add a JDK 17 (hawk is a Java app) + install the latest hawk via the repo's own documented method: resolve version from api.stackhawk.com/hawkscan/version, download the Linux ZIP, unzip, add to PATH. Install/verify are continue-on-error so a flaky download still lets evals run and record state. Auth (HAWK_API_KEY) and hawkop are not wired here; the api skill and live-app checks remain blocked until those land. This isolates "does installing hawk flip the hawkscan trigger + preflight/validate/scan checks green". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the hand-rolled curl/unzip with stackhawk/hawkscan-action@v2.5.0 using installCLIOnly: true — the maintained, canonical install path (resolves latest, handles the download/PATH). Keep setup-java@17 (hawk is a Java app; the action ships the CLI, not a JRE) and the post-install `hawk version` verify. apiKey is passed from the (currently empty) HAWK_API_KEY secret; install-only performs no scan so the key is unused. Step is continue-on-error so a missing key can't abort the job — evals still runs and records hawk availability. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All four AI harnesses (claude-code, codex, agy, cursor) drive an agent that runs the hawkscan skill, which needs the hawk CLI on PATH. Previously only claude-code installed it. Add the same install-only step (setup-java@17 + stackhawk/hawkscan-action@v2.5.0 installCLIOnly + hawk version verify) to the codex, agy, and cursor jobs. Note: the api skill prefers hawkop (separate CLI) and codex/agy/cursor still have their own tool-CLI provisioning blockers (codex auth, agy/cursor install) — tracked separately. This change covers hawk specifically, for every job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Root causes found from the run-26769783222 step logs + traces: - codex: 401 "Missing bearer" — `codex exec` reads stored credentials, not OPENAI_API_KEY. Add `printenv OPENAI_API_KEY | codex login --with-api-key` before the eval run (pipe via stdin, never as an arg). - agy: `https://antigravity.google/install-cli` returns the site's HTML landing page, so `| bash` died with a syntax error and `agy` never installed. Use the real bootstrapper `/cli/install.sh`, add ~/.local/bin to PATH, and set ANTIGRAVITY_API_KEY (the env var agy actually reads) from the AGY_API_KEY secret. - cursor: `@cursor/cli` 404s and the `cursor-agent` npm package ships no `agent` binary. Use the official installer `curl https://cursor.com/install | bash`, which symlinks `agent` into ~/.local/bin; add that to PATH. claude-code plumbing already works (hw-05 trace: agent runs `hawk version` + `hawk scan --help`); no change there. Installer URLs verified to serve real shell scripts (application/x-sh) before wiring them in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After auth was fixed, codex still couldn't reach hawk: on Ubuntu-24.04 runners the bubblewrap sandbox fails to initialize (unprivileged user namespaces are gated by AppArmor) — "bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted" — so codex exits at sandbox startup before running any command (33 occurrences across cells; codex issue #16334). When CI is set, launch with --dangerously-bypass-approvals-and-sandbox instead of --sandbox <mode>. Safe on an ephemeral runner in a throwaway tmpdir, and the agent needs write+exec to run the hawkscan workflow anyway. Local runs keep the real sandbox (workspace-write for full-auto, else read-only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

agy has no non-interactive auth (OAuth-only; upstream antigravity-cli#78 is open and unimplemented), so in a browser-less CI runner it prints an auth URL and times out. Detect that and set a clear error note so the digest attributes it to the upstream limitation rather than a plumbing/eval failure on our side. Boundary A (eval finds agy) and hawk-on-PATH are fixed; this is the one remaining harness that cannot run headlessly until upstream adds API-key auth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-01T18:23:00Z

+        for load in (True, False):
+            run = adapter.launch(p.prompt, skill, f"{p.id}-{'with' if load else 'without'}",
+                                 plugin_dirs, model=model, load_skill=load,
+                                 max_budget=max_budget, bare=bare, full_auto=full_auto)


[warning] compare_skill() has no per-prompt exception handling. If adapter.launch() throws (e.g. subprocess failure, disk error), the entire compare run aborts mid-loop with an unhandled exception rather than recording the failure for that prompt and continuing. The evals CLI wraps each prompt in try/except for exactly this reason — compare should do the same:

for p in prompts: graded = {} for load in (True, False): try: run = adapter.launch(...) did = adapter.detect_trigger(run, skill) graded[load] = grade(p, run, cfg.checks, ...) except Exception as e: graded[load] = grade(p, ParsedRun(error=str(e)), cfg.checks, ...)

Note: the CI step already uses || true so a full abort won't break the workflow, but a half-written lift.json artifact will be confusing.

claude · 2026-06-01T18:23:10Z

+    rows = ["| test | result | why |", "|---|---|---|"]
+    for r in sorted(cell.results, key=lambda r: (_row_rank(r), r.run_id)):
+        why = "; ".join(r.budget_breaches) if r.budget_breaches else (
+            "" if r.trigger_correct else


[warning] When trigger_correct=True but a blocking process check fails (verdict = FAIL), the why column in the table will be empty — the user sees ❌ FAIL with no explanation. The condition chain falls through to the empty string branch:

why = "; ".join(r.budget_breaches) if r.budget_breaches else ( "" if r.trigger_correct else # ← empty when trigger correct but proc-check failed ("false-positive" if r.did_trigger else "false-negative"))

Consider adding a fallback for this case:

failed_checks = [c.id for c in r.process_checks if not c.passed and c.severity == "blocking"] why = "; ".join(r.budget_breaches) or ("; ".join(failed_checks) if r.verdict.value == "fail" else "") or ("" if r.trigger_correct else ("false-positive" if r.did_trigger else "false-negative"))

claude · 2026-06-01T18:23:16Z

+                         max_budget=args.max_budget, bare=args.bare,
+                         full_auto=args.full_auto, only_id=args.prompt_id)
+    import json
+    from pathlib import Path


[suggestion] json and Path are already imported at the top of cli.py — these in-function imports are redundant dead weight. Remove them.

The Actions run summary showed ~14 tables (one per matrix cell), because every eval job wrote its own render_job_summary to GITHUB_STEP_SUMMARY. Stop that; the `report` job now aggregates all cell.json into a single pivot table: test | claude-code-haiku-4-5 | claude-code-sonnet-4-6 | codex-gpt-5.5 | ... hawkscan/hw-01 | ✅ | ❌ — false-negative | ❌ — blocking check failed | ... Rows are skill/test, columns are platform-model (date stamp + redundant "claude-" prefix trimmed), cells are a verdict emoji + a terse reason on non-pass outcomes (`·` = that harness/model didn't run the test). Baseline/lift extras kept as compact notes below the table, not as more tables. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

--bare is "minimal mode": per `claude --help` it skips hooks/LSP/plugins and "skills still resolve via /skill-name" — i.e. skills do NOT auto-trigger from their description. The eval prompts are natural language, so in bare mode the skill never fired (the agent ran as a vanilla model and gave generic DAST advice, even naming ZAP). That produced ~all false-negatives on positive prompts. Run in full plugin mode instead (also the realistic user experience). Isolated change — measuring trigger rate and whether negative controls now over-trigger before layering in HAWK_API_KEY / hawkop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uged The CI sandbox has no running app/credentials, so a triggered agent correctly stops and asks for a target instead of completing a scan — leaving the workflow process-checks (which scan bash_commands + output_text) unsatisfied. In observe mode (default, no --bare, not full-auto) append a suffix asking the agent to (1) declare the StackHawk skill it would invoke in a signal-matching format and (2) outline the CLI commands that skill's workflow runs. The declaration drives trigger detection; the outline satisfies the workflow checks via output_text — reproducing origin/main's observe-mode intent (gauge that the right skill triggers and the agent knows its workflow) without a real target. The commands are intentionally NOT listed in the suffix — producing them is the skill's job. Full-auto/extended (real target) keeps the bare prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The hawkscan Step-1 dedup checks (hawkop app list / env list) and the entire api skill require hawkop — a separate native CLI we never installed, so agents couldn't run/narrate those steps. No official GitHub Action exists for hawkop, so install the native Linux binary directly (download.stackhawk.com/hawkop, latest-version.txt + x86_64-unknown-linux-gnu tarball) into /usr/local/bin, right beside the hawk install in all four jobs. continue-on-error so a flaky download never aborts the job. No runtime deps (native binary; no JDK needed). URL/version per the repo's own api skill reference (hawkop-shortcuts.md); tarball verified to contain a top-level `hawkop` binary before wiring it in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-01T21:27:47Z

⚠️ Claude Code Review — Needs Attention

Substantial, well-architected PR. The shared evals/lib/ refactor is clean, Pydantic strict-validated config is a real safety improvement, and the 39-test suite gives good coverage across models/grading/reporting. Three issues from the previous review round are confirmed fixed (codex duplicate --sandbox, cursor API key CLI-exposure, and CI jobs running on every PR). A few warnings remain open.

✅ Fixes confirmed from prior review round

Codex duplicate --sandbox (evals/harnesses/codex/adapter.py) — now correctly branches on CI env var; --dangerously-bypass-approvals-and-sandbox in CI, conditional workspace-write / read-only locally.
Cursor CURSOR_API_KEY CLI exposure (evals/harnesses/cursor/adapter.py) — key now injected via env=dict(os.environ), never on the command line.
codex/agy/cursor running on every PR (skill-evals.yml) — workflow is now workflow_dispatch-only; free jobs (validate, pytest) are the only automatic gates.

🚨 Issues Found

[warning] evals/lib/compare.py:21-29 — compare_skill() has no per-prompt exception handling. If adapter.launch() raises mid-loop (subprocess failure, disk error, timeout not wrapped), the entire compare run aborts with an unhandled exception, producing a half-written or missing lift.json artifact. cli.main() wraps each prompt in try/except for exactly this reason; compare_skill() should do the same. The || true in the CI step masks the abort but leaves the artifact in an indeterminate state.

[warning] evals/lib/reporting.py:217-219 — render_job_summary() leaves the why column empty when a blocking process-check fails but the trigger fired correctly (i.e., trigger_correct=True, budget_breaches=[], verdict=FAIL). The condition chain short-circuits to "" in that branch. The _fail_reason() helper in the same file handles this gracefully with a "blocking check failed" fallback — render_job_summary should use the same logic. A user sees ❌ FAIL with no explanation in the table.

[warning] evals/harnesses/agy/adapter.py / evals/lib/compare.py — load_skill is documented as a no-op for agy (skills are globally installed). This means both the load=True and load=False compare arms run against the same globally-installed skill, so compare will always report effect: "none" for agy — the lift measurement is meaningless. Consider either filtering agy out of compare runs or emitting a warning in compare_skill() when the adapter doesn't support per-run skill toggling.

[warning] evals/harnesses/codex/adapter.py:82-87 — The seen: set() deduplication for item.started events means a command that legitimately runs twice (e.g., two hawk scan invocations in a rescan flow) is only counted once. This under-counts bash_commands for budget checks, potentially letting a prompt pass a bash_commands budget it actually violated.

Suggested before merge

Add try/except in compare_skill() loop — mirror the resilience already in cli.main().
Fix the empty why in render_job_summary for blocking-check failures — use _fail_reason(r) or add the "blocking check failed" fallback.
Remove the redundant import json / from pathlib import Path at the top of compare() in evals/cli.py:102-103 (both already imported at module level).
Document or guard the agy compare limitation — a # NOTE: load_skill is a no-op; compare results for agy are always "none" comment in compare_skill() at minimum.

Bwvolleyball and others added 17 commits May 29, 2026 12:23

build: bootstrap uv project for evals lib

fde324e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): Pydantic data models with strict validation

2c4fd36

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): skill config loader with validation

1041542

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): grading with per-prompt expected + budget verdict

23f82e2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(evals): grading robustness from code review (loud on malformed ch…

d7adeff

…ecks, +tests) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): Harness protocol + claude-code adapter

cf46d3e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(evals): add one-time prompts CSV->YAML migration script

b3827fc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): migrate hawkscan prompts to YAML + worked budget example

f424b26

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): replay-from-trace regrade + realistic hw-07 fixture

ae57aed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): with/without-skill compare mode

de9cb81

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): unified CLI (evals/compare/regrade/validate) + reporting

9399123

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(evals): harness scripts become shims into unified CLI

5472ed2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci(evals): tiered runs (validate on PR, cheap PR matrix, full main) + uv

5a7f80f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(evals): document uv CLI, prompts.yaml, compare/regrade, PASS-SLOW

fc9c551

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci+docs(evals): dispatch-only non-claude jobs (C2), add pytest job, d…

413a748

…rop rubric plumbing, refresh stale docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bwvolleyball requested a review from a team as a code owner May 29, 2026 20:01

This comment was marked as outdated.

Sign in to view

Bwvolleyball and others added 10 commits May 29, 2026 14:53

fix(evals): cursor adapter launch restores --trust + skill-loading (l…

ff59637

…ive-run fidelity) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci(evals): re-enable codex/cursor/agy now that adapters exist (closes…

aef302e

… C2) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): CellReport model + cell.json artifact from main()

a041ca9

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): render_job_summary (JUnit-style, failures first) + shiel…

da3d46c

…ds badge Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): main() writes GITHUB_STEP_SUMMARY job report

5afc996

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bwvolleyball and others added 5 commits May 29, 2026 15:21

ci(evals): capture baseline at release tag; PR diffs against it (grac…

324b8cf

…eful) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): compare emits lift effect + writes lift.json

bd52c2c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(evals): render skill-lift section; PR runs compare for lift

5372103

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(evals): document JUnit-style report, comparisons, four real adap…

690dc5a

…ters Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(evals): comment job checkout-before-download (empty-digest bug); …

dca5e67

…cursor best-effort; wire score_delta into overview Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude Bot reviewed May 29, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Bwvolleyball and others added 11 commits May 31, 2026 20:45

fix(evals): main() resilient to per-prompt launch crashes; always wri…

05c7bc7

…te cell+summary+trace Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci(evals): full tool×model matrix on PR+dispatch (drop push); digest …

06e063b

…to run summary; capture-baseline full matrix Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude Bot reviewed Jun 1, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Bwvolleyball and others added 4 commits June 1, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44

Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44
Bwvolleyball wants to merge 49 commits into
mainfrom
feat/scale-skill-evals

Bwvolleyball commented May 29, 2026

Uh oh!

claude Bot commented May 29, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

claude Bot May 29, 2026

Uh oh!

claude Bot May 29, 2026

Uh oh!

claude Bot May 29, 2026

Uh oh!

claude Bot May 29, 2026

Uh oh!

This comment has been minimized.

claude Bot Jun 1, 2026

Uh oh!

claude Bot Jun 1, 2026

Uh oh!

claude Bot Jun 1, 2026

Uh oh!

This comment has been minimized.

claude Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bwvolleyball commented May 29, 2026

Summary

Test Plan

Notes / follow-ups

Uh oh!

claude Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

⚠️ Needs Attention

✅ Fixed from prior review

Remaining warnings

Uh oh!

This comment was marked as outdated.

claude Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

claude Bot commented Jun 1, 2026

⚠️ Claude Code Review — Needs Attention

✅ Fixes confirmed from prior review round

🚨 Issues Found

Suggested before merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 29, 2026 •

edited

Loading