Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44
Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44Bwvolleyball wants to merge 49 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecks, +tests) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Generate evals/api/prompts.yaml (16 prompts) via migrate_prompts.py and delete both evals/api/prompts.csv and evals/hawkscan/prompts.csv. prompts.yaml is now the single source of truth for both skills; harness shims still reference prompts.csv and will be updated in Task 12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rop unused --rubric Process checks, ad-hoc expectations, and budget checks now only run when prompt.should_trigger and did_trigger are both true. Correct non-triggers, false positives, and false negatives are graded purely on trigger accuracy, fixing the critical bug where a 100%-correct run would exit non-zero in CI. Also removes the parsed-but-never-read --rubric flag from _common_args. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rop rubric plumbing, refresh stale docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @Bwvolleyball's task in 3m 0s —— View job Claude Code Review
|
This comment was marked as outdated.
This comment was marked as outdated.
Implements CodexAdapter with CLI_SIGNALS, INVOCATION_SIGNALS, parse_stream (item.started/item.completed/turn.completed), detect_trigger, and launch (codex exec --json --sandbox workspace-write --skip-git-repo-check), resolving the C2 defect where uv run evals --harness codex raised ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements CursorAdapter with cursor-specific stream-json event keys (tool_call/subtype:started/shellToolCall, not claude-code's tool_use blocks), the full CLI_SIGNALS and INVOCATION_SIGNALS from pre-shim, launch flags matching the pre-shim invocation, and tests backed by a minimal cursor.txt fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ive-run fidelity) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds AgyAdapter with plain-text parse_stream (wraps full stdout in output_text, bash_commands always empty), INVOCATION_SIGNALS recovered verbatim from pre-shim ALL_SIGNALS plus evaluation-format backtick variants, and launch() mirroring the pre-shim agy -p / --print-timeout invocation. CLI_SIGNALS is empty (agy has no shell commands to scan). Skills are installed globally in CI via agy plugin install; load_skill is a no-op. AGY_API_KEY flows through os.environ as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ve-run fidelity) Restores the pre-shim OBSERVE_SUFFIX (verbatim) and appends it to the prompt inside launch() before invoking agy. In --print mode agy hangs on tool approvals, so the suffix makes the agent declare 'SKILL: hawkscan' / 'SKILL: api' / 'SKILL: none' up front — that declaration is what the pre-shim SKILL: signals in INVOCATION_SIGNALS match. Without it, live agy runs emit no detectable trigger text (all false-negatives). Both signal sets are retained: pre-shim SKILL: entries AND the backtick evaluation-format variants, so detection is robust regardless of which format agy emits. Adds a unit test asserting OBSERVE_SUFFIX is non-empty, requests the SKILL: declaration, and that detect_trigger fires on it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… C2) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ds badge Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds render_digest() to reporting.py, a new `report` CLI entrypoint that discovers cell.json artifacts via rglob and writes a consolidated digest.md, and replaces the flat JS-built comment in skill-evals.yml with two clean steps (uv build + thin github-script post). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eful) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ters Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cursor best-effort; wire score_delta into overview Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| if model: | ||
| cmd += ["-m", model] | ||
| if not full_auto: | ||
| cmd += ["--sandbox", "read-only"] |
There was a problem hiding this comment.
[critical] --sandbox is specified twice with conflicting values. workspace-write is always added on line 128, then when full_auto=False (the default), read-only is appended on line 134. The resulting command is:
codex exec --json --sandbox workspace-write --skip-git-repo-check --sandbox read-only <prompt>
Depending on how Codex resolves duplicate flags this will either use the wrong sandbox level or error out. Fix by making it conditional:
sandbox = "workspace-write" if full_auto else "read-only"
cmd = ["codex", "exec", "--json", "--sandbox", sandbox, "--skip-git-repo-check"]| _setup_skill(tmpdir) | ||
| api_key = os.environ.get("CURSOR_API_KEY", "") | ||
| cmd = [ | ||
| "agent", "-p", prompt, |
There was a problem hiding this comment.
[warning] The CURSOR_API_KEY is passed as a command-line argument, making it visible in process listings (ps aux, /proc/<pid>/cmdline). Prefer injecting secrets via environment variable rather than --api-key <value>. The cursor CLI likely reads CURSOR_API_KEY from the environment already (consistent with how ANTHROPIC_API_KEY is used in the claude-code adapter — never passed on the command line there).
| ) -> ParsedRun: | ||
| # Skills are installed globally via `agy plugin install` in CI; | ||
| # load_skill toggling is a no-op here. | ||
| tmpdir = tempfile.mkdtemp(prefix=f"hawkeval_{run_id}_") |
There was a problem hiding this comment.
[warning] The agy adapter documents load_skill as a no-op because skills are installed globally via agy plugin install in CI. This means compare_skill() cannot actually test "without skill" for agy — both the load=True and load=False runs use the same globally-installed skill, so the compare will always report effect: "none" for this platform, producing misleading lift data. Consider either: (a) filtering agy out of compare runs until per-run skill toggle is supported, or (b) emitting a warning in compare output when the adapter returns noop for load_skill=False.
| github.event_name != 'workflow_dispatch' || | ||
| github.event_name == 'pull_request' || | ||
| github.event_name == 'push' || | ||
| inputs.platform == 'all' || |
There was a problem hiding this comment.
[warning] codex, agy, and cursor jobs all run on every PR (github.event_name == pull_request), but they produce empty results because the API keys are not set for PR builds. The existing CI comment shows "No results" for all three platforms. Each job installs a CLI, fails at plugin install, then exits — burning ~30 CI min per PR (3 platforms × 2 skills). Consider gating them to workflow_dispatch only until the adapters and API keys are ready.
This comment has been minimized.
This comment has been minimized.
…ort, fix total_cost_usd - ParsedRun gains returncode + stderr_tail fields; EvalResult gains note field - grade() propagates run.error → EvalResult.note on both return paths - render_job_summary() appends note to the "why" column when present - All four adapters (claude-code, codex, cursor, agy) now capture proc.returncode and proc.stderr after subprocess.run, set run.error on non-zero exit or empty output - claude-code adapter parse_stream reads total_cost_usd (new key) before cost_usd (legacy key) so cost stops showing $0.00 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te cell+summary+trace Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to run summary; capture-baseline full matrix Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st to run summary on dispatch Evals run real agents against tool CLIs and were never an automatic PR gate (origin/main commit c860e47 deliberately removed the pull_request trigger). Auto-PR runs surfaced env gaps (CLIs not installed, skills not loading under --bare) that were never set up for CI. Restore manual dispatch; the report job now writes the consolidated digest to GITHUB_STEP_SUMMARY on dispatch and only posts a PR comment when a PR context exists. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- codex: pick --sandbox value once (workspace-write vs read-only). Passing
both made codex exit 2 ("--sandbox cannot be used multiple times"), failing
every non-full-auto run before the agent started.
- cursor: pass CURSOR_API_KEY via the child environment instead of --api-key
on the command line (the flag leaked the secret into process listings/logs;
the agent CLI reads it from the env directly).
- agy: mark CLI install/verify/plugin-install/run steps continue-on-error so a
flaky preview installer no longer aborts the job before evals run — the eval
CLI records the launch failure and uploads a result the digest can surface
(matches cursor's best-effort treatment).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hawkscan skill grades on whether the agent runs the documented hawk commands (hawk version/config/validate/scan) — all command_executed checks. With no hawk on the runner, the agent improvised (docker) and never emitted a hawk* trigger signal, so every triggering prompt scored FN. Add a JDK 17 (hawk is a Java app) + install the latest hawk via the repo's own documented method: resolve version from api.stackhawk.com/hawkscan/version, download the Linux ZIP, unzip, add to PATH. Install/verify are continue-on-error so a flaky download still lets evals run and record state. Auth (HAWK_API_KEY) and hawkop are not wired here; the api skill and live-app checks remain blocked until those land. This isolates "does installing hawk flip the hawkscan trigger + preflight/validate/scan checks green". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled curl/unzip with stackhawk/hawkscan-action@v2.5.0 using installCLIOnly: true — the maintained, canonical install path (resolves latest, handles the download/PATH). Keep setup-java@17 (hawk is a Java app; the action ships the CLI, not a JRE) and the post-install `hawk version` verify. apiKey is passed from the (currently empty) HAWK_API_KEY secret; install-only performs no scan so the key is unused. Step is continue-on-error so a missing key can't abort the job — evals still runs and records hawk availability. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All four AI harnesses (claude-code, codex, agy, cursor) drive an agent that runs the hawkscan skill, which needs the hawk CLI on PATH. Previously only claude-code installed it. Add the same install-only step (setup-java@17 + stackhawk/hawkscan-action@v2.5.0 installCLIOnly + hawk version verify) to the codex, agy, and cursor jobs. Note: the api skill prefers hawkop (separate CLI) and codex/agy/cursor still have their own tool-CLI provisioning blockers (codex auth, agy/cursor install) — tracked separately. This change covers hawk specifically, for every job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root causes found from the run-26769783222 step logs + traces: - codex: 401 "Missing bearer" — `codex exec` reads stored credentials, not OPENAI_API_KEY. Add `printenv OPENAI_API_KEY | codex login --with-api-key` before the eval run (pipe via stdin, never as an arg). - agy: `https://antigravity.google/install-cli` returns the site's HTML landing page, so `| bash` died with a syntax error and `agy` never installed. Use the real bootstrapper `/cli/install.sh`, add ~/.local/bin to PATH, and set ANTIGRAVITY_API_KEY (the env var agy actually reads) from the AGY_API_KEY secret. - cursor: `@cursor/cli` 404s and the `cursor-agent` npm package ships no `agent` binary. Use the official installer `curl https://cursor.com/install | bash`, which symlinks `agent` into ~/.local/bin; add that to PATH. claude-code plumbing already works (hw-05 trace: agent runs `hawk version` + `hawk scan --help`); no change there. Installer URLs verified to serve real shell scripts (application/x-sh) before wiring them in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After auth was fixed, codex still couldn't reach hawk: on Ubuntu-24.04 runners the bubblewrap sandbox fails to initialize (unprivileged user namespaces are gated by AppArmor) — "bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted" — so codex exits at sandbox startup before running any command (33 occurrences across cells; codex issue #16334). When CI is set, launch with --dangerously-bypass-approvals-and-sandbox instead of --sandbox <mode>. Safe on an ephemeral runner in a throwaway tmpdir, and the agent needs write+exec to run the hawkscan workflow anyway. Local runs keep the real sandbox (workspace-write for full-auto, else read-only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agy has no non-interactive auth (OAuth-only; upstream antigravity-cli#78 is open and unimplemented), so in a browser-less CI runner it prints an auth URL and times out. Detect that and set a clear error note so the digest attributes it to the upstream limitation rather than a plumbing/eval failure on our side. Boundary A (eval finds agy) and hawk-on-PATH are fixed; this is the one remaining harness that cannot run headlessly until upstream adds API-key auth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| for load in (True, False): | ||
| run = adapter.launch(p.prompt, skill, f"{p.id}-{'with' if load else 'without'}", | ||
| plugin_dirs, model=model, load_skill=load, | ||
| max_budget=max_budget, bare=bare, full_auto=full_auto) |
There was a problem hiding this comment.
[warning] compare_skill() has no per-prompt exception handling. If adapter.launch() throws (e.g. subprocess failure, disk error), the entire compare run aborts mid-loop with an unhandled exception rather than recording the failure for that prompt and continuing. The evals CLI wraps each prompt in try/except for exactly this reason — compare should do the same:
for p in prompts:
graded = {}
for load in (True, False):
try:
run = adapter.launch(...)
did = adapter.detect_trigger(run, skill)
graded[load] = grade(p, run, cfg.checks, ...)
except Exception as e:
graded[load] = grade(p, ParsedRun(error=str(e)), cfg.checks, ...)Note: the CI step already uses || true so a full abort won't break the workflow, but a half-written lift.json artifact will be confusing.
| rows = ["| test | result | why |", "|---|---|---|"] | ||
| for r in sorted(cell.results, key=lambda r: (_row_rank(r), r.run_id)): | ||
| why = "; ".join(r.budget_breaches) if r.budget_breaches else ( | ||
| "" if r.trigger_correct else |
There was a problem hiding this comment.
[warning] When trigger_correct=True but a blocking process check fails (verdict = FAIL), the why column in the table will be empty — the user sees ❌ FAIL with no explanation. The condition chain falls through to the empty string branch:
why = "; ".join(r.budget_breaches) if r.budget_breaches else (
"" if r.trigger_correct else # ← empty when trigger correct but proc-check failed
("false-positive" if r.did_trigger else "false-negative"))Consider adding a fallback for this case:
failed_checks = [c.id for c in r.process_checks if not c.passed and c.severity == "blocking"]
why = "; ".join(r.budget_breaches) or ("; ".join(failed_checks) if r.verdict.value == "fail" else "") or ("" if r.trigger_correct else ("false-positive" if r.did_trigger else "false-negative"))| max_budget=args.max_budget, bare=args.bare, | ||
| full_auto=args.full_auto, only_id=args.prompt_id) | ||
| import json | ||
| from pathlib import Path |
There was a problem hiding this comment.
[suggestion] json and Path are already imported at the top of cli.py — these in-function imports are redundant dead weight. Remove them.
This comment has been minimized.
This comment has been minimized.
The Actions run summary showed ~14 tables (one per matrix cell), because every eval job wrote its own render_job_summary to GITHUB_STEP_SUMMARY. Stop that; the `report` job now aggregates all cell.json into a single pivot table: test | claude-code-haiku-4-5 | claude-code-sonnet-4-6 | codex-gpt-5.5 | ... hawkscan/hw-01 | ✅ | ❌ — false-negative | ❌ — blocking check failed | ... Rows are skill/test, columns are platform-model (date stamp + redundant "claude-" prefix trimmed), cells are a verdict emoji + a terse reason on non-pass outcomes (`·` = that harness/model didn't run the test). Baseline/lift extras kept as compact notes below the table, not as more tables. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--bare is "minimal mode": per `claude --help` it skips hooks/LSP/plugins and "skills still resolve via /skill-name" — i.e. skills do NOT auto-trigger from their description. The eval prompts are natural language, so in bare mode the skill never fired (the agent ran as a vanilla model and gave generic DAST advice, even naming ZAP). That produced ~all false-negatives on positive prompts. Run in full plugin mode instead (also the realistic user experience). Isolated change — measuring trigger rate and whether negative controls now over-trigger before layering in HAWK_API_KEY / hawkop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uged The CI sandbox has no running app/credentials, so a triggered agent correctly stops and asks for a target instead of completing a scan — leaving the workflow process-checks (which scan bash_commands + output_text) unsatisfied. In observe mode (default, no --bare, not full-auto) append a suffix asking the agent to (1) declare the StackHawk skill it would invoke in a signal-matching format and (2) outline the CLI commands that skill's workflow runs. The declaration drives trigger detection; the outline satisfies the workflow checks via output_text — reproducing origin/main's observe-mode intent (gauge that the right skill triggers and the agent knows its workflow) without a real target. The commands are intentionally NOT listed in the suffix — producing them is the skill's job. Full-auto/extended (real target) keeps the bare prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hawkscan Step-1 dedup checks (hawkop app list / env list) and the entire api skill require hawkop — a separate native CLI we never installed, so agents couldn't run/narrate those steps. No official GitHub Action exists for hawkop, so install the native Linux binary directly (download.stackhawk.com/hawkop, latest-version.txt + x86_64-unknown-linux-gnu tarball) into /usr/local/bin, right beside the hawk install in all four jobs. continue-on-error so a flaky download never aborts the job. No runtime deps (native binary; no JDK needed). URL/version per the repo's own api skill reference (hawkop-shortcuts.md); tarball verified to contain a top-level `hawkop` binary before wiring it in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Summary
Adds three capabilities to the skill-eval system, built on a uv-managed shared library so every future grader improvement lands once instead of per-harness:
uv run compare) — runs eachshould_triggerprompt with and without the skill loaded and reports the verdict + cost delta. This is the missing "did the skill actually help?" signal: previously a skill that never fired and a useless skill graded identically.prompts.csv→prompts.yaml; each prompt can carry anexpectedlist (signal / anti_pattern / check_id) and abudget(cost_usd / bash_commands / output_tokens / wall_seconds). Process-checks can be scoped to specific prompts viaapplies_to.PASS/PASS-SLOW/FAIL. A correct run that breaches a budget gradesPASS-SLOW, catching over-orchestration without losing the "it worked" signal.Supporting scaffolding:
uv run evals | compare | regrade | validate.extra="forbid") — config typos become load-time errors naming the offending prompt, before any API spend.uv run regrade <trace.jsonl>) — regrade a saved run with zero API cost; the iteration loop for the CI-only testbed.evals/lib/(models, config, grading, harness, replay, compare, reporting) + claude-code adapter. The 6 oldrun-evals.pyscripts become thin shims into the CLI (−2447 lines of duplicated parsing/grading).validate+pytest+ a cheap claude-code/Haiku run; merge-to-main runs the full Sonnet+Opus+Haiku matrix. codex/agy/cursor are dispatch-only until their adapters land.Test Plan
Verified locally (free layers):
uv run pytest— 39 unit tests (models, config, grading incl. all check types + 3-state verdict, harness parse/trigger, replay, compare lift, reporting)uv run validate— both skills' prompts.yaml + process-checks.json parse & validateuv run regrade tests/fixtures/hw-07.trace.jsonl --skill hawkscan→ PASSactionlintclean on the workflowExercised by this PR's CI (first live agent run):
eval-claude-code(Haiku, hawkscan + api) — the livelaunch()→claude -ppathNotes / follow-ups
ANTHROPIC_API_KEYrepo secret foreval-claude-code; the two free jobs pass without it.adapter.pyfor codex/cursor/copilot/agy (dispatch-only until then); rubric grading lives only in the manual harness (not yet ported to the lib);output_tokens/wall_secondsbudget axes are graded but not yet populated by the claude-code adapter.🤖 Generated with Claude Code