Skip to content

Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44

Open
Bwvolleyball wants to merge 49 commits into
mainfrom
feat/scale-skill-evals
Open

Scale skill evals: compare mode, per-prompt budgets, efficiency grader (uv + shared lib)#44
Bwvolleyball wants to merge 49 commits into
mainfrom
feat/scale-skill-evals

Conversation

@Bwvolleyball
Copy link
Copy Markdown
Contributor

it's working

Summary

Adds three capabilities to the skill-eval system, built on a uv-managed shared library so every future grader improvement lands once instead of per-harness:

  • Compare mode (uv run compare) — runs each should_trigger prompt with and without the skill loaded and reports the verdict + cost delta. This is the missing "did the skill actually help?" signal: previously a skill that never fired and a useless skill graded identically.
  • Per-prompt expected behavior + budgetsprompts.csvprompts.yaml; each prompt can carry an expected list (signal / anti_pattern / check_id) and a budget (cost_usd / bash_commands / output_tokens / wall_seconds). Process-checks can be scoped to specific prompts via applies_to.
  • Three-state efficiency graderPASS / PASS-SLOW / FAIL. A correct run that breaches a budget grades PASS-SLOW, catching over-orchestration without losing the "it worked" signal.

Supporting scaffolding:

  • uv project + unified CLI: uv run evals | compare | regrade | validate.
  • Pydantic strict-validated config (extra="forbid") — config typos become load-time errors naming the offending prompt, before any API spend.
  • Replay-from-trace (uv run regrade <trace.jsonl>) — regrade a saved run with zero API cost; the iteration loop for the CI-only testbed.
  • Shared lib evals/lib/ (models, config, grading, harness, replay, compare, reporting) + claude-code adapter. The 6 old run-evals.py scripts become thin shims into the CLI (−2447 lines of duplicated parsing/grading).
  • Tiered CI: every PR runs keys-free validate + pytest + a cheap claude-code/Haiku run; merge-to-main runs the full Sonnet+Opus+Haiku matrix. codex/agy/cursor are dispatch-only until their adapters land.

Test Plan

Verified locally (free layers):

  • uv run pytest — 39 unit tests (models, config, grading incl. all check types + 3-state verdict, harness parse/trigger, replay, compare lift, reporting)
  • uv run validate — both skills' prompts.yaml + process-checks.json parse & validate
  • uv run regrade tests/fixtures/hw-07.trace.jsonl --skill hawkscan → PASS
  • actionlint clean on the workflow

Exercised by this PR's CI (first live agent run):

  • eval-claude-code (Haiku, hawkscan + api) — the live launch()claude -p path
  • PR comment posts trigger-accuracy + process-score table

Notes / follow-ups

  • Requires ANTHROPIC_API_KEY repo secret for eval-claude-code; the two free jobs pass without it.
  • This PR is the first end-to-end run of the live agent path — local verification covered parsing/grading/validation/replay only.
  • Future work (documented, non-blocking): per-platform adapter.py for codex/cursor/copilot/agy (dispatch-only until then); rubric grading lives only in the manual harness (not yet ported to the lib); output_tokens/wall_seconds budget axes are graded but not yet populated by the claude-code adapter.

🤖 Generated with Claude Code

Bwvolleyball and others added 17 commits May 29, 2026 12:23
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecks, +tests)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Generate evals/api/prompts.yaml (16 prompts) via migrate_prompts.py and
delete both evals/api/prompts.csv and evals/hawkscan/prompts.csv.
prompts.yaml is now the single source of truth for both skills; harness
shims still reference prompts.csv and will be updated in Task 12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rop unused --rubric

Process checks, ad-hoc expectations, and budget checks now only run when
prompt.should_trigger and did_trigger are both true. Correct non-triggers,
false positives, and false negatives are graded purely on trigger accuracy,
fixing the critical bug where a 100%-correct run would exit non-zero in CI.
Also removes the parsed-but-never-read --rubric flag from _common_args.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rop rubric plumbing, refresh stale docs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Bwvolleyball Bwvolleyball requested a review from a team as a code owner May 29, 2026 20:01
@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @Bwvolleyball's task in 3m 0s —— View job


Claude Code Review

  • Gather context and understand the PR
  • Read changed files and diff
  • Review correctness, security, performance, testing, conventions
  • Post final verdict

⚠️ Needs Attention

Solid, well-tested refactor. The shared lib, strict Pydantic config, replay-from-trace, and 39-unit-test suite are all real improvements. Three issues from the previous round are confirmed fixed.

✅ Fixed from prior review

  • Codex duplicate --sandbox — correctly branches on CI env var now
  • Cursor CURSOR_API_KEY CLI exposure — injected via environment, never on the command line
  • codex/agy/cursor running on every PR — workflow is workflow_dispatch-only

Remaining warnings

  • evals/lib/compare.py:21-29compare_skill() has no per-prompt try/except; a single adapter.launch() failure aborts the entire compare run and produces a partial lift.json. The cli.main() loop already handles this correctly — compare should mirror it. Fix this →
  • evals/lib/reporting.py:217-219render_job_summary() leaves why blank when a blocking process-check fails but trigger_correct=True. The _fail_reason() helper in the same file already handles this with "blocking check failed" — use it. Fix this →
  • evals/harnesses/agy/adapter.pyload_skill is a no-op for agy; both with and without compare arms run the same globally-installed skill, so compare always reports effect: "none" for agy. At minimum add a note; ideally filter agy from compare runs.
  • evals/harnesses/codex/adapter.py:82-87seen set deduplicates item.started events by command string, so a command that legitimately runs twice is only counted once — under-counting bash_commands for budget checks.
  • evals/cli.py:102-103import json and from pathlib import Path inside compare() are redundant; both already imported at module level.

@github-actions

This comment was marked as outdated.

Bwvolleyball and others added 10 commits May 29, 2026 14:53
Implements CodexAdapter with CLI_SIGNALS, INVOCATION_SIGNALS, parse_stream
(item.started/item.completed/turn.completed), detect_trigger, and launch
(codex exec --json --sandbox workspace-write --skip-git-repo-check), resolving
the C2 defect where uv run evals --harness codex raised ValueError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements CursorAdapter with cursor-specific stream-json event keys
(tool_call/subtype:started/shellToolCall, not claude-code's tool_use
blocks), the full CLI_SIGNALS and INVOCATION_SIGNALS from pre-shim,
launch flags matching the pre-shim invocation, and tests backed by a
minimal cursor.txt fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ive-run fidelity)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds AgyAdapter with plain-text parse_stream (wraps full stdout in
output_text, bash_commands always empty), INVOCATION_SIGNALS recovered
verbatim from pre-shim ALL_SIGNALS plus evaluation-format backtick
variants, and launch() mirroring the pre-shim agy -p / --print-timeout
invocation.  CLI_SIGNALS is empty (agy has no shell commands to scan).
Skills are installed globally in CI via agy plugin install; load_skill
is a no-op.  AGY_API_KEY flows through os.environ as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ve-run fidelity)

Restores the pre-shim OBSERVE_SUFFIX (verbatim) and appends it to the
prompt inside launch() before invoking agy. In --print mode agy hangs on
tool approvals, so the suffix makes the agent declare 'SKILL: hawkscan' /
'SKILL: api' / 'SKILL: none' up front — that declaration is what the
pre-shim SKILL: signals in INVOCATION_SIGNALS match. Without it, live agy
runs emit no detectable trigger text (all false-negatives).

Both signal sets are retained: pre-shim SKILL: entries AND the backtick
evaluation-format variants, so detection is robust regardless of which
format agy emits. Adds a unit test asserting OBSERVE_SUFFIX is non-empty,
requests the SKILL: declaration, and that detect_trigger fires on it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… C2)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ds badge

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds render_digest() to reporting.py, a new `report` CLI entrypoint that
discovers cell.json artifacts via rglob and writes a consolidated digest.md,
and replaces the flat JS-built comment in skill-evals.yml with two clean
steps (uv build + thin github-script post).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bwvolleyball and others added 5 commits May 29, 2026 15:21
…eful)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ters

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cursor best-effort; wire score_delta into overview

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread evals/harnesses/codex/adapter.py Outdated
if model:
cmd += ["-m", model]
if not full_auto:
cmd += ["--sandbox", "read-only"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] --sandbox is specified twice with conflicting values. workspace-write is always added on line 128, then when full_auto=False (the default), read-only is appended on line 134. The resulting command is:

codex exec --json --sandbox workspace-write --skip-git-repo-check --sandbox read-only <prompt>

Depending on how Codex resolves duplicate flags this will either use the wrong sandbox level or error out. Fix by making it conditional:

sandbox = "workspace-write" if full_auto else "read-only"
cmd = ["codex", "exec", "--json", "--sandbox", sandbox, "--skip-git-repo-check"]

Fix this →

_setup_skill(tmpdir)
api_key = os.environ.get("CURSOR_API_KEY", "")
cmd = [
"agent", "-p", prompt,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] The CURSOR_API_KEY is passed as a command-line argument, making it visible in process listings (ps aux, /proc/<pid>/cmdline). Prefer injecting secrets via environment variable rather than --api-key <value>. The cursor CLI likely reads CURSOR_API_KEY from the environment already (consistent with how ANTHROPIC_API_KEY is used in the claude-code adapter — never passed on the command line there).

) -> ParsedRun:
# Skills are installed globally via `agy plugin install` in CI;
# load_skill toggling is a no-op here.
tmpdir = tempfile.mkdtemp(prefix=f"hawkeval_{run_id}_")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] The agy adapter documents load_skill as a no-op because skills are installed globally via agy plugin install in CI. This means compare_skill() cannot actually test "without skill" for agy — both the load=True and load=False runs use the same globally-installed skill, so the compare will always report effect: "none" for this platform, producing misleading lift data. Consider either: (a) filtering agy out of compare runs until per-run skill toggle is supported, or (b) emitting a warning in compare output when the adapter returns noop for load_skill=False.

github.event_name != 'workflow_dispatch' ||
github.event_name == 'pull_request' ||
github.event_name == 'push' ||
inputs.platform == 'all' ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] codex, agy, and cursor jobs all run on every PR (github.event_name == pull_request), but they produce empty results because the API keys are not set for PR builds. The existing CI comment shows "No results" for all three platforms. Each job installs a CLI, fails at plugin install, then exits — burning ~30 CI min per PR (3 platforms × 2 skills). Consider gating them to workflow_dispatch only until the adapters and API keys are ready.

@claude

This comment has been minimized.

Bwvolleyball and others added 11 commits May 31, 2026 20:45
…ort, fix total_cost_usd

- ParsedRun gains returncode + stderr_tail fields; EvalResult gains note field
- grade() propagates run.error → EvalResult.note on both return paths
- render_job_summary() appends note to the "why" column when present
- All four adapters (claude-code, codex, cursor, agy) now capture proc.returncode
  and proc.stderr after subprocess.run, set run.error on non-zero exit or empty output
- claude-code adapter parse_stream reads total_cost_usd (new key) before cost_usd
  (legacy key) so cost stops showing $0.00

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te cell+summary+trace

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to run summary; capture-baseline full matrix

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st to run summary on dispatch

Evals run real agents against tool CLIs and were never an automatic PR gate
(origin/main commit c860e47 deliberately removed the pull_request trigger).
Auto-PR runs surfaced env gaps (CLIs not installed, skills not loading under
--bare) that were never set up for CI. Restore manual dispatch; the report job
now writes the consolidated digest to GITHUB_STEP_SUMMARY on dispatch and only
posts a PR comment when a PR context exists.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- codex: pick --sandbox value once (workspace-write vs read-only). Passing
  both made codex exit 2 ("--sandbox cannot be used multiple times"), failing
  every non-full-auto run before the agent started.
- cursor: pass CURSOR_API_KEY via the child environment instead of --api-key
  on the command line (the flag leaked the secret into process listings/logs;
  the agent CLI reads it from the env directly).
- agy: mark CLI install/verify/plugin-install/run steps continue-on-error so a
  flaky preview installer no longer aborts the job before evals run — the eval
  CLI records the launch failure and uploads a result the digest can surface
  (matches cursor's best-effort treatment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hawkscan skill grades on whether the agent runs the documented hawk
commands (hawk version/config/validate/scan) — all command_executed checks.
With no hawk on the runner, the agent improvised (docker) and never emitted a
hawk* trigger signal, so every triggering prompt scored FN.

Add a JDK 17 (hawk is a Java app) + install the latest hawk via the repo's
own documented method: resolve version from api.stackhawk.com/hawkscan/version,
download the Linux ZIP, unzip, add to PATH. Install/verify are
continue-on-error so a flaky download still lets evals run and record state.

Auth (HAWK_API_KEY) and hawkop are not wired here; the api skill and live-app
checks remain blocked until those land. This isolates "does installing hawk
flip the hawkscan trigger + preflight/validate/scan checks green".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled curl/unzip with stackhawk/hawkscan-action@v2.5.0 using
installCLIOnly: true — the maintained, canonical install path (resolves latest,
handles the download/PATH). Keep setup-java@17 (hawk is a Java app; the action
ships the CLI, not a JRE) and the post-install `hawk version` verify.

apiKey is passed from the (currently empty) HAWK_API_KEY secret; install-only
performs no scan so the key is unused. Step is continue-on-error so a missing
key can't abort the job — evals still runs and records hawk availability.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All four AI harnesses (claude-code, codex, agy, cursor) drive an agent that
runs the hawkscan skill, which needs the hawk CLI on PATH. Previously only
claude-code installed it. Add the same install-only step (setup-java@17 +
stackhawk/hawkscan-action@v2.5.0 installCLIOnly + hawk version verify) to the
codex, agy, and cursor jobs.

Note: the api skill prefers hawkop (separate CLI) and codex/agy/cursor still
have their own tool-CLI provisioning blockers (codex auth, agy/cursor install)
— tracked separately. This change covers hawk specifically, for every job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root causes found from the run-26769783222 step logs + traces:

- codex: 401 "Missing bearer" — `codex exec` reads stored credentials, not
  OPENAI_API_KEY. Add `printenv OPENAI_API_KEY | codex login --with-api-key`
  before the eval run (pipe via stdin, never as an arg).
- agy: `https://antigravity.google/install-cli` returns the site's HTML landing
  page, so `| bash` died with a syntax error and `agy` never installed. Use the
  real bootstrapper `/cli/install.sh`, add ~/.local/bin to PATH, and set
  ANTIGRAVITY_API_KEY (the env var agy actually reads) from the AGY_API_KEY secret.
- cursor: `@cursor/cli` 404s and the `cursor-agent` npm package ships no `agent`
  binary. Use the official installer `curl https://cursor.com/install | bash`,
  which symlinks `agent` into ~/.local/bin; add that to PATH.

claude-code plumbing already works (hw-05 trace: agent runs `hawk version` +
`hawk scan --help`); no change there. Installer URLs verified to serve real
shell scripts (application/x-sh) before wiring them in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After auth was fixed, codex still couldn't reach hawk: on Ubuntu-24.04 runners
the bubblewrap sandbox fails to initialize (unprivileged user namespaces are
gated by AppArmor) — "bwrap: loopback: Failed RTM_NEWADDR: Operation not
permitted" — so codex exits at sandbox startup before running any command
(33 occurrences across cells; codex issue #16334).

When CI is set, launch with --dangerously-bypass-approvals-and-sandbox instead
of --sandbox <mode>. Safe on an ephemeral runner in a throwaway tmpdir, and the
agent needs write+exec to run the hawkscan workflow anyway. Local runs keep the
real sandbox (workspace-write for full-auto, else read-only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agy has no non-interactive auth (OAuth-only; upstream antigravity-cli#78 is open
and unimplemented), so in a browser-less CI runner it prints an auth URL and
times out. Detect that and set a clear error note so the digest attributes it to
the upstream limitation rather than a plumbing/eval failure on our side.

Boundary A (eval finds agy) and hawk-on-PATH are fixed; this is the one remaining
harness that cannot run headlessly until upstream adds API-key auth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread evals/lib/compare.py
for load in (True, False):
run = adapter.launch(p.prompt, skill, f"{p.id}-{'with' if load else 'without'}",
plugin_dirs, model=model, load_skill=load,
max_budget=max_budget, bare=bare, full_auto=full_auto)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] compare_skill() has no per-prompt exception handling. If adapter.launch() throws (e.g. subprocess failure, disk error), the entire compare run aborts mid-loop with an unhandled exception rather than recording the failure for that prompt and continuing. The evals CLI wraps each prompt in try/except for exactly this reason — compare should do the same:

for p in prompts:
    graded = {}
    for load in (True, False):
        try:
            run = adapter.launch(...)
            did = adapter.detect_trigger(run, skill)
            graded[load] = grade(p, run, cfg.checks, ...)
        except Exception as e:
            graded[load] = grade(p, ParsedRun(error=str(e)), cfg.checks, ...)

Note: the CI step already uses || true so a full abort won't break the workflow, but a half-written lift.json artifact will be confusing.

Comment thread evals/lib/reporting.py
rows = ["| test | result | why |", "|---|---|---|"]
for r in sorted(cell.results, key=lambda r: (_row_rank(r), r.run_id)):
why = "; ".join(r.budget_breaches) if r.budget_breaches else (
"" if r.trigger_correct else
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] When trigger_correct=True but a blocking process check fails (verdict = FAIL), the why column in the table will be empty — the user sees ❌ FAIL with no explanation. The condition chain falls through to the empty string branch:

why = "; ".join(r.budget_breaches) if r.budget_breaches else (
    "" if r.trigger_correct else          # ← empty when trigger correct but proc-check failed
    ("false-positive" if r.did_trigger else "false-negative"))

Consider adding a fallback for this case:

failed_checks = [c.id for c in r.process_checks if not c.passed and c.severity == "blocking"]
why = "; ".join(r.budget_breaches) or       ("; ".join(failed_checks) if r.verdict.value == "fail" else "") or       ("" if r.trigger_correct else ("false-positive" if r.did_trigger else "false-negative"))

Comment thread evals/cli.py
max_budget=args.max_budget, bare=args.bare,
full_auto=args.full_auto, only_id=args.prompt_id)
import json
from pathlib import Path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] json and Path are already imported at the top of cli.py — these in-function imports are redundant dead weight. Remove them.

@claude

This comment has been minimized.

Bwvolleyball and others added 4 commits June 1, 2026 13:17
The Actions run summary showed ~14 tables (one per matrix cell), because every
eval job wrote its own render_job_summary to GITHUB_STEP_SUMMARY. Stop that; the
`report` job now aggregates all cell.json into a single pivot table:

  test | claude-code-haiku-4-5 | claude-code-sonnet-4-6 | codex-gpt-5.5 | ...
  hawkscan/hw-01 | ✅ | ❌ — false-negative | ❌ — blocking check failed | ...

Rows are skill/test, columns are platform-model (date stamp + redundant
"claude-" prefix trimmed), cells are a verdict emoji + a terse reason on
non-pass outcomes (`·` = that harness/model didn't run the test). Baseline/lift
extras kept as compact notes below the table, not as more tables.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
--bare is "minimal mode": per `claude --help` it skips hooks/LSP/plugins and
"skills still resolve via /skill-name" — i.e. skills do NOT auto-trigger from
their description. The eval prompts are natural language, so in bare mode the
skill never fired (the agent ran as a vanilla model and gave generic DAST advice,
even naming ZAP). That produced ~all false-negatives on positive prompts.

Run in full plugin mode instead (also the realistic user experience). Isolated
change — measuring trigger rate and whether negative controls now over-trigger
before layering in HAWK_API_KEY / hawkop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uged

The CI sandbox has no running app/credentials, so a triggered agent correctly
stops and asks for a target instead of completing a scan — leaving the
workflow process-checks (which scan bash_commands + output_text) unsatisfied.

In observe mode (default, no --bare, not full-auto) append a suffix asking the
agent to (1) declare the StackHawk skill it would invoke in a signal-matching
format and (2) outline the CLI commands that skill's workflow runs. The
declaration drives trigger detection; the outline satisfies the workflow checks
via output_text — reproducing origin/main's observe-mode intent (gauge that the
right skill triggers and the agent knows its workflow) without a real target.

The commands are intentionally NOT listed in the suffix — producing them is the
skill's job. Full-auto/extended (real target) keeps the bare prompt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hawkscan Step-1 dedup checks (hawkop app list / env list) and the entire
api skill require hawkop — a separate native CLI we never installed, so agents
couldn't run/narrate those steps. No official GitHub Action exists for hawkop,
so install the native Linux binary directly (download.stackhawk.com/hawkop,
latest-version.txt + x86_64-unknown-linux-gnu tarball) into /usr/local/bin,
right beside the hawk install in all four jobs. continue-on-error so a flaky
download never aborts the job. No runtime deps (native binary; no JDK needed).

URL/version per the repo's own api skill reference (hawkop-shortcuts.md);
tarball verified to contain a top-level `hawkop` binary before wiring it in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

⚠️ Claude Code Review — Needs Attention

Substantial, well-architected PR. The shared evals/lib/ refactor is clean, Pydantic strict-validated config is a real safety improvement, and the 39-test suite gives good coverage across models/grading/reporting. Three issues from the previous review round are confirmed fixed (codex duplicate --sandbox, cursor API key CLI-exposure, and CI jobs running on every PR). A few warnings remain open.

✅ Fixes confirmed from prior review round

  • Codex duplicate --sandbox (evals/harnesses/codex/adapter.py) — now correctly branches on CI env var; --dangerously-bypass-approvals-and-sandbox in CI, conditional workspace-write / read-only locally.
  • Cursor CURSOR_API_KEY CLI exposure (evals/harnesses/cursor/adapter.py) — key now injected via env=dict(os.environ), never on the command line.
  • codex/agy/cursor running on every PR (skill-evals.yml) — workflow is now workflow_dispatch-only; free jobs (validate, pytest) are the only automatic gates.

🚨 Issues Found

[warning] evals/lib/compare.py:21-29compare_skill() has no per-prompt exception handling. If adapter.launch() raises mid-loop (subprocess failure, disk error, timeout not wrapped), the entire compare run aborts with an unhandled exception, producing a half-written or missing lift.json artifact. cli.main() wraps each prompt in try/except for exactly this reason; compare_skill() should do the same. The || true in the CI step masks the abort but leaves the artifact in an indeterminate state.

[warning] evals/lib/reporting.py:217-219render_job_summary() leaves the why column empty when a blocking process-check fails but the trigger fired correctly (i.e., trigger_correct=True, budget_breaches=[], verdict=FAIL). The condition chain short-circuits to "" in that branch. The _fail_reason() helper in the same file handles this gracefully with a "blocking check failed" fallback — render_job_summary should use the same logic. A user sees ❌ FAIL with no explanation in the table.

[warning] evals/harnesses/agy/adapter.py / evals/lib/compare.pyload_skill is documented as a no-op for agy (skills are globally installed). This means both the load=True and load=False compare arms run against the same globally-installed skill, so compare will always report effect: "none" for agy — the lift measurement is meaningless. Consider either filtering agy out of compare runs or emitting a warning in compare_skill() when the adapter doesn't support per-run skill toggling.

[warning] evals/harnesses/codex/adapter.py:82-87 — The seen: set() deduplication for item.started events means a command that legitimately runs twice (e.g., two hawk scan invocations in a rescan flow) is only counted once. This under-counts bash_commands for budget checks, potentially letting a prompt pass a bash_commands budget it actually violated.

Suggested before merge

  • Add try/except in compare_skill() loop — mirror the resilience already in cli.main().
  • Fix the empty why in render_job_summary for blocking-check failures — use _fail_reason(r) or add the "blocking check failed" fallback.
  • Remove the redundant import json / from pathlib import Path at the top of compare() in evals/cli.py:102-103 (both already imported at module level).
  • Document or guard the agy compare limitation — a # NOTE: load_skill is a no-op; compare results for agy are always "none" comment in compare_skill() at minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant