Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
fde324e
build: bootstrap uv project for evals lib
Bwvolleyball May 29, 2026
2c4fd36
feat(evals): Pydantic data models with strict validation
Bwvolleyball May 29, 2026
1041542
feat(evals): skill config loader with validation
Bwvolleyball May 29, 2026
23f82e2
feat(evals): grading with per-prompt expected + budget verdict
Bwvolleyball May 29, 2026
d7adeff
fix(evals): grading robustness from code review (loud on malformed ch…
Bwvolleyball May 29, 2026
cf46d3e
feat(evals): Harness protocol + claude-code adapter
Bwvolleyball May 29, 2026
b3827fc
chore(evals): add one-time prompts CSV->YAML migration script
Bwvolleyball May 29, 2026
f424b26
feat(evals): migrate hawkscan prompts to YAML + worked budget example
Bwvolleyball May 29, 2026
ae57aed
feat(evals): replay-from-trace regrade + realistic hw-07 fixture
Bwvolleyball May 29, 2026
de9cb81
feat(evals): with/without-skill compare mode
Bwvolleyball May 29, 2026
9399123
feat(evals): unified CLI (evals/compare/regrade/validate) + reporting
Bwvolleyball May 29, 2026
f054e1f
feat(evals): migrate api prompts to YAML; remove CSV sources
Bwvolleyball May 29, 2026
5472ed2
refactor(evals): harness scripts become shims into unified CLI
Bwvolleyball May 29, 2026
5a7f80f
ci(evals): tiered runs (validate on PR, cheap PR matrix, full main) + uv
Bwvolleyball May 29, 2026
fc9c551
docs(evals): document uv CLI, prompts.yaml, compare/regrade, PASS-SLOW
Bwvolleyball May 29, 2026
46ed9e8
fix(evals): gate process checks on should_trigger+did_trigger (C1); d…
Bwvolleyball May 29, 2026
413a748
ci+docs(evals): dispatch-only non-claude jobs (C2), add pytest job, d…
Bwvolleyball May 29, 2026
7cc0f77
feat(evals): real codex adapter (ports pre-shim stream parsing)
Bwvolleyball May 29, 2026
7250ae2
feat(evals): real cursor adapter (ports pre-shim stream parsing)
Bwvolleyball May 29, 2026
ff59637
fix(evals): cursor adapter launch restores --trust + skill-loading (l…
Bwvolleyball May 29, 2026
5610376
feat(evals): real agy adapter (plain-text parsing)
Bwvolleyball May 29, 2026
47d2a3e
fix(evals): agy adapter appends OBSERVE_SUFFIX so triggers detect (li…
Bwvolleyball May 29, 2026
aef302e
ci(evals): re-enable codex/cursor/agy now that adapters exist (closes…
Bwvolleyball May 29, 2026
a041ca9
feat(evals): CellReport model + cell.json artifact from main()
Bwvolleyball May 29, 2026
da3d46c
feat(evals): render_job_summary (JUnit-style, failures first) + shiel…
Bwvolleyball May 29, 2026
5afc996
feat(evals): main() writes GITHUB_STEP_SUMMARY job report
Bwvolleyball May 29, 2026
03707c0
feat(evals): render_digest + report CLI; comment job posts rich digest
Bwvolleyball May 29, 2026
e076846
feat(evals): baseline diff + score_delta (pure threshold math, no AI)
Bwvolleyball May 29, 2026
10b4883
feat(evals): digest shows regression vs released-tag baseline
Bwvolleyball May 29, 2026
324b8cf
ci(evals): capture baseline at release tag; PR diffs against it (grac…
Bwvolleyball May 29, 2026
bd52c2c
feat(evals): compare emits lift effect + writes lift.json
Bwvolleyball May 29, 2026
5372103
feat(evals): render skill-lift section; PR runs compare for lift
Bwvolleyball May 29, 2026
690dc5a
docs(evals): document JUnit-style report, comparisons, four real adap…
Bwvolleyball May 29, 2026
dca5e67
fix(evals): comment job checkout-before-download (empty-digest bug); …
Bwvolleyball May 29, 2026
7947bc3
feat(evals): capture stderr+returncode, surface harness errors in rep…
Bwvolleyball Jun 1, 2026
05c7bc7
fix(evals): main() resilient to per-prompt launch crashes; always wri…
Bwvolleyball Jun 1, 2026
06e063b
ci(evals): full tool×model matrix on PR+dispatch (drop push); digest …
Bwvolleyball Jun 1, 2026
327a770
ci(evals): revert to workflow_dispatch-only (match origin/main); dige…
Bwvolleyball Jun 1, 2026
cbe638f
fix(evals): unblock codex/cursor/agy harness execution
Bwvolleyball Jun 1, 2026
7c38290
ci(evals): install latest hawk CLI in the claude-code job
Bwvolleyball Jun 1, 2026
7b79cf9
ci(evals): install hawk via official hawkscan-action (install-only)
Bwvolleyball Jun 1, 2026
f7d47fc
ci(evals): install hawk in every harness job, not just claude-code
Bwvolleyball Jun 1, 2026
7a789ec
ci(evals): fix agent-CLI plumbing for codex, agy, cursor
Bwvolleyball Jun 1, 2026
f7e0a3e
fix(evals): codex bypasses bwrap sandbox in CI so the agent can run hawk
Bwvolleyball Jun 1, 2026
755dd7b
evals(agy): label the OAuth-only auth blocker distinctly
Bwvolleyball Jun 1, 2026
4b7008c
report(evals): collapse matrix into one pivot table in the run summary
Bwvolleyball Jun 1, 2026
221d47d
ci(evals): drop --bare so claude-code skills auto-trigger
Bwvolleyball Jun 1, 2026
ee2d17c
evals(claude-code): observe-mode suffix so skill triggering can be ga…
Bwvolleyball Jun 1, 2026
482206b
ci(evals): install hawkop CLI in every harness job
Bwvolleyball Jun 1, 2026
171b61c
Merge remote-tracking branch 'origin/main' into feat/scale-skill-evals
Bwvolleyball Jun 2, 2026
9c5719f
evals: integrate origin/main's stackhawk-data-seed suite into the new…
Bwvolleyball Jun 2, 2026
aebb4a3
chore: gitignore __pycache__ (untrack files accidentally added in pre…
Bwvolleyball Jun 2, 2026
30a70e4
evals: restore hawk version signal + wire stackhawk-data-seed triggers
Bwvolleyball Jun 2, 2026
1ffc1f4
evals: port the qualitative rubric grader, woven into the pass/fail t…
Bwvolleyball Jun 2, 2026
84a11b5
ci(evals): run the rubric grader matrix-wide (dispatch toggle, defaul…
Bwvolleyball Jun 2, 2026
68d9e54
ci(evals): wire HAWK_API_KEY into all harness run steps
Bwvolleyball Jun 2, 2026
e97dab8
evals: fix rubric grader + data-seed check-type grading (clean-baseli…
Bwvolleyball Jun 2, 2026
a3ffa7c
fix(evals): rubric grader hit max-budget — bump cap + pin a cheap gra…
Bwvolleyball Jun 2, 2026
a5e2d64
evals: per-skill observe suffix + authoritative decision-line trigger
Bwvolleyball Jun 2, 2026
13ea680
fix(evals): rubric grader chokes on prose-wrapped JSON — extract tole…
Bwvolleyball Jun 3, 2026
ef2793a
fix(evals): observe suffix — weak models refused when skill body absent
Bwvolleyball Jun 3, 2026
4a30b37
evals: repair data-seed regression + sharpen trigger accuracy
Bwvolleyball Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/actionlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
self-hosted-runner:
labels:
- agent-skills-amd-4cpu
166 changes: 166 additions & 0 deletions .github/workflows/capture-baseline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
name: Capture Eval Baseline
on:
workflow_dispatch:
inputs:
tag:
description: "Release tag to baseline (e.g. v1.9.0)"
required: true
type: string
permissions:
contents: read

jobs:

# ── Claude Code — 3 models ─────────────────────────────────────────────────
capture-claude-code:
name: baseline / claude-code / ${{ matrix.skill }} / ${{ matrix.model }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
skill: [hawkscan, api]
model: [claude-sonnet-4-6, claude-opus-4-7, claude-haiku-4-5-20251001]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.tag }}
- uses: astral-sh/setup-uv@v5
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install Claude Code CLI
run: npm install -g @anthropic-ai/claude-code
- name: Verify claude CLI
run: claude --version
- name: Run baseline eval (${{ matrix.model }})
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
uv run evals --harness claude-code --skill ${{ matrix.skill }} \
--model ${{ matrix.model }} --bare --max-budget 0.15 || true
- name: Upload baseline artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-claude-code-${{ matrix.skill }}-${{ matrix.model }}
path: evals/harnesses/claude-code/results/${{ matrix.skill }}/cell.json
retention-days: 90

# ── Codex — 2 models ──────────────────────────────────────────────────────
capture-codex:
name: baseline / codex / ${{ matrix.skill }} / ${{ matrix.model }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
skill: [hawkscan, api]
model: [gpt-5.5, o3]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.tag }}
- uses: astral-sh/setup-uv@v5
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install Codex CLI
run: npm install -g @openai/codex
- name: Verify codex CLI
run: codex --version
- name: Install StackHawk skills (hawkscan + api)
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
codex plugin marketplace add .
echo y | codex plugin add hawkscan@stackhawk
echo y | codex plugin add stackhawk-api@stackhawk
- name: Run baseline eval (${{ matrix.model }})
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
uv run evals --harness codex --skill ${{ matrix.skill }} --model ${{ matrix.model }} || true
- name: Upload baseline artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-codex-${{ matrix.skill }}-${{ matrix.model }}
path: evals/harnesses/codex/results/${{ matrix.skill }}/cell.json
retention-days: 90

# ── Antigravity (agy) — default model ─────────────────────────────────────
capture-agy:
name: baseline / agy / ${{ matrix.skill }} / ${{ matrix.model }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
skill: [hawkscan, api]
model: [default]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.tag }}
- uses: astral-sh/setup-uv@v5
- name: Install agy CLI
run: curl -fsSL https://antigravity.google/install-cli | bash
- name: Verify agy CLI
run: agy --version
- name: Install StackHawk plugins
env:
AGY_API_KEY: ${{ secrets.AGY_API_KEY }}
run: |
echo y | agy plugin install plugins/hawkscan
echo y | agy plugin install plugins/api
- name: Run baseline eval
env:
AGY_API_KEY: ${{ secrets.AGY_API_KEY }}
run: |
MODEL_ARGS=()
if [ "${{ matrix.model }}" != "default" ]; then MODEL_ARGS=(--model "${{ matrix.model }}"); fi
uv run evals --harness agy --skill ${{ matrix.skill }} "${MODEL_ARGS[@]}" || true
- name: Upload baseline artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-agy-${{ matrix.skill }}-${{ matrix.model }}
path: evals/harnesses/agy/results/${{ matrix.skill }}/cell.json
retention-days: 90

# ── Cursor — default model ─────────────────────────────────────────────────
capture-cursor:
name: baseline / cursor / ${{ matrix.skill }} / ${{ matrix.model }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
skill: [hawkscan, api]
model: [default]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.tag }}
- uses: astral-sh/setup-uv@v5
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install Cursor CLI
run: npm install -g @cursor/cli || npm install -g cursor-agent
continue-on-error: true # package name TBD; update when stable
- name: Verify agent CLI
run: agent --version
continue-on-error: true # CLI package name TBD; skip if unavailable
- name: Run baseline eval
env:
CURSOR_API_KEY: ${{ secrets.CURSOR_API_KEY }}
run: |
MODEL_ARGS=()
if [ "${{ matrix.model }}" != "default" ]; then MODEL_ARGS=(--model "${{ matrix.model }}"); fi
uv run evals --harness cursor --skill ${{ matrix.skill }} "${MODEL_ARGS[@]}" || true
continue-on-error: true # best-effort
- name: Upload baseline artifact
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-cursor-${{ matrix.skill }}-${{ matrix.model }}
path: evals/harnesses/cursor/results/${{ matrix.skill }}/cell.json
retention-days: 90
22 changes: 20 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,24 @@ jobs:
if: inputs.dry_run == true
run: echo "DRY RUN complete — all checks passed for ${{ steps.version.outputs.tag }}"

capture-baseline:
name: Trigger baseline capture
needs: release
if: inputs.dry_run != true
runs-on: ubuntu-latest
permissions:
actions: write
steps:
- uses: actions/checkout@v4
- name: Dispatch capture-baseline
# GITHUB_TOKEN can dispatch workflows in the same repo for most orgs.
# If org policy blocks it, swap to the TF_GITHUB_TOKEN PAT that
# update-marketplace pulls from SSM (aws ssm get-parameter --name TF_GITHUB_TOKEN).
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
RELEASE_TAG: ${{ needs.release.outputs.tag }}
run: gh workflow run capture-baseline.yml -f tag="$RELEASE_TAG"

update-marketplace:
name: Update marketplace pin
needs: release
Expand All @@ -139,7 +157,7 @@ jobs:
- name: Resolve cache
run: |
biodome ci restore-cache
rm -rf *.tar.lz4
rm -rf ./*.tar.lz4

- name: Pull secrets
run: biodome ci save-secrets
Expand All @@ -158,7 +176,7 @@ jobs:
echo "::add-mask::${GH_PAT}"
git clone https://github.com/stackhawk/agent-skills-marketplace.git /tmp/marketplace
git -C /tmp/marketplace remote set-url origin \
https://x-access-token:${GH_PAT}@github.com/stackhawk/agent-skills-marketplace.git
"https://x-access-token:${GH_PAT}@github.com/stackhawk/agent-skills-marketplace.git"

- name: Update marketplace.json
run: |
Expand Down
Loading
Loading