Skip to content

Add behavioral evals framework#12

Merged
HartBrook merged 2 commits intomainfrom
feature/evals
Jan 17, 2026
Merged

Add behavioral evals framework#12
HartBrook merged 2 commits intomainfrom
feature/evals

Conversation

@HartBrook
Copy link
Copy Markdown
Owner

@HartBrook HartBrook commented Jan 17, 2026

Summary

  • Add stag eval command to run behavioral tests against CLAUDE.md configs
  • Include 25 starter evals covering security, code quality, documentation, and language-specific best practices
  • Integrate with Promptfoo for LLM-based test assertions
  • Support eval syncing from team repos via stag sync

What's Included

New Commands:

  • stag eval - Run evals against your merged config
  • stag eval list - List available evals
  • stag eval init - Install starter evals
  • stag eval info <name> - Show eval details

Features:

  • Filter by tag (--tag security), name, or test (--test uses-*)
  • Test specific config layers (--layer team)
  • Multiple output formats: table, JSON, GitHub Actions annotations
  • Debug mode with full Claude responses (--debug)
  • Dry-run to preview without API calls

Starter Evals (25 total):

Category Evals
Security secrets, injection, auth, OWASP top 10, validation
Quality clarity, simplicity, naming, error handling
Review bugs, tests, performance, maintainability
Docs API documentation, code comments
Git commit messages, sensitive files
Language Python, Go, TypeScript, Rust
Baseline helpful, focused, honest, minimal

Test plan

  • stag eval init installs starter evals to ~/.config/staghorn/evals/
  • stag eval list shows all available evals grouped by source
  • stag eval --dry-run previews tests without API calls
  • stag eval security-secrets runs a specific eval
  • stag eval --tag security filters by tag
  • stag eval --output json produces valid JSON
  • stag team validate validates evals in team repos
  • stag sync fetches evals from team repo's evals/ directory

@HartBrook HartBrook merged commit 77283ed into main Jan 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant