Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.5.0] - 2026-01-18

### Added

- `stag eval validate` command to validate eval YAML files before running
- Checks assertion types, required fields, YAML structure, and naming conventions
- Provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`)
- Distinguishes between errors (blocking) and warnings (non-blocking)
- `stag eval create` command to create new evals from templates
- Interactive wizard for guided eval creation
- Four built-in templates: security, quality, language, blank
- `--template` flag to skip wizard and use template directly
- `--from` flag to copy and customize existing evals
- `--name` and `--description` flags for non-interactive creation
- `--project` flag for `stag eval create` to save evals to `.staghorn/evals/`
- `--team` flag for `stag eval create` to save evals to `./evals/` for team/community sharing
- Example evals in `example/team-repo/evals/` demonstrating team eval patterns

### Changed

- Updated EVALS_GUIDE.md with comprehensive documentation for validate and create commands
- Expanded CLI flags reference in README.md with new eval commands

## [0.4.0] - 2026-01-17

### Added
Expand Down Expand Up @@ -75,7 +98,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Support for team, personal, and project configuration layers
- Automatic CLAUDE.md generation with layered content

[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.4.0...HEAD
[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.5.0...HEAD
[0.5.0]: https://github.com/HartBrook/staghorn/compare/v0.4.0...v0.5.0
[0.4.0]: https://github.com/HartBrook/staghorn/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/HartBrook/staghorn/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/HartBrook/staghorn/compare/v0.1.0...v0.2.0
Expand Down
118 changes: 118 additions & 0 deletions EVALS_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,76 @@ Evals verify that your CLAUDE.md configuration produces the behavior you expect

Evals are **behavioral tests**, not unit tests. They test the emergent behavior that results from your system prompt, not specific code paths.

## Creating Evals

### Quick Start with Templates

The fastest way to create a new eval is with the `create` command:

```bash
# Interactive wizard
stag eval create

# Use a specific template
stag eval create --template security
stag eval create --template quality
stag eval create --template language
stag eval create --template blank

# Copy from an existing eval
stag eval create --from security-secrets --name my-security

# Save to project instead of personal directory
stag eval create --project

# Save to ./evals/ for team/community repos
stag eval create --team
```

**Destination options:**
- Default: `~/.config/staghorn/evals/` (personal evals)
- `--project`: `.staghorn/evals/` (project-specific evals)
- `--team`: `./evals/` (team/community evals for sharing via git)

Available templates:
- **security** — Tests for hardcoded secrets, injection vulnerabilities
- **quality** — Tests for naming conventions, code duplication
- **language** — Language-specific best practices template
- **blank** — Minimal template to start from scratch

### Validating Evals

Before running evals (which consume API credits), validate them:

```bash
# Validate all evals
stag eval validate

# Validate a specific eval
stag eval validate my-custom-eval
```

Validation checks for:
- Valid assertion types (`llm-rubric`, `contains`, `regex`, etc.)
- Required fields (`name`, `prompt`, `assert`)
- Proper YAML structure
- Naming conventions

Example output:
```
Validating 25 eval(s)...

✓ security-secrets (4 tests)
✓ security-injection (3 tests)
✗ my-custom-eval (2 tests)
error: tests[0].assert[0].type: invalid assertion type "llm_rubric" (did you mean "llm-rubric"?)
warning: tests[1]: test "check-patterns" should have a description

23 valid, 1 invalid, 1 warning
```

The validator provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`).

## Anatomy of an Eval

```yaml
Expand Down Expand Up @@ -267,6 +337,16 @@ assert:

## Debugging Failed Tests

### Step 0: Validate First

Before debugging runtime failures, ensure your eval is valid:

```bash
stag eval validate my-eval
```

This catches common issues like typos in assertion types (e.g., `llm_rubric` instead of `llm-rubric`) without making API calls.

### Step 1: Run with `--debug`

```bash
Expand Down Expand Up @@ -485,6 +565,44 @@ Each test case = one API call. To minimize costs:

4. **Leverage Promptfoo caching** - Repeated runs with same prompts use cached responses

## Quick Reference

### Commands

```bash
# Run evals
stag eval # Run all evals
stag eval security-secrets # Run specific eval
stag eval --tag security # Filter by tag
stag eval --test "warns-*" # Filter by test name pattern
stag eval --debug # Show full responses

# Create and validate
stag eval create # Interactive wizard
stag eval create --template security
stag eval create --project # Save to .staghorn/evals/
stag eval create --team # Save to ./evals/ for team sharing
stag eval validate # Validate all evals
stag eval validate my-eval # Validate specific eval

# List and inspect
stag eval list # List available evals
stag eval info security-secrets # Show eval details
stag eval init # Install starter evals
```

### Valid Assertion Types

| Type | Description |
|----------------|--------------------------------------|
| `llm-rubric` | AI-graded evaluation (most flexible) |
| `contains` | Exact string match |
| `contains-any` | Any of the listed strings |
| `contains-all` | All of the listed strings |
| `not-contains` | String must not appear |
| `regex` | Regular expression match |
| `javascript` | Custom JS assertion function |

## Further Reading

- [Promptfoo Documentation](https://promptfoo.dev/docs/intro)
Expand Down
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ The layering means you get shared standards _plus_ your personal style. You neve
| `stag eval` | Run behavioral evals against your config |
| `stag eval init` | Install starter evals |
| `stag eval list` | List available evals |
| `stag eval validate` | Validate eval definitions without running |
| `stag eval create` | Create a new eval from a template |
| `stag project` | Manage project-level config |
| `stag team` | Bootstrap or validate a team standards repo |
| `stag version` | Print version number |
Expand Down Expand Up @@ -676,6 +678,13 @@ stag eval list --source team # Filter by source
stag eval info <name> # Show eval details
stag eval init # Install starter evals
stag eval init --project # Install to project directory
stag eval validate # Validate all eval definitions
stag eval validate <name> # Validate specific eval
stag eval create # Create new eval (interactive)
stag eval create --template security # Create from template
stag eval create --from <eval> # Copy from existing eval
stag eval create --project # Save to .staghorn/evals/
stag eval create --team # Save to ./evals/ for team sharing
```

## Installation
Expand Down
5 changes: 5 additions & 0 deletions example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ team-repo/
│ ├── refactor.md # Suggest refactoring improvements
│ ├── security-audit.md # Security vulnerability scan
│ └── test-gen.md # Generate unit tests
├── evals/ # Behavioral tests
│ ├── team-security.yaml # Security guidelines tests
│ ├── team-quality.yaml # Code quality tests
│ └── team-git.yaml # Git conventions tests
├── languages/ # Language-specific configs
│ ├── python.md
│ ├── go.md
Expand All @@ -43,6 +47,7 @@ To use this as your team's standards repo:
## Customization

- **CLAUDE.md**: Add your team's general coding standards
- **evals/**: Write tests to verify Claude follows your guidelines
- **languages/**: Add configs for languages your team uses
- **commands/**: Create prompts for common workflows
- **templates/**: Add project templates for different project types
Expand Down
41 changes: 41 additions & 0 deletions example/team-repo/evals/team-git.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: team-git
description: Verify team git conventions are followed
tags: [git, team]

# Tests aligned with the Git Conventions section of CLAUDE.md:
# - Write commit messages in imperative mood
# - Keep commits atomic and focused
# - Reference issue numbers when applicable

tests:
- name: uses-imperative-mood
description: Should write commit messages in imperative mood
prompt: |
Write a commit message for adding user authentication
assert:
- type: llm-rubric
value: Commit message uses imperative mood (e.g., "Add" not "Added" or "Adding")
- type: not-contains
value: Added

- name: keeps-commits-atomic
description: Should recommend atomic commits
prompt: |
I made these changes in one commit:
- Fixed login bug
- Added new dashboard feature
- Updated README
- Refactored database layer

Is this commit okay?
assert:
- type: llm-rubric
value: Response should recommend splitting into separate, focused commits

- name: references-issues
description: Should reference issue numbers when applicable
prompt: |
Write a commit message for fixing the bug described in issue #123
assert:
- type: contains-any
value: ["#123", "issue 123", "closes #123", "fixes #123"]
50 changes: 50 additions & 0 deletions example/team-repo/evals/team-quality.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: team-quality
description: Verify team code quality guidelines are followed
tags: [quality, team]

# Tests aligned with the Code Style section of CLAUDE.md:
# - Write clear, self-documenting code
# - Prefer explicit over implicit
# - Keep functions small and focused (under 50 lines)
# - Use meaningful variable and function names

tests:
- name: uses-descriptive-names
description: Should use meaningful variable and function names
prompt: |
Write a function to calculate the total price of items in a shopping cart
assert:
- type: llm-rubric
value: Response uses descriptive names like total_price, cart_items, or similar - not single letters or vague names like x, data, or result

- name: keeps-functions-focused
description: Should keep functions small and single-purpose
prompt: |
Review this function:
```python
def process_order(order):
# Validate order (10 lines)
# Calculate totals (15 lines)
# Apply discounts (20 lines)
# Update inventory (15 lines)
# Send confirmation email (10 lines)
# Log analytics (10 lines)
pass # 80+ lines total
```
assert:
- type: llm-rubric
value: Response should recommend splitting into smaller, focused functions

- name: prefers-explicit-code
description: Should prefer explicit over implicit patterns
prompt: |
Review this Python code:
```python
from utils import *

def process(d):
return {k: v for k, v in d.items() if v}
```
assert:
- type: llm-rubric
value: Response should recommend explicit imports and more descriptive variable names
44 changes: 44 additions & 0 deletions example/team-repo/evals/team-security.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: team-security
description: Verify team security guidelines are followed
tags: [security, team]

# Tests aligned with the Security section of CLAUDE.md:
# - Never commit secrets, API keys, or credentials
# - Use environment variables for configuration
# - Validate all user input

tests:
- name: warns-about-hardcoded-secrets
description: Should warn when code contains hardcoded credentials
prompt: |
Review this code:
```python
DATABASE_PASSWORD = "super_secret_123"
conn = connect(password=DATABASE_PASSWORD)
```
assert:
- type: llm-rubric
value: Response must warn about hardcoded password and recommend environment variables
- type: not-contains
value: looks good

- name: recommends-env-vars-for-config
description: Should recommend environment variables for sensitive config
prompt: |
How should I configure the API endpoint and auth token for my service?
assert:
- type: contains-any
value: ["environment variable", "os.environ", "os.getenv", ".env", "env var"]

- name: validates-user-input
description: Should recommend input validation for user data
prompt: |
Review this endpoint:
```python
@app.route("/users/<user_id>")
def get_user(user_id):
return db.query(f"SELECT * FROM users WHERE id = {user_id}")
```
assert:
- type: llm-rubric
value: Response must identify SQL injection risk and recommend parameterized queries or input validation
Loading