HartBrook · HartBrook · Jan 18, 2026 · Jan 18, 2026 · Jan 18, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.5.0] - 2026-01-18
+
+### Added
+
+- `stag eval validate` command to validate eval YAML files before running
+  - Checks assertion types, required fields, YAML structure, and naming conventions
+  - Provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`)
+  - Distinguishes between errors (blocking) and warnings (non-blocking)
+- `stag eval create` command to create new evals from templates
+  - Interactive wizard for guided eval creation
+  - Four built-in templates: security, quality, language, blank
+  - `--template` flag to skip wizard and use template directly
+  - `--from` flag to copy and customize existing evals
+  - `--name` and `--description` flags for non-interactive creation
+- `--project` flag for `stag eval create` to save evals to `.staghorn/evals/`
+- `--team` flag for `stag eval create` to save evals to `./evals/` for team/community sharing
+- Example evals in `example/team-repo/evals/` demonstrating team eval patterns
+
+### Changed
+
+- Updated EVALS_GUIDE.md with comprehensive documentation for validate and create commands
+- Expanded CLI flags reference in README.md with new eval commands
+
 ## [0.4.0] - 2026-01-17
 
 ### Added
@@ -75,7 +98,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Support for team, personal, and project configuration layers
 - Automatic CLAUDE.md generation with layered content
 
-[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.4.0...HEAD
+[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.5.0...HEAD
+[0.5.0]: https://github.com/HartBrook/staghorn/compare/v0.4.0...v0.5.0
 [0.4.0]: https://github.com/HartBrook/staghorn/compare/v0.3.0...v0.4.0
 [0.3.0]: https://github.com/HartBrook/staghorn/compare/v0.2.0...v0.3.0
 [0.2.0]: https://github.com/HartBrook/staghorn/compare/v0.1.0...v0.2.0

diff --git a/EVALS_GUIDE.md b/EVALS_GUIDE.md
@@ -12,6 +12,76 @@ Evals verify that your CLAUDE.md configuration produces the behavior you expect
 
 Evals are **behavioral tests**, not unit tests. They test the emergent behavior that results from your system prompt, not specific code paths.
 
+## Creating Evals
+
+### Quick Start with Templates
+
+The fastest way to create a new eval is with the `create` command:
+
+```bash
+# Interactive wizard
+stag eval create
+
+# Use a specific template
+stag eval create --template security
+stag eval create --template quality
+stag eval create --template language
+stag eval create --template blank
+
+# Copy from an existing eval
+stag eval create --from security-secrets --name my-security
+
+# Save to project instead of personal directory
+stag eval create --project
+
+# Save to ./evals/ for team/community repos
+stag eval create --team
+```
+
+**Destination options:**
+- Default: `~/.config/staghorn/evals/` (personal evals)
+- `--project`: `.staghorn/evals/` (project-specific evals)
+- `--team`: `./evals/` (team/community evals for sharing via git)
+
+Available templates:
+- **security** — Tests for hardcoded secrets, injection vulnerabilities
+- **quality** — Tests for naming conventions, code duplication
+- **language** — Language-specific best practices template
+- **blank** — Minimal template to start from scratch
+
+### Validating Evals
+
+Before running evals (which consume API credits), validate them:
+
+```bash
+# Validate all evals
+stag eval validate
+
+# Validate a specific eval
+stag eval validate my-custom-eval
+```
+
+Validation checks for:
+- Valid assertion types (`llm-rubric`, `contains`, `regex`, etc.)
+- Required fields (`name`, `prompt`, `assert`)
+- Proper YAML structure
+- Naming conventions
+
+Example output:
+```
+Validating 25 eval(s)...
+
+✓ security-secrets (4 tests)
+✓ security-injection (3 tests)
+✗ my-custom-eval (2 tests)
+  error: tests[0].assert[0].type: invalid assertion type "llm_rubric" (did you mean "llm-rubric"?)
+  warning: tests[1]: test "check-patterns" should have a description
+
+23 valid, 1 invalid, 1 warning
+```
+
+The validator provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`).
+
 ## Anatomy of an Eval
 
 ```yaml
@@ -267,6 +337,16 @@ assert:
 
 ## Debugging Failed Tests
 
+### Step 0: Validate First
+
+Before debugging runtime failures, ensure your eval is valid:
+
+```bash
+stag eval validate my-eval
+```
+
+This catches common issues like typos in assertion types (e.g., `llm_rubric` instead of `llm-rubric`) without making API calls.
+
 ### Step 1: Run with `--debug`
 
 ```bash
@@ -485,6 +565,44 @@ Each test case = one API call. To minimize costs:
 
 4. **Leverage Promptfoo caching** - Repeated runs with same prompts use cached responses
 
+## Quick Reference
+
+### Commands
+
+```bash
+# Run evals
+stag eval                           # Run all evals
+stag eval security-secrets          # Run specific eval
+stag eval --tag security            # Filter by tag
+stag eval --test "warns-*"          # Filter by test name pattern
+stag eval --debug                   # Show full responses
+
+# Create and validate
+stag eval create                    # Interactive wizard
+stag eval create --template security
+stag eval create --project          # Save to .staghorn/evals/
+stag eval create --team             # Save to ./evals/ for team sharing
+stag eval validate                  # Validate all evals
+stag eval validate my-eval          # Validate specific eval
+
+# List and inspect
+stag eval list                      # List available evals
+stag eval info security-secrets     # Show eval details
+stag eval init                      # Install starter evals
+```
+
+### Valid Assertion Types
+
+| Type           | Description                          |
+|----------------|--------------------------------------|
+| `llm-rubric`   | AI-graded evaluation (most flexible) |
+| `contains`     | Exact string match                   |
+| `contains-any` | Any of the listed strings            |
+| `contains-all` | All of the listed strings            |
+| `not-contains` | String must not appear               |
+| `regex`        | Regular expression match             |
+| `javascript`   | Custom JS assertion function         |
+
 ## Further Reading
 
 - [Promptfoo Documentation](https://promptfoo.dev/docs/intro)

diff --git a/README.md b/README.md
@@ -157,6 +157,8 @@ The layering means you get shared standards _plus_ your personal style. You neve
 | `stag eval`           | Run behavioral evals against your config          |
 | `stag eval init`      | Install starter evals                             |
 | `stag eval list`      | List available evals                              |
+| `stag eval validate`  | Validate eval definitions without running         |
+| `stag eval create`    | Create a new eval from a template                 |
 | `stag project`        | Manage project-level config                       |
 | `stag team`           | Bootstrap or validate a team standards repo       |
 | `stag version`        | Print version number                              |
@@ -676,6 +678,13 @@ stag eval list --source team   # Filter by source
 stag eval info <name>          # Show eval details
 stag eval init                 # Install starter evals
 stag eval init --project       # Install to project directory
+stag eval validate             # Validate all eval definitions
+stag eval validate <name>      # Validate specific eval
+stag eval create               # Create new eval (interactive)
+stag eval create --template security  # Create from template
+stag eval create --from <eval> # Copy from existing eval
+stag eval create --project     # Save to .staghorn/evals/
+stag eval create --team        # Save to ./evals/ for team sharing
 ```
 
 ## Installation

diff --git a/example/README.md b/example/README.md
@@ -20,6 +20,10 @@ team-repo/
 │   ├── refactor.md        # Suggest refactoring improvements
 │   ├── security-audit.md  # Security vulnerability scan
 │   └── test-gen.md        # Generate unit tests
+├── evals/                 # Behavioral tests
+│   ├── team-security.yaml # Security guidelines tests
+│   ├── team-quality.yaml  # Code quality tests
+│   └── team-git.yaml      # Git conventions tests
 ├── languages/             # Language-specific configs
 │   ├── python.md
 │   ├── go.md
@@ -43,6 +47,7 @@ To use this as your team's standards repo:
 ## Customization
 
 - **CLAUDE.md**: Add your team's general coding standards
+- **evals/**: Write tests to verify Claude follows your guidelines
 - **languages/**: Add configs for languages your team uses
 - **commands/**: Create prompts for common workflows
 - **templates/**: Add project templates for different project types

diff --git a/example/team-repo/evals/team-git.yaml b/example/team-repo/evals/team-git.yaml
@@ -0,0 +1,41 @@
+name: team-git
+description: Verify team git conventions are followed
+tags: [git, team]
+
+# Tests aligned with the Git Conventions section of CLAUDE.md:
+# - Write commit messages in imperative mood
+# - Keep commits atomic and focused
+# - Reference issue numbers when applicable
+
+tests:
+  - name: uses-imperative-mood
+    description: Should write commit messages in imperative mood
+    prompt: |
+      Write a commit message for adding user authentication
+    assert:
+      - type: llm-rubric
+        value: Commit message uses imperative mood (e.g., "Add" not "Added" or "Adding")
+      - type: not-contains
+        value: Added
+
+  - name: keeps-commits-atomic
+    description: Should recommend atomic commits
+    prompt: |
+      I made these changes in one commit:
+      - Fixed login bug
+      - Added new dashboard feature
+      - Updated README
+      - Refactored database layer
+
+      Is this commit okay?
+    assert:
+      - type: llm-rubric
+        value: Response should recommend splitting into separate, focused commits
+
+  - name: references-issues
+    description: Should reference issue numbers when applicable
+    prompt: |
+      Write a commit message for fixing the bug described in issue #123
+    assert:
+      - type: contains-any
+        value: ["#123", "issue 123", "closes #123", "fixes #123"]
diff --git a/example/team-repo/evals/team-quality.yaml b/example/team-repo/evals/team-quality.yaml
@@ -0,0 +1,50 @@
+name: team-quality
+description: Verify team code quality guidelines are followed
+tags: [quality, team]
+
+# Tests aligned with the Code Style section of CLAUDE.md:
+# - Write clear, self-documenting code
+# - Prefer explicit over implicit
+# - Keep functions small and focused (under 50 lines)
+# - Use meaningful variable and function names
+
+tests:
+  - name: uses-descriptive-names
+    description: Should use meaningful variable and function names
+    prompt: |
+      Write a function to calculate the total price of items in a shopping cart
+    assert:
+      - type: llm-rubric
+        value: Response uses descriptive names like total_price, cart_items, or similar - not single letters or vague names like x, data, or result
+
+  - name: keeps-functions-focused
+    description: Should keep functions small and single-purpose
+    prompt: |
+      Review this function:
+      ```python
+      def process_order(order):
+          # Validate order (10 lines)
+          # Calculate totals (15 lines)
+          # Apply discounts (20 lines)
+          # Update inventory (15 lines)
+          # Send confirmation email (10 lines)
+          # Log analytics (10 lines)
+          pass  # 80+ lines total
+      ```
+    assert:
+      - type: llm-rubric
+        value: Response should recommend splitting into smaller, focused functions
+
+  - name: prefers-explicit-code
+    description: Should prefer explicit over implicit patterns
+    prompt: |
+      Review this Python code:
+      ```python
+      from utils import *
+
+      def process(d):
+          return {k: v for k, v in d.items() if v}
+      ```
+    assert:
+      - type: llm-rubric
+        value: Response should recommend explicit imports and more descriptive variable names
diff --git a/example/team-repo/evals/team-security.yaml b/example/team-repo/evals/team-security.yaml
@@ -0,0 +1,44 @@
+name: team-security
+description: Verify team security guidelines are followed
+tags: [security, team]
+
+# Tests aligned with the Security section of CLAUDE.md:
+# - Never commit secrets, API keys, or credentials
+# - Use environment variables for configuration
+# - Validate all user input
+
+tests:
+  - name: warns-about-hardcoded-secrets
+    description: Should warn when code contains hardcoded credentials
+    prompt: |
+      Review this code:
+      ```python
+      DATABASE_PASSWORD = "super_secret_123"
+      conn = connect(password=DATABASE_PASSWORD)
+      ```
+    assert:
+      - type: llm-rubric
+        value: Response must warn about hardcoded password and recommend environment variables
+      - type: not-contains
+        value: looks good
+
+  - name: recommends-env-vars-for-config
+    description: Should recommend environment variables for sensitive config
+    prompt: |
+      How should I configure the API endpoint and auth token for my service?
+    assert:
+      - type: contains-any
+        value: ["environment variable", "os.environ", "os.getenv", ".env", "env var"]
+
+  - name: validates-user-input
+    description: Should recommend input validation for user data
+    prompt: |
+      Review this endpoint:
+      ```python
+      @app.route("/users/<user_id>")
+      def get_user(user_id):
+          return db.query(f"SELECT * FROM users WHERE id = {user_id}")
+      ```
+    assert:
+      - type: llm-rubric
+        value: Response must identify SQL injection risk and recommend parameterized queries or input validation