From 2769a629f2c46ef1117f1b95fc3213be0f90e183 Mon Sep 17 00:00:00 2001 From: Cody Hart Date: Sun, 18 Jan 2026 09:19:53 -0500 Subject: [PATCH 1/2] improve eval devx --- EVALS_GUIDE.md | 118 +++++++ README.md | 9 + example/README.md | 5 + example/team-repo/evals/team-git.yaml | 41 +++ example/team-repo/evals/team-quality.yaml | 50 +++ example/team-repo/evals/team-security.yaml | 44 +++ internal/cli/eval.go | 362 ++++++++++++++++++++ internal/cli/root.go | 1 + internal/eval/eval.go | 262 ++++++++++++++ internal/eval/eval_test.go | 378 +++++++++++++++++++++ internal/eval/templates.go | 193 +++++++++++ internal/eval/templates_test.go | 204 +++++++++++ 12 files changed, 1667 insertions(+) create mode 100644 example/team-repo/evals/team-git.yaml create mode 100644 example/team-repo/evals/team-quality.yaml create mode 100644 example/team-repo/evals/team-security.yaml create mode 100644 internal/eval/eval_test.go create mode 100644 internal/eval/templates.go create mode 100644 internal/eval/templates_test.go diff --git a/EVALS_GUIDE.md b/EVALS_GUIDE.md index 4c51bc2..76e6474 100644 --- a/EVALS_GUIDE.md +++ b/EVALS_GUIDE.md @@ -12,6 +12,76 @@ Evals verify that your CLAUDE.md configuration produces the behavior you expect Evals are **behavioral tests**, not unit tests. They test the emergent behavior that results from your system prompt, not specific code paths. +## Creating Evals + +### Quick Start with Templates + +The fastest way to create a new eval is with the `create` command: + +```bash +# Interactive wizard +stag eval create + +# Use a specific template +stag eval create --template security +stag eval create --template quality +stag eval create --template language +stag eval create --template blank + +# Copy from an existing eval +stag eval create --from security-secrets --name my-security + +# Save to project instead of personal directory +stag eval create --project + +# Save to ./evals/ for team/community repos +stag eval create --team +``` + +**Destination options:** +- Default: `~/.config/staghorn/evals/` (personal evals) +- `--project`: `.staghorn/evals/` (project-specific evals) +- `--team`: `./evals/` (team/community evals for sharing via git) + +Available templates: +- **security** — Tests for hardcoded secrets, injection vulnerabilities +- **quality** — Tests for naming conventions, code duplication +- **language** — Language-specific best practices template +- **blank** — Minimal template to start from scratch + +### Validating Evals + +Before running evals (which consume API credits), validate them: + +```bash +# Validate all evals +stag eval validate + +# Validate a specific eval +stag eval validate my-custom-eval +``` + +Validation checks for: +- Valid assertion types (`llm-rubric`, `contains`, `regex`, etc.) +- Required fields (`name`, `prompt`, `assert`) +- Proper YAML structure +- Naming conventions + +Example output: +``` +Validating 25 eval(s)... + +✓ security-secrets (4 tests) +✓ security-injection (3 tests) +✗ my-custom-eval (2 tests) + error: tests[0].assert[0].type: invalid assertion type "llm_rubric" (did you mean "llm-rubric"?) + warning: tests[1]: test "check-patterns" should have a description + +23 valid, 1 invalid, 1 warning +``` + +The validator provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`). + ## Anatomy of an Eval ```yaml @@ -267,6 +337,16 @@ assert: ## Debugging Failed Tests +### Step 0: Validate First + +Before debugging runtime failures, ensure your eval is valid: + +```bash +stag eval validate my-eval +``` + +This catches common issues like typos in assertion types (e.g., `llm_rubric` instead of `llm-rubric`) without making API calls. + ### Step 1: Run with `--debug` ```bash @@ -485,6 +565,44 @@ Each test case = one API call. To minimize costs: 4. **Leverage Promptfoo caching** - Repeated runs with same prompts use cached responses +## Quick Reference + +### Commands + +```bash +# Run evals +stag eval # Run all evals +stag eval security-secrets # Run specific eval +stag eval --tag security # Filter by tag +stag eval --test "warns-*" # Filter by test name pattern +stag eval --debug # Show full responses + +# Create and validate +stag eval create # Interactive wizard +stag eval create --template security +stag eval create --project # Save to .staghorn/evals/ +stag eval create --team # Save to ./evals/ for team sharing +stag eval validate # Validate all evals +stag eval validate my-eval # Validate specific eval + +# List and inspect +stag eval list # List available evals +stag eval info security-secrets # Show eval details +stag eval init # Install starter evals +``` + +### Valid Assertion Types + +| Type | Description | +|----------------|--------------------------------------| +| `llm-rubric` | AI-graded evaluation (most flexible) | +| `contains` | Exact string match | +| `contains-any` | Any of the listed strings | +| `contains-all` | All of the listed strings | +| `not-contains` | String must not appear | +| `regex` | Regular expression match | +| `javascript` | Custom JS assertion function | + ## Further Reading - [Promptfoo Documentation](https://promptfoo.dev/docs/intro) diff --git a/README.md b/README.md index 29d683c..156c962 100644 --- a/README.md +++ b/README.md @@ -157,6 +157,8 @@ The layering means you get shared standards _plus_ your personal style. You neve | `stag eval` | Run behavioral evals against your config | | `stag eval init` | Install starter evals | | `stag eval list` | List available evals | +| `stag eval validate` | Validate eval definitions without running | +| `stag eval create` | Create a new eval from a template | | `stag project` | Manage project-level config | | `stag team` | Bootstrap or validate a team standards repo | | `stag version` | Print version number | @@ -676,6 +678,13 @@ stag eval list --source team # Filter by source stag eval info # Show eval details stag eval init # Install starter evals stag eval init --project # Install to project directory +stag eval validate # Validate all eval definitions +stag eval validate # Validate specific eval +stag eval create # Create new eval (interactive) +stag eval create --template security # Create from template +stag eval create --from # Copy from existing eval +stag eval create --project # Save to .staghorn/evals/ +stag eval create --team # Save to ./evals/ for team sharing ``` ## Installation diff --git a/example/README.md b/example/README.md index e89279a..2f773ae 100644 --- a/example/README.md +++ b/example/README.md @@ -20,6 +20,10 @@ team-repo/ │ ├── refactor.md # Suggest refactoring improvements │ ├── security-audit.md # Security vulnerability scan │ └── test-gen.md # Generate unit tests +├── evals/ # Behavioral tests +│ ├── team-security.yaml # Security guidelines tests +│ ├── team-quality.yaml # Code quality tests +│ └── team-git.yaml # Git conventions tests ├── languages/ # Language-specific configs │ ├── python.md │ ├── go.md @@ -43,6 +47,7 @@ To use this as your team's standards repo: ## Customization - **CLAUDE.md**: Add your team's general coding standards +- **evals/**: Write tests to verify Claude follows your guidelines - **languages/**: Add configs for languages your team uses - **commands/**: Create prompts for common workflows - **templates/**: Add project templates for different project types diff --git a/example/team-repo/evals/team-git.yaml b/example/team-repo/evals/team-git.yaml new file mode 100644 index 0000000..5697b35 --- /dev/null +++ b/example/team-repo/evals/team-git.yaml @@ -0,0 +1,41 @@ +name: team-git +description: Verify team git conventions are followed +tags: [git, team] + +# Tests aligned with the Git Conventions section of CLAUDE.md: +# - Write commit messages in imperative mood +# - Keep commits atomic and focused +# - Reference issue numbers when applicable + +tests: + - name: uses-imperative-mood + description: Should write commit messages in imperative mood + prompt: | + Write a commit message for adding user authentication + assert: + - type: llm-rubric + value: Commit message uses imperative mood (e.g., "Add" not "Added" or "Adding") + - type: not-contains + value: Added + + - name: keeps-commits-atomic + description: Should recommend atomic commits + prompt: | + I made these changes in one commit: + - Fixed login bug + - Added new dashboard feature + - Updated README + - Refactored database layer + + Is this commit okay? + assert: + - type: llm-rubric + value: Response should recommend splitting into separate, focused commits + + - name: references-issues + description: Should reference issue numbers when applicable + prompt: | + Write a commit message for fixing the bug described in issue #123 + assert: + - type: contains-any + value: ["#123", "issue 123", "closes #123", "fixes #123"] diff --git a/example/team-repo/evals/team-quality.yaml b/example/team-repo/evals/team-quality.yaml new file mode 100644 index 0000000..ee7f14f --- /dev/null +++ b/example/team-repo/evals/team-quality.yaml @@ -0,0 +1,50 @@ +name: team-quality +description: Verify team code quality guidelines are followed +tags: [quality, team] + +# Tests aligned with the Code Style section of CLAUDE.md: +# - Write clear, self-documenting code +# - Prefer explicit over implicit +# - Keep functions small and focused (under 50 lines) +# - Use meaningful variable and function names + +tests: + - name: uses-descriptive-names + description: Should use meaningful variable and function names + prompt: | + Write a function to calculate the total price of items in a shopping cart + assert: + - type: llm-rubric + value: Response uses descriptive names like total_price, cart_items, or similar - not single letters or vague names like x, data, or result + + - name: keeps-functions-focused + description: Should keep functions small and single-purpose + prompt: | + Review this function: + ```python + def process_order(order): + # Validate order (10 lines) + # Calculate totals (15 lines) + # Apply discounts (20 lines) + # Update inventory (15 lines) + # Send confirmation email (10 lines) + # Log analytics (10 lines) + pass # 80+ lines total + ``` + assert: + - type: llm-rubric + value: Response should recommend splitting into smaller, focused functions + + - name: prefers-explicit-code + description: Should prefer explicit over implicit patterns + prompt: | + Review this Python code: + ```python + from utils import * + + def process(d): + return {k: v for k, v in d.items() if v} + ``` + assert: + - type: llm-rubric + value: Response should recommend explicit imports and more descriptive variable names diff --git a/example/team-repo/evals/team-security.yaml b/example/team-repo/evals/team-security.yaml new file mode 100644 index 0000000..acb9551 --- /dev/null +++ b/example/team-repo/evals/team-security.yaml @@ -0,0 +1,44 @@ +name: team-security +description: Verify team security guidelines are followed +tags: [security, team] + +# Tests aligned with the Security section of CLAUDE.md: +# - Never commit secrets, API keys, or credentials +# - Use environment variables for configuration +# - Validate all user input + +tests: + - name: warns-about-hardcoded-secrets + description: Should warn when code contains hardcoded credentials + prompt: | + Review this code: + ```python + DATABASE_PASSWORD = "super_secret_123" + conn = connect(password=DATABASE_PASSWORD) + ``` + assert: + - type: llm-rubric + value: Response must warn about hardcoded password and recommend environment variables + - type: not-contains + value: looks good + + - name: recommends-env-vars-for-config + description: Should recommend environment variables for sensitive config + prompt: | + How should I configure the API endpoint and auth token for my service? + assert: + - type: contains-any + value: ["environment variable", "os.environ", "os.getenv", ".env", "env var"] + + - name: validates-user-input + description: Should recommend input validation for user data + prompt: | + Review this endpoint: + ```python + @app.route("/users/") + def get_user(user_id): + return db.query(f"SELECT * FROM users WHERE id = {user_id}") + ``` + assert: + - type: llm-rubric + value: Response must identify SQL injection risk and recommend parameterized queries or input validation diff --git a/internal/cli/eval.go b/internal/cli/eval.go index 1692b76..b7b3922 100644 --- a/internal/cli/eval.go +++ b/internal/cli/eval.go @@ -61,6 +61,8 @@ merged config. This helps ensure your team's guidelines are actually followed.`, cmd.AddCommand(NewEvalListCmd()) cmd.AddCommand(NewEvalInitCmd()) cmd.AddCommand(NewEvalInfoCmd()) + cmd.AddCommand(NewEvalValidateCmd()) + cmd.AddCommand(NewEvalCreateCmd()) return cmd } @@ -572,3 +574,363 @@ func countTests(evals []*eval.Eval) int { } return total } + +// NewEvalValidateCmd creates the 'eval validate' command. +func NewEvalValidateCmd() *cobra.Command { + return &cobra.Command{ + Use: "validate [name]", + Short: "Validate eval definitions without running them", + Long: `Validates eval YAML files for correctness before running. + +Checks for: +- Valid assertion types (llm-rubric, contains, regex, etc.) +- Required fields (name, prompt, assert) +- Proper YAML structure +- Naming conventions + +Returns exit code 1 if any errors are found.`, + Example: ` staghorn eval validate # Validate all evals + staghorn eval validate security-secrets # Validate specific eval`, + Args: cobra.MaximumNArgs(1), + RunE: func(cmd *cobra.Command, args []string) error { + name := "" + if len(args) == 1 { + name = args[0] + } + return runEvalValidate(name) + }, + } +} + +func runEvalValidate(name string) error { + evals, err := loadEvals() + if err != nil { + return err + } + + if len(evals) == 0 { + fmt.Println("No evals found.") + fmt.Println() + fmt.Println("Run " + info("staghorn eval init") + " to install starter evals.") + return nil + } + + // Filter by name if provided + if name != "" { + var filtered []*eval.Eval + for _, e := range evals { + if e.Name == name { + filtered = append(filtered, e) + break + } + } + if len(filtered) == 0 { + return fmt.Errorf("eval '%s' not found", name) + } + evals = filtered + } + + fmt.Printf("Validating %d eval(s)...\n", len(evals)) + fmt.Println() + + totalErrors := 0 + totalWarnings := 0 + validCount := 0 + invalidCount := 0 + + for _, e := range evals { + errors := e.Validate() + errorCount, warningCount := eval.CountByLevel(errors) + totalErrors += errorCount + totalWarnings += warningCount + + if errorCount > 0 { + invalidCount++ + fmt.Printf("%s %s (%d tests)\n", danger("✗"), e.Name, e.TestCount()) + for _, err := range errors { + prefix := " " + if err.Level == eval.ValidationLevelError { + prefix += danger("error: ") + } else { + prefix += warning("warning: ") + } + fmt.Printf("%s%s: %s\n", prefix, err.Field, err.Message) + } + } else if warningCount > 0 { + validCount++ + fmt.Printf("%s %s (%d tests)\n", success("✓"), e.Name, e.TestCount()) + for _, err := range errors { + fmt.Printf(" %s%s: %s\n", warning("warning: "), err.Field, err.Message) + } + } else { + validCount++ + fmt.Printf("%s %s (%d tests)\n", success("✓"), e.Name, e.TestCount()) + } + } + + fmt.Println() + summary := fmt.Sprintf("%d valid", validCount) + if invalidCount > 0 { + summary += fmt.Sprintf(", %s", danger(fmt.Sprintf("%d invalid", invalidCount))) + } + if totalWarnings > 0 { + summary += fmt.Sprintf(", %s", warning(fmt.Sprintf("%d warning(s)", totalWarnings))) + } + fmt.Println(summary) + + if totalErrors > 0 { + return fmt.Errorf("%d eval(s) have validation errors", invalidCount) + } + + return nil +} + +// NewEvalCreateCmd creates the 'eval create' command. +func NewEvalCreateCmd() *cobra.Command { + var project bool + var team bool + var templateName string + var fromEval string + var evalName string + var description string + + cmd := &cobra.Command{ + Use: "create", + Short: "Create a new eval from a template", + Long: `Creates a new eval file from a template or existing eval. + +Available templates: +- security: Security-focused eval for testing security guidelines +- quality: Code quality eval for testing style and best practices +- language: Language-specific eval template +- blank: Minimal blank template to start from scratch + +Destination options: +- Default: ~/.config/staghorn/evals/ (personal evals) +- --project: .staghorn/evals/ (project-specific evals) +- --team: ./evals/ (team/community evals for sharing via git)`, + Example: ` staghorn eval create # Interactive wizard + staghorn eval create --template security # Use security template + staghorn eval create --from security-secrets # Copy existing eval + staghorn eval create --project # Save to project directory + staghorn eval create --team # Save to ./evals/ for team sharing`, + RunE: func(cmd *cobra.Command, args []string) error { + return runEvalCreate(project, team, templateName, fromEval, evalName, description) + }, + } + + cmd.Flags().BoolVar(&project, "project", false, "Save to project directory (.staghorn/evals/) instead of personal") + cmd.Flags().BoolVar(&team, "team", false, "Save to ./evals/ for team/community sharing via git") + cmd.Flags().StringVar(&templateName, "template", "", "Template to use (security, quality, language, blank)") + cmd.Flags().StringVar(&fromEval, "from", "", "Copy from an existing eval") + cmd.Flags().StringVar(&evalName, "name", "", "Name for the new eval") + cmd.Flags().StringVar(&description, "description", "", "Description for the new eval") + + return cmd +} + +func runEvalCreate(project, team bool, templateName, fromEval, evalName, description string) error { + paths := config.NewPaths() + + // Check for conflicting flags + if project && team { + return fmt.Errorf("cannot use both --project and --team flags") + } + + // Determine target directory + var targetDir string + var targetLabel string + if team { + // Team evals go to ./evals/ in the current directory + cwd, err := os.Getwd() + if err != nil { + return fmt.Errorf("failed to get current directory: %w", err) + } + targetDir = cwd + "/evals" + targetLabel = "./evals/" + } else if project { + projectRoot := findProjectRoot() + if projectRoot == "" { + return fmt.Errorf("no project root found (looking for .git or .staghorn directory)") + } + targetDir = config.ProjectEvalsDir(projectRoot) + targetLabel = ".staghorn/evals/" + } else { + targetDir = paths.PersonalEvals + targetLabel = "~/.config/staghorn/evals/" + } + + // Ensure target directory exists + if err := os.MkdirAll(targetDir, 0755); err != nil { + return fmt.Errorf("failed to create directory: %w", err) + } + + var content string + var err error + + if fromEval != "" { + // Copy from existing eval + content, err = copyFromExistingEval(fromEval, evalName, description) + if err != nil { + return err + } + } else { + // Use template (interactive if not specified) + content, err = createFromTemplate(templateName, evalName, description) + if err != nil { + return err + } + } + + // Parse to get the name for the filename + parsedEval, err := eval.Parse(content, eval.SourcePersonal, "") + if err != nil { + return fmt.Errorf("generated invalid eval: %w", err) + } + + // Check if file already exists + filename := parsedEval.Name + ".yaml" + filepath := targetDir + "/" + filename + if _, err := os.Stat(filepath); err == nil { + return fmt.Errorf("eval file already exists: %s", filepath) + } + + // Write file + if err := os.WriteFile(filepath, []byte(content), 0644); err != nil { + return fmt.Errorf("failed to write file: %w", err) + } + + printSuccess("Created %s%s", targetLabel, filename) + fmt.Println() + fmt.Println("Next steps:") + fmt.Printf(" 1. Edit the eval file to customize tests\n") + fmt.Printf(" 2. Run: %s\n", info("staghorn eval validate "+parsedEval.Name)) + fmt.Printf(" 3. Run: %s\n", info("staghorn eval "+parsedEval.Name)) + + return nil +} + +func copyFromExistingEval(fromName, newName, description string) (string, error) { + evals, err := loadEvals() + if err != nil { + return "", err + } + + var source *eval.Eval + for _, e := range evals { + if e.Name == fromName { + source = e + break + } + } + if source == nil { + return "", fmt.Errorf("eval '%s' not found", fromName) + } + + // Read the original file + content, err := os.ReadFile(source.FilePath) + if err != nil { + return "", fmt.Errorf("failed to read source eval: %w", err) + } + + // Get new name interactively if not provided + if newName == "" { + fmt.Print("Name for new eval: ") + if _, err := fmt.Scanln(&newName); err != nil { + return "", fmt.Errorf("failed to read input: %w", err) + } + if newName == "" { + return "", fmt.Errorf("name is required") + } + } + + // Get description interactively if not provided + if description == "" { + fmt.Print("Description (press enter to keep original): ") + var input string + // Ignore error here - empty input on enter is expected + _, _ = fmt.Scanln(&input) + if input != "" { + description = input + } + } + + // Replace name in content + result := strings.Replace(string(content), "name: "+source.Name, "name: "+newName, 1) + + // Replace description if provided + if description != "" && source.Description != "" { + result = strings.Replace(result, "description: "+source.Description, "description: "+description, 1) + } + + return result, nil +} + +func createFromTemplate(templateName, evalName, description string) (string, error) { + // Interactive mode if template not specified + if templateName == "" { + fmt.Println("Select a template:") + for i, t := range eval.Templates { + fmt.Printf(" %d. %s - %s\n", i+1, t.Name, t.Description) + } + fmt.Print("Choice (1-4): ") + var choice int + if _, err := fmt.Scanln(&choice); err != nil { + return "", fmt.Errorf("failed to read input: %w", err) + } + if choice < 1 || choice > len(eval.Templates) { + return "", fmt.Errorf("invalid choice") + } + templateName = eval.Templates[choice-1].Name + } + + // Get name interactively if not provided + if evalName == "" { + fmt.Print("Name for new eval: ") + if _, err := fmt.Scanln(&evalName); err != nil { + return "", fmt.Errorf("failed to read input: %w", err) + } + if evalName == "" { + return "", fmt.Errorf("name is required") + } + } + + // Get description interactively if not provided + if description == "" { + fmt.Print("Description: ") + // Ignore error - empty input on enter uses default + _, _ = fmt.Scanln(&description) + if description == "" { + description = "Custom eval" + } + } + + // Get tags + fmt.Print("Tags (comma-separated, or press enter for none): ") + var tagsInput string + // Ignore error - empty input on enter is expected + _, _ = fmt.Scanln(&tagsInput) + var tags []string + if tagsInput != "" { + for _, tag := range strings.Split(tagsInput, ",") { + tag = strings.TrimSpace(tag) + if tag != "" { + tags = append(tags, tag) + } + } + } + + // Render template + vars := eval.TemplateVars{ + Name: evalName, + Description: description, + Tags: tags, + } + + content, err := eval.RenderTemplateByName(templateName, vars) + if err != nil { + return "", err + } + + return content, nil +} diff --git a/internal/cli/root.go b/internal/cli/root.go index 2923b85..2e88b4a 100644 --- a/internal/cli/root.go +++ b/internal/cli/root.go @@ -20,6 +20,7 @@ var ( success = color.New(color.FgGreen).SprintFunc() warning = color.New(color.FgYellow).SprintFunc() + danger = color.New(color.FgRed).SprintFunc() info = color.New(color.FgCyan).SprintFunc() dim = color.New(color.Faint).SprintFunc() ) diff --git a/internal/eval/eval.go b/internal/eval/eval.go index 480385a..c1f69dc 100644 --- a/internal/eval/eval.go +++ b/internal/eval/eval.go @@ -5,11 +5,38 @@ import ( "fmt" "os" "path/filepath" + "regexp" "strings" "gopkg.in/yaml.v3" ) +// ValidAssertionTypes lists all valid assertion types supported by Promptfoo. +var ValidAssertionTypes = []string{ + "llm-rubric", + "contains", + "contains-any", + "contains-all", + "not-contains", + "regex", + "javascript", +} + +// ValidationLevel indicates the severity of a validation issue. +type ValidationLevel string + +const ( + ValidationLevelError ValidationLevel = "error" + ValidationLevelWarning ValidationLevel = "warning" +) + +// ValidationError represents a single validation issue. +type ValidationError struct { + Field string // e.g., "tests[0].assert[0].type" + Message string // Human-readable error message + Level ValidationLevel // error or warning +} + // Source indicates where an eval came from. type Source string @@ -255,3 +282,238 @@ func (e *Eval) FilterTests(testFilter string) *Eval { result.Tests = filtered return &result } + +// Validate performs detailed validation of the eval and returns any issues found. +// Unlike Parse, which fails fast on critical errors, Validate collects all issues +// including warnings for non-critical problems. +func (e *Eval) Validate() []ValidationError { + var errors []ValidationError + + // Check eval-level fields + if e.Name == "" { + errors = append(errors, ValidationError{ + Field: "name", + Message: "eval must have a name", + Level: ValidationLevelError, + }) + } else if !isValidName(e.Name) { + errors = append(errors, ValidationError{ + Field: "name", + Message: "name should contain only lowercase letters, numbers, and hyphens", + Level: ValidationLevelWarning, + }) + } + + if e.Description == "" { + errors = append(errors, ValidationError{ + Field: "description", + Message: "eval should have a description", + Level: ValidationLevelWarning, + }) + } + + // Validate tags + for i, tag := range e.Tags { + if !isValidTag(tag) { + errors = append(errors, ValidationError{ + Field: fmt.Sprintf("tags[%d]", i), + Message: fmt.Sprintf("tag %q should contain only lowercase letters, numbers, and hyphens", tag), + Level: ValidationLevelWarning, + }) + } + } + + // Check tests + if len(e.Tests) == 0 { + errors = append(errors, ValidationError{ + Field: "tests", + Message: "eval must have at least one test", + Level: ValidationLevelError, + }) + } + + for i, test := range e.Tests { + testErrors := validateTest(test, i) + errors = append(errors, testErrors...) + } + + return errors +} + +// validateTest validates a single test and returns any issues. +func validateTest(test Test, index int) []ValidationError { + var errors []ValidationError + prefix := fmt.Sprintf("tests[%d]", index) + + if test.Name == "" { + errors = append(errors, ValidationError{ + Field: prefix + ".name", + Message: "test must have a name", + Level: ValidationLevelError, + }) + } + + if test.Description == "" { + errors = append(errors, ValidationError{ + Field: prefix, + Message: fmt.Sprintf("test %q should have a description", test.Name), + Level: ValidationLevelWarning, + }) + } + + if test.Prompt == "" { + errors = append(errors, ValidationError{ + Field: prefix + ".prompt", + Message: fmt.Sprintf("test %q must have a prompt", test.Name), + Level: ValidationLevelError, + }) + } else if strings.TrimSpace(test.Prompt) == "" { + errors = append(errors, ValidationError{ + Field: prefix + ".prompt", + Message: fmt.Sprintf("test %q has an empty prompt (whitespace only)", test.Name), + Level: ValidationLevelError, + }) + } + + if len(test.Assert) == 0 { + errors = append(errors, ValidationError{ + Field: prefix + ".assert", + Message: fmt.Sprintf("test %q must have at least one assertion", test.Name), + Level: ValidationLevelError, + }) + } + + for j, assertion := range test.Assert { + assertErrors := validateAssertion(assertion, prefix, j) + errors = append(errors, assertErrors...) + } + + return errors +} + +// validateAssertion validates a single assertion and returns any issues. +func validateAssertion(assertion Assertion, testPrefix string, index int) []ValidationError { + var errors []ValidationError + field := fmt.Sprintf("%s.assert[%d]", testPrefix, index) + + if assertion.Type == "" { + errors = append(errors, ValidationError{ + Field: field + ".type", + Message: "assertion must have a type", + Level: ValidationLevelError, + }) + return errors + } + + if !isValidAssertionType(assertion.Type) { + suggestion := suggestAssertionType(assertion.Type) + msg := fmt.Sprintf("invalid assertion type %q", assertion.Type) + if suggestion != "" { + msg += fmt.Sprintf(" (did you mean %q?)", suggestion) + } + errors = append(errors, ValidationError{ + Field: field + ".type", + Message: msg, + Level: ValidationLevelError, + }) + } + + if assertion.Value == nil { + errors = append(errors, ValidationError{ + Field: field + ".value", + Message: "assertion must have a value", + Level: ValidationLevelError, + }) + } + + return errors +} + +// isValidAssertionType checks if the assertion type is valid. +func isValidAssertionType(t string) bool { + for _, valid := range ValidAssertionTypes { + if t == valid { + return true + } + } + return false +} + +// suggestAssertionType suggests a valid assertion type based on the invalid input. +func suggestAssertionType(invalid string) string { + // Normalize input for comparison + normalized := strings.ToLower(strings.ReplaceAll(invalid, "_", "-")) + + // Check for exact match after normalization + for _, valid := range ValidAssertionTypes { + if normalized == valid { + return valid + } + } + + // Check for partial matches + for _, valid := range ValidAssertionTypes { + if strings.Contains(normalized, strings.ReplaceAll(valid, "-", "")) || + strings.Contains(strings.ReplaceAll(valid, "-", ""), normalized) { + return valid + } + } + + // Check for common typos + typoMap := map[string]string{ + "llm_rubric": "llm-rubric", + "llmrubric": "llm-rubric", + "rubric": "llm-rubric", + "contain": "contains", + "notcontains": "not-contains", + "not_contains": "not-contains", + "containsall": "contains-all", + "containsany": "contains-any", + "contains_all": "contains-all", + "contains_any": "contains-any", + "js": "javascript", + "regexp": "regex", + } + + if suggestion, ok := typoMap[normalized]; ok { + return suggestion + } + + return "" +} + +// isValidName checks if a name follows naming conventions. +func isValidName(name string) bool { + // Allow lowercase letters, numbers, and hyphens + matched, _ := regexp.MatchString(`^[a-z0-9][a-z0-9-]*[a-z0-9]$|^[a-z0-9]$`, name) + return matched +} + +// isValidTag checks if a tag follows naming conventions. +func isValidTag(tag string) bool { + // Allow lowercase letters, numbers, and hyphens + matched, _ := regexp.MatchString(`^[a-z0-9][a-z0-9-]*[a-z0-9]$|^[a-z0-9]$`, tag) + return matched +} + +// HasErrors returns true if there are any errors (not just warnings). +func HasErrors(errors []ValidationError) bool { + for _, err := range errors { + if err.Level == ValidationLevelError { + return true + } + } + return false +} + +// CountByLevel counts errors and warnings separately. +func CountByLevel(errors []ValidationError) (errorCount, warningCount int) { + for _, err := range errors { + if err.Level == ValidationLevelError { + errorCount++ + } else { + warningCount++ + } + } + return +} diff --git a/internal/eval/eval_test.go b/internal/eval/eval_test.go new file mode 100644 index 0000000..4649de2 --- /dev/null +++ b/internal/eval/eval_test.go @@ -0,0 +1,378 @@ +package eval + +import ( + "testing" +) + +func TestValidate_ValidEval(t *testing.T) { + e := &Eval{ + Name: "test-eval", + Description: "A test eval", + Tags: []string{"security", "test"}, + Tests: []Test{ + { + Name: "test-case", + Description: "A test case", + Prompt: "Review this code", + Assert: []Assertion{ + {Type: "llm-rubric", Value: "Response is helpful"}, + }, + }, + }, + } + + errors := e.Validate() + if HasErrors(errors) { + t.Errorf("expected no errors for valid eval, got: %v", errors) + } +} + +func TestValidate_InvalidAssertionType(t *testing.T) { + e := &Eval{ + Name: "test-eval", + Description: "A test eval", + Tests: []Test{ + { + Name: "test-case", + Description: "A test case", + Prompt: "Review this code", + Assert: []Assertion{ + {Type: "llm_rubric", Value: "Response is helpful"}, + }, + }, + }, + } + + errors := e.Validate() + if !HasErrors(errors) { + t.Error("expected error for invalid assertion type") + } + + // Check for suggestion + found := false + for _, err := range errors { + if err.Level == ValidationLevelError && err.Field == "tests[0].assert[0].type" { + found = true + if err.Message == "" { + t.Error("expected error message") + } + // Should suggest llm-rubric + if !contains(err.Message, "llm-rubric") { + t.Errorf("expected suggestion for llm-rubric, got: %s", err.Message) + } + } + } + if !found { + t.Error("expected error for tests[0].assert[0].type") + } +} + +func TestValidate_MissingName(t *testing.T) { + e := &Eval{ + Description: "A test eval", + Tests: []Test{ + { + Name: "test-case", + Prompt: "Review this code", + Assert: []Assertion{ + {Type: "contains", Value: "test"}, + }, + }, + }, + } + + errors := e.Validate() + if !HasErrors(errors) { + t.Error("expected error for missing name") + } + + found := false + for _, err := range errors { + if err.Field == "name" && err.Level == ValidationLevelError { + found = true + } + } + if !found { + t.Error("expected error for name field") + } +} + +func TestValidate_MissingPrompt(t *testing.T) { + e := &Eval{ + Name: "test-eval", + Description: "A test eval", + Tests: []Test{ + { + Name: "test-case", + Description: "A test case", + Prompt: "", // Empty prompt + Assert: []Assertion{ + {Type: "contains", Value: "test"}, + }, + }, + }, + } + + errors := e.Validate() + if !HasErrors(errors) { + t.Error("expected error for missing prompt") + } + + found := false + for _, err := range errors { + if err.Field == "tests[0].prompt" && err.Level == ValidationLevelError { + found = true + } + } + if !found { + t.Error("expected error for tests[0].prompt field") + } +} + +func TestValidate_EmptyAssertions(t *testing.T) { + e := &Eval{ + Name: "test-eval", + Description: "A test eval", + Tests: []Test{ + { + Name: "test-case", + Description: "A test case", + Prompt: "Review this code", + Assert: []Assertion{}, // Empty assertions + }, + }, + } + + errors := e.Validate() + if !HasErrors(errors) { + t.Error("expected error for empty assertions") + } + + found := false + for _, err := range errors { + if err.Field == "tests[0].assert" && err.Level == ValidationLevelError { + found = true + } + } + if !found { + t.Error("expected error for tests[0].assert field") + } +} + +func TestValidate_MissingDescription(t *testing.T) { + e := &Eval{ + Name: "test-eval", + // Missing description + Tests: []Test{ + { + Name: "test-case", + Prompt: "Review this code", + Assert: []Assertion{ + {Type: "contains", Value: "test"}, + }, + }, + }, + } + + errors := e.Validate() + // Should not have errors, only warnings + if HasErrors(errors) { + t.Error("expected only warnings for missing description, not errors") + } + + // Should have warnings + _, warningCount := CountByLevel(errors) + if warningCount == 0 { + t.Error("expected warnings for missing descriptions") + } + + // Check for eval description warning + foundEvalWarning := false + foundTestWarning := false + for _, err := range errors { + if err.Field == "description" && err.Level == ValidationLevelWarning { + foundEvalWarning = true + } + if err.Field == "tests[0]" && err.Level == ValidationLevelWarning { + foundTestWarning = true + } + } + if !foundEvalWarning { + t.Error("expected warning for eval description") + } + if !foundTestWarning { + t.Error("expected warning for test description") + } +} + +func TestValidate_MultipleErrors(t *testing.T) { + e := &Eval{ + Name: "", // Error 1: missing name + Tests: []Test{ + { + Name: "", // Error 2: missing test name + Prompt: "", // Error 3: missing prompt + Assert: nil, // Error 4: missing assertions + }, + }, + } + + errors := e.Validate() + errorCount, _ := CountByLevel(errors) + + // Should have at least 4 errors + if errorCount < 4 { + t.Errorf("expected at least 4 errors, got %d", errorCount) + } +} + +func TestValidate_AllAssertionTypes(t *testing.T) { + for _, assertType := range ValidAssertionTypes { + e := &Eval{ + Name: "test-eval", + Description: "A test eval", + Tests: []Test{ + { + Name: "test-case", + Description: "A test case", + Prompt: "Review this code", + Assert: []Assertion{ + {Type: assertType, Value: "test value"}, + }, + }, + }, + } + + errors := e.Validate() + if HasErrors(errors) { + t.Errorf("expected no errors for valid assertion type %q, got: %v", assertType, errors) + } + } +} + +func TestSuggestAssertionType(t *testing.T) { + tests := []struct { + input string + expected string + }{ + {"llm_rubric", "llm-rubric"}, + {"LLM-RUBRIC", "llm-rubric"}, + {"rubric", "llm-rubric"}, + {"contain", "contains"}, + {"not_contains", "not-contains"}, + {"contains_all", "contains-all"}, + {"contains_any", "contains-any"}, + {"js", "javascript"}, + {"regexp", "regex"}, + {"invalid", ""}, + {"xyz", ""}, + } + + for _, tc := range tests { + result := suggestAssertionType(tc.input) + if result != tc.expected { + t.Errorf("suggestAssertionType(%q) = %q, expected %q", tc.input, result, tc.expected) + } + } +} + +func TestIsValidName(t *testing.T) { + tests := []struct { + name string + valid bool + }{ + {"security-secrets", true}, + {"test", true}, + {"test-123", true}, + {"a", true}, + {"Test", false}, // Uppercase + {"test_name", false}, // Underscore + {"-test", false}, // Leading hyphen + {"test-", false}, // Trailing hyphen + {"test name", false}, // Space + {"test.name", false}, // Dot + } + + for _, tc := range tests { + result := isValidName(tc.name) + if result != tc.valid { + t.Errorf("isValidName(%q) = %v, expected %v", tc.name, result, tc.valid) + } + } +} + +func TestIsValidTag(t *testing.T) { + tests := []struct { + tag string + valid bool + }{ + {"security", true}, + {"python", true}, + {"code-quality", true}, + {"Security", false}, // Uppercase + {"code_quality", false}, // Underscore + } + + for _, tc := range tests { + result := isValidTag(tc.tag) + if result != tc.valid { + t.Errorf("isValidTag(%q) = %v, expected %v", tc.tag, result, tc.valid) + } + } +} + +func TestCountByLevel(t *testing.T) { + errors := []ValidationError{ + {Level: ValidationLevelError}, + {Level: ValidationLevelError}, + {Level: ValidationLevelWarning}, + {Level: ValidationLevelWarning}, + {Level: ValidationLevelWarning}, + } + + errorCount, warningCount := CountByLevel(errors) + if errorCount != 2 { + t.Errorf("expected 2 errors, got %d", errorCount) + } + if warningCount != 3 { + t.Errorf("expected 3 warnings, got %d", warningCount) + } +} + +func TestHasErrors(t *testing.T) { + // Only warnings + warnings := []ValidationError{ + {Level: ValidationLevelWarning}, + } + if HasErrors(warnings) { + t.Error("expected HasErrors to return false for warnings only") + } + + // With errors + withErrors := []ValidationError{ + {Level: ValidationLevelWarning}, + {Level: ValidationLevelError}, + } + if !HasErrors(withErrors) { + t.Error("expected HasErrors to return true when errors present") + } + + // Empty + if HasErrors(nil) { + t.Error("expected HasErrors to return false for nil") + } +} + +// Helper function +func contains(s, substr string) bool { + return len(s) >= len(substr) && (s == substr || len(s) > 0 && containsSubstring(s, substr)) +} + +func containsSubstring(s, substr string) bool { + for i := 0; i <= len(s)-len(substr); i++ { + if s[i:i+len(substr)] == substr { + return true + } + } + return false +} diff --git a/internal/eval/templates.go b/internal/eval/templates.go new file mode 100644 index 0000000..e3b8331 --- /dev/null +++ b/internal/eval/templates.go @@ -0,0 +1,193 @@ +package eval + +import ( + "bytes" + "fmt" + "text/template" +) + +// EvalTemplate represents a starter template for creating evals. +type EvalTemplate struct { + Name string // Template identifier (e.g., "security") + Description string // Human-readable description + Content string // YAML template with Go template placeholders +} + +// TemplateVars holds variables for template rendering. +type TemplateVars struct { + Name string + Description string + Tags []string +} + +// Templates contains all available eval templates. +var Templates = []EvalTemplate{ + { + Name: "security", + Description: "Security-focused eval for testing security guidelines", + Content: `name: {{.Name}} +description: {{.Description}} +tags: +{{- range .Tags}} + - {{.}} +{{- end}} + +tests: + - name: warns-about-hardcoded-secrets + description: Verifies Claude warns about hardcoded API keys and secrets + prompt: | + Review this code: + API_KEY = "sk-12345abcdef" + db_password = "super_secret_123" + assert: + - type: llm-rubric + value: Response warns about hardcoded secrets and suggests using environment variables + + - name: identifies-injection-risks + description: Verifies Claude identifies potential injection vulnerabilities + prompt: | + Review this code: + query = f"SELECT * FROM users WHERE id = {user_input}" + assert: + - type: llm-rubric + value: Response identifies SQL injection risk and suggests parameterized queries +`, + }, + { + Name: "quality", + Description: "Code quality eval for testing style and best practices", + Content: `name: {{.Name}} +description: {{.Description}} +tags: +{{- range .Tags}} + - {{.}} +{{- end}} + +tests: + - name: suggests-clear-naming + description: Verifies Claude suggests better variable names + prompt: | + Review this code: + def f(x, y, z): + a = x + y + b = a * z + return b + assert: + - type: llm-rubric + value: Response suggests more descriptive function and variable names + + - name: identifies-code-duplication + description: Verifies Claude identifies duplicated code + prompt: | + Review this code: + def process_user(user): + if user.age > 18: + print("Adult") + return user.name.upper() + + def process_admin(admin): + if admin.age > 18: + print("Adult") + return admin.name.upper() + assert: + - type: llm-rubric + value: Response identifies code duplication and suggests refactoring +`, + }, + { + Name: "language", + Description: "Language-specific eval template", + Content: `name: {{.Name}} +description: {{.Description}} +tags: +{{- range .Tags}} + - {{.}} +{{- end}} + +tests: + - name: follows-language-conventions + description: Verifies Claude follows language-specific conventions + prompt: | + Write a function that calculates the factorial of a number. + assert: + - type: llm-rubric + value: Response follows idiomatic patterns for the target language + + - name: uses-appropriate-error-handling + description: Verifies Claude uses proper error handling + prompt: | + Write code to read a file and parse its JSON contents. + assert: + - type: llm-rubric + value: Response includes proper error handling for file and JSON operations +`, + }, + { + Name: "blank", + Description: "Minimal blank template to start from scratch", + Content: `name: {{.Name}} +description: {{.Description}} +tags: +{{- range .Tags}} + - {{.}} +{{- end}} + +tests: + - name: example-test + description: Replace with your test description + prompt: | + Your prompt here + assert: + - type: llm-rubric + value: Your assertion here +`, + }, +} + +// GetTemplate returns a template by name. +func GetTemplate(name string) (*EvalTemplate, error) { + for _, t := range Templates { + if t.Name == name { + return &t, nil + } + } + return nil, fmt.Errorf("template %q not found", name) +} + +// ListTemplates returns all available template names. +func ListTemplates() []string { + names := make([]string, len(Templates)) + for i, t := range Templates { + names[i] = t.Name + } + return names +} + +// RenderTemplate renders a template with the given variables. +func RenderTemplate(t *EvalTemplate, vars TemplateVars) (string, error) { + // Ensure tags is not nil + if vars.Tags == nil { + vars.Tags = []string{} + } + + tmpl, err := template.New(t.Name).Parse(t.Content) + if err != nil { + return "", fmt.Errorf("failed to parse template: %w", err) + } + + var buf bytes.Buffer + if err := tmpl.Execute(&buf, vars); err != nil { + return "", fmt.Errorf("failed to execute template: %w", err) + } + + return buf.String(), nil +} + +// RenderTemplateByName gets a template by name and renders it. +func RenderTemplateByName(name string, vars TemplateVars) (string, error) { + t, err := GetTemplate(name) + if err != nil { + return "", err + } + return RenderTemplate(t, vars) +} diff --git a/internal/eval/templates_test.go b/internal/eval/templates_test.go new file mode 100644 index 0000000..134286a --- /dev/null +++ b/internal/eval/templates_test.go @@ -0,0 +1,204 @@ +package eval + +import ( + "strings" + "testing" +) + +func TestGetTemplate_ValidName(t *testing.T) { + for _, name := range []string{"security", "quality", "language", "blank"} { + tmpl, err := GetTemplate(name) + if err != nil { + t.Errorf("GetTemplate(%q) returned error: %v", name, err) + continue + } + if tmpl == nil { + t.Errorf("GetTemplate(%q) returned nil template", name) + continue + } + if tmpl.Name != name { + t.Errorf("GetTemplate(%q).Name = %q, expected %q", name, tmpl.Name, name) + } + if tmpl.Description == "" { + t.Errorf("GetTemplate(%q) has empty description", name) + } + if tmpl.Content == "" { + t.Errorf("GetTemplate(%q) has empty content", name) + } + } +} + +func TestGetTemplate_InvalidName(t *testing.T) { + tmpl, err := GetTemplate("nonexistent") + if err == nil { + t.Error("expected error for nonexistent template") + } + if tmpl != nil { + t.Error("expected nil template for nonexistent name") + } +} + +func TestListTemplates(t *testing.T) { + names := ListTemplates() + if len(names) == 0 { + t.Error("expected at least one template") + } + + // Check that expected templates are present + expected := []string{"security", "quality", "language", "blank"} + for _, exp := range expected { + found := false + for _, name := range names { + if name == exp { + found = true + break + } + } + if !found { + t.Errorf("expected template %q in list", exp) + } + } +} + +func TestRenderTemplate_WithVars(t *testing.T) { + tmpl, err := GetTemplate("blank") + if err != nil { + t.Fatalf("GetTemplate failed: %v", err) + } + + vars := TemplateVars{ + Name: "my-custom-eval", + Description: "My custom evaluation", + Tags: []string{"custom", "test"}, + } + + result, err := RenderTemplate(tmpl, vars) + if err != nil { + t.Fatalf("RenderTemplate failed: %v", err) + } + + // Check that variables were substituted + if !strings.Contains(result, "name: my-custom-eval") { + t.Error("expected name to be substituted") + } + if !strings.Contains(result, "description: My custom evaluation") { + t.Error("expected description to be substituted") + } + if !strings.Contains(result, "- custom") { + t.Error("expected custom tag to be present") + } + if !strings.Contains(result, "- test") { + t.Error("expected test tag to be present") + } +} + +func TestRenderTemplate_EmptyTags(t *testing.T) { + tmpl, err := GetTemplate("blank") + if err != nil { + t.Fatalf("GetTemplate failed: %v", err) + } + + vars := TemplateVars{ + Name: "my-eval", + Description: "Test eval", + Tags: nil, // No tags + } + + result, err := RenderTemplate(tmpl, vars) + if err != nil { + t.Fatalf("RenderTemplate failed: %v", err) + } + + // Should still produce valid YAML with empty tags + if !strings.Contains(result, "name: my-eval") { + t.Error("expected name to be substituted") + } +} + +func TestRenderTemplateByName(t *testing.T) { + vars := TemplateVars{ + Name: "test-eval", + Description: "Test description", + Tags: []string{"test"}, + } + + result, err := RenderTemplateByName("security", vars) + if err != nil { + t.Fatalf("RenderTemplateByName failed: %v", err) + } + + if !strings.Contains(result, "name: test-eval") { + t.Error("expected name to be substituted") + } +} + +func TestRenderTemplateByName_InvalidTemplate(t *testing.T) { + vars := TemplateVars{ + Name: "test-eval", + Description: "Test description", + } + + _, err := RenderTemplateByName("nonexistent", vars) + if err == nil { + t.Error("expected error for nonexistent template") + } +} + +func TestAllTemplates_AreValid(t *testing.T) { + vars := TemplateVars{ + Name: "test-eval", + Description: "Test eval for validation", + Tags: []string{"test"}, + } + + for _, tmpl := range Templates { + t.Run(tmpl.Name, func(t *testing.T) { + // Render the template + rendered, err := RenderTemplate(&tmpl, vars) + if err != nil { + t.Fatalf("RenderTemplate failed for %q: %v", tmpl.Name, err) + } + + // Parse the rendered YAML as an eval + eval, err := Parse(rendered, SourcePersonal, "test.yaml") + if err != nil { + t.Fatalf("Parse failed for rendered template %q: %v\nRendered:\n%s", tmpl.Name, err, rendered) + } + + // Validate the parsed eval + errors := eval.Validate() + if HasErrors(errors) { + t.Errorf("Validation errors for template %q:", tmpl.Name) + for _, e := range errors { + if e.Level == ValidationLevelError { + t.Errorf(" %s: %s", e.Field, e.Message) + } + } + } + }) + } +} + +func TestAllTemplates_HaveRequiredFields(t *testing.T) { + for _, tmpl := range Templates { + t.Run(tmpl.Name, func(t *testing.T) { + if tmpl.Name == "" { + t.Error("template has empty name") + } + if tmpl.Description == "" { + t.Error("template has empty description") + } + if tmpl.Content == "" { + t.Error("template has empty content") + } + + // Check that template contains required placeholders + if !strings.Contains(tmpl.Content, "{{.Name}}") { + t.Error("template missing {{.Name}} placeholder") + } + if !strings.Contains(tmpl.Content, "{{.Description}}") { + t.Error("template missing {{.Description}} placeholder") + } + }) + } +} From 0e42034dda3e232510930126043cade734d8f2d2 Mon Sep 17 00:00:00 2001 From: Cody Hart Date: Sun, 18 Jan 2026 09:22:10 -0500 Subject: [PATCH 2/2] update changelog --- CHANGELOG.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 689767f..9049529 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.5.0] - 2026-01-18 + +### Added + +- `stag eval validate` command to validate eval YAML files before running + - Checks assertion types, required fields, YAML structure, and naming conventions + - Provides helpful suggestions for common typos (e.g., `llm_rubric` → `llm-rubric`) + - Distinguishes between errors (blocking) and warnings (non-blocking) +- `stag eval create` command to create new evals from templates + - Interactive wizard for guided eval creation + - Four built-in templates: security, quality, language, blank + - `--template` flag to skip wizard and use template directly + - `--from` flag to copy and customize existing evals + - `--name` and `--description` flags for non-interactive creation +- `--project` flag for `stag eval create` to save evals to `.staghorn/evals/` +- `--team` flag for `stag eval create` to save evals to `./evals/` for team/community sharing +- Example evals in `example/team-repo/evals/` demonstrating team eval patterns + +### Changed + +- Updated EVALS_GUIDE.md with comprehensive documentation for validate and create commands +- Expanded CLI flags reference in README.md with new eval commands + ## [0.4.0] - 2026-01-17 ### Added @@ -75,7 +98,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Support for team, personal, and project configuration layers - Automatic CLAUDE.md generation with layered content -[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.4.0...HEAD +[Unreleased]: https://github.com/HartBrook/staghorn/compare/v0.5.0...HEAD +[0.5.0]: https://github.com/HartBrook/staghorn/compare/v0.4.0...v0.5.0 [0.4.0]: https://github.com/HartBrook/staghorn/compare/v0.3.0...v0.4.0 [0.3.0]: https://github.com/HartBrook/staghorn/compare/v0.2.0...v0.3.0 [0.2.0]: https://github.com/HartBrook/staghorn/compare/v0.1.0...v0.2.0