diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 7ffc088..b733539 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -11,7 +11,7 @@ "name": "look", "source": "./src", "description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.", - "version": "0.2.0", + "version": "0.2.1", "author": { "name": "HartBrook" }, "repository": "https://github.com/HartBrook/lookagain", "license": "MIT", diff --git a/CHANGELOG.md b/CHANGELOG.md index 881056e..4e72da5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,18 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.2.1] - 2026-01-28 + +### Fixed + +- Arguments like `auto-fix` now use `$ARGUMENTS.` syntax at decision points in command prompts, not just in display sections. Previously, the executing agent could miss interpolated values and fall back to safe defaults (e.g., `auto-fix=false`). + +### Added + +- Behavioral evals via [promptfoo](https://promptfoo.dev) (`make eval`) that verify models correctly interpret argument values +- Static test (`test_argument_interpolation`) that enforces every frontmatter argument is referenced as `$ARGUMENTS.` in the instruction body +- Contributing guide sections for running tests, setting `ANTHROPIC_API_KEY`, and writing command prompts + ## [0.2.0] - 2026-01-28 ### Added diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3c6541b..11af6d0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -29,6 +29,9 @@ lookagain/ ├── scripts/ │ ├── package.sh # Build script │ └── test.sh # Plugin validation tests +├── evals/ # Behavioral evals (promptfoo) +│ ├── promptfooconfig.yaml +│ └── prompt-loader.js ├── dist/ # Build output (git-ignored) └── Makefile ``` @@ -53,6 +56,33 @@ make help `make dev` builds the plugin and starts a new Claude Code session with it loaded. Test with `/look:again`. +### Running Tests + +```bash +# Structural validation (file existence, JSON, frontmatter, cross-refs) +make test + +# Behavioral evals — verifies models interpret prompt arguments correctly +# Requires ANTHROPIC_API_KEY +make eval +``` + +`make test` runs fast, offline checks that validate plugin structure: file existence, JSON validity, frontmatter fields, cross-references between manifests, and that all frontmatter arguments are referenced as `$ARGUMENTS.` in the instruction body (not just in display sections). + +`make eval` runs [promptfoo](https://promptfoo.dev) evals that send the interpolated prompts to Claude and assert on behavioral correctness. For example, it verifies that `auto-fix=false` causes the model to skip fixes, and that `passes=5` results in 5 planned passes. + +Evals require an Anthropic API key and cost a small amount per run. Set the key before running: + +```bash +# Option 1: export for the current shell session +export ANTHROPIC_API_KEY=sk-ant-... + +# Option 2: inline for a single run +ANTHROPIC_API_KEY=sk-ant-... make eval +``` + +Get an API key at [console.anthropic.com](https://console.anthropic.com/settings/keys). + ### Testing via Marketplace (local) You can also test the plugin through the marketplace install flow, which is closer to what end users experience: @@ -86,6 +116,15 @@ You can also test the plugin through the marketplace install flow, which is clos - **[src/commands/tidy.md](src/commands/tidy.md)**: Tidy command for pruning old review runs. - **[.claude-plugin/marketplace.json](.claude-plugin/marketplace.json)**: Marketplace manifest for plugin discovery and installation. +### Writing Command Prompts + +When editing or adding command prompts in `src/commands/`: + +- Define arguments in the YAML frontmatter with `name`, `description`, and `default`. +- Reference arguments in the instruction body using `$ARGUMENTS.` — not just in display sections. The executing agent needs to see the interpolated value at the point where it makes decisions. For example, write `If $ARGUMENTS.auto-fix is true` rather than `If auto-fix is enabled`. +- `make test` enforces that every frontmatter argument appears as `$ARGUMENTS.` somewhere in the body. If you add an argument, the test will fail until you reference it. +- After changing prompt logic, run `make eval` to verify models still interpret the arguments correctly. + ## Pull Requests 1. Fork the repository diff --git a/Makefile b/Makefile index 9bec581..d837424 100644 --- a/Makefile +++ b/Makefile @@ -15,6 +15,9 @@ dev: build ## Build and start Claude Code with plugin loaded @echo "" @claude --plugin-dir ./dist/lookagain +eval: ## Run behavioral evals (requires ANTHROPIC_API_KEY) + @npx promptfoo@latest eval -c evals/promptfooconfig.yaml + clean: ## Remove build artifacts @rm -rf dist/ @echo "Cleaned dist/" diff --git a/README.md b/README.md index fb6a1c1..32f9839 100644 --- a/README.md +++ b/README.md @@ -140,6 +140,16 @@ Previous runs are preserved. Use `/look:tidy` to prune old results: | should_fix | No | Performance issues, poor error handling | | suggestion | No | Refactoring, documentation, style | +## Development + +```bash +make test # Structural validation (offline, fast) +make eval # Behavioral evals via promptfoo (requires ANTHROPIC_API_KEY) +make dev # Build and start Claude Code with the plugin loaded +``` + +See [CONTRIBUTING.md](CONTRIBUTING.md) for full development setup and guidelines. + ## License MIT diff --git a/evals/.gitignore b/evals/.gitignore new file mode 100644 index 0000000..db6cc49 --- /dev/null +++ b/evals/.gitignore @@ -0,0 +1,3 @@ +node_modules/ +output/ +*.html diff --git a/evals/prompt-loader.js b/evals/prompt-loader.js new file mode 100644 index 0000000..84daccc --- /dev/null +++ b/evals/prompt-loader.js @@ -0,0 +1,47 @@ +// Loads a markdown command file, strips frontmatter, interpolates +// $ARGUMENTS.* tokens with test-case variables, and prepends a +// meta-instruction so the model describes its plan without executing. + +const fs = require("fs"); +const path = require("path"); + +/** + * @param {object} context + * @param {Record} context.vars + * @returns {string} + */ +function generatePrompt(context) { + const { vars } = context; + const filePath = path.resolve(__dirname, "..", vars.prompt_file); + const raw = fs.readFileSync(filePath, "utf-8"); + + // Strip YAML frontmatter (between opening and closing ---) + const stripped = raw.replace(/^---\n[\s\S]*?\n---\n/, ""); + + // Replace $ARGUMENTS. with matching arg_ variable. + // Argument names may contain hyphens (e.g. auto-fix, max-passes). + const interpolated = stripped.replace( + /\$ARGUMENTS\.([\w-]+)/g, + (_match, name) => { + const key = `arg_${name}`; + if (key in vars) { + return vars[key]; + } + return _match; // leave unresolved tokens as-is + }, + ); + + const meta = [ + "You are analyzing a Claude Code plugin command prompt.", + "Describe step-by-step what you would do given this command and its configuration.", + "Be specific about how each configuration value affects your behavior.", + "Do NOT execute anything — just describe your plan.", + "", + "---", + "", + ].join("\n"); + + return meta + interpolated; +} + +module.exports = generatePrompt; diff --git a/evals/promptfooconfig.yaml b/evals/promptfooconfig.yaml new file mode 100644 index 0000000..46fee81 --- /dev/null +++ b/evals/promptfooconfig.yaml @@ -0,0 +1,141 @@ +description: "Behavioral evals for lookagain prompt interpolation" + +prompts: + - file://prompt-loader.js + +providers: + - id: anthropic:messages:claude-sonnet-4-20250514 + config: + max_tokens: 2048 + +tests: + # ================================================================== + # again.md — auto-fix interpretation + # ================================================================== + - description: "auto-fix=true → model plans to apply fixes" + vars: + prompt_file: src/commands/again.md + arg_passes: "3" + arg_target: staged + arg_auto-fix: "true" + arg_model: thorough + arg_max-passes: "7" + assert: + - type: llm-rubric + value: > + The response must clearly state that it will automatically fix + or apply fixes for must_fix issues between review passes. + It should NOT say it will skip fixing or leave fixes to the user. + + - description: "auto-fix=false → model skips fixes" + vars: + prompt_file: src/commands/again.md + arg_passes: "3" + arg_target: staged + arg_auto-fix: "false" + arg_model: thorough + arg_max-passes: "7" + assert: + - type: llm-rubric + value: > + The response must clearly state that auto-fix is disabled or false, + and that it will NOT automatically apply fixes between passes. + It should not describe applying any code fixes. + + # ================================================================== + # again.md — passes count + # ================================================================== + - description: "passes=5 → model plans exactly 5 initial passes" + vars: + prompt_file: src/commands/again.md + arg_passes: "5" + arg_target: staged + arg_auto-fix: "true" + arg_model: thorough + arg_max-passes: "7" + assert: + - type: icontains + value: "5" + - type: llm-rubric + value: > + The response must indicate it will run 5 review passes + (not 3, which is the default). It should plan for exactly + 5 sequential passes before considering additional passes. + + # ================================================================== + # again.md — model resolution + # ================================================================== + - description: "model=fast → reviewer uses haiku" + vars: + prompt_file: src/commands/again.md + arg_passes: "3" + arg_target: staged + arg_auto-fix: "true" + arg_model: fast + arg_max-passes: "7" + assert: + - type: icontains + value: haiku + - type: llm-rubric + value: > + The response must indicate that the reviewer subagent model + will be set to haiku, since the model argument is "fast". + + - description: "model=thorough → no explicit model override" + vars: + prompt_file: src/commands/again.md + arg_passes: "3" + arg_target: staged + arg_auto-fix: "true" + arg_model: thorough + arg_max-passes: "7" + assert: + - type: llm-rubric + value: > + The response must indicate that for model=thorough, the model + parameter is omitted from the Task tool call (it inherits the + current model). It should NOT set the model to haiku or sonnet. + + # ================================================================== + # again.md — scope resolution + # ================================================================== + - description: "target=branch → branch-based diff scope" + vars: + prompt_file: src/commands/again.md + arg_passes: "3" + arg_target: branch + arg_auto-fix: "true" + arg_model: thorough + arg_max-passes: "7" + assert: + - type: llm-rubric + value: > + The response must indicate that the scope is branch-based, + reviewing all changes on the current branch versus the base + branch. It should reference branch comparison or merge-base. + + # ================================================================== + # tidy.md — all flag + # ================================================================== + - description: "all=true → removes all runs" + vars: + prompt_file: src/commands/tidy.md + arg_keep: "1" + arg_all: "true" + assert: + - type: llm-rubric + value: > + The response must state that ALL run directories will be removed, + regardless of date. It should not apply any date-based filtering. + + - description: "all=false, keep=3 → date-based retention" + vars: + prompt_file: src/commands/tidy.md + arg_keep: "3" + arg_all: "false" + assert: + - type: llm-rubric + value: > + The response must describe calculating a cutoff date by subtracting + 3 days from today, and only removing runs older than that cutoff. + It should keep runs from the last 3 days. diff --git a/scripts/test.sh b/scripts/test.sh index 9a65e0b..637098f 100755 --- a/scripts/test.sh +++ b/scripts/test.sh @@ -129,6 +129,51 @@ test_frontmatter() { check_frontmatter "$PROJECT_ROOT/src/skills/lookagain-output-format/SKILL.md" name description } +test_argument_interpolation() { + # Verify that arguments defined in frontmatter are referenced using + # $ARGUMENTS. syntax in the instruction body, not just in the + # Configuration display section. This prevents the executing agent + # from missing argument values and falling back to safe defaults. + + for file in "$PROJECT_ROOT"/src/commands/*.md; do + local relpath="${file#"$PROJECT_ROOT"/}" + + # Extract argument names from frontmatter + local args + args=$(awk ' + NR==1 && /^---$/ { in_fm=1; next } + in_fm && /^---$/ { exit } + in_fm && /^ - name: / { gsub(/^ - name: /, ""); print } + ' "$file") + + if [[ -z "$args" ]]; then + continue + fi + + # Extract the body (everything after the second ---) + local body + body=$(awk ' + NR==1 && /^---$/ { in_fm=1; next } + in_fm && /^---$/ { in_fm=0; next } + !in_fm { print } + ' "$file") + + # For each argument, verify $ARGUMENTS. appears in the body + local all_found=1 + while IFS= read -r arg; do + local ref="\$ARGUMENTS.${arg}" + if ! echo "$body" | grep -qF "$ref"; then + fail "$relpath: argument '$arg' defined but \$ARGUMENTS.$arg never used in body" + all_found=0 + fi + done <<< "$args" + + if [[ $all_found -eq 1 ]]; then + pass "$relpath: all arguments interpolated in body" + fi + done +} + test_cross_references() { local pjson="$PROJECT_ROOT/src/dot-claude-plugin/plugin.json" @@ -324,6 +369,10 @@ echo "--- frontmatter ---" test_frontmatter echo "" +echo "--- argument interpolation ---" +test_argument_interpolation +echo "" + echo "--- cross-references ---" test_cross_references echo "" diff --git a/src/commands/again.md b/src/commands/again.md index 81b684a..7243ae4 100644 --- a/src/commands/again.md +++ b/src/commands/again.md @@ -62,17 +62,17 @@ Map the model argument to the Task tool model parameter: CRITICAL: Passes run in sequence, NOT in parallel. Each pass reviews code after previous fixes. -For each pass (1 through N): +For each pass (1 through $ARGUMENTS.passes): **Review**: Spawn a fresh subagent via the Task tool using the `lookagain-reviewer` agent. Include: pass number, scope instruction, and instruction to use the `lookagain-output-format` skill. Set the model parameter based on the resolved model. Do NOT include findings from previous passes. **Collect**: Parse the JSON response. Store findings and track which pass found each issue. -**Fix**: If auto-fix is enabled, apply fixes for `must_fix` issues only. Minimal changes, no refactoring. +**Fix**: If `$ARGUMENTS.auto-fix` is `true`, apply fixes for `must_fix` issues only. Minimal changes, no refactoring. **Log**: "Pass N complete. Found X must_fix, Y should_fix, Z suggestions." -After configured passes, if `must_fix` issues remain and passes < max-passes, run additional passes. +After completing $ARGUMENTS.passes passes, if `must_fix` issues remain and total passes < $ARGUMENTS.max-passes, run additional passes. ## Phase 2: Aggregate @@ -122,6 +122,6 @@ Include the count of previous runs (glob `.lookagain/????-??-??T??-??-??/`, subt 1. **Sequential**: Never launch passes in parallel. Each must complete before the next starts. 2. **Fresh context**: Always use the Task tool for subagents. 3. **Independence**: Never tell subagents what previous passes found. -4. **Minimal fixes**: Only change what's necessary when auto-fixing. +4. **Minimal fixes**: Only change what's necessary when `$ARGUMENTS.auto-fix` is `true`. 5. **Valid JSON**: If subagent output fails to parse, log the error and continue. -6. **Respect max-passes**: Never exceed the limit. +6. **Respect max-passes**: Never exceed $ARGUMENTS.max-passes. diff --git a/src/dot-claude-plugin/plugin.json b/src/dot-claude-plugin/plugin.json index bbc4cc4..082f580 100644 --- a/src/dot-claude-plugin/plugin.json +++ b/src/dot-claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "look", - "version": "0.2.0", + "version": "0.2.1", "description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.", "author": { "name": "HartBrook" }, "repository": "https://github.com/HartBrook/lookagain",