diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
index 7ffc088..b733539 100644
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -11,7 +11,7 @@
       "name": "look",
       "source": "./src",
       "description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.",
-      "version": "0.2.0",
+      "version": "0.2.1",
       "author": { "name": "HartBrook" },
       "repository": "https://github.com/HartBrook/lookagain",
       "license": "MIT",
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 881056e..4e72da5 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,18 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.2.1] - 2026-01-28
+
+### Fixed
+
+- Arguments like `auto-fix` now use `$ARGUMENTS.<name>` syntax at decision points in command prompts, not just in display sections. Previously, the executing agent could miss interpolated values and fall back to safe defaults (e.g., `auto-fix=false`).
+
+### Added
+
+- Behavioral evals via [promptfoo](https://promptfoo.dev) (`make eval`) that verify models correctly interpret argument values
+- Static test (`test_argument_interpolation`) that enforces every frontmatter argument is referenced as `$ARGUMENTS.<name>` in the instruction body
+- Contributing guide sections for running tests, setting `ANTHROPIC_API_KEY`, and writing command prompts
+
 ## [0.2.0] - 2026-01-28
 
 ### Added
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 3c6541b..11af6d0 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -29,6 +29,9 @@ lookagain/
 ├── scripts/
 │   ├── package.sh          # Build script
 │   └── test.sh             # Plugin validation tests
+├── evals/                  # Behavioral evals (promptfoo)
+│   ├── promptfooconfig.yaml
+│   └── prompt-loader.js
 ├── dist/                   # Build output (git-ignored)
 └── Makefile
 ```
@@ -53,6 +56,33 @@ make help
 
 `make dev` builds the plugin and starts a new Claude Code session with it loaded. Test with `/look:again`.
 
+### Running Tests
+
+```bash
+# Structural validation (file existence, JSON, frontmatter, cross-refs)
+make test
+
+# Behavioral evals — verifies models interpret prompt arguments correctly
+# Requires ANTHROPIC_API_KEY
+make eval
+```
+
+`make test` runs fast, offline checks that validate plugin structure: file existence, JSON validity, frontmatter fields, cross-references between manifests, and that all frontmatter arguments are referenced as `$ARGUMENTS.<name>` in the instruction body (not just in display sections).
+
+`make eval` runs [promptfoo](https://promptfoo.dev) evals that send the interpolated prompts to Claude and assert on behavioral correctness. For example, it verifies that `auto-fix=false` causes the model to skip fixes, and that `passes=5` results in 5 planned passes.
+
+Evals require an Anthropic API key and cost a small amount per run. Set the key before running:
+
+```bash
+# Option 1: export for the current shell session
+export ANTHROPIC_API_KEY=sk-ant-...
+
+# Option 2: inline for a single run
+ANTHROPIC_API_KEY=sk-ant-... make eval
+```
+
+Get an API key at [console.anthropic.com](https://console.anthropic.com/settings/keys).
+
 ### Testing via Marketplace (local)
 
 You can also test the plugin through the marketplace install flow, which is closer to what end users experience:
@@ -86,6 +116,15 @@ You can also test the plugin through the marketplace install flow, which is clos
 - **[src/commands/tidy.md](src/commands/tidy.md)**: Tidy command for pruning old review runs.
 - **[.claude-plugin/marketplace.json](.claude-plugin/marketplace.json)**: Marketplace manifest for plugin discovery and installation.
 
+### Writing Command Prompts
+
+When editing or adding command prompts in `src/commands/`:
+
+- Define arguments in the YAML frontmatter with `name`, `description`, and `default`.
+- Reference arguments in the instruction body using `$ARGUMENTS.<name>` — not just in display sections. The executing agent needs to see the interpolated value at the point where it makes decisions. For example, write `If $ARGUMENTS.auto-fix is true` rather than `If auto-fix is enabled`.
+- `make test` enforces that every frontmatter argument appears as `$ARGUMENTS.<name>` somewhere in the body. If you add an argument, the test will fail until you reference it.
+- After changing prompt logic, run `make eval` to verify models still interpret the arguments correctly.
+
 ## Pull Requests
 
 1. Fork the repository
diff --git a/Makefile b/Makefile
index 9bec581..d837424 100644
--- a/Makefile
+++ b/Makefile
@@ -15,6 +15,9 @@ dev: build ## Build and start Claude Code with plugin loaded
 	@echo ""
 	@claude --plugin-dir ./dist/lookagain
 
+eval: ## Run behavioral evals (requires ANTHROPIC_API_KEY)
+	@npx promptfoo@latest eval -c evals/promptfooconfig.yaml
+
 clean: ## Remove build artifacts
 	@rm -rf dist/
 	@echo "Cleaned dist/"
diff --git a/README.md b/README.md
index fb6a1c1..32f9839 100644
--- a/README.md
+++ b/README.md
@@ -140,6 +140,16 @@ Previous runs are preserved. Use `/look:tidy` to prune old results:
 | should_fix | No          | Performance issues, poor error handling         |
 | suggestion | No          | Refactoring, documentation, style               |
 
+## Development
+
+```bash
+make test   # Structural validation (offline, fast)
+make eval   # Behavioral evals via promptfoo (requires ANTHROPIC_API_KEY)
+make dev    # Build and start Claude Code with the plugin loaded
+```
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for full development setup and guidelines.
+
 ## License
 
 MIT
diff --git a/evals/.gitignore b/evals/.gitignore
new file mode 100644
index 0000000..db6cc49
--- /dev/null
+++ b/evals/.gitignore
@@ -0,0 +1,3 @@
+node_modules/
+output/
+*.html
diff --git a/evals/prompt-loader.js b/evals/prompt-loader.js
new file mode 100644
index 0000000..84daccc
--- /dev/null
+++ b/evals/prompt-loader.js
@@ -0,0 +1,47 @@
+// Loads a markdown command file, strips frontmatter, interpolates
+// $ARGUMENTS.* tokens with test-case variables, and prepends a
+// meta-instruction so the model describes its plan without executing.
+
+const fs = require("fs");
+const path = require("path");
+
+/**
+ * @param {object} context
+ * @param {Record<string, string>} context.vars
+ * @returns {string}
+ */
+function generatePrompt(context) {
+  const { vars } = context;
+  const filePath = path.resolve(__dirname, "..", vars.prompt_file);
+  const raw = fs.readFileSync(filePath, "utf-8");
+
+  // Strip YAML frontmatter (between opening and closing ---)
+  const stripped = raw.replace(/^---\n[\s\S]*?\n---\n/, "");
+
+  // Replace $ARGUMENTS.<name> with matching arg_<name> variable.
+  // Argument names may contain hyphens (e.g. auto-fix, max-passes).
+  const interpolated = stripped.replace(
+    /\$ARGUMENTS\.([\w-]+)/g,
+    (_match, name) => {
+      const key = `arg_${name}`;
+      if (key in vars) {
+        return vars[key];
+      }
+      return _match; // leave unresolved tokens as-is
+    },
+  );
+
+  const meta = [
+    "You are analyzing a Claude Code plugin command prompt.",
+    "Describe step-by-step what you would do given this command and its configuration.",
+    "Be specific about how each configuration value affects your behavior.",
+    "Do NOT execute anything — just describe your plan.",
+    "",
+    "---",
+    "",
+  ].join("\n");
+
+  return meta + interpolated;
+}
+
+module.exports = generatePrompt;
diff --git a/evals/promptfooconfig.yaml b/evals/promptfooconfig.yaml
new file mode 100644
index 0000000..46fee81
--- /dev/null
+++ b/evals/promptfooconfig.yaml
@@ -0,0 +1,141 @@
+description: "Behavioral evals for lookagain prompt interpolation"
+
+prompts:
+  - file://prompt-loader.js
+
+providers:
+  - id: anthropic:messages:claude-sonnet-4-20250514
+    config:
+      max_tokens: 2048
+
+tests:
+  # ==================================================================
+  # again.md — auto-fix interpretation
+  # ==================================================================
+  - description: "auto-fix=true → model plans to apply fixes"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "3"
+      arg_target: staged
+      arg_auto-fix: "true"
+      arg_model: thorough
+      arg_max-passes: "7"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must clearly state that it will automatically fix
+          or apply fixes for must_fix issues between review passes.
+          It should NOT say it will skip fixing or leave fixes to the user.
+
+  - description: "auto-fix=false → model skips fixes"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "3"
+      arg_target: staged
+      arg_auto-fix: "false"
+      arg_model: thorough
+      arg_max-passes: "7"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must clearly state that auto-fix is disabled or false,
+          and that it will NOT automatically apply fixes between passes.
+          It should not describe applying any code fixes.
+
+  # ==================================================================
+  # again.md — passes count
+  # ==================================================================
+  - description: "passes=5 → model plans exactly 5 initial passes"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "5"
+      arg_target: staged
+      arg_auto-fix: "true"
+      arg_model: thorough
+      arg_max-passes: "7"
+    assert:
+      - type: icontains
+        value: "5"
+      - type: llm-rubric
+        value: >
+          The response must indicate it will run 5 review passes
+          (not 3, which is the default). It should plan for exactly
+          5 sequential passes before considering additional passes.
+
+  # ==================================================================
+  # again.md — model resolution
+  # ==================================================================
+  - description: "model=fast → reviewer uses haiku"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "3"
+      arg_target: staged
+      arg_auto-fix: "true"
+      arg_model: fast
+      arg_max-passes: "7"
+    assert:
+      - type: icontains
+        value: haiku
+      - type: llm-rubric
+        value: >
+          The response must indicate that the reviewer subagent model
+          will be set to haiku, since the model argument is "fast".
+
+  - description: "model=thorough → no explicit model override"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "3"
+      arg_target: staged
+      arg_auto-fix: "true"
+      arg_model: thorough
+      arg_max-passes: "7"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must indicate that for model=thorough, the model
+          parameter is omitted from the Task tool call (it inherits the
+          current model). It should NOT set the model to haiku or sonnet.
+
+  # ==================================================================
+  # again.md — scope resolution
+  # ==================================================================
+  - description: "target=branch → branch-based diff scope"
+    vars:
+      prompt_file: src/commands/again.md
+      arg_passes: "3"
+      arg_target: branch
+      arg_auto-fix: "true"
+      arg_model: thorough
+      arg_max-passes: "7"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must indicate that the scope is branch-based,
+          reviewing all changes on the current branch versus the base
+          branch. It should reference branch comparison or merge-base.
+
+  # ==================================================================
+  # tidy.md — all flag
+  # ==================================================================
+  - description: "all=true → removes all runs"
+    vars:
+      prompt_file: src/commands/tidy.md
+      arg_keep: "1"
+      arg_all: "true"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must state that ALL run directories will be removed,
+          regardless of date. It should not apply any date-based filtering.
+
+  - description: "all=false, keep=3 → date-based retention"
+    vars:
+      prompt_file: src/commands/tidy.md
+      arg_keep: "3"
+      arg_all: "false"
+    assert:
+      - type: llm-rubric
+        value: >
+          The response must describe calculating a cutoff date by subtracting
+          3 days from today, and only removing runs older than that cutoff.
+          It should keep runs from the last 3 days.
diff --git a/scripts/test.sh b/scripts/test.sh
index 9a65e0b..637098f 100755
--- a/scripts/test.sh
+++ b/scripts/test.sh
@@ -129,6 +129,51 @@ test_frontmatter() {
     check_frontmatter "$PROJECT_ROOT/src/skills/lookagain-output-format/SKILL.md" name description
 }
 
+test_argument_interpolation() {
+    # Verify that arguments defined in frontmatter are referenced using
+    # $ARGUMENTS.<name> syntax in the instruction body, not just in the
+    # Configuration display section. This prevents the executing agent
+    # from missing argument values and falling back to safe defaults.
+
+    for file in "$PROJECT_ROOT"/src/commands/*.md; do
+        local relpath="${file#"$PROJECT_ROOT"/}"
+
+        # Extract argument names from frontmatter
+        local args
+        args=$(awk '
+            NR==1 && /^---$/ { in_fm=1; next }
+            in_fm && /^---$/ { exit }
+            in_fm && /^  - name: / { gsub(/^  - name: /, ""); print }
+        ' "$file")
+
+        if [[ -z "$args" ]]; then
+            continue
+        fi
+
+        # Extract the body (everything after the second ---)
+        local body
+        body=$(awk '
+            NR==1 && /^---$/ { in_fm=1; next }
+            in_fm && /^---$/ { in_fm=0; next }
+            !in_fm { print }
+        ' "$file")
+
+        # For each argument, verify $ARGUMENTS.<name> appears in the body
+        local all_found=1
+        while IFS= read -r arg; do
+            local ref="\$ARGUMENTS.${arg}"
+            if ! echo "$body" | grep -qF "$ref"; then
+                fail "$relpath: argument '$arg' defined but \$ARGUMENTS.$arg never used in body"
+                all_found=0
+            fi
+        done <<< "$args"
+
+        if [[ $all_found -eq 1 ]]; then
+            pass "$relpath: all arguments interpolated in body"
+        fi
+    done
+}
+
 test_cross_references() {
     local pjson="$PROJECT_ROOT/src/dot-claude-plugin/plugin.json"
 
@@ -324,6 +369,10 @@ echo "--- frontmatter ---"
 test_frontmatter
 echo ""
 
+echo "--- argument interpolation ---"
+test_argument_interpolation
+echo ""
+
 echo "--- cross-references ---"
 test_cross_references
 echo ""
diff --git a/src/commands/again.md b/src/commands/again.md
index 81b684a..7243ae4 100644
--- a/src/commands/again.md
+++ b/src/commands/again.md
@@ -62,17 +62,17 @@ Map the model argument to the Task tool model parameter:
 
 CRITICAL: Passes run in sequence, NOT in parallel. Each pass reviews code after previous fixes.
 
-For each pass (1 through N):
+For each pass (1 through $ARGUMENTS.passes):
 
 **Review**: Spawn a fresh subagent via the Task tool using the `lookagain-reviewer` agent. Include: pass number, scope instruction, and instruction to use the `lookagain-output-format` skill. Set the model parameter based on the resolved model. Do NOT include findings from previous passes.
 
 **Collect**: Parse the JSON response. Store findings and track which pass found each issue.
 
-**Fix**: If auto-fix is enabled, apply fixes for `must_fix` issues only. Minimal changes, no refactoring.
+**Fix**: If `$ARGUMENTS.auto-fix` is `true`, apply fixes for `must_fix` issues only. Minimal changes, no refactoring.
 
 **Log**: "Pass N complete. Found X must_fix, Y should_fix, Z suggestions."
 
-After configured passes, if `must_fix` issues remain and passes < max-passes, run additional passes.
+After completing $ARGUMENTS.passes passes, if `must_fix` issues remain and total passes < $ARGUMENTS.max-passes, run additional passes.
 
 ## Phase 2: Aggregate
 
@@ -122,6 +122,6 @@ Include the count of previous runs (glob `.lookagain/????-??-??T??-??-??/`, subt
 1. **Sequential**: Never launch passes in parallel. Each must complete before the next starts.
 2. **Fresh context**: Always use the Task tool for subagents.
 3. **Independence**: Never tell subagents what previous passes found.
-4. **Minimal fixes**: Only change what's necessary when auto-fixing.
+4. **Minimal fixes**: Only change what's necessary when `$ARGUMENTS.auto-fix` is `true`.
 5. **Valid JSON**: If subagent output fails to parse, log the error and continue.
-6. **Respect max-passes**: Never exceed the limit.
+6. **Respect max-passes**: Never exceed $ARGUMENTS.max-passes.
diff --git a/src/dot-claude-plugin/plugin.json b/src/dot-claude-plugin/plugin.json
index bbc4cc4..082f580 100644
--- a/src/dot-claude-plugin/plugin.json
+++ b/src/dot-claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "look",
-  "version": "0.2.0",
+  "version": "0.2.1",
   "description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.",
   "author": { "name": "HartBrook" },
   "repository": "https://github.com/HartBrook/lookagain",