Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"name": "look",
"source": "./src",
"description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.",
"version": "0.2.0",
"version": "0.2.1",
"author": { "name": "HartBrook" },
"repository": "https://github.com/HartBrook/lookagain",
"license": "MIT",
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.1] - 2026-01-28

### Fixed

- Arguments like `auto-fix` now use `$ARGUMENTS.<name>` syntax at decision points in command prompts, not just in display sections. Previously, the executing agent could miss interpolated values and fall back to safe defaults (e.g., `auto-fix=false`).

### Added

- Behavioral evals via [promptfoo](https://promptfoo.dev) (`make eval`) that verify models correctly interpret argument values
- Static test (`test_argument_interpolation`) that enforces every frontmatter argument is referenced as `$ARGUMENTS.<name>` in the instruction body
- Contributing guide sections for running tests, setting `ANTHROPIC_API_KEY`, and writing command prompts

## [0.2.0] - 2026-01-28

### Added
Expand Down
39 changes: 39 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ lookagain/
├── scripts/
│ ├── package.sh # Build script
│ └── test.sh # Plugin validation tests
├── evals/ # Behavioral evals (promptfoo)
│ ├── promptfooconfig.yaml
│ └── prompt-loader.js
├── dist/ # Build output (git-ignored)
└── Makefile
```
Expand All @@ -53,6 +56,33 @@ make help

`make dev` builds the plugin and starts a new Claude Code session with it loaded. Test with `/look:again`.

### Running Tests

```bash
# Structural validation (file existence, JSON, frontmatter, cross-refs)
make test

# Behavioral evals — verifies models interpret prompt arguments correctly
# Requires ANTHROPIC_API_KEY
make eval
```

`make test` runs fast, offline checks that validate plugin structure: file existence, JSON validity, frontmatter fields, cross-references between manifests, and that all frontmatter arguments are referenced as `$ARGUMENTS.<name>` in the instruction body (not just in display sections).

`make eval` runs [promptfoo](https://promptfoo.dev) evals that send the interpolated prompts to Claude and assert on behavioral correctness. For example, it verifies that `auto-fix=false` causes the model to skip fixes, and that `passes=5` results in 5 planned passes.

Evals require an Anthropic API key and cost a small amount per run. Set the key before running:

```bash
# Option 1: export for the current shell session
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2: inline for a single run
ANTHROPIC_API_KEY=sk-ant-... make eval
```

Get an API key at [console.anthropic.com](https://console.anthropic.com/settings/keys).

### Testing via Marketplace (local)

You can also test the plugin through the marketplace install flow, which is closer to what end users experience:
Expand Down Expand Up @@ -86,6 +116,15 @@ You can also test the plugin through the marketplace install flow, which is clos
- **[src/commands/tidy.md](src/commands/tidy.md)**: Tidy command for pruning old review runs.
- **[.claude-plugin/marketplace.json](.claude-plugin/marketplace.json)**: Marketplace manifest for plugin discovery and installation.

### Writing Command Prompts

When editing or adding command prompts in `src/commands/`:

- Define arguments in the YAML frontmatter with `name`, `description`, and `default`.
- Reference arguments in the instruction body using `$ARGUMENTS.<name>` — not just in display sections. The executing agent needs to see the interpolated value at the point where it makes decisions. For example, write `If $ARGUMENTS.auto-fix is true` rather than `If auto-fix is enabled`.
- `make test` enforces that every frontmatter argument appears as `$ARGUMENTS.<name>` somewhere in the body. If you add an argument, the test will fail until you reference it.
- After changing prompt logic, run `make eval` to verify models still interpret the arguments correctly.

## Pull Requests

1. Fork the repository
Expand Down
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ dev: build ## Build and start Claude Code with plugin loaded
@echo ""
@claude --plugin-dir ./dist/lookagain

eval: ## Run behavioral evals (requires ANTHROPIC_API_KEY)
@npx promptfoo@latest eval -c evals/promptfooconfig.yaml

clean: ## Remove build artifacts
@rm -rf dist/
@echo "Cleaned dist/"
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,16 @@ Previous runs are preserved. Use `/look:tidy` to prune old results:
| should_fix | No | Performance issues, poor error handling |
| suggestion | No | Refactoring, documentation, style |

## Development

```bash
make test # Structural validation (offline, fast)
make eval # Behavioral evals via promptfoo (requires ANTHROPIC_API_KEY)
make dev # Build and start Claude Code with the plugin loaded
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for full development setup and guidelines.

## License

MIT
3 changes: 3 additions & 0 deletions evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node_modules/
output/
*.html
47 changes: 47 additions & 0 deletions evals/prompt-loader.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
// Loads a markdown command file, strips frontmatter, interpolates
// $ARGUMENTS.* tokens with test-case variables, and prepends a
// meta-instruction so the model describes its plan without executing.

const fs = require("fs");
const path = require("path");

/**
* @param {object} context
* @param {Record<string, string>} context.vars
* @returns {string}
*/
function generatePrompt(context) {
const { vars } = context;
const filePath = path.resolve(__dirname, "..", vars.prompt_file);
const raw = fs.readFileSync(filePath, "utf-8");

// Strip YAML frontmatter (between opening and closing ---)
const stripped = raw.replace(/^---\n[\s\S]*?\n---\n/, "");

// Replace $ARGUMENTS.<name> with matching arg_<name> variable.
// Argument names may contain hyphens (e.g. auto-fix, max-passes).
const interpolated = stripped.replace(
/\$ARGUMENTS\.([\w-]+)/g,
(_match, name) => {
const key = `arg_${name}`;
if (key in vars) {
return vars[key];
}
return _match; // leave unresolved tokens as-is
},
);

const meta = [
"You are analyzing a Claude Code plugin command prompt.",
"Describe step-by-step what you would do given this command and its configuration.",
"Be specific about how each configuration value affects your behavior.",
"Do NOT execute anything — just describe your plan.",
"",
"---",
"",
].join("\n");

return meta + interpolated;
}

module.exports = generatePrompt;
141 changes: 141 additions & 0 deletions evals/promptfooconfig.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
description: "Behavioral evals for lookagain prompt interpolation"

prompts:
- file://prompt-loader.js

providers:
- id: anthropic:messages:claude-sonnet-4-20250514
config:
max_tokens: 2048

tests:
# ==================================================================
# again.md — auto-fix interpretation
# ==================================================================
- description: "auto-fix=true → model plans to apply fixes"
vars:
prompt_file: src/commands/again.md
arg_passes: "3"
arg_target: staged
arg_auto-fix: "true"
arg_model: thorough
arg_max-passes: "7"
assert:
- type: llm-rubric
value: >
The response must clearly state that it will automatically fix
or apply fixes for must_fix issues between review passes.
It should NOT say it will skip fixing or leave fixes to the user.

- description: "auto-fix=false → model skips fixes"
vars:
prompt_file: src/commands/again.md
arg_passes: "3"
arg_target: staged
arg_auto-fix: "false"
arg_model: thorough
arg_max-passes: "7"
assert:
- type: llm-rubric
value: >
The response must clearly state that auto-fix is disabled or false,
and that it will NOT automatically apply fixes between passes.
It should not describe applying any code fixes.

# ==================================================================
# again.md — passes count
# ==================================================================
- description: "passes=5 → model plans exactly 5 initial passes"
vars:
prompt_file: src/commands/again.md
arg_passes: "5"
arg_target: staged
arg_auto-fix: "true"
arg_model: thorough
arg_max-passes: "7"
assert:
- type: icontains
value: "5"
- type: llm-rubric
value: >
The response must indicate it will run 5 review passes
(not 3, which is the default). It should plan for exactly
5 sequential passes before considering additional passes.

# ==================================================================
# again.md — model resolution
# ==================================================================
- description: "model=fast → reviewer uses haiku"
vars:
prompt_file: src/commands/again.md
arg_passes: "3"
arg_target: staged
arg_auto-fix: "true"
arg_model: fast
arg_max-passes: "7"
assert:
- type: icontains
value: haiku
- type: llm-rubric
value: >
The response must indicate that the reviewer subagent model
will be set to haiku, since the model argument is "fast".

- description: "model=thorough → no explicit model override"
vars:
prompt_file: src/commands/again.md
arg_passes: "3"
arg_target: staged
arg_auto-fix: "true"
arg_model: thorough
arg_max-passes: "7"
assert:
- type: llm-rubric
value: >
The response must indicate that for model=thorough, the model
parameter is omitted from the Task tool call (it inherits the
current model). It should NOT set the model to haiku or sonnet.

# ==================================================================
# again.md — scope resolution
# ==================================================================
- description: "target=branch → branch-based diff scope"
vars:
prompt_file: src/commands/again.md
arg_passes: "3"
arg_target: branch
arg_auto-fix: "true"
arg_model: thorough
arg_max-passes: "7"
assert:
- type: llm-rubric
value: >
The response must indicate that the scope is branch-based,
reviewing all changes on the current branch versus the base
branch. It should reference branch comparison or merge-base.

# ==================================================================
# tidy.md — all flag
# ==================================================================
- description: "all=true → removes all runs"
vars:
prompt_file: src/commands/tidy.md
arg_keep: "1"
arg_all: "true"
assert:
- type: llm-rubric
value: >
The response must state that ALL run directories will be removed,
regardless of date. It should not apply any date-based filtering.

- description: "all=false, keep=3 → date-based retention"
vars:
prompt_file: src/commands/tidy.md
arg_keep: "3"
arg_all: "false"
assert:
- type: llm-rubric
value: >
The response must describe calculating a cutoff date by subtracting
3 days from today, and only removing runs older than that cutoff.
It should keep runs from the last 3 days.
49 changes: 49 additions & 0 deletions scripts/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,51 @@ test_frontmatter() {
check_frontmatter "$PROJECT_ROOT/src/skills/lookagain-output-format/SKILL.md" name description
}

test_argument_interpolation() {
# Verify that arguments defined in frontmatter are referenced using
# $ARGUMENTS.<name> syntax in the instruction body, not just in the
# Configuration display section. This prevents the executing agent
# from missing argument values and falling back to safe defaults.

for file in "$PROJECT_ROOT"/src/commands/*.md; do
local relpath="${file#"$PROJECT_ROOT"/}"

# Extract argument names from frontmatter
local args
args=$(awk '
NR==1 && /^---$/ { in_fm=1; next }
in_fm && /^---$/ { exit }
in_fm && /^ - name: / { gsub(/^ - name: /, ""); print }
' "$file")

if [[ -z "$args" ]]; then
continue
fi

# Extract the body (everything after the second ---)
local body
body=$(awk '
NR==1 && /^---$/ { in_fm=1; next }
in_fm && /^---$/ { in_fm=0; next }
!in_fm { print }
' "$file")

# For each argument, verify $ARGUMENTS.<name> appears in the body
local all_found=1
while IFS= read -r arg; do
local ref="\$ARGUMENTS.${arg}"
if ! echo "$body" | grep -qF "$ref"; then
fail "$relpath: argument '$arg' defined but \$ARGUMENTS.$arg never used in body"
all_found=0
fi
done <<< "$args"

if [[ $all_found -eq 1 ]]; then
pass "$relpath: all arguments interpolated in body"
fi
done
}

test_cross_references() {
local pjson="$PROJECT_ROOT/src/dot-claude-plugin/plugin.json"

Expand Down Expand Up @@ -324,6 +369,10 @@ echo "--- frontmatter ---"
test_frontmatter
echo ""

echo "--- argument interpolation ---"
test_argument_interpolation
echo ""

echo "--- cross-references ---"
test_cross_references
echo ""
Expand Down
Loading