HartBrook · HartBrook · Jan 29, 2026 · Jan 29, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -11,14 +11,13 @@
       "name": "look",
       "source": "./src",
       "description": "Sequential code review with fresh agent contexts. Runs multiple independent review passes to catch more issues.",
-      "version": "0.3.0",
+      "version": "0.4.0",
       "author": { "name": "HartBrook" },
       "repository": "https://github.com/HartBrook/lookagain",
       "license": "MIT",
       "keywords": ["code-review", "quality", "iterative", "multi-pass"],
-      "commands": ["./commands/again.md", "./commands/tidy.md"],
       "agents": ["./agents/lookagain-reviewer.md"],
-      "skills": ["./skills/lookagain-output-format"],
+      "skills": ["./skills/again", "./skills/tidy", "./skills/lookagain-output-format"],
       "strict": false
     }
   ]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,16 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.4.0] - 2026-01-29
+
+### Changed
+
+- Migrated commands from `commands/` directory to `skills/` directory format (`SKILL.md` files) to fix CLI command resolution issues on newer Claude Code versions. See [GSD plugin issue #218](https://github.com/glittercowboy/get-shit-done/issues/218) for background.
+- Added `disable-model-invocation: true` to `/look:again` and `/look:tidy` to prevent unintended auto-invocation.
+- Renamed frontmatter field `tools` to `allowed-tools` per skills format.
+- Updated test suite, build script, evals, and documentation to reflect new file layout.
+- No changes to user-facing command names: `/look:again` and `/look:tidy` work as before.
+
 ## [0.3.0] - 2026-01-28
 
 ### Changed

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -17,18 +17,18 @@ lookagain/
 ├── .claude-plugin/
 │   └── marketplace.json    # Marketplace manifest
 ├── src/
-│   ├── commands/           # Plugin commands
-│   │   ├── again.md        # Main orchestrator
-│   │   └── tidy.md         # Tidy old review runs
 │   ├── agents/             # Subagent definitions
 │   │   └── lookagain-reviewer.md
-│   ├── skills/             # Output format specs
-│   │   └── lookagain-output-format/
+│   ├── skills/             # Plugin skills (commands + reference)
+│   │   ├── again/          # Main orchestrator (/look:again)
+│   │   ├── tidy/           # Tidy old review runs (/look:tidy)
+│   │   └── lookagain-output-format/  # Output format spec
 │   ├── dot-claude-plugin/  # Plugin manifest (becomes .claude-plugin/)
 │   └── dot-claude/         # Claude settings (becomes .claude/)
 ├── scripts/
 │   ├── package.sh          # Build script
-│   └── test.sh             # Plugin validation tests
+│   ├── test.sh             # Plugin validation tests
+│   └── integration-test.sh # End-to-end integration test
 ├── evals/                  # Behavioral evals (promptfoo)
 │   ├── promptfooconfig.yaml
 │   └── prompt-loader.js
@@ -65,13 +65,27 @@ make test
 # Behavioral evals — verifies models interpret prompt arguments correctly
 # Requires ANTHROPIC_API_KEY
 make eval
+
+# Integration test — end-to-end run of look:again with auto-fix
+# Requires ANTHROPIC_API_KEY
+make integration
 ```
 
-`make test` runs fast, offline checks that validate plugin structure: file existence, JSON validity, frontmatter fields, cross-references between manifests, and that commands accepting arguments use the correct pattern (`argument-hint` in frontmatter, `$ARGUMENTS` placeholder, and a defaults table in the body).
+The three test layers catch different kinds of regressions:
+
+| Layer | Command | What it catches | Cost |
+|-------|---------|----------------|------|
+| **Structure** | `make test` | Broken manifests, missing files, frontmatter drift, cross-reference mismatches | Free, offline, fast |
+| **Evals** | `make eval` | Model misinterpreting arguments (e.g. ignoring `auto-fix=false`, wrong pass count) | API calls (~$0.05/run) |
+| **Integration** | `make integration` | End-to-end failures: review not finding bugs, auto-fix not applying, output artifacts missing, tidy not cleaning up | API calls (~$0.50/run) |
+
+**When to run each:**
 
-`make eval` runs [promptfoo](https://promptfoo.dev) evals that send the interpolated prompts to Claude and assert on behavioral correctness. For example, it verifies that `auto-fix=false` causes the model to skip fixes, and that `passes=5` results in 5 planned passes.
+- **`make test`** — always, before every commit. It's fast and offline. Catches most structural mistakes from editing manifests, renaming files, or changing frontmatter.
+- **`make eval`** — after changing prompt wording or argument handling in skill files. The evals send the interpolated prompts to Claude and assert on behavioral correctness (e.g. that `auto-fix=false` causes the model to skip fixes, that `passes=5` results in 5 planned passes). Cheap enough to run liberally.
+- **`make integration`** — after changes that affect the review pipeline end-to-end (orchestration logic, agent prompts, output format). Spins up a temp git repo with contrived bugs, runs `/look:again` with auto-fix, and verifies that bugs are found, fixed, and that output artifacts are correct. Also tests `/look:tidy` cleanup. More expensive, so run when the cheaper layers aren't sufficient.
 
-Evals require an Anthropic API key and cost a small amount per run. Set the key before running:
+Both `make eval` and `make integration` require an Anthropic API key. Set it before running:
 
 ```bash
 # Option 1: export for the current shell session
@@ -109,22 +123,24 @@ You can also test the plugin through the marketplace install flow, which is clos
 
 ### Key Files
 
-- **[src/commands/again.md](src/commands/again.md)**: Main orchestrator logic. Controls pass execution, auto-fixing, and aggregation.
+- **[src/skills/again/SKILL.md](src/skills/again/SKILL.md)**: Main orchestrator logic. Controls pass execution, auto-fixing, and aggregation.
+- **[src/skills/tidy/SKILL.md](src/skills/tidy/SKILL.md)**: Tidy skill for pruning old review runs.
 - **[src/agents/lookagain-reviewer.md](src/agents/lookagain-reviewer.md)**: Reviewer subagent. Defines how individual review passes work.
 - **[src/skills/lookagain-output-format/SKILL.md](src/skills/lookagain-output-format/SKILL.md)**: JSON output format specification.
 - **[src/dot-claude-plugin/plugin.json](src/dot-claude-plugin/plugin.json)**: Plugin metadata and version.
-- **[src/commands/tidy.md](src/commands/tidy.md)**: Tidy command for pruning old review runs.
 - **[.claude-plugin/marketplace.json](.claude-plugin/marketplace.json)**: Marketplace manifest for plugin discovery and installation.
 
-### Writing Command Prompts
+### Writing Skill Prompts
 
-When editing or adding command prompts in `src/commands/`:
+When editing or adding skills in `src/skills/`:
 
-- **Use `$ARGUMENTS` for the raw string.** Claude Code replaces `$ARGUMENTS` with whatever the user typed after the command name. There is no `$ARGUMENTS.name` dot-access syntax — only `$ARGUMENTS` (whole string) and `$ARGUMENTS[N]` (positional). Do NOT use an `arguments:` array in frontmatter — it is not a supported Claude Code feature and will not be interpolated.
+- Each skill is a directory containing a `SKILL.md` file (e.g., `src/skills/again/SKILL.md`).
+- **Use `$ARGUMENTS` for the raw string.** Claude Code replaces `$ARGUMENTS` with whatever the user typed after the skill name. There is no `$ARGUMENTS.name` dot-access syntax — only `$ARGUMENTS` (whole string) and `$ARGUMENTS[N]` (positional). Do NOT use an `arguments:` array in frontmatter — it is not a supported Claude Code feature and will not be interpolated.
 - **Add `argument-hint`** in frontmatter to document expected input format (e.g., `argument-hint: "[key=value ...]"`).
+- **Add `disable-model-invocation: true`** for user-triggered actions to prevent Claude from auto-invoking them.
 - **Include a defaults table** in the body listing each key, its default, and a description. Instruct the agent to parse `key=value` pairs from `$ARGUMENTS` and fall back to defaults for missing keys.
 - **Log the resolved configuration** so it is visible in output and reviewable in evals.
-- `make test` enforces that commands using `$ARGUMENTS` have `argument-hint` in frontmatter, a defaults table in the body, and do NOT use the unsupported `arguments:` frontmatter array.
+- `make test` enforces that skills using `$ARGUMENTS` have `argument-hint` in frontmatter, a defaults table in the body, and do NOT use the unsupported `arguments:` frontmatter array.
 - After changing prompt logic, run `make eval` to verify models still interpret the arguments correctly.
 
 ## Pull Requests
@@ -138,4 +154,4 @@ When editing or adding command prompts in `src/commands/`:
 
 ## Versioning
 
-Update the version in `src/dot-claude-plugin/plugin.json` when making releases. The marketplace entry in `.claude-plugin/marketplace.json` also has a `version` field — keep both in sync. The test suite (`make test`) validates that commands, agents, and skills match between the two files.
+Update the version in `src/dot-claude-plugin/plugin.json` when making releases. The marketplace entry in `.claude-plugin/marketplace.json` also has a `version` field — keep both in sync. The test suite (`make test`) validates that agents and skills match between the two files.
diff --git a/Makefile b/Makefile
@@ -18,6 +18,9 @@ dev: build ## Build and start Claude Code with plugin loaded
 eval: ## Run behavioral evals (requires ANTHROPIC_API_KEY)
 	@npx promptfoo@latest eval -c evals/promptfooconfig.yaml
 
+integration: build ## Run integration test (requires ANTHROPIC_API_KEY)
+	@./scripts/integration-test.sh
+
 clean: ## Remove build artifacts
 	@rm -rf dist/
 	@echo "Cleaned dist/"
diff --git a/README.md b/README.md
@@ -140,15 +140,26 @@ Previous runs are preserved. Use `/look:tidy` to prune old results:
 | should_fix | No          | Performance issues, poor error handling         |
 | suggestion | No          | Refactoring, documentation, style               |
 
+## Updating
+
+To update to the latest version:
+
+```bash
+/plugin marketplace update hartbrook-plugins
+/plugin uninstall look@hartbrook-plugins
+/plugin install look@hartbrook-plugins
+```
+
 ## Development
 
 ```bash
-make test   # Structural validation (offline, fast)
-make eval   # Behavioral evals via promptfoo (requires ANTHROPIC_API_KEY)
-make dev    # Build and start Claude Code with the plugin loaded
+make test          # Structural validation (offline, fast)
+make eval          # Behavioral evals via promptfoo (requires ANTHROPIC_API_KEY)
+make integration   # End-to-end integration test (requires ANTHROPIC_API_KEY)
+make dev           # Build and start Claude Code with the plugin loaded
 ```
 
-See [CONTRIBUTING.md](CONTRIBUTING.md) for full development setup and guidelines.
+Run `make test` before every commit (free, offline). Run `make eval` after changing prompt wording or argument handling. Run `make integration` after changes to the review pipeline or output format. See [CONTRIBUTING.md](CONTRIBUTING.md) for details on what each layer catches.
 
 ## License
 

diff --git a/evals/promptfooconfig.yaml b/evals/promptfooconfig.yaml
@@ -10,11 +10,11 @@ providers:
 
 tests:
   # ==================================================================
-  # again.md — no arguments (defaults)
+  # again — no arguments (defaults)
   # ==================================================================
   - description: "no arguments → model uses all defaults (auto-fix=true, passes=3, thorough)"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
     assert:
       - type: llm-rubric
         value: >
@@ -26,11 +26,11 @@ tests:
           simply means the user wants all defaults.
 
   # ==================================================================
-  # again.md — auto-fix interpretation
+  # again — auto-fix interpretation
   # ==================================================================
   - description: "auto-fix=true → model plans to apply fixes"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "3"
       arg_target: staged
       arg_auto-fix: "true"
@@ -45,7 +45,7 @@ tests:
 
   - description: "auto-fix=false → model skips fixes"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "3"
       arg_target: staged
       arg_auto-fix: "false"
@@ -59,11 +59,11 @@ tests:
           It should not describe applying any code fixes.
 
   # ==================================================================
-  # again.md — passes count
+  # again — passes count
   # ==================================================================
   - description: "passes=5 → model plans exactly 5 initial passes"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "5"
       arg_target: staged
       arg_auto-fix: "true"
@@ -79,11 +79,11 @@ tests:
           5 sequential passes before considering additional passes.
 
   # ==================================================================
-  # again.md — model resolution
+  # again — model resolution
   # ==================================================================
   - description: "model=fast → reviewer uses haiku"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "3"
       arg_target: staged
       arg_auto-fix: "true"
@@ -99,7 +99,7 @@ tests:
 
   - description: "model=thorough → no explicit model override"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "3"
       arg_target: staged
       arg_auto-fix: "true"
@@ -113,11 +113,11 @@ tests:
           current model). It should NOT set the model to haiku or sonnet.
 
   # ==================================================================
-  # again.md — scope resolution
+  # again — scope resolution
   # ==================================================================
   - description: "target=branch → branch-based diff scope"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_passes: "3"
       arg_target: branch
       arg_auto-fix: "true"
@@ -131,11 +131,11 @@ tests:
           branch. It should reference branch comparison or merge-base.
 
   # ==================================================================
-  # again.md — partial arguments (only auto-fix specified)
+  # again — partial arguments (only auto-fix specified)
   # ==================================================================
   - description: "only auto-fix=false → defaults for everything else, no fixing"
     vars:
-      prompt_file: src/commands/again.md
+      prompt_file: src/skills/again/SKILL.md
       arg_auto-fix: "false"
     assert:
       - type: llm-rubric
@@ -146,11 +146,11 @@ tests:
           auto-fix is explicitly false.
 
   # ==================================================================
-  # tidy.md — all flag
+  # tidy — all flag
   # ==================================================================
   - description: "all=true → removes all runs"
     vars:
-      prompt_file: src/commands/tidy.md
+      prompt_file: src/skills/tidy/SKILL.md
       arg_keep: "1"
       arg_all: "true"
     assert:
@@ -161,7 +161,7 @@ tests:
 
   - description: "all=false, keep=3 → date-based retention"
     vars:
-      prompt_file: src/commands/tidy.md
+      prompt_file: src/skills/tidy/SKILL.md
       arg_keep: "3"
       arg_all: "false"
     assert:
@@ -172,11 +172,11 @@ tests:
           It should keep runs from the last 3 days.
 
   # ==================================================================
-  # tidy.md — no arguments (defaults)
+  # tidy — no arguments (defaults)
   # ==================================================================
   - description: "tidy with no arguments → keep=1, all=false"
     vars:
-      prompt_file: src/commands/tidy.md
+      prompt_file: src/skills/tidy/SKILL.md
     assert:
       - type: llm-rubric
         value: >