feat: Anvil competitive analysis — borrowable patterns for DevFlow

## Overview

Deep analysis of **Anvil** — a context engineering and project execution framework for Claude Code (and other runtimes). Anvil focuses on preventing context rot through structured decomposition and atomic execution across multi-phase projects.

**Key philosophical difference:** DevFlow = "Make every Claude interaction high quality" (skills, review, ambient). Anvil = "Make Claude reliably ship multi-phase projects" (milestones, phases, plans, waves). They're complementary, not competing.

---

## Priority Findings (Investigate First)

### P1: Context Monitor Hook with Graduated Thresholds

Anvil implements a `PostToolUse` hook that monitors context window consumption and injects `additionalContext` warnings at graduated thresholds:

- **WARNING** (≤35% remaining / ~65% used): Agent should wrap up current task
- **CRITICAL** (≤25% remaining / ~75% used): Agent should save state immediately
- **Debouncing**: 5 tool uses between warnings to prevent spam; severity escalation (WARNING→CRITICAL) bypasses debounce
- **Bridge file pattern**: Statusline hook writes metrics to `/tmp/claude-ctx-{session_id}.json`, context monitor reads it — decoupled inter-hook communication without modifying settings.json
- **GSD-aware**: Detects project state and tailors recommendations ("save state using pause command" vs generic advice)

**Why this matters**: Our agents (Coder, Reviewer, etc.) have zero awareness of context consumption. They keep working until compaction triggers. Anvil's agents *know* when to wrap up. Our PreCompact hook saves snapshots reactively — this would add *proactive* awareness before compaction fires.

**What we'd build**: A PostToolUse hook that monitors context usage and injects graduated warnings. Combined with our existing PreCompact snapshot pattern, this creates a defense-in-depth approach to context management.

---

### P1: Model Profiles (quality/balanced/budget)

Anvil uses a single `model_profile` config setting to control which model each agent uses:

| Agent Role | `quality` | `balanced` | `budget` |
|------------|-----------|-----------|----------|
| Planner/Architect | opus | opus | sonnet |
| Executor/Coder | opus | sonnet | sonnet |
| Researcher | opus | sonnet | haiku |
| Verifier | sonnet | sonnet | haiku |
| Mapper/Explorer | sonnet | haiku | haiku |

Key details:
- **Per-agent overrides**: `"model_overrides": { "executor": "opus" }` for fine-grained control
- **Opus → "inherit"**: Opus agents resolve to session model, respecting org policies that block specific model versions
- **Profile switching**: Single config change affects all agents simultaneously
- **Global user defaults**: `~/.gsd/defaults.json` persists preferences across projects

**Why this matters**: Our `/implement` spawns ~8 agents with no cost control. Users can't choose between maximum quality vs fast-and-cheap. A model profile system would let users make this tradeoff explicitly.

**What we'd build**: A `--profile quality|balanced|budget` flag on orchestration commands (`/implement`, `/code-review`, `/debug`), with a resolution table mapping agent roles to models per profile. Could also be set in project config for persistent preference.

---

### P1: Goal-Backward Verification for Shepherd

Anvil's verification philosophy: Don't check "did you complete all tasks?" — check "does the codebase actually deliver what was promised?"

Their verifier agent:
- Reads the phase goal and success criteria
- Inspects the actual codebase (not the summary/report)
- Checks **observable truths**: things that must be TRUE for the goal to be met
- Verifies artifacts at 3 levels: exists → substantive → wired (connected to the system)
- Detects stubs: components returning `<div>Component</div>`, APIs returning "Not implemented", empty handlers
- Re-verification mode: when previous gaps exist, focuses only on failed items

**Verification hierarchy:**
1. **Pre-execution** (plan-checker): Will these plans achieve the goal?
2. **During execution** (deviation rules): Is the executor staying on track?
3. **Post-execution** (verifier): Did the result actually deliver the goal?

**Why this matters**: Our Shepherd checks intent alignment ("does implementation match what was asked"), which is close but less structured. Anvil's approach is more systematic — it has explicit observable truths, artifact depth checks (exists vs substantive vs wired), and re-verification mode.

**What we'd enhance**: Strengthen Shepherd to do explicit goal-backward verification: define observable truths from the original request, verify artifacts are substantive (not stubs), and check wiring (components connected, APIs consumed, state rendered). Add stub detection patterns to Scrutinizer.

---

### P2: Decision Preservation (Locked/Flexible/Deferred)

Anvil has a `CONTEXT.md` pattern created during a "discuss phase" step:

```markdown
## Implementation Decisions
### Authentication
- **LOCKED**: Use JWT with httpOnly cookies (user decision)
- **Claude's discretion**: Token expiry duration, refresh strategy
- **Deferred**: OAuth providers (out of scope for this phase)
```

Key aspects:
- **Locked decisions** are NON-NEGOTIABLE — all downstream agents (planner, researcher, executor) MUST honor them
- **Claude's discretion** areas give the agent explicit freedom
- **Deferred ideas** are scope guardrails — explicitly captured but excluded
- Created once during discussion, consumed everywhere downstream
- Plan-checker verifies plans comply with locked decisions

**Why this matters**: Our `/specify` captures requirements but doesn't separate "user already decided this" from "Claude can choose." When Coder implements, it might re-debate something the user already settled. Our Shepherd validates intent alignment, but doesn't enforce decision preservation at the constraint level.

**What we'd build**: During `/specify` or `/implement` planning phase, capture user decisions with locked/flexible/deferred classification. Propagate locked decisions to Coder (must honor) and Shepherd (must verify compliance). This prevents re-debating and gives Claude clear autonomy boundaries.

---

## Secondary Findings (Documented for Future Reference)

### Quick Tasks with Composable Flags

Anvil has a `/quick` command for lightweight single-task execution with composable flags:
- Default: Just implement the task with atomic commits
- `--discuss`: Pre-planning discussion to surface gray areas, captures decisions
- `--full`: Plan checking (max 2 iterations) + post-execution verification
- `--discuss --full`: Full workflow with all guardrails

**Relevance**: We have a gap between ambient/QUICK (zero overhead) and full `/implement` (full ceremony). A `/quick` command would fill this middle ground — single task, optional discussion, optional verification.

### Deviation Rules for Autonomous Decision-Making

Anvil codifies explicit rules for when executors should auto-fix vs ask:
- **Rule 1** (Auto-fix bugs): Code doesn't work as intended → fix immediately
- **Rule 2** (Auto-add critical): Missing error handling, validation, auth → add it
- **Rule 3** (Auto-fix blockers): Something prevents completing task → fix it
- **Rule 4** (Ask about architecture): Significant structural changes → STOP and ask

**Relevance**: Our Coder agent has implicit behavior about when to ask vs proceed. Codifying this makes behavior predictable and documented.

### Health Check Command

Anvil runs 8 diagnostic checks on project integrity:
- E001-E005: Critical errors (missing dirs, invalid config JSON)
- W001-W009: Warnings (orphaned files, validation consistency)
- Auto-repair path for fixable issues (createConfig, resetConfig, regenerateState)
- Structured output with code/message/repairable flags

**Relevance**: A `/health` command that validates DevFlow installation, hook integrity, settings.json consistency, and plugin state would help troubleshooting.

### Disk-Status Inference for Progress

Instead of explicit status tracking, Anvil infers phase status from files on disk:
- Has PLAN.md? → "planned"
- Has PLAN.md + SUMMARY.md? → "complete"
- Has RESEARCH.md only? → "researched"
- Self-healing: manually adding artifacts auto-updates status

**Relevance**: Our `.docs/reviews/` directory could provide similar intelligence for review progress without explicit state management.

### Wave-Based Parallel Execution

Anvil groups plans by dependency and runs independent plans in parallel within waves:
- Wave 1: All independent plans (parallel)
- Wave 2: Plans depending on Wave 1 (wait, then parallel)
- Wave N: Sequential dependency chain

**Relevance**: Our `/implement` already parallelizes some agents, but for multi-task implementations, wave-based parallelism could improve throughput.

### Verification Pattern Templates (Stub Detection)

Detailed patterns for detecting real vs stub implementations:

**React stubs**: `return <div>Component</div>`, `return null`, `onClick={() => {}}`
**API stubs**: `return Response.json({ message: "Not implemented" })`
**Hook stubs**: `return { user: null, login: () => {}, logout: () => {} }`
**Wiring gaps**: fetch without await, query without return, state exists but not rendered

**Relevance**: Could enhance Scrutinizer's detection capabilities with these patterns.

### Per-Task Atomic Commits

Anvil commits after every task (sub-feature level), not just per feature. Enables `git bisect` at granular level and failure recovery (task 1 committed, task 2 failed → next session knows exactly where to resume).

**Relevance**: Our Coder does atomic commits per feature. Going more granular might add overhead, but the failure recovery benefit is worth considering.

### Session Handoff Files (.continue-here.md)

Explicit handoff files created on pause with YAML frontmatter + structured sections (current_state, completed_work, remaining_work, decisions_made, blockers, next_action). Deleted after resume.

**Relevance**: Our WORKING-MEMORY.md serves a similar purpose but is automatic (background write). Anvil's approach is more explicit and targeted. Both have merit.

---

## Architecture Comparison Summary

| Dimension | DevFlow | Anvil |
|-----------|---------|-------|
| Core metaphor | Plugin marketplace + skills | Project lifecycle manager |
| Unit of work | Single task/PR | Phase → Plan → Task (3 levels) |
| State | `.memory/WORKING-MEMORY.md` (session) | `.planning/STATE.md` + `ROADMAP.md` + `CONTEXT.md` (persistent) |
| Quality | Skills loaded per-prompt (ambient) | Goal-backward verification at every stage |
| Execution | Agent spawns per command | Wave-based parallel execution of plans |
| Git | Atomic commits per feature | Atomic commits per task (sub-feature) |
| Context mgmt | Pre-compact hook + session start | Context monitor hook + bridge files + statusline |
| Model selection | Hardcoded per agent | Config-driven profiles with per-agent overrides |
| Session continuity | WORKING-MEMORY.md (background write) | `.continue-here.md` handoff files + STATE.md |
| Multi-runtime | Claude Code only | Claude Code + OpenCode + Gemini CLI + Codex |

## What NOT to Borrow

- **Full milestone/phase/roadmap hierarchy** — Too opinionated for DevFlow's composable philosophy
- **Multi-runtime support** — Dilutes focus; Claude Code is our target
- **XML task format** — Our markdown-based approach with agent prompts is cleaner
- **Single-system monolith** — Our plugin architecture is more composable
- **Interactive project setup wizard** — Heavy for DevFlow's "enhance every prompt" approach

## Priority Matrix

| Priority | Feature | Effort | Impact |
|----------|---------|--------|--------|
| **P1** | Context monitor hook (graduated thresholds) | Medium | High |
| **P1** | Model profiles (quality/balanced/budget) | Medium | High |
| **P1** | Goal-backward verification in Shepherd | Low | Medium-High |
| **P2** | Decision preservation (locked/flexible/deferred) | Low | Medium |
| **P3** | `/quick` command with composable flags | Medium | Medium |
| **P3** | Deviation rules for Coder | Low | Medium |
| **P4** | Health check command | Low | Low |
| **P4** | Stub detection patterns for Scrutinizer | Low | Low |
| **P4** | Bridge files for inter-hook communication | Low | Low |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Anvil competitive analysis — borrowable patterns for DevFlow #114

Overview

Priority Findings (Investigate First)

P1: Context Monitor Hook with Graduated Thresholds

P1: Model Profiles (quality/balanced/budget)

P1: Goal-Backward Verification for Shepherd

P2: Decision Preservation (Locked/Flexible/Deferred)

Secondary Findings (Documented for Future Reference)

Quick Tasks with Composable Flags

Deviation Rules for Autonomous Decision-Making

Health Check Command

Disk-Status Inference for Progress

Wave-Based Parallel Execution

Verification Pattern Templates (Stub Detection)

Per-Task Atomic Commits

Session Handoff Files (.continue-here.md)

Architecture Comparison Summary

What NOT to Borrow

Priority Matrix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agent Role	`quality`	`balanced`	`budget`
Planner/Architect	opus	opus	sonnet
Executor/Coder	opus	sonnet	sonnet
Researcher	opus	sonnet	haiku
Verifier	sonnet	sonnet	haiku
Mapper/Explorer	sonnet	haiku	haiku

Dimension	DevFlow	Anvil
Core metaphor	Plugin marketplace + skills	Project lifecycle manager
Unit of work	Single task/PR	Phase → Plan → Task (3 levels)
State	`.memory/WORKING-MEMORY.md` (session)	`.planning/STATE.md` + `ROADMAP.md` + `CONTEXT.md` (persistent)
Quality	Skills loaded per-prompt (ambient)	Goal-backward verification at every stage
Execution	Agent spawns per command	Wave-based parallel execution of plans
Git	Atomic commits per feature	Atomic commits per task (sub-feature)
Context mgmt	Pre-compact hook + session start	Context monitor hook + bridge files + statusline
Model selection	Hardcoded per agent	Config-driven profiles with per-agent overrides
Session continuity	WORKING-MEMORY.md (background write)	`.continue-here.md` handoff files + STATE.md
Multi-runtime	Claude Code only	Claude Code + OpenCode + Gemini CLI + Codex

Priority	Feature	Effort	Impact
P1	Context monitor hook (graduated thresholds)	Medium	High
P1	Model profiles (quality/balanced/budget)	Medium	High
P1	Goal-backward verification in Shepherd	Low	Medium-High
P2	Decision preservation (locked/flexible/deferred)	Low	Medium
P3	`/quick` command with composable flags	Medium	Medium
P3	Deviation rules for Coder	Low	Medium
P4	Health check command	Low	Low
P4	Stub detection patterns for Scrutinizer	Low	Low
P4	Bridge files for inter-hook communication	Low	Low

feat: Anvil competitive analysis — borrowable patterns for DevFlow #114

Description

Overview

Priority Findings (Investigate First)

P1: Context Monitor Hook with Graduated Thresholds

P1: Model Profiles (quality/balanced/budget)

P1: Goal-Backward Verification for Shepherd

P2: Decision Preservation (Locked/Flexible/Deferred)

Secondary Findings (Documented for Future Reference)

Quick Tasks with Composable Flags

Deviation Rules for Autonomous Decision-Making

Health Check Command

Disk-Status Inference for Progress

Wave-Based Parallel Execution

Verification Pattern Templates (Stub Detection)

Per-Task Atomic Commits

Session Handoff Files (.continue-here.md)

Architecture Comparison Summary

What NOT to Borrow

Priority Matrix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions