-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Overview
Deep analysis of Anvil — a context engineering and project execution framework for Claude Code (and other runtimes). Anvil focuses on preventing context rot through structured decomposition and atomic execution across multi-phase projects.
Key philosophical difference: DevFlow = "Make every Claude interaction high quality" (skills, review, ambient). Anvil = "Make Claude reliably ship multi-phase projects" (milestones, phases, plans, waves). They're complementary, not competing.
Priority Findings (Investigate First)
P1: Context Monitor Hook with Graduated Thresholds
Anvil implements a PostToolUse hook that monitors context window consumption and injects additionalContext warnings at graduated thresholds:
- WARNING (≤35% remaining / ~65% used): Agent should wrap up current task
- CRITICAL (≤25% remaining / ~75% used): Agent should save state immediately
- Debouncing: 5 tool uses between warnings to prevent spam; severity escalation (WARNING→CRITICAL) bypasses debounce
- Bridge file pattern: Statusline hook writes metrics to
/tmp/claude-ctx-{session_id}.json, context monitor reads it — decoupled inter-hook communication without modifying settings.json - GSD-aware: Detects project state and tailors recommendations ("save state using pause command" vs generic advice)
Why this matters: Our agents (Coder, Reviewer, etc.) have zero awareness of context consumption. They keep working until compaction triggers. Anvil's agents know when to wrap up. Our PreCompact hook saves snapshots reactively — this would add proactive awareness before compaction fires.
What we'd build: A PostToolUse hook that monitors context usage and injects graduated warnings. Combined with our existing PreCompact snapshot pattern, this creates a defense-in-depth approach to context management.
P1: Model Profiles (quality/balanced/budget)
Anvil uses a single model_profile config setting to control which model each agent uses:
| Agent Role | quality |
balanced |
budget |
|---|---|---|---|
| Planner/Architect | opus | opus | sonnet |
| Executor/Coder | opus | sonnet | sonnet |
| Researcher | opus | sonnet | haiku |
| Verifier | sonnet | sonnet | haiku |
| Mapper/Explorer | sonnet | haiku | haiku |
Key details:
- Per-agent overrides:
"model_overrides": { "executor": "opus" }for fine-grained control - Opus → "inherit": Opus agents resolve to session model, respecting org policies that block specific model versions
- Profile switching: Single config change affects all agents simultaneously
- Global user defaults:
~/.gsd/defaults.jsonpersists preferences across projects
Why this matters: Our /implement spawns ~8 agents with no cost control. Users can't choose between maximum quality vs fast-and-cheap. A model profile system would let users make this tradeoff explicitly.
What we'd build: A --profile quality|balanced|budget flag on orchestration commands (/implement, /code-review, /debug), with a resolution table mapping agent roles to models per profile. Could also be set in project config for persistent preference.
P1: Goal-Backward Verification for Shepherd
Anvil's verification philosophy: Don't check "did you complete all tasks?" — check "does the codebase actually deliver what was promised?"
Their verifier agent:
- Reads the phase goal and success criteria
- Inspects the actual codebase (not the summary/report)
- Checks observable truths: things that must be TRUE for the goal to be met
- Verifies artifacts at 3 levels: exists → substantive → wired (connected to the system)
- Detects stubs: components returning
<div>Component</div>, APIs returning "Not implemented", empty handlers - Re-verification mode: when previous gaps exist, focuses only on failed items
Verification hierarchy:
- Pre-execution (plan-checker): Will these plans achieve the goal?
- During execution (deviation rules): Is the executor staying on track?
- Post-execution (verifier): Did the result actually deliver the goal?
Why this matters: Our Shepherd checks intent alignment ("does implementation match what was asked"), which is close but less structured. Anvil's approach is more systematic — it has explicit observable truths, artifact depth checks (exists vs substantive vs wired), and re-verification mode.
What we'd enhance: Strengthen Shepherd to do explicit goal-backward verification: define observable truths from the original request, verify artifacts are substantive (not stubs), and check wiring (components connected, APIs consumed, state rendered). Add stub detection patterns to Scrutinizer.
P2: Decision Preservation (Locked/Flexible/Deferred)
Anvil has a CONTEXT.md pattern created during a "discuss phase" step:
## Implementation Decisions
### Authentication
- **LOCKED**: Use JWT with httpOnly cookies (user decision)
- **Claude's discretion**: Token expiry duration, refresh strategy
- **Deferred**: OAuth providers (out of scope for this phase)Key aspects:
- Locked decisions are NON-NEGOTIABLE — all downstream agents (planner, researcher, executor) MUST honor them
- Claude's discretion areas give the agent explicit freedom
- Deferred ideas are scope guardrails — explicitly captured but excluded
- Created once during discussion, consumed everywhere downstream
- Plan-checker verifies plans comply with locked decisions
Why this matters: Our /specify captures requirements but doesn't separate "user already decided this" from "Claude can choose." When Coder implements, it might re-debate something the user already settled. Our Shepherd validates intent alignment, but doesn't enforce decision preservation at the constraint level.
What we'd build: During /specify or /implement planning phase, capture user decisions with locked/flexible/deferred classification. Propagate locked decisions to Coder (must honor) and Shepherd (must verify compliance). This prevents re-debating and gives Claude clear autonomy boundaries.
Secondary Findings (Documented for Future Reference)
Quick Tasks with Composable Flags
Anvil has a /quick command for lightweight single-task execution with composable flags:
- Default: Just implement the task with atomic commits
--discuss: Pre-planning discussion to surface gray areas, captures decisions--full: Plan checking (max 2 iterations) + post-execution verification--discuss --full: Full workflow with all guardrails
Relevance: We have a gap between ambient/QUICK (zero overhead) and full /implement (full ceremony). A /quick command would fill this middle ground — single task, optional discussion, optional verification.
Deviation Rules for Autonomous Decision-Making
Anvil codifies explicit rules for when executors should auto-fix vs ask:
- Rule 1 (Auto-fix bugs): Code doesn't work as intended → fix immediately
- Rule 2 (Auto-add critical): Missing error handling, validation, auth → add it
- Rule 3 (Auto-fix blockers): Something prevents completing task → fix it
- Rule 4 (Ask about architecture): Significant structural changes → STOP and ask
Relevance: Our Coder agent has implicit behavior about when to ask vs proceed. Codifying this makes behavior predictable and documented.
Health Check Command
Anvil runs 8 diagnostic checks on project integrity:
- E001-E005: Critical errors (missing dirs, invalid config JSON)
- W001-W009: Warnings (orphaned files, validation consistency)
- Auto-repair path for fixable issues (createConfig, resetConfig, regenerateState)
- Structured output with code/message/repairable flags
Relevance: A /health command that validates DevFlow installation, hook integrity, settings.json consistency, and plugin state would help troubleshooting.
Disk-Status Inference for Progress
Instead of explicit status tracking, Anvil infers phase status from files on disk:
- Has PLAN.md? → "planned"
- Has PLAN.md + SUMMARY.md? → "complete"
- Has RESEARCH.md only? → "researched"
- Self-healing: manually adding artifacts auto-updates status
Relevance: Our .docs/reviews/ directory could provide similar intelligence for review progress without explicit state management.
Wave-Based Parallel Execution
Anvil groups plans by dependency and runs independent plans in parallel within waves:
- Wave 1: All independent plans (parallel)
- Wave 2: Plans depending on Wave 1 (wait, then parallel)
- Wave N: Sequential dependency chain
Relevance: Our /implement already parallelizes some agents, but for multi-task implementations, wave-based parallelism could improve throughput.
Verification Pattern Templates (Stub Detection)
Detailed patterns for detecting real vs stub implementations:
React stubs: return <div>Component</div>, return null, onClick={() => {}}
API stubs: return Response.json({ message: "Not implemented" })
Hook stubs: return { user: null, login: () => {}, logout: () => {} }
Wiring gaps: fetch without await, query without return, state exists but not rendered
Relevance: Could enhance Scrutinizer's detection capabilities with these patterns.
Per-Task Atomic Commits
Anvil commits after every task (sub-feature level), not just per feature. Enables git bisect at granular level and failure recovery (task 1 committed, task 2 failed → next session knows exactly where to resume).
Relevance: Our Coder does atomic commits per feature. Going more granular might add overhead, but the failure recovery benefit is worth considering.
Session Handoff Files (.continue-here.md)
Explicit handoff files created on pause with YAML frontmatter + structured sections (current_state, completed_work, remaining_work, decisions_made, blockers, next_action). Deleted after resume.
Relevance: Our WORKING-MEMORY.md serves a similar purpose but is automatic (background write). Anvil's approach is more explicit and targeted. Both have merit.
Architecture Comparison Summary
| Dimension | DevFlow | Anvil |
|---|---|---|
| Core metaphor | Plugin marketplace + skills | Project lifecycle manager |
| Unit of work | Single task/PR | Phase → Plan → Task (3 levels) |
| State | .memory/WORKING-MEMORY.md (session) |
.planning/STATE.md + ROADMAP.md + CONTEXT.md (persistent) |
| Quality | Skills loaded per-prompt (ambient) | Goal-backward verification at every stage |
| Execution | Agent spawns per command | Wave-based parallel execution of plans |
| Git | Atomic commits per feature | Atomic commits per task (sub-feature) |
| Context mgmt | Pre-compact hook + session start | Context monitor hook + bridge files + statusline |
| Model selection | Hardcoded per agent | Config-driven profiles with per-agent overrides |
| Session continuity | WORKING-MEMORY.md (background write) | .continue-here.md handoff files + STATE.md |
| Multi-runtime | Claude Code only | Claude Code + OpenCode + Gemini CLI + Codex |
What NOT to Borrow
- Full milestone/phase/roadmap hierarchy — Too opinionated for DevFlow's composable philosophy
- Multi-runtime support — Dilutes focus; Claude Code is our target
- XML task format — Our markdown-based approach with agent prompts is cleaner
- Single-system monolith — Our plugin architecture is more composable
- Interactive project setup wizard — Heavy for DevFlow's "enhance every prompt" approach
Priority Matrix
| Priority | Feature | Effort | Impact |
|---|---|---|---|
| P1 | Context monitor hook (graduated thresholds) | Medium | High |
| P1 | Model profiles (quality/balanced/budget) | Medium | High |
| P1 | Goal-backward verification in Shepherd | Low | Medium-High |
| P2 | Decision preservation (locked/flexible/deferred) | Low | Medium |
| P3 | /quick command with composable flags |
Medium | Medium |
| P3 | Deviation rules for Coder | Low | Medium |
| P4 | Health check command | Low | Low |
| P4 | Stub detection patterns for Scrutinizer | Low | Low |
| P4 | Bridge files for inter-hook communication | Low | Low |