Skip to content

feat: Anvil competitive analysis — borrowable patterns for DevFlow #114

@dean0x

Description

@dean0x

Overview

Deep analysis of Anvil — a context engineering and project execution framework for Claude Code (and other runtimes). Anvil focuses on preventing context rot through structured decomposition and atomic execution across multi-phase projects.

Key philosophical difference: DevFlow = "Make every Claude interaction high quality" (skills, review, ambient). Anvil = "Make Claude reliably ship multi-phase projects" (milestones, phases, plans, waves). They're complementary, not competing.


Priority Findings (Investigate First)

P1: Context Monitor Hook with Graduated Thresholds

Anvil implements a PostToolUse hook that monitors context window consumption and injects additionalContext warnings at graduated thresholds:

  • WARNING (≤35% remaining / ~65% used): Agent should wrap up current task
  • CRITICAL (≤25% remaining / ~75% used): Agent should save state immediately
  • Debouncing: 5 tool uses between warnings to prevent spam; severity escalation (WARNING→CRITICAL) bypasses debounce
  • Bridge file pattern: Statusline hook writes metrics to /tmp/claude-ctx-{session_id}.json, context monitor reads it — decoupled inter-hook communication without modifying settings.json
  • GSD-aware: Detects project state and tailors recommendations ("save state using pause command" vs generic advice)

Why this matters: Our agents (Coder, Reviewer, etc.) have zero awareness of context consumption. They keep working until compaction triggers. Anvil's agents know when to wrap up. Our PreCompact hook saves snapshots reactively — this would add proactive awareness before compaction fires.

What we'd build: A PostToolUse hook that monitors context usage and injects graduated warnings. Combined with our existing PreCompact snapshot pattern, this creates a defense-in-depth approach to context management.


P1: Model Profiles (quality/balanced/budget)

Anvil uses a single model_profile config setting to control which model each agent uses:

Agent Role quality balanced budget
Planner/Architect opus opus sonnet
Executor/Coder opus sonnet sonnet
Researcher opus sonnet haiku
Verifier sonnet sonnet haiku
Mapper/Explorer sonnet haiku haiku

Key details:

  • Per-agent overrides: "model_overrides": { "executor": "opus" } for fine-grained control
  • Opus → "inherit": Opus agents resolve to session model, respecting org policies that block specific model versions
  • Profile switching: Single config change affects all agents simultaneously
  • Global user defaults: ~/.gsd/defaults.json persists preferences across projects

Why this matters: Our /implement spawns ~8 agents with no cost control. Users can't choose between maximum quality vs fast-and-cheap. A model profile system would let users make this tradeoff explicitly.

What we'd build: A --profile quality|balanced|budget flag on orchestration commands (/implement, /code-review, /debug), with a resolution table mapping agent roles to models per profile. Could also be set in project config for persistent preference.


P1: Goal-Backward Verification for Shepherd

Anvil's verification philosophy: Don't check "did you complete all tasks?" — check "does the codebase actually deliver what was promised?"

Their verifier agent:

  • Reads the phase goal and success criteria
  • Inspects the actual codebase (not the summary/report)
  • Checks observable truths: things that must be TRUE for the goal to be met
  • Verifies artifacts at 3 levels: exists → substantive → wired (connected to the system)
  • Detects stubs: components returning <div>Component</div>, APIs returning "Not implemented", empty handlers
  • Re-verification mode: when previous gaps exist, focuses only on failed items

Verification hierarchy:

  1. Pre-execution (plan-checker): Will these plans achieve the goal?
  2. During execution (deviation rules): Is the executor staying on track?
  3. Post-execution (verifier): Did the result actually deliver the goal?

Why this matters: Our Shepherd checks intent alignment ("does implementation match what was asked"), which is close but less structured. Anvil's approach is more systematic — it has explicit observable truths, artifact depth checks (exists vs substantive vs wired), and re-verification mode.

What we'd enhance: Strengthen Shepherd to do explicit goal-backward verification: define observable truths from the original request, verify artifacts are substantive (not stubs), and check wiring (components connected, APIs consumed, state rendered). Add stub detection patterns to Scrutinizer.


P2: Decision Preservation (Locked/Flexible/Deferred)

Anvil has a CONTEXT.md pattern created during a "discuss phase" step:

## Implementation Decisions
### Authentication
- **LOCKED**: Use JWT with httpOnly cookies (user decision)
- **Claude's discretion**: Token expiry duration, refresh strategy
- **Deferred**: OAuth providers (out of scope for this phase)

Key aspects:

  • Locked decisions are NON-NEGOTIABLE — all downstream agents (planner, researcher, executor) MUST honor them
  • Claude's discretion areas give the agent explicit freedom
  • Deferred ideas are scope guardrails — explicitly captured but excluded
  • Created once during discussion, consumed everywhere downstream
  • Plan-checker verifies plans comply with locked decisions

Why this matters: Our /specify captures requirements but doesn't separate "user already decided this" from "Claude can choose." When Coder implements, it might re-debate something the user already settled. Our Shepherd validates intent alignment, but doesn't enforce decision preservation at the constraint level.

What we'd build: During /specify or /implement planning phase, capture user decisions with locked/flexible/deferred classification. Propagate locked decisions to Coder (must honor) and Shepherd (must verify compliance). This prevents re-debating and gives Claude clear autonomy boundaries.


Secondary Findings (Documented for Future Reference)

Quick Tasks with Composable Flags

Anvil has a /quick command for lightweight single-task execution with composable flags:

  • Default: Just implement the task with atomic commits
  • --discuss: Pre-planning discussion to surface gray areas, captures decisions
  • --full: Plan checking (max 2 iterations) + post-execution verification
  • --discuss --full: Full workflow with all guardrails

Relevance: We have a gap between ambient/QUICK (zero overhead) and full /implement (full ceremony). A /quick command would fill this middle ground — single task, optional discussion, optional verification.

Deviation Rules for Autonomous Decision-Making

Anvil codifies explicit rules for when executors should auto-fix vs ask:

  • Rule 1 (Auto-fix bugs): Code doesn't work as intended → fix immediately
  • Rule 2 (Auto-add critical): Missing error handling, validation, auth → add it
  • Rule 3 (Auto-fix blockers): Something prevents completing task → fix it
  • Rule 4 (Ask about architecture): Significant structural changes → STOP and ask

Relevance: Our Coder agent has implicit behavior about when to ask vs proceed. Codifying this makes behavior predictable and documented.

Health Check Command

Anvil runs 8 diagnostic checks on project integrity:

  • E001-E005: Critical errors (missing dirs, invalid config JSON)
  • W001-W009: Warnings (orphaned files, validation consistency)
  • Auto-repair path for fixable issues (createConfig, resetConfig, regenerateState)
  • Structured output with code/message/repairable flags

Relevance: A /health command that validates DevFlow installation, hook integrity, settings.json consistency, and plugin state would help troubleshooting.

Disk-Status Inference for Progress

Instead of explicit status tracking, Anvil infers phase status from files on disk:

  • Has PLAN.md? → "planned"
  • Has PLAN.md + SUMMARY.md? → "complete"
  • Has RESEARCH.md only? → "researched"
  • Self-healing: manually adding artifacts auto-updates status

Relevance: Our .docs/reviews/ directory could provide similar intelligence for review progress without explicit state management.

Wave-Based Parallel Execution

Anvil groups plans by dependency and runs independent plans in parallel within waves:

  • Wave 1: All independent plans (parallel)
  • Wave 2: Plans depending on Wave 1 (wait, then parallel)
  • Wave N: Sequential dependency chain

Relevance: Our /implement already parallelizes some agents, but for multi-task implementations, wave-based parallelism could improve throughput.

Verification Pattern Templates (Stub Detection)

Detailed patterns for detecting real vs stub implementations:

React stubs: return <div>Component</div>, return null, onClick={() => {}}
API stubs: return Response.json({ message: "Not implemented" })
Hook stubs: return { user: null, login: () => {}, logout: () => {} }
Wiring gaps: fetch without await, query without return, state exists but not rendered

Relevance: Could enhance Scrutinizer's detection capabilities with these patterns.

Per-Task Atomic Commits

Anvil commits after every task (sub-feature level), not just per feature. Enables git bisect at granular level and failure recovery (task 1 committed, task 2 failed → next session knows exactly where to resume).

Relevance: Our Coder does atomic commits per feature. Going more granular might add overhead, but the failure recovery benefit is worth considering.

Session Handoff Files (.continue-here.md)

Explicit handoff files created on pause with YAML frontmatter + structured sections (current_state, completed_work, remaining_work, decisions_made, blockers, next_action). Deleted after resume.

Relevance: Our WORKING-MEMORY.md serves a similar purpose but is automatic (background write). Anvil's approach is more explicit and targeted. Both have merit.


Architecture Comparison Summary

Dimension DevFlow Anvil
Core metaphor Plugin marketplace + skills Project lifecycle manager
Unit of work Single task/PR Phase → Plan → Task (3 levels)
State .memory/WORKING-MEMORY.md (session) .planning/STATE.md + ROADMAP.md + CONTEXT.md (persistent)
Quality Skills loaded per-prompt (ambient) Goal-backward verification at every stage
Execution Agent spawns per command Wave-based parallel execution of plans
Git Atomic commits per feature Atomic commits per task (sub-feature)
Context mgmt Pre-compact hook + session start Context monitor hook + bridge files + statusline
Model selection Hardcoded per agent Config-driven profiles with per-agent overrides
Session continuity WORKING-MEMORY.md (background write) .continue-here.md handoff files + STATE.md
Multi-runtime Claude Code only Claude Code + OpenCode + Gemini CLI + Codex

What NOT to Borrow

  • Full milestone/phase/roadmap hierarchy — Too opinionated for DevFlow's composable philosophy
  • Multi-runtime support — Dilutes focus; Claude Code is our target
  • XML task format — Our markdown-based approach with agent prompts is cleaner
  • Single-system monolith — Our plugin architecture is more composable
  • Interactive project setup wizard — Heavy for DevFlow's "enhance every prompt" approach

Priority Matrix

Priority Feature Effort Impact
P1 Context monitor hook (graduated thresholds) Medium High
P1 Model profiles (quality/balanced/budget) Medium High
P1 Goal-backward verification in Shepherd Low Medium-High
P2 Decision preservation (locked/flexible/deferred) Low Medium
P3 /quick command with composable flags Medium Medium
P3 Deviation rules for Coder Low Medium
P4 Health check command Low Low
P4 Stub detection patterns for Scrutinizer Low Low
P4 Bridge files for inter-hook communication Low Low

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions