Thesis: When working with AI coding agents, the specification -- not the code -- is the load-bearing artifact. Code becomes a generated, disposable, regenerable byproduct. The spec is what you version, review, and defend.
| Person / Source | Role / Affiliation | Key Claim | Link |
|---|---|---|---|
| Addy Osmani | Engineering Lead, Google | Specs are the "single source of truth" for AI agents; use a Specify -> Plan -> Tasks -> Implement workflow | How to write a good spec for AI agents |
| Addy Osmani | Engineering Lead, Google | AI-augmented (not automated) engineering requires classic discipline applied to agent collaboration | My LLM coding workflow going into 2026 |
| Andrej Karpathy | Co-founder, OpenAI; former Tesla AI | Coined "vibe coding" (Feb 2025), then declared it passe in favor of "agentic engineering" -- orchestrating agents under structured oversight | 2025 LLM Year in Review |
| Deepak Babu Piskala | Researcher (arXiv) | Proposes three levels of spec rigor (spec-first, spec-anchored, spec-as-source); specs invert the traditional dev workflow | arXiv 2602.00180 |
| Den Delimarsky | Principal PM, GitHub | Specs are "living, executable artifacts" and the "shared source of truth"; agents can't read minds, they need contracts | GitHub Blog: Spec-driven development |
| GitHub | Spec Kit (open source) | Four-phase gated workflow: Specify, Plan, Tasks, Implement -- with human review checkpoints at each gate | github/spec-kit |
| Anthropic | Claude Code documentation | "Explore first, then plan, then code"; CLAUDE.md as persistent spec; recommends writing spec then starting fresh session to implement | Best Practices for Claude Code |
| Anthropic | Engineering blog | Long-running agents need structured feature inventories and progress files; agents drift without explicit scaffolding | Effective harnesses for long-running agents |
| Thoughtworks | Technology Radar (Nov 2025) | Placed SDD in "Assess" ring; calls it the key emerging practice of 2025 but warns workflows remain "elaborate and opinionated" | Technology Radar: SDD |
| Thoughtworks | Blog | SDD addresses the failure mode of vibe coding: too fast, too haphazard, too much unmaintainable one-off code | Unpacking SDD |
| Martin Fowler | Thoughtworks | Tested Kiro, spec-kit, and Tessl; found agents ignore spec instructions, create duplicates, and claim success when builds fail; warns of MDD parallels | SDD Tools Analysis |
| Leigh Griffin & Ray Carroll | InfoQ | SDD makes architecture executable and enforceable through continuous validation | InfoQ: When Architecture Becomes Executable |
| Simon Willison | Independent developer, Datasette creator | Formalizing "agentic engineering patterns"; robust test suites give agents "superpowers"; warns of "house of cards code" | Agentic Engineering Patterns |
| Steve Yegge & Gene Kim | Sourcegraph / IT Revolution | "The IDE is dead by 2026"; developers shift from writing code to articulating intent and managing agent ensembles; three-loop framework for AI-assisted dev | Pragmatic Engineer interview |
| Kent Beck | Creator of XP, TDD | Experiments with AI tools confirm that test-first discipline (a form of spec) remains essential even when agents write the code | TDD, AI agents and coding |
Osmani's position is the most operationally detailed. In How to write a good spec for AI agents, he argues that a spec is the "single source of truth for both humans and AI agents." His framework covers six areas every spec should address: commands, testing, project structure, code style, git workflow, and boundaries (what agents must never touch).
His workflow is explicitly gated: Specify, Plan, Tasks, Implement -- with human validation at each gate. This mirrors GitHub Spec Kit's structure, and Osmani has endorsed it. In My LLM coding workflow going into 2026, he emphasizes that the best results come from "applying classic software engineering discipline to AI collaborations" -- not from giving agents more autonomy, but from giving them better specs.
Key insight: specs should focus on what and why, not the nitty-gritty how. The agent fills in the how. The human owns the what.
Karpathy coined "vibe coding" in February 2025 to describe the practice of "fully giving in to the vibes, embracing exponentials, and forgetting that the code even exists." It was a provocation -- and it worked. The term became Collins Dictionary's Word of the Year for 2025.
But by early 2026, Karpathy himself declared vibe coding passe. His replacement term is "agentic engineering": "'agentic' because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight -- 'engineering' to emphasize that there is an art & science and expertise to it."
The arc from vibe coding to agentic engineering is the arc from "let the AI figure it out" to "give the AI a spec and verify the output." Karpathy's evolution mirrors the industry's.
Anthropic's own Claude Code best practices are explicit: "Letting Claude jump straight to coding can produce code that solves the wrong problem." The recommended workflow has four phases: Explore (read files, understand context), Plan (create detailed implementation plan), Implement (code against the plan), Commit.
Anthropic goes further in their engineering blog post on effective harnesses for long-running agents. They recommend structured feature inventories (JSON-formatted, initially marked "failing"), progress files (claude-progress.txt), and explicit testing as alignment mechanisms. The key finding: "agents tend toward one-shot approaches," so explicit feature inventories prevent them from prematurely declaring work complete.
Their recommended interview pattern is particularly telling: start with a minimal prompt, have Claude interview you about edge cases and tradeoffs, then write a complete spec to SPEC.md, then start a fresh session to implement it. The spec is the handoff artifact between sessions. The conversation is disposable; the spec persists.
Fowler's analysis of Kiro, spec-kit, and Tessl is the most rigorous critique available. He tested these tools on real tasks and found:
- Kiro turned a small bug fix into "4 user stories with a total of 16 acceptance criteria" -- "like using a sledgehammer to crack a nut"
- Agents frequently ignored spec instructions: "it just took them as a new specification and generated them all over again, creating duplicates"
- He found himself preferring to "review code than all these markdown files"
- He warns of parallels to Model-Driven Development (MDD), which "never took off for business applications" because it "sits at an awkward abstraction level"
Fowler's critique is important because it identifies a real tension: specs can become bureaucracy if they're not the right size for the problem. But notably, even Fowler doesn't argue against specs -- he argues against over-specified specs for small tasks and tools that enforce a single workflow regardless of problem size.
Willison's agentic engineering patterns emphasize that "having a robust test suite is like giving the agents superpowers -- they can validate and iterate quickly when tests fail." His coinage of "house of cards code" describes the failure mode of unspecified agent output: code that looks right but collapses under scrutiny.
This aligns with a key insight: tests are specs. A test suite is a machine-readable specification. When you give an agent tests to pass, you're giving it a spec. When you give it a written spec without tests, you're trusting the agent to interpret natural language correctly -- which, as Fowler demonstrated, it frequently does not.
Yegge's prediction that "the IDE is dead... it will be gone by 2026" (Pragmatic Engineer) is provocative, but the underlying argument is about where developers spend their time. If agents write the code, the developer's primary artifact becomes the intent specification -- the description of what should exist. Yegge and Kim's three-loop framework for AI-assisted development formalizes this, organizing work into loops that each require specification, execution, and verification.
Den Delimarsky of GitHub states it plainly: AI agents "are exceptional at pattern completion, but not at mind reading." Without a spec, the agent must guess at unstated requirements. It will produce code that compiles but does not match intent. The spec eliminates the guessing.
When an agent can generate 500 lines of code in seconds, the bottleneck shifts. Writing code is no longer the hard part. Knowing what code to write -- and being able to verify that the output matches intent -- is the hard part. The spec captures the expensive thing (intent) and lets the cheap thing (code generation) be automated.
The arXiv paper and Augment Code's analysis both identify this as critical: in multi-agent workflows, "every agent reads from and writes to the same living spec, so the Coordinator, Implementors, and Verifier stay aligned." Without a shared spec, each agent operates from its own interpretation of conversation context, producing misaligned assumptions that only surface during integration.
This is the most practical argument. Anthropic's best practices note that "Claude's context window fills up fast, and performance degrades as it fills." A conversation is ephemeral and degrades. A spec file on disk is permanent. Anthropic explicitly recommends writing a spec, then starting a fresh session to implement it. The spec is the bridge across context boundaries.
As Osmani, Willison, and Anthropic all emphasize, the single highest-leverage thing you can do is give the agent a way to verify its work. A spec with test cases, acceptance criteria, and success conditions enables this. Without a spec, "you become the only feedback loop, and every mistake requires your attention" (Anthropic).
Reviewing a 50-line spec is faster and more reliable than reviewing 500 lines of generated code. The spec describes intent in human language. The code describes mechanism in machine language. Humans are better at validating intent than mechanism. (Fowler's counterpoint -- that verbose specs can themselves become a burden -- is valid, but the answer is better specs, not no specs.)
Thoughtworks warns that SDD can recreate waterfall's failure modes: too much upfront specification, too little iteration. Fowler's experiments confirm this -- when a small bug fix becomes 4 user stories with 16 acceptance criteria, something has gone wrong. The mitigation is to right-size specs: a one-line bug fix gets a one-line spec; a new feature gets a full spec document.
Fowler observes that "the mapping from spec to code is non-deterministic -- the same specification given to the same model on different days produces different implementations." This is a fundamental limitation of treating specs as compilable source code. Specs work best as contracts and constraints, not as deterministic compilation inputs.
Specs can drift from code just as code drifts from requirements. If the spec is not actively enforced (through tests, CI, or agent validation), it becomes stale documentation -- exactly the failure mode that traditional software specs have always suffered from.
Fowler's most damning finding: agents "ignore the notes... just took them as a new specification and generated them all over again, creating duplicates." Current LLMs do not reliably follow all instructions in long, detailed specifications. This is a capability limitation, not a conceptual one -- it will improve -- but it is real today.
Fowler found that extensive markdown artifacts created "a tedious review experience." If the spec is so long that reviewing it is harder than reviewing the code, it has failed its purpose. Specs must be concise enough to actually read.
Thoughtworks notes that SDD tools "behave inconsistently based on task size and type" and that "tool-specific specification formats create lock-in risk." The ecosystem is immature -- Kiro, spec-kit, and Tessl all launched in 2025 and are still evolving rapidly.
This is where the argument becomes strongest. Consider a workflow with multiple agents:
- Agent A writes the backend implementation
- Agent B writes the frontend
- Agent C writes the tests
- Agent D performs security review
Each agent runs in a separate context window. Each has no memory of the others' conversations. What keeps them aligned?
Without a frozen spec, the answer is: nothing. Each agent interprets its prompt independently. Agent A decides the API returns camelCase keys. Agent B assumes snake_case. Agent C writes tests for a third interpretation. You discover this at integration time, after all four agents have consumed their context budgets.
With a frozen, versioned spec committed to version control:
- Every agent reads the same
spec.mdandtasks.mdfrom the repo - The spec defines the API contract, data formats, and acceptance criteria
- If an agent's output doesn't match the spec, the mismatch is detectable -- by tests, by CI, or by a reviewer comparing output to spec
- The spec can be updated through a deliberate, reviewable process (a PR to the spec file), not through conversational drift
Anthropic's guidance on long-running agents confirms this pattern. They use structured feature inventories and progress files precisely because "agents tend toward one-shot approaches" and will "prematurely declare work complete" without explicit, file-based tracking. The claude-progress.txt pattern is a lightweight spec -- a persistent, file-based source of truth that survives context boundaries.
The arXiv paper formalizes this with three levels of rigor:
| Level | Description | When to Use |
|---|---|---|
| Spec-first | Write the spec before any code; use it to guide implementation | New features, greenfield projects |
| Spec-anchored | Spec guides and validates ongoing development; updated as decisions emerge | Evolving systems, multi-sprint work |
| Spec-as-source | Spec generates or verifies code directly; code is a derived artifact | API contracts, schema-driven systems |
For multi-agent workflows, spec-anchored is the practical sweet spot. The spec is versioned in git. Agents read it at the start of each session. Changes to the spec go through code review. The spec is the contract between agents, between sessions, and between humans and machines.
Despite the criticisms, a rough consensus is forming across these sources:
-
Vibe coding is a starting point, not a destination. Karpathy's own evolution from coining "vibe coding" to advocating "agentic engineering" captures the industry's trajectory.
-
Specs don't have to be heavyweight. The best specs are concise, focused on what and why, and include testable acceptance criteria. Fowler's critique of over-specified workflows is valid, but it's a critique of bad specs, not of specs themselves.
-
Tests are the most important form of spec. Willison, Osmani, and Anthropic all converge on this: give the agent a way to verify its own work. A test suite is a machine-readable spec that the agent can execute against.
-
The spec is the handoff artifact. When you switch context windows, switch agents, switch sessions, or switch humans -- the spec is what persists. Conversations are ephemeral. Code is regenerable. The spec is the durable artifact.
-
The tooling is immature but the pattern is sound. Spec-kit, Kiro, and Tessl are v0.x tools. But the underlying pattern -- write a spec, gate the workflow, verify against the spec -- is robust and will survive regardless of which tools win.
The strongest version of this argument: in a world where code is generated, the spec is source code. It is what you version-control, review, and defend. It is the artifact that carries intent across context boundaries, agent boundaries, and session boundaries. It is the load-bearing artifact.
- Osmani, A. How to write a good spec for AI agents
- Osmani, A. My LLM coding workflow going into 2026
- Osmani, A. The future of agentic coding
- Karpathy, A. 2025 LLM Year in Review
- Piskala, D.B. Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants
- Delimarsky, D. Spec-driven development with AI (GitHub Blog)
- GitHub. spec-kit repository
- Anthropic. Best Practices for Claude Code
- Anthropic. Effective harnesses for long-running agents
- Fowler, M. Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl
- Thoughtworks. Spec-driven development (Technology Radar)
- Thoughtworks. Unpacking one of 2025's key new AI-assisted engineering practices
- Willison, S. Agentic Engineering Patterns
- Yegge, S. Steve Yegge on AI Agents and the Future of Software Engineering (Pragmatic Engineer)
- Beck, K. TDD, AI agents and coding with Kent Beck (Pragmatic Engineer)
- Griffin, L. & Carroll, R. Spec Driven Development: When Architecture Becomes Executable (InfoQ)