tl;dr — I used 8 AI agents to systematically analyze a production agentic codebase (512K lines, ~1890 files, millions of sessions). I distilled everything into 91 battle-tested patterns across 9 domains, with the anti-patterns that actually caused production incidents. Every pattern includes implementation-ready pseudocode. This is the engineering playbook I wish existed when I started building agents.
Everyone is building AI agents. Almost nobody is building them well.
The hard problems aren't the LLM calls — they're everything around them:
- Your prompt cache busts silently and your API bill 12x's overnight
- Your agent loop crashes mid-conversation and the user's 30-minute session is gone
- Your MCP tools run on a different pipeline than built-in tools, so bugs only surface in plugins
- Your permission model has TOCTOU races that a crafted hook can exploit mid-session
- Your context window fills up and your "smart" compaction deletes the file state the model needs next
- Your 292 concurrent agents OOM because nobody set a message queue cap
I've hit every one of these building Abel AI. This repo is the result of turning those lessons into reusable engineering patterns — not as theory, but as implementation-ready specifications with the exact anti-patterns to avoid.
91 patterns. 9 domains. 58-item audit checklist. Zero hand-waving.
| Module | Patterns | You'll learn... |
|---|---|---|
| Architecture | 7 | Why one binary with four execution modes beats four codebases. How a 34-line reactive store outperforms Redux. Why your bootstrap state module should import nothing. |
| Agentic Loop | 8 | The while(true) AsyncGenerator pattern that handles streaming, cancellation, and backpressure in one construct. Context management as an ordered pipeline. Autocompact with circuit breakers. |
| LLM Integration | 9 | Why you should disable SDK retry and build your own. The 6-layer system prompt pipeline. How beta header latching saves millions in cache costs. |
| Tool System | 10 | The three-tier tool interface where Tier 2 defaults fail-closed. The seven-step lifecycle that prevents permission bypass. Why tool order matters for your API bill. |
| Agent Orchestration | 9 | Spawning a sub-agent IS running another conversation — same query(), zero feature drift. How to share prompt cache across N forked agents. The 50-message cap that prevented OOM at 292 concurrent agents. |
| Permission & Safety | 11 | Six permission modes as strategy objects. The six-layer Bash permission cascade with tree-sitter AST analysis. Why stripping code interpreter rules in auto mode prevents the AI from approving arbitrary code. |
| Hooks & Extensibility | 12 | 26 lifecycle events with typed frozen payloads. Six hook types from shell scripts to LLM calls. Why your hook config must be snapshot-frozen at startup (TOCTOU injection vector). |
| UI & Infrastructure | 13 | Integer interning for 60fps terminal rendering. Hardware scroll via DECSTBM. Virtual scroll with quantized React commits. Git ref validation that blocks injection. |
| Philosophy | 12 | The 12 principles that generate correct patterns — so you can derive the right answer for situations these patterns don't cover. |
| Audit Checklist | 58 items | Grade any agentic tool across 8 categories. Minimum viable = 70%. Production-grade = 90%. |
The core of any AI agent is a loop: call the LLM, execute tools, repeat. The 7 principles below govern every design decision in that loop. They're extracted from the Philosophy module — the "why behind the why" that generates correct patterns for situations this repo doesn't explicitly cover.
1. AsyncGenerator as lingua franca
→ One composition primitive. Streaming, backpressure, cancellation, type safety.
If you're using callbacks AND promises AND event emitters, you're paying
complexity tax for zero compositional benefit.
2. Prompt cache is sacred
→ Cache bust = 12x cost at fleet scale. Sort tool pools deterministically.
Latch beta headers. Hash content paths. Never put timestamps in your prefix.
This is not an optimization — it's architecture.
3. Fail-fast for safety, fail-open for UX
→ Permission denied? exit(1). MCP server unreachable? Show what you have.
One strategy for both = guaranteed wrong for one.
4. Just enough complexity
→ 34-line store > Redux (when you have 3 pieces of state).
If you can't explain a component in one sentence, it's too complex.
If you can't fill one sentence, it's too simple.
5. Tool interface IS the extension point
→ MCP tools = built-in tools. Same pipeline, same validation, same permissions.
If external contributors learn a different abstraction than internal code uses,
you've created an unnecessary seam.
6. Persist before the crash boundary
→ If the process can die at line X, state must be on disk before X.
User messages saved before the API call. Transcript saved before compaction.
"I lost my conversation" is an architecture bug, not bad luck.
7. Hide latency, don't reduce it
→ Start I/O before you need the result. Preconnect during setup.
Read files A, B, C concurrently. If every I/O call blocks something,
you're leaving latency on the table.
The Tool System module covers the 10 patterns that make MCP tools first-class citizens — same validation pipeline, same permissions, same lifecycle as built-in tools. Key insight: MCP tools default fail-closed (isConcurrencySafe: false, isReadOnly: false). Omitting a security field is safe, not dangerous.
The Permission & Safety module covers the 11 patterns for the full permission model — from six strategy-based modes to the six-layer Bash permission cascade that uses tree-sitter AST analysis (not regex) to catch rm -rf "$VAR".
The Agentic Loop module covers the 5-stage context management pipeline: tool result budgets → history snip → microcompact (per-tool-type retention thresholds) → context collapse → autocompact (forked summarization agent with circuit breaker). Each stage is a pure function. Cheap stages run first. Expensive summarization fires only when everything else fails.
The Agent Orchestration module covers 9 patterns for multi-agent systems. The key pattern: spawning a sub-agent IS running another conversation — the Agent tool calls the same query() function as the main loop. Zero feature drift between parent and child. Also covers: fork cache sharing (N children share one prompt-cache entry), the 50-message queue cap that prevented OOM at 292 concurrent agents, and mailbox-based permission synchronization for swarms.
The Hooks & Extensibility module covers 12 patterns: 26 typed lifecycle events, 6 hook execution types (shell → LLM call → subagent), exit-code-as-contract, TOCTOU-safe snapshot isolation, frontmatter-driven skill configuration, conditional path-based activation, namespaced plugin architecture with impersonation protection, and self-authoring via /skillify.
git clone https://github.com/cauchyturing/agent-harness-engineering.git
ln -s "$(pwd)/agent-harness-engineering" ~/.claude/skills/agent-harness-engineeringThen invoke with /agent-harness-engineering. The skill only loads the 1-2 modules relevant to your current task — context discipline is principle #1.
Works with Claude Code, Codex, and any AI coding assistant that supports markdown skills.
Each module is self-contained. Read what you need:
| I'm building... | Start here |
|---|---|
| An agentic loop | 01-agentic-loop.md |
| A tool execution system | 03-tool-system.md |
| A permission model | 05-permission-safety.md |
| A hook/plugin system | 06-hooks-extensibility.md |
| Nothing yet — just want to understand | 08-philosophy.md |
Run the 58-item checklist against your harness:
- 70% overall, no category < 50% → Minimum viable
- 90% overall, no category < 75% → Production-grade
Every pattern follows a consistent structure:
### N. Pattern Name
Problem: What engineering challenge does this solve?
Pattern: The solution in 2-3 sentences.
Implementation: Concrete pseudocode (language-agnostic principles, TypeScript-flavored examples).
Why it works: The engineering reasoning — not "because best practice", but the actual mechanism.
Anti-pattern: What to avoid and why — often from real incidents.
See also: Cross-references to related patterns in other modules.
8 parallel deep-analysis agents, each specializing in one subsystem of a 512K-line production agentic codebase:
- Core Runtime & Bootstrap
- Tool System Architecture
- Hook & Permission Model
- Services & LLM Integration
- UI Components & Rendering
- Skill & Command System
- Bridge, Remote & Task System
- Utils & Infrastructure
Raw analysis → 3,100-line synthesis → modular distillation → this repo.
The agents found the patterns. I verified them against production incidents. If a pattern didn't have a real anti-pattern that actually went wrong, it didn't make the cut.
agent-harness-engineering/
├── SKILL.md # AI skill entry point (routing table)
├── references/
│ ├── 00-architecture.md # 7 patterns — system design
│ ├── 01-agentic-loop.md # 8 patterns — the core loop
│ ├── 02-llm-integration.md # 9 patterns — LLM API layer
│ ├── 03-tool-system.md # 10 patterns — tool execution
│ ├── 04-agent-orchestration.md # 9 patterns — multi-agent
│ ├── 05-permission-safety.md # 11 patterns — security model
│ ├── 06-hooks-extensibility.md # 12 patterns — extension system
│ ├── 07-ui-infrastructure.md # 13 patterns — terminal & infra
│ └── 08-philosophy.md # 12 principles — generative wisdom
└── checklists/
└── harness-audit.md # 58-item evaluation checklist
Patterns must be:
- Proven — from production code, not whiteboards
- Generalizable — language/framework agnostic where possible
- Actionable — pseudocode or it didn't happen
- Honest — every pattern needs an anti-pattern from a real failure
Built by Stephen — founder of Abel AI, the social-physical engine driven by causal AI.