Skip to content

cauchyturing/agent-harness-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We reverse-engineered a 512K-line CC codebase. Here are the 91 patterns that actually matter.

tl;dr — I used 8 AI agents to systematically analyze a production agentic codebase (512K lines, ~1890 files, millions of sessions). I distilled everything into 91 battle-tested patterns across 9 domains, with the anti-patterns that actually caused production incidents. Every pattern includes implementation-ready pseudocode. This is the engineering playbook I wish existed when I started building agents.


Why most AI agents break in production

Everyone is building AI agents. Almost nobody is building them well.

The hard problems aren't the LLM calls — they're everything around them:

  • Your prompt cache busts silently and your API bill 12x's overnight
  • Your agent loop crashes mid-conversation and the user's 30-minute session is gone
  • Your MCP tools run on a different pipeline than built-in tools, so bugs only surface in plugins
  • Your permission model has TOCTOU races that a crafted hook can exploit mid-session
  • Your context window fills up and your "smart" compaction deletes the file state the model needs next
  • Your 292 concurrent agents OOM because nobody set a message queue cap

I've hit every one of these building Abel AI. This repo is the result of turning those lessons into reusable engineering patterns — not as theory, but as implementation-ready specifications with the exact anti-patterns to avoid.


What's inside

91 patterns. 9 domains. 58-item audit checklist. Zero hand-waving.

Module Patterns You'll learn...
Architecture 7 Why one binary with four execution modes beats four codebases. How a 34-line reactive store outperforms Redux. Why your bootstrap state module should import nothing.
Agentic Loop 8 The while(true) AsyncGenerator pattern that handles streaming, cancellation, and backpressure in one construct. Context management as an ordered pipeline. Autocompact with circuit breakers.
LLM Integration 9 Why you should disable SDK retry and build your own. The 6-layer system prompt pipeline. How beta header latching saves millions in cache costs.
Tool System 10 The three-tier tool interface where Tier 2 defaults fail-closed. The seven-step lifecycle that prevents permission bypass. Why tool order matters for your API bill.
Agent Orchestration 9 Spawning a sub-agent IS running another conversation — same query(), zero feature drift. How to share prompt cache across N forked agents. The 50-message cap that prevented OOM at 292 concurrent agents.
Permission & Safety 11 Six permission modes as strategy objects. The six-layer Bash permission cascade with tree-sitter AST analysis. Why stripping code interpreter rules in auto mode prevents the AI from approving arbitrary code.
Hooks & Extensibility 12 26 lifecycle events with typed frozen payloads. Six hook types from shell scripts to LLM calls. Why your hook config must be snapshot-frozen at startup (TOCTOU injection vector).
UI & Infrastructure 13 Integer interning for 60fps terminal rendering. Hardware scroll via DECSTBM. Virtual scroll with quantized React commits. Git ref validation that blocks injection.
Philosophy 12 The 12 principles that generate correct patterns — so you can derive the right answer for situations these patterns don't cover.
Audit Checklist 58 items Grade any agentic tool across 8 categories. Minimum viable = 70%. Production-grade = 90%.

How to design an agentic loop that doesn't break

The core of any AI agent is a loop: call the LLM, execute tools, repeat. The 7 principles below govern every design decision in that loop. They're extracted from the Philosophy module — the "why behind the why" that generates correct patterns for situations this repo doesn't explicitly cover.

1. AsyncGenerator as lingua franca
   → One composition primitive. Streaming, backpressure, cancellation, type safety.
     If you're using callbacks AND promises AND event emitters, you're paying
     complexity tax for zero compositional benefit.

2. Prompt cache is sacred
   → Cache bust = 12x cost at fleet scale. Sort tool pools deterministically.
     Latch beta headers. Hash content paths. Never put timestamps in your prefix.
     This is not an optimization — it's architecture.

3. Fail-fast for safety, fail-open for UX
   → Permission denied? exit(1). MCP server unreachable? Show what you have.
     One strategy for both = guaranteed wrong for one.

4. Just enough complexity
   → 34-line store > Redux (when you have 3 pieces of state).
     If you can't explain a component in one sentence, it's too complex.
     If you can't fill one sentence, it's too simple.

5. Tool interface IS the extension point
   → MCP tools = built-in tools. Same pipeline, same validation, same permissions.
     If external contributors learn a different abstraction than internal code uses,
     you've created an unnecessary seam.

6. Persist before the crash boundary
   → If the process can die at line X, state must be on disk before X.
     User messages saved before the API call. Transcript saved before compaction.
     "I lost my conversation" is an architecture bug, not bad luck.

7. Hide latency, don't reduce it
   → Start I/O before you need the result. Preconnect during setup.
     Read files A, B, C concurrently. If every I/O call blocks something,
     you're leaving latency on the table.

How to build an MCP tool system with proper security

The Tool System module covers the 10 patterns that make MCP tools first-class citizens — same validation pipeline, same permissions, same lifecycle as built-in tools. Key insight: MCP tools default fail-closed (isConcurrencySafe: false, isReadOnly: false). Omitting a security field is safe, not dangerous.

The Permission & Safety module covers the 11 patterns for the full permission model — from six strategy-based modes to the six-layer Bash permission cascade that uses tree-sitter AST analysis (not regex) to catch rm -rf "$VAR".


How to manage context windows without losing state

The Agentic Loop module covers the 5-stage context management pipeline: tool result budgets → history snip → microcompact (per-tool-type retention thresholds) → context collapse → autocompact (forked summarization agent with circuit breaker). Each stage is a pure function. Cheap stages run first. Expensive summarization fires only when everything else fails.


How to orchestrate multiple AI agents

The Agent Orchestration module covers 9 patterns for multi-agent systems. The key pattern: spawning a sub-agent IS running another conversation — the Agent tool calls the same query() function as the main loop. Zero feature drift between parent and child. Also covers: fork cache sharing (N children share one prompt-cache entry), the 50-message queue cap that prevented OOM at 292 concurrent agents, and mailbox-based permission synchronization for swarms.


How to build a hook and plugin system for AI tools

The Hooks & Extensibility module covers 12 patterns: 26 typed lifecycle events, 6 hook execution types (shell → LLM call → subagent), exit-code-as-contract, TOCTOU-safe snapshot isolation, frontmatter-driven skill configuration, conditional path-based activation, namespaced plugin architecture with impersonation protection, and self-authoring via /skillify.


Quick start

Use as a Claude Code / Codex skill

git clone https://github.com/cauchyturing/agent-harness-engineering.git
ln -s "$(pwd)/agent-harness-engineering" ~/.claude/skills/agent-harness-engineering

Then invoke with /agent-harness-engineering. The skill only loads the 1-2 modules relevant to your current task — context discipline is principle #1.

Works with Claude Code, Codex, and any AI coding assistant that supports markdown skills.

Use as a standalone reference

Each module is self-contained. Read what you need:

I'm building... Start here
An agentic loop 01-agentic-loop.md
A tool execution system 03-tool-system.md
A permission model 05-permission-safety.md
A hook/plugin system 06-hooks-extensibility.md
Nothing yet — just want to understand 08-philosophy.md

Audit an existing agentic tool

Run the 58-item checklist against your harness:

  • 70% overall, no category < 50% → Minimum viable
  • 90% overall, no category < 75% → Production-grade

Pattern format

Every pattern follows a consistent structure:

### N. Pattern Name
Problem:        What engineering challenge does this solve?
Pattern:        The solution in 2-3 sentences.
Implementation: Concrete pseudocode (language-agnostic principles, TypeScript-flavored examples).
Why it works:   The engineering reasoning — not "because best practice", but the actual mechanism.
Anti-pattern:   What to avoid and why — often from real incidents.
See also:       Cross-references to related patterns in other modules.

How this was made

8 parallel deep-analysis agents, each specializing in one subsystem of a 512K-line production agentic codebase:

  1. Core Runtime & Bootstrap
  2. Tool System Architecture
  3. Hook & Permission Model
  4. Services & LLM Integration
  5. UI Components & Rendering
  6. Skill & Command System
  7. Bridge, Remote & Task System
  8. Utils & Infrastructure

Raw analysis → 3,100-line synthesis → modular distillation → this repo.

The agents found the patterns. I verified them against production incidents. If a pattern didn't have a real anti-pattern that actually went wrong, it didn't make the cut.


Repo structure

agent-harness-engineering/
├── SKILL.md                     # AI skill entry point (routing table)
├── references/
│   ├── 00-architecture.md       #  7 patterns — system design
│   ├── 01-agentic-loop.md      #  8 patterns — the core loop
│   ├── 02-llm-integration.md   #  9 patterns — LLM API layer
│   ├── 03-tool-system.md       # 10 patterns — tool execution
│   ├── 04-agent-orchestration.md #  9 patterns — multi-agent
│   ├── 05-permission-safety.md  # 11 patterns — security model
│   ├── 06-hooks-extensibility.md # 12 patterns — extension system
│   ├── 07-ui-infrastructure.md  # 13 patterns — terminal & infra
│   └── 08-philosophy.md        # 12 principles — generative wisdom
└── checklists/
    └── harness-audit.md         # 58-item evaluation checklist

Contributing

Patterns must be:

  • Proven — from production code, not whiteboards
  • Generalizable — language/framework agnostic where possible
  • Actionable — pseudocode or it didn't happen
  • Honest — every pattern needs an anti-pattern from a real failure

License

MIT


Built by Stephen — founder of Abel AI, the social-physical engine driven by causal AI.

About

91 production-proven patterns for building AI agents, extracted from a 512K-line codebase. Covers agentic loops, tool systems, permissions, MCP, prompt caching, multi-agent orchestration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors