harness-builder

A minimal, educational AI agent harness — built to learn the plumbing behind tools like Claude Code: dynamic tool calling, progressive disclosure of skills, deferred tool loading, subagents, permission gating, and context compaction.

It is driven by a deterministic mock LLM that speaks the real Anthropic wire format (text / tool_use / tool_result content blocks). The "brain" is faked so runs are free and reproducible — but every other line is authentic harness code. Swapping in a real model would be a single new file implementing the LLM interface.

The goal is to see the machinery. Every demo prints a step-by-step trace of the loop: what was sent, what the model asked for, what ran, and what was fed back.

Quick start

npm install
npm run demo:01      # run one demo
npm run demo:all     # run them all in order
npm run typecheck    # tsc --noEmit

No build step — tsx runs the TypeScript directly. Zero runtime dependencies (the skill frontmatter is hand-parsed so there's no magic).

The one idea

A harness is a small agentic loop (src/core/loop.ts, ~60 lines):

user → [ model → tool_use → execute → tool_result ] → … → model → text

The model never runs anything; it only asks. The harness runs the tools and feeds results back as a user message (that's where tool_result blocks live). Every "feature" below is just a tool or a thin wrapper around this loop — the loop itself never grows.

The demos

Demo	Concept	What to watch for in the trace
`demo:01`	Core loop + dynamic tool calling	`tool_use` → `✓ tool` → model answers using the result
`demo:02`	Permission gating	`gate [ALLOW]` runs the tool; `gate [DENY]` blocks it before execution
`demo:03`	Parallel tool calls	3 tools requested in one turn; wall-clock ≈ slowest tool, not the sum
`demo:04`	Progressive disclosure of skills	system prompt lists skills by description only; `Skill` tool loads the body on demand
`demo:05`	Tool search / deferred loading	"tools available" count grows 1→2 after `ToolSearch` registers a match
`demo:06`	Subagents / sub-loops	indented nested transcript; parent context stays tiny, child does the work
`demo:07`	Context compaction	message count climbs, then a `compaction:` line drops it back under budget
`demo:08`	Skill script execution	a skill loads its instructions, then runs its bundled async script

Architecture

src/
  core/
    types.ts       Anthropic-shaped Message / ContentBlock / Tool contract (the keystone)
    loop.ts        runLoop(): the agentic loop
    registry.ts    ToolRegistry: register / expose schemas / execute (can grow at runtime)
    trace.ts       step-by-step console observability
  llm/
    types.ts       LLM interface — the single swap point for a real model
    mock.ts        MockLLM: replays a scripted Scenario (deterministic)
  permissions/
    gate.ts        Allow / Deny policy gate (+ an async approval gate)
  skills/
    loader.ts      scan dir, parse frontmatter, lazy-read bodies & scripts
    skillTool.ts   built-in `Skill` tool (discloses instructions)
    skillScriptTool.ts  built-in `run_skill_script` tool (runs bundled scripts)
    skills/        the skill files: *.md (+ wordcount.mjs bundled script)
  deferred/
    catalog.ts     dormant tool pool + keyword search
    toolSearchTool.ts  built-in `ToolSearch` tool (registers matches at runtime)
  subagents/
    spawn.ts       runSubLoop(): a nested loop with context isolation
    agentTool.ts   built-in `Agent` tool
  context/
    compaction.ts  token estimate + summarize-old-turns compactor
scenarios/         one runnable demo per concept (the verification)

Three of the built-in tools mirror real Claude Code mechanisms exactly: Skill (progressive disclosure), ToolSearch (deferred tools), and Agent (subagents). Each is "a normal tool whose handler does something interesting" — Skill reads a file, ToolSearch mutates the registry, Agent recurses into the loop.

Sync vs. async: the five execution seams

Anywhere the harness executes something, a real implementation does I/O — so these seams are all async-capable (awaited by the loop), even though the mock implementations are often synchronous:

LLM completion — LLM.complete() (network call)
Tool handlers — Tool.handler returns string | Promise<string> (exercised live in demos 03 & 06)
Permission gate — gate.check() (a real "ask" mode awaits a human; see asyncApprovalGate)
Compactor — maybeCompact() (real compaction awaits an LLM to write the summary)
Skill script execution — run_skill_script dynamically imports and runs a bundled script (demo 08)

Designing these async from the start means a real model, real approval prompts, and real scripts all drop in without reworking the loop.

Deliberately left out

Real Anthropic adapter (the LLM interface leaves room for it), streaming, a REPL/TUI, and persistence — all excluded to keep the focus on the plumbing.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scenarios		scenarios
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

harness-builder

Quick start

The one idea

The demos

Architecture

Sync vs. async: the five execution seams

Deliberately left out

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

harness-builder

Quick start

The one idea

The demos

Architecture

Sync vs. async: the five execution seams

Deliberately left out

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages