A library for deciding whether an LLM-driven generator did its job.
You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.
import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'
const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
await session.startChat()
const ship = await session.ship({
harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
})
console.log(ship.result.passed, ship.result.score)- You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
- You ship a content generator and need quality signal beyond "the LLM said it's good".
- You want a release gate that fails on regressions you can name, not vibes.
If that's you, start with docs/concepts.md — 5-minute mental model — then come back here.
The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.
npm i -g @tangle-network/agent-eval
# HTTP — long-running
agent-eval serve --port 5005
# stdio RPC — one-shot, batch
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judgePython:
pip install tangle-agent-evalfrom tangle_agent_eval import Client
c = Client()
r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
print(r.composite, r.failure_modes)See docs/wire-protocol.md for the full surface.
In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.
pnpm add @tangle-network/agent-evalThe recipe for a code-generator eval is in SKILL.md §Minimal working path.
- You're a human onboarding — read
docs/concepts.mdfor the mental model, thendocs/wire-protocol.mdif you'll call from another language, orSKILL.mdif you'll embed in TS. - You're an LLM agent writing integration code — read
SKILL.md. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
| Module | What it does | Doc |
|---|---|---|
BuilderSession |
Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval |
MultiLayerVerifier |
Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers |
judges, createCustomJudge, createAntiSlopJudge |
LLM and deterministic judges. | SKILL.md |
Wire protocol (agent-eval serve / rpc) |
HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
clients/python/ |
First-party Python client (tangle-agent-eval on PyPI). Version-locked to npm. |
clients/python/README.md |
BenchmarkRunner, executeScenario, ConvergenceTracker |
Multi-turn scenario execution + cross-run tracking. | SKILL.md |
ExperimentTracker, PromptOptimizer, bisector |
A/B prompts, optimize steering, bisect regressions. | SKILL.md |
Telemetry (telemetry/, telemetry/file) |
OTLP export, trace replay, file sinks. | inline JSDoc |
- TypeScript strict, no semicolons, single quotes, 2-space indent
tsupfor bundling,vitestfor tests@tangle-network/tcloudfor LLM calls (judges, driver)hono+@asteasolutions/zod-to-openapifor the wire protocol
pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi # write dist/openapi.json from the wire schemas
# Run the server locally
node dist/cli.js serve --port 5005
# Python client tests (require pnpm build first)
cd clients/python && pip install -e ".[dev]" && pytest@tangle-network/agent-eval (npm) and tangle-agent-eval (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.
MIT