Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
-
Updated
Mar 11, 2026 - TypeScript
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, latency, token usage, tool calls, and file reads, then generate a single comparison report.
Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.
To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."