An open-source agentic coding harness that learns from its own failures.
Cairn records structured telemetry on every run. An evolution loop reads the telemetry, identifies failure modes, and proposes scaffolding changes — context strategies, tool ordering, retry policies, prompt templates. Each run leaves a marker. The next run reads the pile.
The metaphor: a cairn is a stack of stones placed by prior travelers so later ones know the path is real.
Pre-alpha. Building in public. The first baseline benchmark on SWE-bench Verified lands in the project's first writeup.
- Open-weight-first. Cairn is built to run against any open-weight coding model — Kimi K2.6, Qwen 3.6, DeepSeek V3.2, GLM-5.1, Devstral 2 — via OpenAI-compatible APIs or local backends (Ollama, llama.cpp).
- Observability-driven. Every run produces structured traces: tool calls, context construction, retries, terminations, failures. The evolution loop reads those traces to propose scaffolding changes.
- Reproducible. All benchmark numbers are published with the eval script and the commit that produced them. Self-reported numbers don't count.
The first 90 days, in public:
- Days 0–30: Eval pipeline + baseline. Run an open-weight backbone through a baseline harness on SWE-bench Lite; publish the number, honestly.
- Days 31–60: First evolution loop. A/B-test scaffolding changes against the baseline.
- Days 61–90: Submit to a tracked SWE-bench Verified leaderboard.
See docs/ for design notes and weekly progress.
Cairn is the first project from Unleashed Labs — a US-based AI research lab focused on agentic capability, tool calling, and the harness layer.
MIT.