Skip to content

almide/almide-dojo

Repository files navigation

Almide Dojo

Daily modification-survival-rate (MSR) measurement for Almide.

📊 Live dashboard: https://almide.github.io/almide-dojo/

Almide's existence rests on one metric: how often LLM-written code survives modification. Almide Dojo makes that measurement continuous — and feeds the failures back as a backlog for improving Almide's compiler diagnostics and stdlib.

The metric that matters

The headline is final pass rate after up to 3 retries — i.e. did the model converge to a passing solution given diagnostic feedback. 1-shot rate is informational only; the real signal is whether the diagnostic loop is good enough for the model to recover. Anything that improves retry-success — clearer diagnostics, better hints, smarter retry prompts, fixed compiler bugs — counts as a win.

What the dashboard shows

  • Pass rate over time — per model (final pass after retries; this is the headline)
  • Failure breakdown by categoryparse-error / type-error / name-error / import-error / runtime-error / wrong-output / unknown per model. The categories tell you why the retry loop didn't converge.
  • Top diagnostic codes — which error[E0xx] codes are most often blocking LLMs even after 3 retries (drives the diagnostic-improvement backlog in almide/almide)
  • Pass rate by Almide feature × model — heatmap over tasks/*/meta.toml tags, surfacing which language features each model handles vs. trips on
  • Per-task results with category, code, and retry count

What happens here

Every day (once Phase 2 is live):

  1. LLMs are given tasks from tasks/ via prompt.md
  2. Their output is compiled with a pinned Almide compiler
  3. If compilation fails, the diagnostic is fed back and the model retries (up to N times)
  4. Successful solutions are tested against tests.almd
  5. Results land in runs/YYYY-MM-DD/

The signal we care about:

  • 1-shot success rate — did the LLM compile without any retry?
  • N-shot success rate (N = 2, 3, 5)
  • Average retry count per task
  • Diagnostics that helped (LLM fixed its code after reading the hint)
  • Malicious hints — diagnostics that led the LLM astray

Structure

almide.toml          Package manifest (the harness itself is an Almide package)
src/main.almd        Harness — written in Almide, of course
tasks/               Task bank (prompts + tests + metadata)
runs/                Per-day results, committed to git
dashboards/          Static site for visualizing trends (GitHub Pages)
almide-pin.toml      Which Almide compiler commit we evaluate against
malicious-hints.md   Incident log of hint texts that misled models

The harness is deliberately written in Almide itself — Dojo is the first place that dogfoods Almide for a non-trivial I/O-heavy program (HTTP, fs, process, json). Every line of the harness is another data point for the language it tests.

Running the harness

Requires the claude CLI to be installed and authenticated. The harness calls it via process.exec, so no API key handling is needed in the harness itself.

# single task
almide run src/main.almd -- fizzbuzz

# all tasks, writes runs/YYYY-MM-DD/summary.md
almide run src/main.almd -- all

Task bank (30 tasks)

Basic (15 tasks, < 20 LOC) — single function, core language features: fizzbuzz, factorial, fibonacci, gcd, is-prime, is-palindrome, string-reverse, sum-digits, count-vowels, clamp, max-of-list, list-sum, title-case, repeat-string, remove-duplicates

Intermediate (10 tasks, 20–80 LOC) — multiple functions, stdlib composition: caesar-cipher, roman-numeral, run-length-encoding, word-count, balanced-parens, anagram-check, binary-search, flatten-nested, partition-list, zip-with

Advanced (5 tasks, > 80 LOC) — custom ADTs, pattern matching, error handling: expression-eval, custom-linked-list, result-pipeline, mini-json-query, matrix-ops

Current phase

Phase 3 — 30-task bank with three difficulty tiers, harness searches across basic/, intermediate/, advanced/ directories. Next: add GitHub Actions daily workflow, build the dashboards.

See docs/roadmap/active/almide-dojo.md in the main Almide repo for the full roadmap.

License

MIT — see LICENSE.

About

Daily modification-survival-rate (MSR) measurement ground for Almide. Harness written in Almide itself.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages