Almide Dojo

Daily modification-survival-rate (MSR) measurement for Almide.

📊 Live dashboard: https://almide.github.io/almide-dojo/

Almide's existence rests on one metric: how often LLM-written code survives modification. Almide Dojo makes that measurement continuous — and feeds the failures back as a backlog for improving Almide's compiler diagnostics and stdlib.

The metric that matters

The headline is final pass rate after up to 3 retries — i.e. did the model converge to a passing solution given diagnostic feedback. 1-shot rate is informational only; the real signal is whether the diagnostic loop is good enough for the model to recover. Anything that improves retry-success — clearer diagnostics, better hints, smarter retry prompts, fixed compiler bugs — counts as a win.

What the dashboard shows

Pass rate over time — per model (final pass after retries; this is the headline)
Failure breakdown by category — parse-error / type-error / name-error / import-error / runtime-error / wrong-output / unknown per model. The categories tell you why the retry loop didn't converge.
Top diagnostic codes — which error[E0xx] codes are most often blocking LLMs even after 3 retries (drives the diagnostic-improvement backlog in almide/almide)
Pass rate by Almide feature × model — heatmap over tasks/*/meta.toml tags, surfacing which language features each model handles vs. trips on
Per-task results with category, code, and retry count

What happens here

Every day (once Phase 2 is live):

LLMs are given tasks from tasks/ via prompt.md
Their output is compiled with a pinned Almide compiler
If compilation fails, the diagnostic is fed back and the model retries (up to N times)
Successful solutions are tested against tests.almd
Results land in runs/YYYY-MM-DD/

The signal we care about:

1-shot success rate — did the LLM compile without any retry?
N-shot success rate (N = 2, 3, 5)
Average retry count per task
Diagnostics that helped (LLM fixed its code after reading the hint)
Malicious hints — diagnostics that led the LLM astray

Structure

almide.toml          Package manifest (the harness itself is an Almide package)
src/main.almd        Harness — written in Almide, of course
tasks/               Task bank (prompts + tests + metadata)
runs/                Per-day results, committed to git
dashboards/          Static site for visualizing trends (GitHub Pages)
almide-pin.toml      Which Almide compiler commit we evaluate against
malicious-hints.md   Incident log of hint texts that misled models

The harness is deliberately written in Almide itself — Dojo is the first place that dogfoods Almide for a non-trivial I/O-heavy program (HTTP, fs, process, json). Every line of the harness is another data point for the language it tests.

Running the harness

Requires the claude CLI to be installed and authenticated. The harness calls it via process.exec, so no API key handling is needed in the harness itself.

# single task
almide run src/main.almd -- fizzbuzz

# all tasks, writes runs/YYYY-MM-DD/summary.md
almide run src/main.almd -- all

Task bank (30 tasks)

Basic (15 tasks, < 20 LOC) — single function, core language features: fizzbuzz, factorial, fibonacci, gcd, is-prime, is-palindrome, string-reverse, sum-digits, count-vowels, clamp, max-of-list, list-sum, title-case, repeat-string, remove-duplicates

Intermediate (10 tasks, 20–80 LOC) — multiple functions, stdlib composition: caesar-cipher, roman-numeral, run-length-encoding, word-count, balanced-parens, anagram-check, binary-search, flatten-nested, partition-list, zip-with

Advanced (5 tasks, > 80 LOC) — custom ADTs, pattern matching, error handling: expression-eval, custom-linked-list, result-pipeline, mini-json-query, matrix-ops

Current phase

Phase 3 — 30-task bank with three difficulty tiers, harness searches across basic/, intermediate/, advanced/ directories. Next: add GitHub Actions daily workflow, build the dashboards.

See docs/roadmap/active/almide-dojo.md in the main Almide repo for the full roadmap.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Almide Dojo

The metric that matters

What the dashboard shows

What happens here

Structure

Running the harness

Task bank (30 tasks)

Current phase

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
dashboards		dashboards
docs		docs
runs		runs
src		src
tasks		tasks
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
almide-pin.toml		almide-pin.toml
almide.lock		almide.lock
almide.toml		almide.toml
malicious-hints.md		malicious-hints.md

Folders and files

Latest commit

History

Repository files navigation

Almide Dojo

The metric that matters

What the dashboard shows

What happens here

Structure

Running the harness

Task bank (30 tasks)

Current phase

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages