Narrative-first benchmark observatory comparing ARC-AGI and Humanity's Last Exam across multiple capability dimensions.
Core thesis: Capability is rising fast, but efficiency, calibration, and domain robustness are NOT rising equally.
How progress differs across ARC-AGI and HLE over release time.
Do higher ARC scores come from smarter models, higher cost, or both?
A model is safer when it is accurate and well-calibrated.
How much does ARC-AGI score differ from HLE score for the same model?
https://mathias3.github.io/BenchmarkAtlas/
The interactive version has tooltips, legends, and hover details for every data point.
- Run pipeline:
python -m pipeline.run_pipeline- Serve static site:
python -m http.server 8000- Open:
http://localhost:8000/site/
- ARC evaluations:
https://arcprize.org/media/data/leaderboard/evaluations.json - ARC models:
https://arcprize.org/media/data/models.json - HLE models API:
https://dashboard.safe.ai/api/models
- If network fetch fails, cached files under
data/sources/are used. - Model matching across ARC/HLE is handled via
data/model_aliases.json. - Phase 2 (subject-level HLE blind spots) deferred until local eval data is available.