Skip to content

mrazakhan/ContinuumAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 

Repository files navigation

ContinuumAI

Solving amnesia for coding agents. An open-weight stack with a learned-from-failure skill loop, measured at roughly 80 % of the top frontier model's score on Terminal-Bench 2.1 — at a fraction of the inference cost.

Read the launch post

BLOG.mdAI agents are amnesiacs. First post in a three-part series. Covers the amnesia problem, the skill-generation loop, the Terminal-Bench 2.1 measurement, and what comes next.

Headline result

Stack TB-2.1 score Lift
Claude Opus 4.8 (TB-2.1 leaderboard #2, frontier reference) 78.9 %
GLM-5.1 baseline (no skill) 59.4 %
GLM-5.1 + GLM-authored skill 62.8 % +3.4 %
GLM-5.1 + Opus-authored skill 64.0 % +4.6 %

89 long-horizon coding and systems tasks; K=3 attempts per task; 87 measured under every condition. Skills are generated only from failed runs, not from worked solutions — no solution leakage between skill authoring and measurement.

The series

Post Topic Status
Post 1 The amnesia problem and the accuracy measurement Live in BLOG.md
Post 2 Cost economics: per-attempt and per-passed-task dollars Coming
Post 3 Inside individual SKILL.md files — what each one teaches, and the regression cases Coming after Post 2

The mechanism in one paragraph

After each agent session ends, a Reflector LLM call reads the session trajectory and extracts the single load-bearing lesson from the failure. A SkillCreator LLM call writes that lesson as a short Markdown file (SKILL.md). The next related session loads the file before starting. The library grows by one per session — no model weights touched, no fine-tuning, no RL, no vector store, no human curation step. Skills are plain Markdown files that work with any model your routing layer (OpenRouter, AWS Bedrock, a direct provider) hands the next request to.

About

Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors