Recursive Fine-Tuning Stability — a Continual-Learning POC

This is the recursive_finetuning_stability POC. It lives under a continual_learning/poc/ workspace that holds sibling POCs (e.g. live_stream_stability), each its own git repo.

What I'm testing. Whether a coding/research model becomes a measurably better researcher by consolidating its own grounded research experience into its weights — recursively. I take a capable open-weight coding model (the researcher), let it run autonomous coding-research experiments in a real environment that returns a hard number, capture everything it produced (reasoning, code, execution logs, the metric), fine-tune the researcher on those traces, and repeat — V0 → V1 → V2 → … — watching whether the trajectory climbs or collapses.

Hypothesis. Recursive/Karpathy-style automated research keeps accumulated experience in context (notes/RAG carried between experiments). I'm betting you can instead bake that experience into the weights, so it compounds across versions without an ever-growing context — and that a researcher updated this way improves faster than the identical model frozen at V0 that keeps the same experience in-context. The honest scientific question underneath: does interleaving real execution signal into the training data turn recursive self-training from collapse (the default for self-generated data) into improvement?

As the research environment I use Karpathy's autoresearch (the closest public proxy for a self-contained, verifiable coding-research loop): an agent edits one file train.py to minimize val_bpb (validation bits-per-byte, lower = better) of a tiny GPT trained from scratch under a fixed 5-minute, single-GPU budget. The environment's own number is my grounding signal — no external judge/verifier model is ever used.

I own this work: Murali Nandan Nagarapu — Head of Engineering, Nucleus AI · nmn@withnucleus.ai

The one idea that reorganizes everything: there are TWO models

Every system in this space (Karpathy, Recursive) has two models, and conflating them causes endless confusion. Keep them separate:

Role	What it is	In this POC
The researcher	the LLM that reads, reasons, writes code, reads logs, decides the next edit	The thing I fine-tune. Qwen2.5-Coder-7B to start (32B-dense as a strength check; Kimi-K2.7-Code-class as the scale-up).
The artifact	the tiny model the research is about — trained from scratch each experiment, scored by `val_bpb`	The environment/seed. A depth-8 nanochat GPT (~50M params), trained 5 min on 1 GPU. It only exists to emit the grounding number.

Karpathy/Recursive keep the researcher frozen and accumulate learning in its context. My bet is to update the researcher's weights instead. That single inversion is the whole POC.

Where this stands today

Design is converged; no code written yet — this scaffold crystallizes what gets built so each phase can be handed to its own agent for how.

Phase	What it does	Status
0 — Harness	Vendor `autoresearch`; build the custom ReAct agent loop; reproduce the depth-8 baseline; lock the trace schema; freeze `program.md`	✅ Done
1 — Collect (STEP 1)	Run one round = 48 parallel agent rollouts in a single 5-min wave; capture every trace to Parquet/GCS	⏳
2 — Consolidate (STEP 2)	Turn one round's traces into V_{n+1}: status-aware labels, outcome-stamp, loss-mask, replay, KL-anchor, LoRA	⏳
3 — The recursive loop	Run the arcs × arms + the CTRL/FLOOR baselines, scheduled by priority; auto-detect collapse	⏳
4 — Eval & verdict	Held-out transfer probe + forgetting suite + collapse diagnostics; the hero plot + decomposition table	⏳

I keep the live working state — what's in flight, gotchas, next steps — in HANDOFF.md and handoff/. This README is the stable overview.

Decisions locked so far: researcher = dense coder (Qwen2.5-Coder-32B-Instruct, open-Q#1 resolved 2026-06-15; 7B retained as a fast fallback), fine-tuned by LoRA on all in-block projections (q,k,v,o,gate,up,down; embeddings/head frozen), r=64/α=64, merged into base each round; environment = autoresearch, depth-8, 5-min, frozen program.md; training = recursive self-SFT on collected traces with outcome-stamping (the env's own number, deterministic, no judge), status-aware labels, 80/20 replay (last-5-round window) and a KL-anchor to V0; the experiment is a 2×2 of arcs × data-rules plus two baselines; success is a relative trajectory result (fine-tuned curve vs the in-context CTRL curve), not the absolute val_bpb leaderboard.

The experiment design

The loop (one round = one V_n → V_{n+1})

        ┌──────────────────────────────────────────────────────────────────┐
        │  Researcher V_n   (V0 = Qwen2.5-Coder-7B), served via vLLM        │
        └──────────────────────────────┬───────────────────────────────────┘
   STEP 1 — COLLECT                      │
   48 parallel ReAct rollouts, ONE 5-min nanochat training each:
     read FROZEN program.md + pristine train.py → reason → edit → `uv run train.py`
     → capture { reasoning, code, diff, stdout/stderr, exit_code, STATUS, val_bpb,
                 steps, vram, gpu_type, token counts }  → one Parquet row / rollout
                                        │
   STEP 2 — CONSOLIDATE                  │   (the three collapse fixes live here)
     ├─ status-aware  : broken code → status≠success, val_bpb=NaN, NEVER a "good" label
     ├─ outcome-stamp : prepend "<val_bpb=0.943 status=success steps=940>"  (env's own number; raw)
     ├─ accumulate    : train on 80% this round + 20% from last-5 rounds  + KL-anchor to V0 (λ≈0.05)
     └─ LoRA SFT      : loss on [code + exec-logs + stamp]; mask the prompt & the agent's CoT
                        → merge-and-unload → V_{n+1}
                                        │
                                        └──────────── repeat (no fixed V cap) ──────────┘

Why these fixes, briefly. Training a model on its own output replaces its data distribution → unbounded error / collapse (Shumailov 2024). Two cheap, judge-free moves flip it: accumulate instead of replace (Gerstgrasser 2024), and use the environment's number to condition (done by string-stamping — Decision-Transformer/RvS style, no model). Execution logs buy stability (the model stops believing broken code ran); the stamp buys improvement.

The arms (two independent axes) and baselines

Axis 1 — how each round seeds train.py:

Arc A — Reset: start from the pristine train.py every round → cleanly measures researcher skill in the weights (starting point held constant).
Arc B — Continue: start from the best train.py so far → mirrors Recursive/Karpathy; a rising curve is partly the better starting artifact, so it's interpreted via improvement-beyond-the-seed.

Axis 2 — what STEP-2 trains on (both are "raw"; no judge model):

Arm S — Stamped: all traces + deterministic outcome stamp. My bet.
Arm F — Filtered: keep only val_bpb-improving traces. The Karpathy/Recursive selection rule — the comparable baseline.

A third rule — pure-Unfiltered (raw traces, no stamp) — was considered and dropped: stamping strictly dominates it (the stamp adds only deterministic ground-truth metadata, zero probabilistic judgment), so S is the raw-but-better version of U. Both S and F stay fully "raw" — no judge model.

Baselines (no weight update):

CTRL — In-context: frozen V0, experience carried in the prompt (Recursive/Karpathy paradigm). The denominator of the whole thesis: weights beat context ⇔ (Arm-S curve) below (CTRL curve).
FLOOR — No memory: frozen V0, no memory, re-sampled. The flat "nothing improves" line. Cheap (1–2 waves establish it).

The matrix and run priority (arms are independent loops — they diverge after round 1, so each costs its own 48-GPU wave per round; ~1 full-width loop fits the cluster at a time → we sequence):

Priority	Run	Why
1	Arc A+S (gold) + CTRL, concurrent (~28 threads each)	The core thesis: do weights beat context? These two curves are the headline.
2	Arc B+F, then Arc B+S	The continue-arc / "Recursive-with-weights" story
3	Arc A+F	Tight attribution of the gold (stamping vs filtering on the clean arc)
anytime	FLOOR	the floor line, 1–2 waves

What "success" means (and explicitly does NOT)

I am not chasing Recursive's absolute 0.9109 val_bpb — that took a frontier agent + ~10k runs / ~14k GPU-hours of engineered search, all in-context. My claim is relative and within-experiment:

Does fine-tuning bend the researcher's own V0→V_N trajectory below the same researcher frozen with experience in-context (CTRL)? That gap is scale-invariant — it can hold on a 7B and is the thing that would justify the Kimi-scale follow-up where absolute-leaderboard contention lives.

Hero deliverable: best val_bpb vs version, colored by solution diversity, with the community baseline line and a ±2σ band. Decomposition table: headline Δ, CTRL (in-context) Δ, weight-attributable Δ = headline − CTRL, and an overfit penalty = held-out Δ. Held-out probes (depth-12 nanochat transfer + a HumanEval+ subset) separate "better researcher" from "memorized the one fixed task" and catch silent forgetting.

Scale & feasibility (what the weekend buys, and what it doesn't)

Knob	POC value	Note
Researcher (V0)	Qwen2.5-Coder-7B (32B-dense strength-check; Kimi-K2.7-class = scale-up)	must be fine-tunable in minutes → rules out the 1T MoE here
Environment	autoresearch depth-8, 5-min, 1 GPU/run	depth-12 = held-out transfer probe
Per round	~25–30 min (48-rollout wave + overlapped LoRA + even-round evals)	a round (V_n→V_{n+1}) ⊃ 48 rollouts; rounds are sequential, rollouts parallel
Horizon	no fixed cap — dashboard-driven stop	literature puts the bootstrap by V2–3 and the plateau/collapse bend at V4–10, so the curve's shape shows early
Cluster	64× H100 (8× a3-mega)	~48 wave + ~serving + ~LoRA + evals = the whole box per full-width loop

Honest gap: the V1000 × 1000-seeds × Kimi-1T vision is a ~10⁴× FLOPs extrapolation this POC does not prove. The POC does answer the three questions that decide whether the vision is worth chasing: (1) does the trajectory climb or collapse, (2) do weights beat context, (3) does the raw/stamped rule suffice vs filtering.

Repository layout

poc/recursive_finetuning_stability/        ← this POC (its own git repo / submodule)
├── README.md                              ← this file (stable human-facing overview)
├── HANDOFF.md  handoff/                    ← agent working canvas + per-phase notes (live state)
├── env/                                    ← vendored Karpathy autoresearch (re-clonable) + the FROZEN program.md
├── scripts/                                ← SLURM/mesh orchestration (launch waves, schedule loops, sync GCS)
└── experiments/                            ← per-phase code lands here (agent loop, consolidation, loop driver, evals)

Third-party reference (gitignored, re-clonable; origin in env/README.md): karpathy/autoresearch — the research environment; only the pinned commit + frozen program.md are tracked.

Data locations

Asset	Location	In git?	In GCS?
Pipeline code, docs, frozen `program.md`, manifest seeds	this repo (NFS `/home`)	✅	✅ (via repo)
nanochat inner-training data + tokenizer (`prepare.py`)	`/mnt/localssd/.../env/data/` (per node)	❌	✅ (re-buildable)
Rollout traces (one Parquet row/rollout) + stdout blobs	`gs://…/recursive_finetuning_stability/traces/<arm>/round-<n>/`	✅ small manifests	✅ bulk
Researcher checkpoints (merged V0,V2,…) + LoRA adapters	`gs://…/recursive_finetuning_stability/checkpoints/`	❌	✅
Metrics parquets + dashboard data + eval results	`gs://…/recursive_finetuning_stability/metrics/`	✅ schema/code	✅ data
HF model cache (Qwen2.5-Coder etc.)	`/mnt/localssd/.hf-home/`	❌	❌ (re-downloadable)

Storage discipline. Git = source of truth for code/docs; GCS = source of truth for bulk data. Reuse the sibling POC's bucket gs://nucleus-continual-learning/ (us-east4, additive/never-delete) under a poc/recursive_finetuning_stability/ prefix. Checkpoint to GCS before each LoRA step so a SLURM preemption never loses a round.

Infra

Cluster, nodes, NFS/SSD/GCS discipline, and conventions live once in HANDOFF.md → Global context (single source of truth); per-phase build specs are in handoff/phase-N-*.md. The one thing to flag here: this runs on the same 64× H100 cluster as live_stream_stability, and a full-width loop needs ~all of it — so coordinate an exclusive (or preemption-checkpointed) node block.

Phase kickoff prompts (for fresh per-phase sessions)

How to use. Each phase is built in its own session at this repo root — paste the matching prompt to start. Every prompt assumes the agent first reads README.md + HANDOFF.md + the phase's handoff/phase-N-*.md, designs the how, keeps that handoff doc current (stamping it), and follows the conventions: no-attribution commits; git = code/docs, GCS = bulk; idempotent/resumable; program.md frozen + hashed; val_bpb only — no judge model. Build order is 0 → 1 → 2 → 3, with Phase-4's V0 baselines run early.

Phase 0 — Harness

We're building the recursive_finetuning_stability POC (continual-learning / recursive self-SFT).
Read README.md, HANDOFF.md, and handoff/phase-0-harness.md fully before doing anything — they hold the
converged design; your job is the HOW for Phase-0 only.

Phase-0 goal: stand up the autoresearch environment and prove ONE agent rollout runs end-to-end into a
valid trace with a real val_bpb on one H100. Deliverables: (1) vendor karpathy/autoresearch under
env/autoresearch/ at a pinned commit, run prepare.py once, reproduce the depth-8 baseline (~0.998 bpb);
(2) build the custom ~400-LOC ReAct agent loop (vLLM-served researcher = Qwen2.5-Coder-7B unless open-Q#1
is resolved to 32B) with a 360s hard timeout + Python syntax pre-check; (3) implement + validate the
locked trace schema (one Parquet row/rollout; status-aware, NaN val_bpb for broken code); (4) freeze +
hash program.md.

Constraints: custom ReAct (NOT a framework — we must capture reasoning/diffs/stdout); val_bpb only, no
judge model; bulk → GCS, code/docs → git; keep handoff/phase-0-harness.md current and stamped. Surface
open-Q#1 (7B vs 32B) if the user hasn't decided. Do not start Phase-1.

Phase 1 — Collect (STEP 1)

We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-1-collect.md fully first. Phase-0 (agent loop + locked trace schema) must exist — build on it.

Phase-1 goal: run ONE round of STEP-1 collection — 48 parallel agent rollouts in a single ~5-min nanochat
wave across 48 H100s — and aggregate into one Parquet shard in GCS with live metrics. Deliverables: SLURM
wave orchestrator (vLLM serving all 48; node-local-write-then-rsync); aggregation to
traces/<arm>/round-<n>.parquet + zstd stdout blobs; a raw reward-hack heuristic pass → anomaly_flags
(flag, don't drop); round_aggregates.parquet (mean/std/median val_bpb, compile/run rates, ast diversity,
reasoning density, data_quality_score).

Constraints: 48 threads (reserve GPUs for serving/LoRA/eval), 1 experiment/thread, round-1 is shared
across all arms (collect once), normalize stdout, GCS-before-train, idempotent/resumable. Keep
handoff/phase-1-collect.md current. Do not start Phase-2.

Phase 2 — Consolidate (STEP 2)

We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-2-consolidate.md fully first. Build on Phase-1's trace Parquet.

Phase-2 goal: turn ONE round's ~48 traces into V_{n+1} for a given arm — STEP-2, the weight update.
Deliverables: consolidation pipeline (Arm S = all traces + deterministic outcome-stamp; Arm F = only
val_bpb-improving traces); loss-mask collator (train on code+exec-logs+stamp; MASK program.md/base
train.py/agent CoT); 80/20 replay over a 5-round window; KL-anchor to V0 (λ=0.05, 50 fixed prompts);
LoRA fit (r=64/α=64, ALL in-block projections q,k,v,o,gate,up,down, embeddings/head frozen, 2 epochs,
LR re-warm→cosine, wd 0.01, dropout 0.05) → merge_and_unload → V_{n+1}; checkpoint to GCS every 2 rounds;
log train/val loss + LoRA effective rank (SVD).

Constraints: these are the LOCKED defaults (open-Q#4) — change only with justification logged in the phase
doc; never stamp a NaN/broken trace as good (status must tell the truth); unsloth for 7B, torchtune+FSDP
for 32B. Keep handoff/phase-2-consolidate.md current. Do not start Phase-3.

Phase 3 — The recursive loop

We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-3-loop.md fully first. Build on Phase-1 (collect) + Phase-2 (consolidate).

Phase-3 goal: wire the recursive loop V0→V_N and run the experiment matrix. Deliverables: loop driver
(collect→consolidate→V_{n+1}→even-round evals; resumable from GCS; no fixed V cap); arc seeders (A=reset
pristine train.py each round; B=continue from previous-round best, reported as Δ-beyond-seed); CTRL
(frozen V0, experience in-context — the thesis denominator); FLOOR (frozen V0, no memory, 1–2 waves);
priority scheduler honoring the GPU budget (arms are INDEPENDENT loops → ~1 full-width loop at a time →
P1 gold ArcA+S + CTRL concurrent ~28 threads each; P2 ArcB+F then ArcB+S; P3 ArcA+F); collapse auto-pause.

BLOCKER: resolve open-Q#2 (CTRL in-context memory design — see the phase doc) before the priority-1 deep
run; CTRL must match the gold loop on everything except weights-vs-context. Success is RELATIVE (Arm-S
curve vs CTRL curve), NOT the absolute 0.9109 leaderboard. Keep handoff/phase-3-loop.md current.

Phase 4 — Eval & verdict

We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-4-eval.md fully first. This is the measurement spine — start its V0 baselines EARLY, even
before the loop is deep.

Phase-4 goal: build the eval/diagnostics that accept or refute the hypothesis. Deliverables: the three
metric Parquets (thread_round_sample / round_aggregates / version_eval); zero-shot first-attempt harness
(Arc-A signal: no in-context exemplars, pristine train.py, ~16 seeds); held-out probes at V0,V2,V4…
(depth-12 nanochat transfer + HumanEval+ subset); cheap collapse diagnostics every round (AST diversity,
reasoning-token density, compile/run rate, data_quality_score); the dashboard; and the verdict =
decomposition table (headline Δ / CTRL Δ / weight-attributable = headline−CTRL / overfit penalty =
held-out Δ) + the hero plot (best val_bpb vs version, colored by diversity, baseline + ±2σ band).

BLOCKER: resolve open-Q#3 (stop/collapse-alert thresholds — see the phase doc). Telescope expensive evals;
run V0 baselines (depth-8 repro, FLOOR, CTRL start, held-out @ V0) first. Success is RELATIVE, not the
absolute leaderboard. Keep handoff/phase-4-eval.md current.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recursive Fine-Tuning Stability — a Continual-Learning POC

The one idea that reorganizes everything: there are TWO models

Where this stands today

The experiment design

The loop (one round = one V_n → V_{n+1})

The arms (two independent axes) and baselines

What "success" means (and explicitly does NOT)

Scale & feasibility (what the weekend buys, and what it doesn't)

Repository layout

Data locations

Infra

Phase kickoff prompts (for fresh per-phase sessions)

Phase 0 — Harness

Phase 1 — Collect (STEP 1)

Phase 2 — Consolidate (STEP 2)

Phase 3 — The recursive loop

Phase 4 — Eval & verdict

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
env		env
experiments/agent		experiments/agent
handoff		handoff
scripts		scripts
.gitignore		.gitignore
HANDOFF.md		HANDOFF.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Recursive Fine-Tuning Stability — a Continual-Learning POC

The one idea that reorganizes everything: there are TWO models

Where this stands today

The experiment design

The loop (one round = one V_n → V_{n+1})

The arms (two independent axes) and baselines

What "success" means (and explicitly does NOT)

Scale & feasibility (what the weekend buys, and what it doesn't)

Repository layout

Data locations

Infra

Phase kickoff prompts (for fresh per-phase sessions)

Phase 0 — Harness

Phase 1 — Collect (STEP 1)

Phase 2 — Consolidate (STEP 2)

Phase 3 — The recursive loop

Phase 4 — Eval & verdict

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages