This is the
recursive_finetuning_stabilityPOC. It lives under acontinual_learning/poc/workspace that holds sibling POCs (e.g.live_stream_stability), each its own git repo.
What I'm testing. Whether a coding/research model becomes a measurably better researcher by consolidating its own grounded research experience into its weights — recursively. I take a capable open-weight coding model (the researcher), let it run autonomous coding-research experiments in a real environment that returns a hard number, capture everything it produced (reasoning, code, execution logs, the metric), fine-tune the researcher on those traces, and repeat — V0 → V1 → V2 → … — watching whether the trajectory climbs or collapses.
Hypothesis. Recursive/Karpathy-style automated research keeps accumulated experience in context (notes/RAG carried between experiments). I'm betting you can instead bake that experience into the weights, so it compounds across versions without an ever-growing context — and that a researcher updated this way improves faster than the identical model frozen at V0 that keeps the same experience in-context. The honest scientific question underneath: does interleaving real execution signal into the training data turn recursive self-training from collapse (the default for self-generated data) into improvement?
As the research environment I use Karpathy's autoresearch (the closest public proxy for a
self-contained, verifiable coding-research loop): an agent edits one file train.py to minimize
val_bpb (validation bits-per-byte, lower = better) of a tiny GPT trained from scratch under a
fixed 5-minute, single-GPU budget. The environment's own number is my grounding signal — no
external judge/verifier model is ever used.
I own this work: Murali Nandan Nagarapu — Head of Engineering, Nucleus AI · nmn@withnucleus.ai
Every system in this space (Karpathy, Recursive) has two models, and conflating them causes endless confusion. Keep them separate:
| Role | What it is | In this POC |
|---|---|---|
| The researcher | the LLM that reads, reasons, writes code, reads logs, decides the next edit | The thing I fine-tune. Qwen2.5-Coder-7B to start (32B-dense as a strength check; Kimi-K2.7-Code-class as the scale-up). |
| The artifact | the tiny model the research is about — trained from scratch each experiment, scored by val_bpb |
The environment/seed. A depth-8 nanochat GPT (~50M params), trained 5 min on 1 GPU. It only exists to emit the grounding number. |
Karpathy/Recursive keep the researcher frozen and accumulate learning in its context. My bet is to update the researcher's weights instead. That single inversion is the whole POC.
Design is converged; no code written yet — this scaffold crystallizes what gets built so each phase can be handed to its own agent for how.
| Phase | What it does | Status |
|---|---|---|
| 0 — Harness | Vendor autoresearch; build the custom ReAct agent loop; reproduce the depth-8 baseline; lock the trace schema; freeze program.md |
✅ Done |
| 1 — Collect (STEP 1) | Run one round = 48 parallel agent rollouts in a single 5-min wave; capture every trace to Parquet/GCS | ⏳ |
| 2 — Consolidate (STEP 2) | Turn one round's traces into V_{n+1}: status-aware labels, outcome-stamp, loss-mask, replay, KL-anchor, LoRA | ⏳ |
| 3 — The recursive loop | Run the arcs × arms + the CTRL/FLOOR baselines, scheduled by priority; auto-detect collapse | ⏳ |
| 4 — Eval & verdict | Held-out transfer probe + forgetting suite + collapse diagnostics; the hero plot + decomposition table | ⏳ |
I keep the live working state — what's in flight, gotchas, next steps — in
HANDOFF.md and handoff/. This README is the stable overview.
Decisions locked so far: researcher = dense coder (Qwen2.5-Coder-32B-Instruct, open-Q#1 resolved
2026-06-15; 7B retained as a fast fallback), fine-tuned
by LoRA on all in-block projections (q,k,v,o,gate,up,down; embeddings/head frozen), r=64/α=64,
merged into base each round; environment = autoresearch, depth-8, 5-min, frozen program.md;
training = recursive self-SFT on collected traces with outcome-stamping (the env's own number,
deterministic, no judge), status-aware labels, 80/20 replay (last-5-round window) and a
KL-anchor to V0; the experiment is a 2×2 of arcs × data-rules plus two baselines; success is a
relative trajectory result (fine-tuned curve vs the in-context CTRL curve), not the absolute
val_bpb leaderboard.
┌──────────────────────────────────────────────────────────────────┐
│ Researcher V_n (V0 = Qwen2.5-Coder-7B), served via vLLM │
└──────────────────────────────┬───────────────────────────────────┘
STEP 1 — COLLECT │
48 parallel ReAct rollouts, ONE 5-min nanochat training each:
read FROZEN program.md + pristine train.py → reason → edit → `uv run train.py`
→ capture { reasoning, code, diff, stdout/stderr, exit_code, STATUS, val_bpb,
steps, vram, gpu_type, token counts } → one Parquet row / rollout
│
STEP 2 — CONSOLIDATE │ (the three collapse fixes live here)
├─ status-aware : broken code → status≠success, val_bpb=NaN, NEVER a "good" label
├─ outcome-stamp : prepend "<val_bpb=0.943 status=success steps=940>" (env's own number; raw)
├─ accumulate : train on 80% this round + 20% from last-5 rounds + KL-anchor to V0 (λ≈0.05)
└─ LoRA SFT : loss on [code + exec-logs + stamp]; mask the prompt & the agent's CoT
→ merge-and-unload → V_{n+1}
│
└──────────── repeat (no fixed V cap) ──────────┘
Why these fixes, briefly. Training a model on its own output replaces its data distribution → unbounded error / collapse (Shumailov 2024). Two cheap, judge-free moves flip it: accumulate instead of replace (Gerstgrasser 2024), and use the environment's number to condition (done by string-stamping — Decision-Transformer/RvS style, no model). Execution logs buy stability (the model stops believing broken code ran); the stamp buys improvement.
Axis 1 — how each round seeds train.py:
- Arc A — Reset: start from the pristine
train.pyevery round → cleanly measures researcher skill in the weights (starting point held constant). - Arc B — Continue: start from the best
train.pyso far → mirrors Recursive/Karpathy; a rising curve is partly the better starting artifact, so it's interpreted via improvement-beyond-the-seed.
Axis 2 — what STEP-2 trains on (both are "raw"; no judge model):
- Arm S — Stamped: all traces + deterministic outcome stamp. My bet.
- Arm F — Filtered: keep only
val_bpb-improving traces. The Karpathy/Recursive selection rule — the comparable baseline.
A third rule — pure-Unfiltered (raw traces, no stamp) — was considered and dropped: stamping strictly dominates it (the stamp adds only deterministic ground-truth metadata, zero probabilistic judgment), so S is the raw-but-better version of U. Both S and F stay fully "raw" — no judge model.
Baselines (no weight update):
- CTRL — In-context: frozen V0, experience carried in the prompt (Recursive/Karpathy paradigm). The denominator of the whole thesis: weights beat context ⇔ (Arm-S curve) below (CTRL curve).
- FLOOR — No memory: frozen V0, no memory, re-sampled. The flat "nothing improves" line. Cheap (1–2 waves establish it).
The matrix and run priority (arms are independent loops — they diverge after round 1, so each costs its own 48-GPU wave per round; ~1 full-width loop fits the cluster at a time → we sequence):
| Priority | Run | Why |
|---|---|---|
| 1 | Arc A+S (gold) + CTRL, concurrent (~28 threads each) | The core thesis: do weights beat context? These two curves are the headline. |
| 2 | Arc B+F, then Arc B+S | The continue-arc / "Recursive-with-weights" story |
| 3 | Arc A+F | Tight attribution of the gold (stamping vs filtering on the clean arc) |
| anytime | FLOOR | the floor line, 1–2 waves |
I am not chasing Recursive's absolute 0.9109 val_bpb — that took a frontier agent + ~10k runs /
~14k GPU-hours of engineered search, all in-context. My claim is relative and within-experiment:
Does fine-tuning bend the researcher's own V0→V_N trajectory below the same researcher frozen with experience in-context (CTRL)? That gap is scale-invariant — it can hold on a 7B and is the thing that would justify the Kimi-scale follow-up where absolute-leaderboard contention lives.
Hero deliverable: best val_bpb vs version, colored by solution diversity, with the community
baseline line and a ±2σ band. Decomposition table: headline Δ, CTRL (in-context) Δ,
weight-attributable Δ = headline − CTRL, and an overfit penalty = held-out Δ. Held-out probes
(depth-12 nanochat transfer + a HumanEval+ subset) separate "better researcher" from "memorized
the one fixed task" and catch silent forgetting.
| Knob | POC value | Note |
|---|---|---|
| Researcher (V0) | Qwen2.5-Coder-7B (32B-dense strength-check; Kimi-K2.7-class = scale-up) | must be fine-tunable in minutes → rules out the 1T MoE here |
| Environment | autoresearch depth-8, 5-min, 1 GPU/run | depth-12 = held-out transfer probe |
| Per round | ~25–30 min (48-rollout wave + overlapped LoRA + even-round evals) | a round (V_n→V_{n+1}) ⊃ 48 rollouts; rounds are sequential, rollouts parallel |
| Horizon | no fixed cap — dashboard-driven stop | literature puts the bootstrap by V2–3 and the plateau/collapse bend at V4–10, so the curve's shape shows early |
| Cluster | 64× H100 (8× a3-mega) | ~48 wave + ~serving + ~LoRA + evals = the whole box per full-width loop |
Honest gap: the V1000 × 1000-seeds × Kimi-1T vision is a ~10⁴× FLOPs extrapolation this POC does not prove. The POC does answer the three questions that decide whether the vision is worth chasing: (1) does the trajectory climb or collapse, (2) do weights beat context, (3) does the raw/stamped rule suffice vs filtering.
poc/recursive_finetuning_stability/ ← this POC (its own git repo / submodule)
├── README.md ← this file (stable human-facing overview)
├── HANDOFF.md handoff/ ← agent working canvas + per-phase notes (live state)
├── env/ ← vendored Karpathy autoresearch (re-clonable) + the FROZEN program.md
├── scripts/ ← SLURM/mesh orchestration (launch waves, schedule loops, sync GCS)
└── experiments/ ← per-phase code lands here (agent loop, consolidation, loop driver, evals)
Third-party reference (gitignored, re-clonable; origin in env/README.md):
karpathy/autoresearch — the research environment; only the pinned commit + frozen program.md
are tracked.
| Asset | Location | In git? | In GCS? |
|---|---|---|---|
Pipeline code, docs, frozen program.md, manifest seeds |
this repo (NFS /home) |
✅ | ✅ (via repo) |
nanochat inner-training data + tokenizer (prepare.py) |
/mnt/localssd/.../env/data/ (per node) |
❌ | ✅ (re-buildable) |
| Rollout traces (one Parquet row/rollout) + stdout blobs | gs://…/recursive_finetuning_stability/traces/<arm>/round-<n>/ |
✅ small manifests | ✅ bulk |
| Researcher checkpoints (merged V0,V2,…) + LoRA adapters | gs://…/recursive_finetuning_stability/checkpoints/ |
❌ | ✅ |
| Metrics parquets + dashboard data + eval results | gs://…/recursive_finetuning_stability/metrics/ |
✅ schema/code | ✅ data |
| HF model cache (Qwen2.5-Coder etc.) | /mnt/localssd/.hf-home/ |
❌ | ❌ (re-downloadable) |
Storage discipline. Git = source of truth for code/docs; GCS = source of truth for bulk data.
Reuse the sibling POC's bucket gs://nucleus-continual-learning/ (us-east4, additive/never-delete)
under a poc/recursive_finetuning_stability/ prefix. Checkpoint to GCS before each LoRA step so a
SLURM preemption never loses a round.
Cluster, nodes, NFS/SSD/GCS discipline, and conventions live once in HANDOFF.md →
Global context (single source of truth); per-phase build specs are in handoff/phase-N-*.md. The one
thing to flag here: this runs on the same 64× H100 cluster as live_stream_stability, and a
full-width loop needs ~all of it — so coordinate an exclusive (or preemption-checkpointed) node block.
How to use. Each phase is built in its own session at this repo root — paste the matching prompt to start. Every prompt assumes the agent first reads
README.md+HANDOFF.md+ the phase'shandoff/phase-N-*.md, designs the how, keeps that handoff doc current (stamping it), and follows the conventions: no-attribution commits; git = code/docs, GCS = bulk; idempotent/resumable;program.mdfrozen + hashed;val_bpbonly — no judge model. Build order is 0 → 1 → 2 → 3, with Phase-4's V0 baselines run early.
We're building the recursive_finetuning_stability POC (continual-learning / recursive self-SFT).
Read README.md, HANDOFF.md, and handoff/phase-0-harness.md fully before doing anything — they hold the
converged design; your job is the HOW for Phase-0 only.
Phase-0 goal: stand up the autoresearch environment and prove ONE agent rollout runs end-to-end into a
valid trace with a real val_bpb on one H100. Deliverables: (1) vendor karpathy/autoresearch under
env/autoresearch/ at a pinned commit, run prepare.py once, reproduce the depth-8 baseline (~0.998 bpb);
(2) build the custom ~400-LOC ReAct agent loop (vLLM-served researcher = Qwen2.5-Coder-7B unless open-Q#1
is resolved to 32B) with a 360s hard timeout + Python syntax pre-check; (3) implement + validate the
locked trace schema (one Parquet row/rollout; status-aware, NaN val_bpb for broken code); (4) freeze +
hash program.md.
Constraints: custom ReAct (NOT a framework — we must capture reasoning/diffs/stdout); val_bpb only, no
judge model; bulk → GCS, code/docs → git; keep handoff/phase-0-harness.md current and stamped. Surface
open-Q#1 (7B vs 32B) if the user hasn't decided. Do not start Phase-1.
We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-1-collect.md fully first. Phase-0 (agent loop + locked trace schema) must exist — build on it.
Phase-1 goal: run ONE round of STEP-1 collection — 48 parallel agent rollouts in a single ~5-min nanochat
wave across 48 H100s — and aggregate into one Parquet shard in GCS with live metrics. Deliverables: SLURM
wave orchestrator (vLLM serving all 48; node-local-write-then-rsync); aggregation to
traces/<arm>/round-<n>.parquet + zstd stdout blobs; a raw reward-hack heuristic pass → anomaly_flags
(flag, don't drop); round_aggregates.parquet (mean/std/median val_bpb, compile/run rates, ast diversity,
reasoning density, data_quality_score).
Constraints: 48 threads (reserve GPUs for serving/LoRA/eval), 1 experiment/thread, round-1 is shared
across all arms (collect once), normalize stdout, GCS-before-train, idempotent/resumable. Keep
handoff/phase-1-collect.md current. Do not start Phase-2.
We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-2-consolidate.md fully first. Build on Phase-1's trace Parquet.
Phase-2 goal: turn ONE round's ~48 traces into V_{n+1} for a given arm — STEP-2, the weight update.
Deliverables: consolidation pipeline (Arm S = all traces + deterministic outcome-stamp; Arm F = only
val_bpb-improving traces); loss-mask collator (train on code+exec-logs+stamp; MASK program.md/base
train.py/agent CoT); 80/20 replay over a 5-round window; KL-anchor to V0 (λ=0.05, 50 fixed prompts);
LoRA fit (r=64/α=64, ALL in-block projections q,k,v,o,gate,up,down, embeddings/head frozen, 2 epochs,
LR re-warm→cosine, wd 0.01, dropout 0.05) → merge_and_unload → V_{n+1}; checkpoint to GCS every 2 rounds;
log train/val loss + LoRA effective rank (SVD).
Constraints: these are the LOCKED defaults (open-Q#4) — change only with justification logged in the phase
doc; never stamp a NaN/broken trace as good (status must tell the truth); unsloth for 7B, torchtune+FSDP
for 32B. Keep handoff/phase-2-consolidate.md current. Do not start Phase-3.
We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-3-loop.md fully first. Build on Phase-1 (collect) + Phase-2 (consolidate).
Phase-3 goal: wire the recursive loop V0→V_N and run the experiment matrix. Deliverables: loop driver
(collect→consolidate→V_{n+1}→even-round evals; resumable from GCS; no fixed V cap); arc seeders (A=reset
pristine train.py each round; B=continue from previous-round best, reported as Δ-beyond-seed); CTRL
(frozen V0, experience in-context — the thesis denominator); FLOOR (frozen V0, no memory, 1–2 waves);
priority scheduler honoring the GPU budget (arms are INDEPENDENT loops → ~1 full-width loop at a time →
P1 gold ArcA+S + CTRL concurrent ~28 threads each; P2 ArcB+F then ArcB+S; P3 ArcA+F); collapse auto-pause.
BLOCKER: resolve open-Q#2 (CTRL in-context memory design — see the phase doc) before the priority-1 deep
run; CTRL must match the gold loop on everything except weights-vs-context. Success is RELATIVE (Arm-S
curve vs CTRL curve), NOT the absolute 0.9109 leaderboard. Keep handoff/phase-3-loop.md current.
We're building the recursive_finetuning_stability POC. Read README.md, HANDOFF.md, and
handoff/phase-4-eval.md fully first. This is the measurement spine — start its V0 baselines EARLY, even
before the loop is deep.
Phase-4 goal: build the eval/diagnostics that accept or refute the hypothesis. Deliverables: the three
metric Parquets (thread_round_sample / round_aggregates / version_eval); zero-shot first-attempt harness
(Arc-A signal: no in-context exemplars, pristine train.py, ~16 seeds); held-out probes at V0,V2,V4…
(depth-12 nanochat transfer + HumanEval+ subset); cheap collapse diagnostics every round (AST diversity,
reasoning-token density, compile/run rate, data_quality_score); the dashboard; and the verdict =
decomposition table (headline Δ / CTRL Δ / weight-attributable = headline−CTRL / overfit penalty =
held-out Δ) + the hero plot (best val_bpb vs version, colored by diversity, baseline + ±2σ band).
BLOCKER: resolve open-Q#3 (stop/collapse-alert thresholds — see the phase doc). Telescope expensive evals;
run V0 baselines (depth-8 repro, FLOOR, CTRL start, held-out @ V0) first. Success is RELATIVE, not the
absolute leaderboard. Keep handoff/phase-4-eval.md current.