Skip to content

Training a small language model from random weights, using curriculum-driven multi-agent conversations, reward-weighted imitation, and staged LoRA adaptation

License

Notifications You must be signed in to change notification settings

hunterooc/bootstrap-basil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bootstrap Basil

Can an LLM learn to speak the way humans do -- through immersion, mimicry, and positive reinforcement?

Bootstrap Basil is an experiment in training a language model from random weights using only AI-generated curriculum. No human-written training data, no distillation from a larger model, no pre-existing text corpus. A GPT-2 (124M) initialized with random weights is placed in a simulated classroom where AI teachers (Tutor, Sophie) run lessons, and Basil's graded attempts become its own training data.

This is not proven to work. This entire codebase was vibecoded with Cursor by a non-technical hobbyist. There are no warranties or guarantees of any kind -- treat it as an experiment, not production software. See ROADMAP.md for the research questions, current status, and how to contribute.

Overview

The system runs an automated loop: generate curriculum → run teaching sessions → grade Basil's responses → train on the best attempts → repeat. Key components:

  • Immersion -- Tutor and Sophie run interactive sessions (classroom, storytime, how-it-works, why-chains). The base model ("trunk") trains on full transcripts, absorbing language structure by listening.
  • Mimicry -- A LoRA adapter trains on Basil's own graded outputs, reinforcing attempts that match the language it's been exposed to.
  • Positive reinforcement -- A two-grader architecture (English quality + task compliance) scores every response. Only above-threshold outputs become training data; garbage is discarded.
  • Developmental staging -- Everything scales with age_band (0-7): LoRA strength, training epochs, output length, score thresholds. This prevents early overfitting while unlocking capacity as Basil improves.

Current Implementation Status

Completed Features

Phase Feature Status
Phase 1 Task Agent + Grader Agent ✅ Complete
Phase 1 Auto Session Runner ✅ Complete
Phase 1 Structured Logging ✅ Complete
Phase 2 Dual-Objective LoRA Training ✅ Complete
Phase 3 Orchestrator + Metrics ✅ Complete
Phase 3 Dynamic Subject Generation ✅ Complete
Phase 3 Lesson Picker (Sophie) ✅ Complete
Phase 3 Blacklist Rotation ✅ Complete
MVP Tier 1 Basil Assessment (age_band 0-7) ✅ Complete
MVP Tier 1 Compliance/Progress Signal Gating ✅ Complete
Phase 4 Session Lifecycle (min/max turns, early stop) ✅ Complete
Phase 4 Graceful Session Wrap-up ✅ Complete
Phase 4 Per-Session Metrics ✅ Complete
Phase 4 Rolling Metrics + EWMA ✅ Complete
Phase 4 Training Triggers ✅ Complete
Phase 4 Post-Train Evaluation ✅ Complete
Phase 4 Checkpoint + Rollback ✅ Complete
Phase 5 Storytime Content Pipeline ✅ Complete
Phase 5 Dual-Objective Training (World + Basil-Policy) ✅ Complete
Phase 5 LoRA Adapters for Basil-Policy ✅ Complete
Phase 5 Sophie Post-Grade Masking (Data Leakage Fix) ✅ Complete
Phase 6 HowItWorks + WhyChain Session Types ✅ Complete
Phase 6 Parallel Data Generation (multi-worker) ✅ Complete
Phase 6 Shared Dedup Utilities (exact/fuzzy/semantic) ✅ Complete
Phase 6 Per-Process Model Cache ✅ Complete
Phase 7 Age-Band-Aware Score Weights Table ✅ Complete
Phase 7 Age-Band-Scaled Early Stopping (Val Loss Floor) ✅ Complete
Phase 7 Episode-Local LoRA Context (Anti-Contamination) ✅ Complete
Phase 8 Trunk Masking (basil_and_after) ✅ Complete
Phase 8 Adaptive LoRA Epoch Cap (avg_score-based) ✅ Complete
Phase 8 Age-Band LoRA Strength Scaling (0.0→1.0) ✅ Complete
Phase 9 LoRA Epoch Cap Scaled by Age Band ✅ Complete
Phase 9 Two-Grader Gating (english gates task) ✅ Complete
Phase 9 Weighted Trunk Masking (partial Basil weight) ✅ Complete
Phase 9 Alternating Popquiz Order ✅ Complete
Phase 9 Usable-Turn Training Triggers ✅ Complete
Phase 10 Per-Phase Early Stopping (WORLD/BASIL) ✅ Complete
Phase 10 Training Stability (wider assessment, doubled thresholds) ✅ Complete
Phase 10 Dynamic Target Turns (parallel generation) ✅ Complete

Training Architecture

Bootstrap Basil uses a dual-objective training approach with LoRA adapters to separate world knowledge from Basil-specific conversational behavior:

Objective 1: WORLD/TRUNK (Language Model)

Standard next-token prediction on full session transcripts (Tutor, Sophie, Story, Basil). Trains the base GPT-2 model to absorb language structure, vocabulary, and conversational patterns. LoRA adapters are disabled during this phase.

  • Dataset: WorldDataset — full transcripts with three-zone weighted masking
  • Trunk masking (mask_mode, default basil_and_after): Controls what the trunk learns from Basil-related content. Three modes:
    • none — no masking, trunk sees everything at full weight
    • basil_only — mask only Basil's output tokens
    • basil_and_after(default) three-zone weighted masking:
      • Zone A (Basil's output) + Zone B (Sophie's immediate reaction): trained at fractional weight (lora_weight / TRUNK_WEIGHT_DIVISOR, where divisor=2). This allows the trunk to gently absorb Basil's improving English without over-reinforcing garbage, while tracking the LoRA's quality signal.
      • Zone C (Sophie's popquiz, Tutor's wrap-up, all other teaching content): trained at full weight (1.0).
      • If the LoRA weight for a given score is 0.0, Zones A and B are fully masked (labels=-100).
    • This graduated approach replaced the earlier binary masking (which threw away all post-Basil content). The key insight: the trunk benefits from seeing Basil's generations at reduced weight, allowing it to learn the shape of improving responses without memorizing garbage.
  • Recency weighting: More recent sessions (grouped by training run) contribute proportionally more. Half-life of 6 training runs, floored at 10% minimum weight.

Objective 2: BASIL-POLICY (LoRA Adapter)

Trains only Basil's LoRA adapter on examples where the target tokens are exclusively Basil's reply. The base model is frozen during this phase. LoRA is active from age_band=0 (LORA_ACTIVATION_AGE_BAND=0), providing a reinforcement bootstrap signal from the very start.

  • Dataset: BasilDataset — context tokens masked (labels=-100), only Basil reply tokens as targets. Tracks per-example scores and computes avg_score for adaptive LoRA epoch capping.
  • Episode-local context: For multi-episode classroom sessions, each Basil training example sees only the current episode's dialogue (teaching, quiz, task prompt) — not previous episodes' Basil outputs. This prevents Basil's earlier garbage from contaminating the LoRA's conditioning context. Single-episode sessions (storytime, howitworks) use full session context since there are no prior Basil turns.
  • Age-band-aware score weights: Each example is weighted by score_to_weight_basil_policy(score, age_band) from BASIL_POLICY_SCORE_WEIGHTS_TABLE. The key design rules:
    • Score 0 always gets weight 0 (discarded)
    • Score 7 always gets weight 1.0 (full reinforcement)
    • Score ≤ age_band gets weight 0 (only scores ABOVE the current band are reinforced)
    • Uniform minimum reinforcing weight of 0.15 at every band, with a linear ramp to 1.0
    • At band 0, score=1 (any English word) gets 0.15 — this is the bootstrap signal that pulls the LoRA toward English
  • Age-band-scaled early stopping (VAL_LOSS_FLOOR_BY_AGE_BAND): Validation loss floor scales with age band. Band 0 stops at loss 3.0 (just learn basic English patterns), while band 7 allows loss down to 1.0. This prevents over-training at early stages and is analogous to the child "learning how to learn."
  • Age-band-scaled LoRA epoch cap: The number of LoRA (BASIL) training epochs scales linearly with age_band via lora_max_epochs_for_age_band(age_band, max_epochs=100). At age_band=0, the LoRA gets 0 epochs (pure trunk imitation). At age_band=7, it gets the full epoch budget (100). This mirrors lora_strength_for_age_band() so that training effort and inference influence grow together. The earlier approach (capping based on avg_score) was replaced because age_band is a more stable and predictable proxy for data quality, and scaling both training and inference together prevents the LoRA from overfitting on garbage during the earliest bootstrapping stages while giving it full training capacity at maturity.

Mixed Training Mode

The default mixed mode alternates epochs between the two objectives, with per-phase early stopping and an adaptive cap on LoRA epochs:

Epoch 1:  WORLD training  (base model trainable, LoRA frozen)
Epoch 2:  BASIL training  (LoRA trainable, base model frozen)
Epoch 3:  WORLD training  ...
Epoch 4:  BASIL training  ... (if adaptive cap not reached)
Epoch 5:  WORLD training  → WORLD converges (patience exhausted)
Epoch 6:  BASIL training  ... (WORLD skipped, BASIL continues alone)
Epoch 7:  BASIL training  ...
...
Epoch 12: BASIL training  → BASIL converges → All phases done, stop

Per-phase early stopping: WORLD and BASIL modify different parameters (trunk vs LoRA), so they have independent convergence tracking — separate best_val_loss, patience_counter, and validation loaders. WORLD validates on world_val_loader, BASIL validates on basil_val_loader. Each phase has its own patience of 8 evals. When one phase converges, the other continues alone. Training stops when both phases have converged (or the LoRA epoch cap is reached for BASIL). Cross-validation metrics are logged for monitoring (basil_val during WORLD epochs, world_val during BASIL epochs). Max training time is 12 hours.

When the age-band LoRA cap is reached, subsequent BASIL slots are redirected to additional WORLD training, ensuring the trunk continues to improve even after LoRA has been capped. The cap is computed from lora_max_epochs_for_age_band():

age_band lora_max_epochs Rationale
0 0 epochs Pre-verbal — no LoRA training, pure trunk imitation
1 14 epochs Proto-English — minimal LoRA, cautious refinement
2 29 epochs First words — growing LoRA budget
3 43 epochs Reliable words — moderate LoRA
4 57 epochs Short phrases — substantial LoRA
5 71 epochs Sentences — most LoRA training
6 86 epochs Conversation — near-full budget
7 100 epochs Reasoning — full epoch budget

Each objective has its own optimizer, learning rate scheduler, and gradient scaler. The LoRA optimizer uses a lower peak learning rate (3e-5) than the trunk (1e-4) because LoRA's alpha/rank amplification (16/8 = 2x) effectively doubles the update magnitude, and the Basil dataset is smaller and noisier. Model saving stores both the base model weights and the LoRA adapter separately.

LoRA Strength Scaling at Inference

LoRA adapter contribution is scaled linearly with age_band at inference time via lora_strength_for_age_band():

age_band LoRA strength Effect
0 0.00 Trunk-only (LoRA present but zeroed out)
1 0.14 Minimal LoRA influence
2 0.29
3 0.43
4 0.57 Balanced trunk + LoRA
5 0.71
6 0.86
7 1.00 Full LoRA refinement

This replaces the earlier binary on/off behavior. The rationale: at age_band=0, the LoRA is trained on mostly garbage data and should have minimal inference influence. As Basil matures and the LoRA trains on higher-quality data, its contribution is smoothly increased. The LoRA is still trained from age_band=0 (building up signal), but its influence at inference scales with developmental stage.

Overrides for experiments are available via the BASIL_LORA_STRENGTH env var or lora_strength constructor parameter in AutoSession.

Grading Pipeline

Basil's responses are scored through a multi-layer grading system designed to produce accurate training signals, especially during early bootstrapping when most outputs are noise.

Two-Grader Architecture

Each response is evaluated by two independent LLM graders:

  1. English Grader (0-3): Evaluates English quality and domain relevance. Generous — rewards any English words, especially those related to the subject/lesson.
  2. Task Grader (0-7): Evaluates task compliance — did Basil say the target word, answer the question, follow the instruction?

English Grader Gating

The English Grader's score acts as a ceiling on the Task Grader's output, preventing the Task Grader from over-scoring responses that lack genuine domain content (e.g., parroting Sophie's "Nice try!" or Tutor's conversational phrases):

English Score Task Score Cap Rationale
0-1 Capped at 2 No domain-relevant English detected — task compliance can't be high
2 Capped at 3 Some English but limited domain content
3 Uncapped Good domain-relevant English — trust the Task Grader

The final score is max(english_score, capped_task_score).

Programmatic Score Floor

After LLM grading, a programmatic floor is applied:

  • If Basil's response contains the exact target word(s), the score is lifted to at least 6 (regardless of LLM grading)
  • If the response contains English words but no target, a minimum score of 1 applies
  • This acts as a safety net for cases where the LLM graders are too strict

Why This Design?

The key failure mode discovered during bootstrapping was parroting inflation: Basil would repeat conversational phrases from Sophie or Tutor (e.g., "Nice try, can you say..."), and the Task Grader would give these high scores because they "attempted the right format." The English Grader consistently rated these 0-1 (no domain content), so using it as a gate solved the problem without brittle regex detection or additional API calls.

LoRA Configuration

Parameter Value
Rank 8
Alpha 16
Dropout 0.05
Target Modules c_attn, c_proj (GPT-2 attention layers)
Adapter Params ~1.44M (0.56% of 254M base)

At inference time, session runners automatically load the LoRA adapter (if present) and scale its contribution by lora_strength_for_age_band(age_band) — 0.0 at band 0 (trunk-only) through 1.0 at band 7 (full LoRA). Classroom Basil generation now derives both max_new_tokens and temperature from get_basil_generation_settings(age_band), so temperature is tied to developmental stage (higher early, lower later), while preserving top_k=50 sampling.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       ORCHESTRATOR                           │
│  • Batch scheduling with configurable delays                │
│  • Training triggers (graded turns + progress signal gate)  │
│  • Post-train evaluation + rollback on regression           │
│  • Checkpointing (pre/post train)                           │
│  • parallel_generate.py: multi-worker data generation       │
└─────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌───────────────────┐ ┌───────────────┐ ┌─────────────────┐
│  CLASSROOM        │ │  STORYTIME    │ │  HOWITWORKS /   │
│  (auto_session)   │ │  (storytime_  │ │  WHYCHAIN       │
│  Multi-episode    │ │   session)    │ │  Single-episode  │
│  Phases A-F loop  │ │  Single-ep    │ │  sessions        │
└───────────────────┘ └───────────────┘ └─────────────────┘
          │                   │                   │
          ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────────────┐
│                   SHARED INFRASTRUCTURE                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Tutor   │  │  Sophie  │  │  Basil   │  │  Grader  │   │
│  │  (API)   │  │  (API)   │  │(local+LoRA)│ │  (API)   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
│  • model_cache.py (per-process Basil model caching)         │
│  • dedup_utils.py (exact + fuzzy + semantic dedup)          │
│  • grader_agent.py (shared grading across session types)    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     TRAINING PIPELINE                        │
│  • train_basil_v2.py: dual-objective (World + Basil-Policy) │
│  • WorldDataset: weighted trunk masking (partial Basil wt)  │
│  • BasilDataset: episode-local ctx → LoRA (age-band cap)    │
│  • LoRA epoch cap: lora_max_epochs_for_age_band() (0→100)   │
│  • LoRA strength at inference: age_band / 7.0 (0.0 → 1.0)  │
│  • Age-band-scaled early stopping (VAL_LOSS_FLOOR_BY_AGE)   │
│  • Age-band-aware score weights (SCORE_WEIGHTS_TABLE)       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     MEMORY / STATE                           │
│  • basil_assessment.json (age_band, compliance, progress)   │
│  • metrics.json (rolling stats + EWMA)                      │
│  • used_lessons/stories/topics/questions.json (dedup)       │
│  • checkpoints/ (pre/post train model snapshots)            │
└─────────────────────────────────────────────────────────────┘

Key Components

Agents

Agent Model Role
Tutor gpt-4o-mini Drives sessions, introduces concepts, asks questions
Sophie gpt-4o-mini Older sibling, picks lessons, models clear language
Task Generator gpt-4o-mini Generates 5-8 candidate tasks per episode
Task Selector gpt-4o-mini Picks best candidate based on teaching paragraph
English Grader gpt-4o-mini Scores English quality/domain relevance (0-3)
Task Grader gpt-4o-mini Scores task compliance (0-7), gated by English Grader
Task Naturalizer gpt-4o-mini Converts raw task text to natural dialogue (cached)
Subject Generator gpt-4o-mini Dynamically generates age-appropriate subjects
Assessment Agent gpt-4o-mini Evaluates Basil's developmental stage (age band)
Basil Local GPT-2 + LoRA The baby model being trained

Age Band System

Basil progresses through 8 developmental stages. Max tokens scale with the formula (band * 3) + 4:

Age Band Description Max Tokens Task Categories
0 Pre-verbal 4 control
1 Proto-English 7 control, vocab
2 First words 10 control, vocab
3 Reliable words 13 control, vocab, relevance
4 Short phrases 16 vocab, relevance, memory
5 Sentences 19 relevance, memory, conversation
6 Conversation 22 memory, conversation
7 Reasoning 25 memory, conversation

Promotion/Demotion Logic

Promotion requires ALL of:

  • Average score ≥ 3.5 over 10 sessions
  • Compliance rate ≥ 60% (turns with score ≥ 4)
  • Progress signal = True (majority in window)
  • Not at or past the training threshold (training must run first)

Demotion requires:

  • Average score ≤ 1.0 over 10 sessions
  • AND (compliance ≤ 20% OR progress signal = False)

Directory Structure

bootstrap-basil/
├── README.md                    # This file
├── config.py                    # Central configuration
├── verify_setup.py              # Setup verification script
│
├── # Core Session Components
├── auto_session.py              # Classroom session runner (multi-episode, Phase A-F)
├── task_agent.py                # Task + rubric generation (generate-then-select)
├── task_contract.py             # TaskSpec contract/schema definitions
├── grader_agent.py              # Two-grader scoring (english + task, with gating)
├── score_override.py            # Programmatic score floors (target-word presence)
├── subject_generator.py         # Dynamic subject generation
├── curriculum_manager.py        # Rotation state management
├── memory_manager.py            # Assessment + summaries
├── metrics_manager.py           # Session metrics + rolling stats (EWMA)
├── orchestrator.py              # Batch scheduling + training workflow
├── parallel_generate.py         # Parallel data generation (multi-worker)
├── dedup_utils.py               # Shared deduplication (exact, fuzzy, semantic)
├── model_cache.py               # Per-process Basil model cache (LoRA enable/disable)
│
├── # Additional Session Types
├── whychain_session.py          # WhyChain session runner (why-question chains)
│
├── # Training
├── train_basil_v2.py            # Dual-objective training (World/trunk + Basil/LoRA)
├── force_train_and_eval.py      # Manual training + evaluation trigger
├── create_basil_v0001.py        # Initial model creation
├── assessment_agent.py          # LLM-based age band assessment
├── identity_probe.py            # Identity probe ("Who are you?")
├── reset.py                     # Reset utility (clear models, state, logs)
│
├── # Legacy/Interactive
├── chat_basil.py                # Interactive chat mode
├── complete_basil.py            # Completion mode
├── sophie_engine.py             # Sophie standalone
├── eval_basil.py                # Evaluation utilities
│
├── prompts/                     # Agent prompt templates
│   ├── classroom/                   # Classroom session flow prompts
│   │   ├── prompt_tutor_kickoff.txt     # Tutor session kickoff (turn 1)
│   │   ├── prompt_tutor_primer.txt      # Tutor session primer
│   │   ├── prompt_tutor_phase_b.txt     # Tutor teaching paragraph (per-episode)
│   │   ├── prompt_tutor_phase_f.txt     # Tutor answering Sophie (per-episode)
│   │   ├── prompt_tutor_wrapup.txt      # Tutor session wrapup
│   │   ├── prompt_sophie_lesson_select.txt  # Sophie lesson picker
│   │   ├── prompt_sophie_react_teaching.txt # Sophie reacting to teaching
│   │   ├── prompt_sophie_post_grade.txt     # Sophie post-grade encouragement
│   │   ├── prompt_sophie_wrapup.txt         # Sophie session wrapup
│   │   ├── subjects.json                    # Auto-generated subject candidates
│   │   └── used_lessons.json                # Run-scoped lesson dedup
│   ├── howitworks/                  # How It Works pipeline prompts + runner
│   │   ├── howitworks_session.py        # HowItWorks session runner
│   │   └── used_topics.json             # Tracks used topics (reset on train)
│   ├── storytime/                   # Storytime pipeline prompts + runner
│   │   ├── storytime_session.py         # Storytime session runner
│   │   ├── prompt_tutor_story_tell.txt  # Tutor reads bedtime story
│   │   ├── prompt_sophie_ask_basil.txt  # Sophie asks Basil about the story
│   │   ├── prompt_sophie_story_pick.txt # Sophie picks a story
│   │   ├── prompt_tutor_story_wrapup.txt # Tutor wraps up storytime
│   │   └── used_stories.json            # Tracks used stories (reset on train)
│   ├── whychain/                    # WhyChain pipeline prompts
│   │   └── used_questions.json          # Tracks used seed questions (reset on train)
│   ├── prompt_tutor_quiz_sophie.txt     # Popquiz: Tutor quizzes Sophie (shared)
│   ├── prompt_task_generator.txt     # Generate-then-select: candidate generation
│   ├── prompt_task_selector.txt      # Generate-then-select: best candidate selection
│   ├── prompt_task_agent.txt         # Legacy single-task generator
│   ├── prompt_task_naturalizer.txt   # Task naturalization
│   ├── prompt_task_validator.txt     # Task validation
│   ├── prompt_english_grader.txt      # English quality/domain relevance grader (0-3)
│   ├── prompt_task_grader.txt        # Task compliance grader (0-7, gated by english grader)
│   ├── prompt_subject_generator.txt  # Subject generation
│   ├── prompt_assessment_agent.txt   # Age band assessment
│   └── session_summary_prompt.txt    # Session summary
│
├── memory/                      # Persistent state
│   ├── basil_assessment.json    # Age band, compliance, progress
│   ├── rotation_state.json      # Subject/lesson blacklists
│   ├── metrics.json             # Rolling statistics + EWMA
│   ├── task_naturalizer_cache.json # Cached task naturalizations
│   ├── session_summaries/       # Per-session summaries
│   └── session_metrics/         # Per-session metrics JSON
│
├── models/                      # Basil model checkpoints
│   └── basil_v0001/             # Initial untrained model
│       └── basil_lora_adapter/  # LoRA adapter weights (after training)
│
├── checkpoints/                 # Pre/post train model snapshots
│   ├── pretrain_YYYYMMDD_*/     # Pre-train checkpoint
│   └── posttrain_YYYYMMDD_*/    # Post-train checkpoint
│
├── identities/                  # Identity probe tracking
│   └── identity_log.jsonl       # Per-batch identity probe results
│
├── logs/                        # Session logs (batch-consolidated)
│   ├── batch_*_graded.jsonl     # Graded turns with weights (training input)
│   ├── batch_*_sessions.jsonl   # Transcripts + episodes (debug/training)
│   ├── batch_*_meta.jsonl       # Per-session metrics (one line per session)
│   └── batch_*_summary.json     # Batch summaries
│
├── docs/                        # Project documentation
│   ├── SESSION_FLOW.md          # Detailed session flow (phases A-F)
│   ├── PROMPT_FLOW.md           # Prompt template flow diagrams
│   ├── SESSION_FLOW_AUDIT.md    # Static code audit report
│   ├── EPISODE_REFACTOR.md      # Episode architecture design doc
│   └── DEBUGGING_ROOT_CAUSE.md  # Historical debugging notes
│
└── utils/
    └── scoring.py               # Scoring utilities

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Set OpenAI API key
export OPENAI_API_KEY=sk-...

Setup

# Verify setup
python verify_setup.py

# Create initial Basil model (if needed)
python create_basil_v0001.py

Run a Session

# Run a single automated session
python auto_session.py --turns 10

# Run with specific subject
python auto_session.py --subject "Mathematics" --turns 5

# Run quietly (less output)
python auto_session.py --turns 10 --quiet

Run Batch Sessions

# Run forever (production mode)
python orchestrator.py run

# Run a specific number of sessions
python orchestrator.py run --sessions 5

# Run a specific number of batches
python orchestrator.py run --batches 2

# Check status
python orchestrator.py status

# Force training (with eval + rollback)
python orchestrator.py train --force

Train Basil

# Dual-objective training (default: alternating world/trunk + basil/LoRA)
# Uses basil_and_after masking and adaptive LoRA cap by default
python train_basil_v2.py --mode mixed

# Explicit masking mode
python train_basil_v2.py --mask-mode basil_and_after  # default
python train_basil_v2.py --mask-mode basil_only
python train_basil_v2.py --mask-mode none

# Manual LoRA epoch cap override (default: scales linearly with age_band)
python train_basil_v2.py --lora-max-epochs 30

# Train only the base model (world/trunk objective)
python train_basil_v2.py --mode world

# Train only the LoRA adapter (basil-policy objective)
python train_basil_v2.py --mode basil

# Force training + evaluation using existing logs
python force_train_and_eval.py

# Legacy modes (backward compatible)
python train_basil_v2.py --mode session
python train_basil_v2.py --mode graded

Temperature Sweep (Classroom, Graded)

Use this script to run a statistically grounded temperature sweep using real classroom tasks pulled from logs. It replays prompts at multiple temperatures, grades each output with the production grader stack, and reports which temperature gives the strongest training signal (score>=3).

Default: two-stage sweep — Coarse sweep 0.8–2.0, then refine around the best temperature. Grading runs in parallel (30 workers) with retries and jitter to avoid API rate limits.

# Two-stage sweep (coarse 0.8–2.0, refine around best; 30 grading workers)
python scripts/test_temperature_sweep.py

# Single-stage with custom temps
python scripts/test_temperature_sweep.py --temps 1.0,1.2,1.4,1.5,1.6

# Custom run
python scripts/test_temperature_sweep.py \
  --age-band 2 \
  --max-prompts 150 \
  --replicates 2 \
  --workers 30 \
  --sweep-mode two-stage \
  --bootstrap-iters 1000

Outputs:

  • Console summary per temperature (avg_score, %>=3, %>=4, %>=6, 95% CIs)
  • Machine-readable report: logs/temperature_sweep_<timestamp>.json
  • Paired bootstrap deltas vs the age-band baseline temperature

Checkpoint-to-Age-Band Mapping

Use this script to map basil_v* checkpoints to their post-train eval assessed_age_band so you can choose representative checkpoints (for example, band-1 and early band-2) before running temperature sweeps.

python scripts/checkpoint_age_band_map.py

It reads logs/train_*.log plus memory/session_metrics/session_*.json (training_phase="posttrain_eval") and prints a per-checkpoint mapping with promotion flags.

Configuration

Key settings in config.py:

# Session settings
SESSION_MAX_TURNS = 10
GRADE_EVERY_N_TURNS = 2
BASIL_MAX_TOKENS = 30  # Default fallback; actual value = (age_band * 3) + 4

# Assessment thresholds
ASSESSMENT_PROMOTE_SCORE = 3.5
ASSESSMENT_DEMOTE_SCORE = 1.0
ASSESSMENT_WINDOW_SESSIONS = 10

# Compliance/Progress gating
ASSESSMENT_MIN_COMPLIANCE_FOR_PROMOTION = 0.60
ASSESSMENT_MAX_COMPLIANCE_FOR_DEMOTION = 0.20
COMPLIANCE_SCORE_THRESHOLD = 4

# Session Lifecycle
MIN_GRADED_TURNS_PER_SESSION = 6
MAX_GRADED_TURNS_PER_SESSION = 20
EARLY_STOP_WINDOW_TURNS = 4
EARLY_STOP_MIN_AVG_SCORE = 2.5
EARLY_STOP_MAX_COMPLIANCE = 0.15
ENABLE_GRACEFUL_WRAPUP = True
WRAPUP_TURNS = 2

# Training Triggers (usable turns, scales with age_band)
# Formula: 1024 * (1 + age_band)
# A "usable turn" is one where score_to_weight_basil_policy(score, age_band) > 0.0
# age_band 0: 1,024 usable turns, age_band 1: 2,048, age_band 7: 8,192
MIN_SESSIONS_BEFORE_TRAIN = 3
TRAIN_ONLY_IF_PROGRESS_SIGNAL_RATE_AT_LEAST = 0.0
TRAIN_PROGRESS_SIGNAL_WINDOW = 10

# Post-Train Eval + Rollback
EVAL_SESSIONS_AFTER_TRAIN = 2
ROLLBACK_IF_SCORE_DROP_PCT = 0.15
ROLLBACK_IF_COMPLIANCE_DROP_ABS = 0.15
EVAL_COMPARE_WINDOW = 10

# LoRA Settings
LORA_ACTIVATION_AGE_BAND = 0  # LoRA training active from birth
LORA_RANK = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["c_attn", "c_proj"]  # GPT-2 attention layers
LORA_ADAPTER_NAME = "basil"
LORA_ADAPTER_SUBDIR = "basil_lora_adapter"

# LoRA Strength Scaling (inference)
def lora_strength_for_age_band(age_band):
    """0.0 at band 0, 1.0 at band 7"""
    return max(0.0, min(1.0, age_band / 7.0))

Key settings in train_basil_v2.py:

# Age-band-scaled early stopping (validation loss floor)
VAL_LOSS_FLOOR_BY_AGE_BAND = {
    0: 3.0,   # Just learn "this is English, speakers take turns"
    1: 2.5,   # Learn word boundaries, basic patterns
    2: 2.5,   # Learn topic relevance, simple word use
    3: 2.0,   # Learn phrase structure
    4: 1.7,   # Learn basic sentence patterns
    5: 1.4,   # Learn coherent responses
    6: 1.2,   # Learn nuance and reasoning
    7: 1.0,   # Full depth allowed
}

Age-band-aware score weights (BASIL_POLICY_SCORE_WEIGHTS_TABLE in config.py):

# Rows = scores 0-7, columns = age bands 0-7.
# Rule: score <= age_band -> 0 weight. Score 7 always -> 1.0.
# Uniform minimum of 0.15 with linear ramp to 1.0.
#              band0  band1  band2  band3  band4  band5  band6  band7
 0: {0: 0.00, 1: 0.00, 2: 0.00, 3: 0.00, 4: 0.00, 5: 0.00, 6: 0.00, 7: 0.00},
 1: {0: 0.15, 1: 0.00, ...},  # Bootstrap signal at band 0
 2: {0: 0.30, 1: 0.15, 2: 0.00, ...},
 ...
 7: {0: 1.00, 1: 1.00, 2: 1.00, 3: 1.00, 4: 1.00, 5: 1.00, 6: 1.00, 7: 1.00},

Session Types

Bootstrap Basil rotates between four session types, each exposing Basil to different language patterns:

School Day Sessions (Classroom)

Structured teaching sessions with Tutor, Sophie, and graded episodes. Multi-episode sessions (6+ episodes each) provide the core curriculum-driven training data. Each episode follows the Phase A-F loop (task generation, teaching, Basil attempt, grading, Sophie encouragement). Sophie's popquiz (where Tutor quizzes Sophie on the lesson material) alternates placement — sometimes before Basil's turn, sometimes after — to diversify the conversational patterns in training data and expose Basil to Sophie's example answers both as priming and as follow-up context.

Storytime Sessions

Bedtime story sessions where Tutor reads a story, Sophie asks Basil a question about it, and the session wraps up with a gentle recap. Single-episode sessions that expose Basil to narrative structure, vocabulary, and natural conversational patterns. Stories tracked in prompts/storytime/used_stories.json.

HowItWorks Sessions

Explanatory sessions where Tutor explains how something works (e.g., "How do magnets work?"), Sophie asks a follow-up, Tutor asks Basil a comprehension question, and the session concludes. Single-episode sessions that expose Basil to explanatory/instructional language patterns. Topics tracked in prompts/howitworks/used_topics.json.

WhyChain Sessions

Open-ended conversational sessions built around a chain of "why?" questions. Sophie and Tutor explore a topic through iterative questioning, with Basil observing the dialogue. These provide rich conversational examples and topic exploration patterns. Questions tracked in prompts/whychain/used_questions.json.

The orchestrator rotates between session types automatically. Content deduplication (exact, fuzzy Jaccard similarity, and LLM-based semantic checks via dedup_utils.py) prevents repetition within a training run. Used content lists are cleared after each training run.

School Day Session Flow

  1. Subject Generation: LLM generates 15-40 age-appropriate school subjects → saved to prompts/classroom/subjects.json for debugging visibility
  2. Subject Selection: Random pick from generated candidates (subjects are broad, deduplication happens at lesson level)
  3. Tutor Kickoff: Announces SUBJECT_OF_THE_DAY
  4. Sophie Lesson Pick: Chooses LESSON_OF_THE_DAY with semantic overlap check against lessons used in current training run (retries up to 3x if overlap detected)
  5. Episode Loop (with hybrid stop policy):
    • Phase A: Task Generator produces candidates → Task Selector picks best
    • Phase B: Tutor teaches (B.1) → Sophie reacts (B.2) → Task delivered (B.3)
    • Phase C: Basil responds
    • Phase D: Two-grader scoring with gating (silent)
    • Phase E: Sophie encourages + asks curiosity question
    • Phase F: Tutor answers Sophie's question
    • Stop conditions:
      • Hard max: MAX_GRADED_TURNS_PER_SESSION (20)
      • Early stop: if avg_score < 2.5 AND compliance < 15% after MIN_GRADED_TURNS (6)
  6. Graceful Wrap-up:
    • Tutor sends brief recap + encouragement (not graded)
    • Sophie sends short closing line (not graded)
    • Identity probe: "Who are you?" → Basil responds → logged to identity_log.jsonl
  7. End of Session:
    • Compute compliance_rate and progress_signal
    • Update basil_assessment.json (potential age_band change)
    • Save artifacts to batch-level files (graded, sessions, meta)
    • Update rolling metrics (EWMA score/compliance)
    • Display progress toward next training round

Training Workflow

The orchestrator manages automated training with evaluation and rollback:

┌─────────────────────────────────────────────────────────┐
│              TRAINING TRIGGER CHECK                      │
│  usable_turns >= 1024 * (1 + age_band)                   │
│  AND total_sessions >= 3                                 │
└─────────────────────────────────────────────────────────┘
                         │ (trigger)
                         ▼
┌─────────────────────────────────────────────────────────┐
│  1. Save pre-train checkpoint                            │
│  2. Compute baseline (last 10 normal sessions)          │
│  3. Run dual-objective training (mixed mode):            │
│     - Alternating WORLD epochs (trunk, basil_and_after mask)│
│     - Alternating BASIL epochs (LoRA, adaptive epoch cap)│
│  4. Save post-train checkpoint (base + LoRA separately)  │
│  5. Run 2 eval sessions (training_phase="posttrain")    │
│  6. Compare eval vs baseline                             │
└─────────────────────────────────────────────────────────┘
                         │
          ┌──────────────┴──────────────┐
          ▼                              ▼
┌─────────────────────┐    ┌─────────────────────────┐
│ Score drop >= 15%   │    │ No significant          │
│ AND compliance drop │    │ regression              │
│ >= 0.15 absolute    │    │                         │
│                     │    │                         │
│ → ROLLBACK to       │    │ → KEEP new model        │
│   pre-train ckpt    │    │                         │
└─────────────────────┘    └─────────────────────────┘

Data Model

Graded Turn (in *_graded.jsonl)

{
  "turn": 2,
  "graded": true,
  "subject": "Mathematics",
  "lesson": "Counting to 5",
  "tutor": "Basil, can you count to three?",
  "basil": "one two three",
  "task": {
    "task_text": "Count from one to three",
    "task_category": "vocab",
    "rubric": {"0": "...", "5": "..."},
    "grader_instructions": "..."
  },
  "grade": {
    "score": 4,
    "justification": "Basil correctly counted...",
    "evidence": ["one two three"]
  },
  "weight": 0.8,
  "age_band": 0
}

Basil Assessment (memory/basil_assessment.json)

{
  "age_band": 2,
  "capabilities": ["Occasionally produces target words...", "..."],
  "preferred_task_categories": ["control", "vocab"],
  "output_caps": {"basil_max_tokens": 10, "...": "..."},
  "progress_signal": false,
  "compliance_rate": 0.0,
  "score_history": [{"session_avg": 2.5, "compliance_rate": 0.4, "...": "..."}],
  "recent_session_metrics": {
    "avg_score_window": 2.5,
    "compliance_rate_window": 0.4,
    "progress_signal_window": false,
    "window_sessions": 3
  },
  "last_updated": "2026-02-05T..."
}

Session Metrics (memory/session_metrics/session_*.json)

{
  "session_id": "20260205_143022",
  "timestamp": "2026-02-05T14:30:22.123456",
  "subject_of_the_day": "Mathematics",
  "lesson_of_the_day": "Counting to 10",
  "age_band_start": 0,
  "age_band_end": 0,
  "graded_turns_count": 15,
  "avg_score_session": 2.3,
  "compliance_rate_session": 0.53,
  "progress_signal_session": false,
  "task_category_counts": {"control": 5, "vocab": 8, "relevance": 2},
  "avg_basil_tokens": 12.4,
  "early_stopped": false,
  "stop_reason": "completed",
  "training_phase": "normal"
}

Rolling Metrics (memory/metrics.json)

{
  "total_sessions": 42,
  "total_graded_turns": 630,
  "graded_turns_since_last_train": 230,
  "last_train_timestamp": "2026-02-05T12:00:00",
  "last_train_result": "kept",
  "ewma_score": 2.5,
  "ewma_compliance": 0.55,
  "last_n_session_ids": ["...", "..."],
  "recent_summary": {
    "avg_score": 2.4,
    "avg_compliance": 0.52,
    "progress_signal_rate": 0.4,
    "sessions_count": 10
  },
  "total_training_runs": 1,
  "total_rollbacks": 0
}

Recent Changes

Phase 10: Per-Phase Early Stopping, Training Stability (2026-02-22)

Training architecture and stability improvements based on observing Basil's bootstrapping through age_band 2:

  • Per-phase early stopping: In mixed mode, WORLD and BASIL phases now have independent patience counters, best_val_loss tracking, and validation loaders. Previously, a single shared patience counter caused BASIL training to be killed prematurely — BASIL epochs naturally raise the WORLD val loss (because LoRA specialization hurts generalization), burning shared patience ticks. With per-phase tracking, BASIL got 9 epochs instead of 1 in the first test, and subsequent runs have reached 6+ BASIL epochs with WORLD running to convergence. Cross-validation metrics (basil_val during WORLD, world_val during BASIL) are logged for monitoring.
  • MAX_TRAIN_TIME increased to 12 hours: Training was hitting the 8-hour time limit before convergence (WORLD still finding new val loss improvements at step 14,500 of 15,500). Increased to 12 hours to allow natural convergence.
  • Training trigger threshold doubled: get_train_every_usable_turns formula changed from 512 * (1 + age_band) to 1024 * (1 + age_band). This gives more time for assessment to stabilize between training runs and ensures a larger, more diverse dataset for each training cycle.
  • Assessment window widened: ASSESSMENT_WINDOW_SESSIONS increased from 3 to 10. The narrow window caused noisy age_band oscillation during parallel generation (3 bad sessions could trigger a demotion that was reversed 3 sessions later).
  • Metric consistency fix: Promotion-blocking logic in memory_manager.py now uses usable_turns_since_last_train and get_train_every_usable_turns (matching the actual training trigger), instead of the deprecated graded_turns_since_last_train / get_train_every_graded_turns. The mismatch previously caused Basil to get stuck — promotion blocked (graded threshold met) but training never triggered (usable threshold not met).
  • Dynamic target turns: parallel_generate.py now uses a shared mp.Value for target_turns that the monitor loop refreshes every ~30 seconds from the current age_band assessment. Previously, target was frozen at startup, causing mismatches when age_band changed mid-run.
  • Grading context for all session types: howitworks_session.py and storytime_session.py now pass subject and lesson to grade_response(), giving the English Grader domain context it was missing (e.g., "warm bath" is domain-relevant when the topic is "How does a hot water heater work?").
  • Orchestrator direct-training path: When usable_turns_since_last_train already exceeds the training threshold (e.g., after an aborted training run), the orchestrator now directly triggers training instead of launching generation with a target of 0.

Phase 9: Grading Fairness, Weighted Masking, Usable-Turn Triggers (2026-02-20)

A comprehensive update to the grading, training, and session flow systems based on iterative debugging of Basil's bootstrapping pipeline:

  • Two-grader gating architecture: grader_agent.py now runs two independent LLM graders — an english_grader (0-3, domain relevance) and a task_grader (0-7, task compliance). The English Grader's score gates the Task Grader's output: if english_score <= 1, the task score is capped at 2; if english_score == 2, capped at 3; otherwise uncapped. Final score = max(english_score, capped_task_score). This solved a pervasive problem where Basil parroting Sophie/Tutor phrases ("Nice try!", "Keep going!") received inflated task scores (4-5) despite containing zero domain content. The English Grader correctly rated these 0-1, and the gate prevents the Task Grader from rewarding mimicry.
  • Weighted trunk masking: WorldDataset evolved from binary masking (labels=-100) to graduated weight masking. Basil's generations and Sophie's immediate reaction are trained at fractional weight (half the LoRA policy weight for that score/age_band), while subsequent Sophie popquiz and Tutor wrap-up content train at full weight. If the LoRA weight is 0.0, those zones are fully masked. This preserves the trunk's exposure to improving Basil responses without memorizing garbage.
  • LoRA epoch cap scaled by age_band: lora_max_epochs_for_age_band(age_band, max_epochs=100) replaced the previous max(1, round(avg_score)) cap. LoRA epochs now scale linearly from 0 at age_band=0 to 100 at age_band=7, mirroring inference strength scaling. This ensures training effort and inference influence grow in lockstep, preventing early overfitting while giving mature LoRA the full training budget.
  • Usable-turn training triggers: Training thresholds now count "usable turns" (where score_to_weight_basil_policy(score, age_band) > 0.0) instead of raw graded turns. Formula: 1024 * (1 + age_band). This ensures each training round has sufficient high-quality data, not just volume.
  • Alternating popquiz order: Classroom sessions now randomly place Sophie's popquiz (where Tutor quizzes Sophie) either before or after Basil's turn in each episode. This diversifies the conversational patterns Basil is exposed to during training — sometimes seeing Sophie's correct answer as a priming example before attempting the task, sometimes seeing it as follow-up context after.
  • Programmatic score floor (score_override.py): Applies target-word presence checks as a safety net — if Basil's response contains the exact target word, the score is lifted to at least 6 regardless of LLM grading. This protects against false negatives from overly strict LLM graders.

V2 Experiment Pipeline Update: Trunk Masking, Adaptive LoRA Cap, LoRA Strength Scaling (2026-02-15)

Based on a systematic 9-fork experiment (3 masking modes × 3 training strategies), the production training pipeline was updated with three key changes:

  • Trunk masking (basil_and_after): WorldDataset now defaults to mask_mode="basil_and_after", which masks Basil's output tokens + Sophie's post-grade encouragement + Tutor's answer for each episode. The trunk only learns from the teaching/task setup content. This was the clear winner in the experiment matrix — basil_and_after masking produced the most coherent English outputs and prevented the trunk from absorbing Basil's garbage or Sophie's score-conditioned phrases. Interestingly, the trunk still learned encouraging dialogue patterns (e.g., "Nice try, can you...") from unmasked Sophie reactions and Tutor lines in earlier phases — a sign of genuine language generalization, not a masking failure. (Note: this was later refined to weighted masking in Phase 9.)
  • Adaptive LoRA epoch cap: The number of LoRA (BASIL) training epochs was capped at max(1, round(avg_score)), where avg_score is the mean score of included BasilDataset examples. This directly addressed the primary failure mode discovered in experiments: LoRA overfitting on garbage tokens. (Note: later replaced by age-band-linear scaling in Phase 9.)
  • Age-band-based LoRA strength at inference: lora_strength_for_age_band() in config.py provides smooth linear scaling from 0.0 (age_band=0, trunk-only) to 1.0 (age_band=7, full LoRA). This replaces the previous binary on/off behavior. The LoRA is still trained from birth, but its inference influence scales with developmental stage — preventing early-stage garbage from dominating generation while allowing mature LoRA refinements full expression.
  • New CLI args: --mask-mode (choices: none, basil_only, basil_and_after) and --lora-max-epochs (manual override for age-band cap) added to train_basil_v2.py.

Episode-Local LoRA Context + Dedup Hardening (2026-02-14)

  • Episode-local context for LoRA training: BasilDataset now uses only the current episode's dialogue (teaching, quiz, task) as context for each Basil training example — NOT the full session including prior episodes' Basil garbage. This prevents contamination of the LoRA's conditioning context. Single-episode sessions (storytime, howitworks) fall back to full session context since there are no prior Basil turns. The WorldDataset (trunk) is unaffected and still trains on full session transcripts.
  • Dedup retry hardening: Increased max_pick_retries / max_retries from 5 to 10 across all session types (classroom, storytime, howitworks, whychain) to handle dedup exhaustion during large-scale parallel generation runs.

Age-Band-Aware Score Weights + Val Loss Floor (2026-02-13)

  • BASIL_POLICY_SCORE_WEIGHTS_TABLE: Replaced the flat score-to-weight mapping with a 2D table (score × age_band). Key rules: score ≤ age_band gets 0 weight, score 7 always 1.0, uniform minimum 0.15 with linear ramp. At band 0, score=1 gets 0.15 as the bootstrap signal pulling LoRA toward English.
  • VAL_LOSS_FLOOR_BY_AGE_BAND: Added age-band-scaled validation loss floor for early stopping. Band 0 stops at loss 3.0, band 7 allows down to 1.0. Prevents over-training at early stages.
  • LoRA activation from birth: LORA_ACTIVATION_AGE_BAND = 0 confirmed — LoRA provides reinforcement bootstrap signal from the very first training run.

HowItWorks + WhyChain Session Types (2026-02-12)

  • HowItWorks sessions: Explanatory "how does X work?" sessions with Tutor explanation, Sophie follow-up, and Basil comprehension question. Single-episode format.
  • WhyChain sessions: Open-ended "why?" chain conversations between Sophie and Tutor, with Basil observing. Provides rich conversational examples.
  • Shared dedup utilities (dedup_utils.py): Three-layer dedup pipeline (exact match, fuzzy Jaccard similarity, LLM-based semantic check) shared across all session types.

Parallel Data Generation (2026-02-12)

  • parallel_generate.py: Multi-worker parallel generation with --target-turns and --no-train flags. Defaults to 20 workers. Each worker runs sessions independently with per-process model caching (model_cache.py).
  • model_cache.py: Per-process caching for the Basil model, with LoRA enable/disable support.
  • reset.py: Utility for clearing generated files with --keep-logs and --reset-model options.

Greedy Decoding Fix + Separate LoRA Learning Rate (2026-02-10)

  • Eliminated greedy decoding for early age bands: Previously, age_band 0-1 used do_sample=False (greedy), while age_band 2+ used sampling. Greedy decoding was intended to give the "strongest signal" from an undertrained model, but in practice it amplified slight distributional biases in the LoRA adapter into single-token repetition loops (e.g. "il il il il" on every turn). All age bands now use uniform sampling (temperature=1.0, top_k=50), which is the neutral default that samples from the model's distribution without reshaping it.
  • Separate LoRA learning rate: The LoRA adapter now trains with its own peak LR (LORA_PEAK_LR=3e-5), 3.3x lower than the trunk's 1e-4. LoRA's alpha/rank amplification (16/8 = 2x) means the effective update magnitude at 3e-5 is comparable to 6e-5 on the parameters. This produces more conservative adapter updates that don't overfit to noisy early training data.
  • Root cause analysis: The previous "benchmark benchmark benchmark" and "il il il il" collapse was caused by two interacting factors: (1) greedy decoding deterministically picking the highest-probability token at each step, creating a self-reinforcing loop, and (2) the LoRA adapter being trained at the same LR as the 254M-parameter trunk, causing it to overfit its 1.4M parameters too aggressively to noise in the early training data. Fixing both produced immediate results: Basil now generates varied English words, forms partial phrases, and in some cases correctly answers questions (scoring 6/7).

Dual-Objective Training with LoRA Adapters (2026-02-09)

  • Two training objectives: WORLD/TRUNK trains the base GPT-2 model on full transcripts for general language understanding; BASIL-POLICY trains a LoRA adapter specifically on Basil's conversational turns with score-weighted examples
  • LoRA adapters: Added peft dependency for parameter-efficient fine-tuning. Only ~1.44M parameters (0.56% of total) are trained during the BASIL-POLICY phase, while the full 254M base model is trained during WORLD phase
  • Mixed training mode: Default mixed mode alternates epochs between the two objectives, each with its own optimizer and learning rate scheduler
  • Sophie post-grade masking: Sophie's encouragement lines after grading are masked in training data to prevent data leakage (Basil was learning to repeat Sophie's phrases)
  • Basil-policy score weighting: Score 0 maps to weight 0.0 (discarded), score 7 maps to 1.0, with intermediate values scaled accordingly
  • LoRA-aware inference: auto_session.py and storytime_session.py automatically detect and load LoRA adapters when generating Basil's responses

Storytime Content Pipeline (2026-02-09)

  • New session type: Bedtime story sessions where Tutor reads a story with pauses for Basil to react
  • Story catalog: prompts/storytime/stories.json with age-appropriate stories
  • Used story tracking: prompts/storytime/used_stories.json prevents repetition within a training run (reset after training)
  • Orchestrator integration: Sessions alternate between school day and storytime automatically

Subject/Lesson Selection Refactor (2026-02-09)

  • Simplified subject selection: Removed subject-level semantic overlap check; subjects are broad categories that can yield many lessons. Generated candidates are saved to prompts/classroom/subjects.json for debugging visibility.
  • Lesson-level deduplication: Semantic overlap checking moved to lesson level (where it matters). Sophie's lesson picks are checked against lessons used in the current training run via check_lesson_overlap(), with automatic retry (up to 3x) if overlap detected.
  • Used lessons tracking: prompts/classroom/used_lessons.json tracks lessons used within each training run. List is cleared after training completes, allowing lesson reuse across training runs while ensuring variety within a single run's data collection.

Training Efficiency Improvements (2026-02-08)

  • Dynamic BLOCK_SIZE: Training sequence length now scales with age band (512 for bands 0-2, up to 1024 for bands 6-7), improving token utilization
  • Scaled LLM generations: Tutor and Sophie max_tokens scale with age_band via scaled_max_tokens(), producing richer content as Basil matures
  • Usable-turn training threshold: Training triggers count "usable turns" (where score_to_weight_basil_policy(score, age_band) > 0.0) rather than raw graded turns. Formula: 1024 * (1 + age_band). This ensures training data quality scales with expectations — only turns that would actually contribute non-zero LoRA weight count toward the trigger.

Prompt & Pipeline Streamlining

  • Removed deprecated prompt_tutor.txt and prompt_sophie.txt: Split into purpose-specific prompt files under prompts/classroom/ (prompt_tutor_kickoff.txt, prompt_sophie_lesson_select.txt, prompt_sophie_react_teaching.txt, prompt_sophie_post_grade.txt)
  • Removed "story_so_far" files: tutor_story_so_far.md and sophie_story_so_far.md were adding complexity without clear benefit; subject rotation and semantic dedup now handle diversity
  • Removed strategy system: The _compute_next_strategy() feedback loop (simplify/maintain/escalate) was not producing useful differentiation and has been removed
  • Simplified basil_assessment in prompts: Only the human-readable age_band_description is passed to prompts, not the full JSON blob
  • Physical action filtering via prompts: Instead of brittle regex filters, task generator and selector prompts now explain that Basil "can talk but has no body"

Identity Probe

  • Probe question updated: Changed from "Basil, who are you?" to "Who are you?"
  • Per-batch identity logging: Identity probe runs once per batch (last session), results saved to identities/identity_log.jsonl
  • Historical backfill: Old session logs were backfilled into identity_log.jsonl
  • Research context: The identity log tracks the secondary research question -- whether a sense of self emerges purely from language acquisition. The trajectory so far shows Basil moving from random tokens ("shaft slip slip shaft") through mimicked Tutor/Sophie phrases ("Hello, Sophie and little Basil! I'm") toward increasingly contextual responses.

Session Output

  • Progress indicator: End of each session displays accumulated graded turns and percentage progress toward next training round
  • Removed top_p from generation: Eliminated spurious transformer warnings for GPT-2

Guiding Principles

  1. No gold-standard answers - We don't train Basil to mimic a teacher model's prose
  2. Use rubrics, not targets - Tasks include scoring criteria with partial credit
  3. Reward-weighted learning - Higher scores = higher training weights
  4. Measure progress - Track scores by category, compliance, progress signals
  5. Be robust to babble - Early rubrics reward controllability and any recognizable signal
  6. Prevent "easy mode leveling" - Compliance gating prevents promotion on trivial tasks
  7. Penalize parroting - Repeating teacher/sibling phrases is not language production; the grading pipeline explicitly detects and downscores conversational mimicry
  8. Scale everything with age - LoRA strength, training epochs, max tokens, training thresholds, and masking weights all scale with age_band, preventing early-stage overfitting while unlocking full capacity at maturity

Test Checklist (Phase 4)

1. Basic Session Test

python orchestrator.py run --sessions 2

Verify:

  • logs/batch_*_graded.jsonl created (graded turns for training)
  • logs/batch_*_sessions.jsonl created (transcripts + episodes)
  • logs/batch_*_meta.jsonl created (per-session metrics)
  • memory/metrics.json updates (total_sessions, EWMA)
  • Sessions end with graceful wrap-up lines (Tutor recap, Sophie closing)

2. Early Stop Test

# Temporarily edit config.py:
# EARLY_STOP_MIN_AVG_SCORE = 7.0  # Force early stop

python orchestrator.py run --sessions 1

# Check session_metrics file:
# early_stopped: true
# stop_reason: "early_stop_low_signal"

3. Training Trigger Smoke Test

# Temporarily edit config.py:
# TRAIN_EVERY_GRADED_TURNS = 1
# MIN_SESSIONS_BEFORE_TRAIN = 1

python orchestrator.py run --sessions 3

Verify:

  • Pre-train checkpoint saved to checkpoints/pretrain_*/
  • Training invoked (new model in models/basil_v*/)
  • Post-train eval sessions run (training_phase="posttrain_eval")
  • metrics.json shows last_train_result ("kept" or "rolled_back")

4. Rollback Test

# Force a rollback by temporarily setting low thresholds:
# ROLLBACK_IF_SCORE_DROP_PCT = 0.0001
# ROLLBACK_IF_COMPLIANCE_DROP_ABS = 0.0001

python orchestrator.py train --force

Verify:

  • Rollback detected
  • Pre-train checkpoint restored as new model version
  • metrics.json shows last_train_result="rolled_back"

5. Status Check

python orchestrator.py status

Verify output includes:

  • Total sessions and graded turns
  • Graded turns since last train
  • EWMA metrics
  • Training trigger status

How to Run

Quick Start (3 sessions)

python orchestrator.py run --sessions 3

Production Mode (continuous)

python orchestrator.py run
# Ctrl+C to stop

Force Training with Eval

python orchestrator.py train --force

Monitor Progress

# Check status
python orchestrator.py status

# View batch log files (3 files per batch, ~20 sessions per batch)
ls -la logs/batch_*

# View rolling metrics
cat memory/metrics.json | python -m json.tool

Roadmap

See ROADMAP.md for a detailed discussion of the current state, potential levers for improvement, and open questions. This project has shown early promise but is not proven to work -- contributions and experimentation are welcome.

Future Work

  • Curriculum variety (content snippets from corpora)
  • Difficulty adjustment based on score trends (age band promotion/demotion)
  • Dynamic training threshold scaling with age band
  • LLM-based semantic lesson deduplication (run-scoped)
  • Scaled LLM output tokens by developmental stage
  • Identity probe logging and tracking
  • Dual-objective training with LoRA adapters
  • Storytime content pipeline (narrative exposure)
  • HowItWorks + WhyChain session types
  • Sophie post-grade masking (data leakage prevention)
  • Parallel session execution (parallel_generate.py)
  • Age-band-aware score weighting table
  • Age-band-scaled early stopping (val loss floor)
  • Episode-local LoRA context (prevent garbage contamination)
  • Trunk masking (basil_and_after — mask Basil + Sophie post-grade + Tutor answer)
  • Adaptive LoRA epoch cap (scale with avg_score to prevent overfitting)
  • Age-band-based LoRA strength scaling at inference (0.0→1.0)
  • Two-grader gating architecture (english gates task, prevents parroting inflation)
  • Weighted trunk masking (partial weight for Basil zones, full weight for teaching content)
  • LoRA epoch cap scaled by age_band (linear 0→100, replaces avg_score cap)
  • Usable-turn training triggers (count quality-weighted turns, not raw volume)
  • Alternating popquiz order (diversify Sophie example placement)
  • Programmatic score floor (target-word safety net)
  • Per-phase early stopping (independent WORLD/BASIL convergence)
  • Dynamic target turns (parallel generation adapts to age_band changes mid-run)
  • Multi-turn memory tasks
  • Evaluation benchmarks
  • Dashboard/visualization for progress tracking
  • Larger base model (scaling beyond GPT-2)
  • Progressive LoRA context expansion (episode → session as Basil improves)
  • Grader calibration harness (systematic prompt variant testing)

Disclaimer

This project was built entirely through AI-assisted development (vibecoding with Cursor) by a non-technical hobbyist. The code is provided as-is with no warranties or guarantees of any kind. It is an experiment, not production software. Use at your own risk.

License

MIT License. See LICENSE for details.

Releases

No releases published

Packages

No packages published

Languages