Bootstrap Basil

Can an LLM learn to speak the way humans do -- through immersion, mimicry, and positive reinforcement?

Bootstrap Basil is an experiment in training a language model from random weights using only AI-generated curriculum. No human-written training data, no distillation from a larger model, no pre-existing text corpus. A GPT-2 (124M) initialized with random weights is placed in a simulated classroom where AI teachers (Tutor, Sophie) run lessons, and Basil's graded attempts become its own training data.

This is not proven to work. This entire codebase was vibecoded with Cursor by a non-technical hobbyist. There are no warranties or guarantees of any kind -- treat it as an experiment, not production software. See ROADMAP.md for the research questions, current status, and how to contribute.

Overview

The system runs an automated loop: generate curriculum → run teaching sessions → grade Basil's responses → train on the best attempts → repeat. Key components:

Immersion -- Tutor and Sophie run interactive sessions (classroom, storytime, how-it-works, why-chains). The base model ("trunk") trains on full transcripts, absorbing language structure by listening.
Mimicry -- A LoRA adapter trains on Basil's own graded outputs, reinforcing attempts that match the language it's been exposed to.
Positive reinforcement -- A two-grader architecture (English quality + task compliance) scores every response. Only above-threshold outputs become training data; garbage is discarded.
Developmental staging -- Everything scales with age_band (0-7): LoRA strength, training epochs, output length, score thresholds. This prevents early overfitting while unlocking capacity as Basil improves.

Current Implementation Status

Completed Features

Phase	Feature	Status
Phase 1	Task Agent + Grader Agent	✅ Complete
Phase 1	Auto Session Runner	✅ Complete
Phase 1	Structured Logging	✅ Complete
Phase 2	Dual-Objective LoRA Training	✅ Complete
Phase 3	Orchestrator + Metrics	✅ Complete
Phase 3	Dynamic Subject Generation	✅ Complete
Phase 3	Lesson Picker (Sophie)	✅ Complete
Phase 3	Blacklist Rotation	✅ Complete
MVP Tier 1	Basil Assessment (age_band 0-7)	✅ Complete
MVP Tier 1	Compliance/Progress Signal Gating	✅ Complete
Phase 4	Session Lifecycle (min/max turns, early stop)	✅ Complete
Phase 4	Graceful Session Wrap-up	✅ Complete
Phase 4	Per-Session Metrics	✅ Complete
Phase 4	Rolling Metrics + EWMA	✅ Complete
Phase 4	Training Triggers	✅ Complete
Phase 4	Post-Train Evaluation	✅ Complete
Phase 4	Checkpoint + Rollback	✅ Complete
Phase 5	Storytime Content Pipeline	✅ Complete
Phase 5	Dual-Objective Training (World + Basil-Policy)	✅ Complete
Phase 5	LoRA Adapters for Basil-Policy	✅ Complete
Phase 5	Sophie Post-Grade Masking (Data Leakage Fix)	✅ Complete
Phase 6	HowItWorks + WhyChain Session Types	✅ Complete
Phase 6	Parallel Data Generation (multi-worker)	✅ Complete
Phase 6	Shared Dedup Utilities (exact/fuzzy/semantic)	✅ Complete
Phase 6	Per-Process Model Cache	✅ Complete
Phase 7	Age-Band-Aware Score Weights Table	✅ Complete
Phase 7	Age-Band-Scaled Early Stopping (Val Loss Floor)	✅ Complete
Phase 7	Episode-Local LoRA Context (Anti-Contamination)	✅ Complete
Phase 8	Trunk Masking (`basil_and_after`)	✅ Complete
Phase 8	Adaptive LoRA Epoch Cap (avg_score-based)	✅ Complete
Phase 8	Age-Band LoRA Strength Scaling (0.0→1.0)	✅ Complete
Phase 9	LoRA Epoch Cap Scaled by Age Band	✅ Complete
Phase 9	Two-Grader Gating (english gates task)	✅ Complete
Phase 9	Weighted Trunk Masking (partial Basil weight)	✅ Complete
Phase 9	Alternating Popquiz Order	✅ Complete
Phase 9	Usable-Turn Training Triggers	✅ Complete
Phase 10	Per-Phase Early Stopping (WORLD/BASIL)	✅ Complete
Phase 10	Training Stability (wider assessment, doubled thresholds)	✅ Complete
Phase 10	Dynamic Target Turns (parallel generation)	✅ Complete

Training Architecture

Bootstrap Basil uses a dual-objective training approach with LoRA adapters to separate world knowledge from Basil-specific conversational behavior:

Objective 1: WORLD/TRUNK (Language Model)

Standard next-token prediction on full session transcripts (Tutor, Sophie, Story, Basil). Trains the base GPT-2 model to absorb language structure, vocabulary, and conversational patterns. LoRA adapters are disabled during this phase.

Dataset: WorldDataset — full transcripts with three-zone weighted masking
Trunk masking (mask_mode, default basil_and_after): Controls what the trunk learns from Basil-related content. Three modes:
- none — no masking, trunk sees everything at full weight
- basil_only — mask only Basil's output tokens
- basil_and_after — (default) three-zone weighted masking:
  - Zone A (Basil's output) + Zone B (Sophie's immediate reaction): trained at fractional weight (lora_weight / TRUNK_WEIGHT_DIVISOR, where divisor=2). This allows the trunk to gently absorb Basil's improving English without over-reinforcing garbage, while tracking the LoRA's quality signal.
  - Zone C (Sophie's popquiz, Tutor's wrap-up, all other teaching content): trained at full weight (1.0).
  - If the LoRA weight for a given score is 0.0, Zones A and B are fully masked (labels=-100).
- This graduated approach replaced the earlier binary masking (which threw away all post-Basil content). The key insight: the trunk benefits from seeing Basil's generations at reduced weight, allowing it to learn the shape of improving responses without memorizing garbage.
Recency weighting: More recent sessions (grouped by training run) contribute proportionally more. Half-life of 6 training runs, floored at 10% minimum weight.

Objective 2: BASIL-POLICY (LoRA Adapter)

Trains only Basil's LoRA adapter on examples where the target tokens are exclusively Basil's reply. The base model is frozen during this phase. LoRA is active from age_band=0 (LORA_ACTIVATION_AGE_BAND=0), providing a reinforcement bootstrap signal from the very start.

Dataset: BasilDataset — context tokens masked (labels=-100), only Basil reply tokens as targets. Tracks per-example scores and computes avg_score for adaptive LoRA epoch capping.
Episode-local context: For multi-episode classroom sessions, each Basil training example sees only the current episode's dialogue (teaching, quiz, task prompt) — not previous episodes' Basil outputs. This prevents Basil's earlier garbage from contaminating the LoRA's conditioning context. Single-episode sessions (storytime, howitworks) use full session context since there are no prior Basil turns.
Age-band-aware score weights: Each example is weighted by score_to_weight_basil_policy(score, age_band) from BASIL_POLICY_SCORE_WEIGHTS_TABLE. The key design rules:
- Score 0 always gets weight 0 (discarded)
- Score 7 always gets weight 1.0 (full reinforcement)
- Score ≤ age_band gets weight 0 (only scores ABOVE the current band are reinforced)
- Uniform minimum reinforcing weight of 0.15 at every band, with a linear ramp to 1.0
- At band 0, score=1 (any English word) gets 0.15 — this is the bootstrap signal that pulls the LoRA toward English
Age-band-scaled early stopping (VAL_LOSS_FLOOR_BY_AGE_BAND): Validation loss floor scales with age band. Band 0 stops at loss 3.0 (just learn basic English patterns), while band 7 allows loss down to 1.0. This prevents over-training at early stages and is analogous to the child "learning how to learn."
Age-band-scaled LoRA epoch cap: The number of LoRA (BASIL) training epochs scales linearly with age_band via lora_max_epochs_for_age_band(age_band, max_epochs=100). At age_band=0, the LoRA gets 0 epochs (pure trunk imitation). At age_band=7, it gets the full epoch budget (100). This mirrors lora_strength_for_age_band() so that training effort and inference influence grow together. The earlier approach (capping based on avg_score) was replaced because age_band is a more stable and predictable proxy for data quality, and scaling both training and inference together prevents the LoRA from overfitting on garbage during the earliest bootstrapping stages while giving it full training capacity at maturity.

Mixed Training Mode

The default mixed mode alternates epochs between the two objectives, with per-phase early stopping and an adaptive cap on LoRA epochs:

Epoch 1:  WORLD training  (base model trainable, LoRA frozen)
Epoch 2:  BASIL training  (LoRA trainable, base model frozen)
Epoch 3:  WORLD training  ...
Epoch 4:  BASIL training  ... (if adaptive cap not reached)
Epoch 5:  WORLD training  → WORLD converges (patience exhausted)
Epoch 6:  BASIL training  ... (WORLD skipped, BASIL continues alone)
Epoch 7:  BASIL training  ...
...
Epoch 12: BASIL training  → BASIL converges → All phases done, stop

Per-phase early stopping: WORLD and BASIL modify different parameters (trunk vs LoRA), so they have independent convergence tracking — separate best_val_loss, patience_counter, and validation loaders. WORLD validates on world_val_loader, BASIL validates on basil_val_loader. Each phase has its own patience of 8 evals. When one phase converges, the other continues alone. Training stops when both phases have converged (or the LoRA epoch cap is reached for BASIL). Cross-validation metrics are logged for monitoring (basil_val during WORLD epochs, world_val during BASIL epochs). Max training time is 12 hours.

When the age-band LoRA cap is reached, subsequent BASIL slots are redirected to additional WORLD training, ensuring the trunk continues to improve even after LoRA has been capped. The cap is computed from lora_max_epochs_for_age_band():

age_band	lora_max_epochs	Rationale
0	0 epochs	Pre-verbal — no LoRA training, pure trunk imitation
1	14 epochs	Proto-English — minimal LoRA, cautious refinement
2	29 epochs	First words — growing LoRA budget
3	43 epochs	Reliable words — moderate LoRA
4	57 epochs	Short phrases — substantial LoRA
5	71 epochs	Sentences — most LoRA training
6	86 epochs	Conversation — near-full budget
7	100 epochs	Reasoning — full epoch budget

Each objective has its own optimizer, learning rate scheduler, and gradient scaler. The LoRA optimizer uses a lower peak learning rate (3e-5) than the trunk (1e-4) because LoRA's alpha/rank amplification (16/8 = 2x) effectively doubles the update magnitude, and the Basil dataset is smaller and noisier. Model saving stores both the base model weights and the LoRA adapter separately.

LoRA Strength Scaling at Inference

LoRA adapter contribution is scaled linearly with age_band at inference time via lora_strength_for_age_band():

age_band	LoRA strength	Effect
0	0.00	Trunk-only (LoRA present but zeroed out)
1	0.14	Minimal LoRA influence
2	0.29
3	0.43
4	0.57	Balanced trunk + LoRA
5	0.71
6	0.86
7	1.00	Full LoRA refinement

This replaces the earlier binary on/off behavior. The rationale: at age_band=0, the LoRA is trained on mostly garbage data and should have minimal inference influence. As Basil matures and the LoRA trains on higher-quality data, its contribution is smoothly increased. The LoRA is still trained from age_band=0 (building up signal), but its influence at inference scales with developmental stage.

Overrides for experiments are available via the BASIL_LORA_STRENGTH env var or lora_strength constructor parameter in AutoSession.

Grading Pipeline

Basil's responses are scored through a multi-layer grading system designed to produce accurate training signals, especially during early bootstrapping when most outputs are noise.

Two-Grader Architecture

Each response is evaluated by two independent LLM graders:

English Grader (0-3): Evaluates English quality and domain relevance. Generous — rewards any English words, especially those related to the subject/lesson.
Task Grader (0-7): Evaluates task compliance — did Basil say the target word, answer the question, follow the instruction?

English Grader Gating

The English Grader's score acts as a ceiling on the Task Grader's output, preventing the Task Grader from over-scoring responses that lack genuine domain content (e.g., parroting Sophie's "Nice try!" or Tutor's conversational phrases):

English Score	Task Score Cap	Rationale
0-1	Capped at 2	No domain-relevant English detected — task compliance can't be high
2	Capped at 3	Some English but limited domain content
3	Uncapped	Good domain-relevant English — trust the Task Grader

The final score is max(english_score, capped_task_score).

Programmatic Score Floor

After LLM grading, a programmatic floor is applied:

If Basil's response contains the exact target word(s), the score is lifted to at least 6 (regardless of LLM grading)
If the response contains English words but no target, a minimum score of 1 applies
This acts as a safety net for cases where the LLM graders are too strict

Why This Design?

The key failure mode discovered during bootstrapping was parroting inflation: Basil would repeat conversational phrases from Sophie or Tutor (e.g., "Nice try, can you say..."), and the Task Grader would give these high scores because they "attempted the right format." The English Grader consistently rated these 0-1 (no domain content), so using it as a gate solved the problem without brittle regex detection or additional API calls.

LoRA Configuration

Parameter	Value
Rank	8
Alpha	16
Dropout	0.05
Target Modules	`c_attn`, `c_proj` (GPT-2 attention layers)
Adapter Params	~1.44M (0.56% of 254M base)

At inference time, session runners automatically load the LoRA adapter (if present) and scale its contribution by lora_strength_for_age_band(age_band) — 0.0 at band 0 (trunk-only) through 1.0 at band 7 (full LoRA). Classroom Basil generation now derives both max_new_tokens and temperature from get_basil_generation_settings(age_band), so temperature is tied to developmental stage (higher early, lower later), while preserving top_k=50 sampling.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       ORCHESTRATOR                           │
│  • Batch scheduling with configurable delays                │
│  • Training triggers (graded turns + progress signal gate)  │
│  • Post-train evaluation + rollback on regression           │
│  • Checkpointing (pre/post train)                           │
│  • parallel_generate.py: multi-worker data generation       │
└─────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
┌───────────────────┐ ┌───────────────┐ ┌─────────────────┐
│  CLASSROOM        │ │  STORYTIME    │ │  HOWITWORKS /   │
│  (auto_session)   │ │  (storytime_  │ │  WHYCHAIN       │
│  Multi-episode    │ │   session)    │ │  Single-episode  │
│  Phases A-F loop  │ │  Single-ep    │ │  sessions        │
└───────────────────┘ └───────────────┘ └─────────────────┘
          │                   │                   │
          ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────────────┐
│                   SHARED INFRASTRUCTURE                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Tutor   │  │  Sophie  │  │  Basil   │  │  Grader  │   │
│  │  (API)   │  │  (API)   │  │(local+LoRA)│ │  (API)   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
│  • model_cache.py (per-process Basil model caching)         │
│  • dedup_utils.py (exact + fuzzy + semantic dedup)          │
│  • grader_agent.py (shared grading across session types)    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     TRAINING PIPELINE                        │
│  • train_basil_v2.py: dual-objective (World + Basil-Policy) │
│  • WorldDataset: weighted trunk masking (partial Basil wt)  │
│  • BasilDataset: episode-local ctx → LoRA (age-band cap)    │
│  • LoRA epoch cap: lora_max_epochs_for_age_band() (0→100)   │
│  • LoRA strength at inference: age_band / 7.0 (0.0 → 1.0)  │
│  • Age-band-scaled early stopping (VAL_LOSS_FLOOR_BY_AGE)   │
│  • Age-band-aware score weights (SCORE_WEIGHTS_TABLE)       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     MEMORY / STATE                           │
│  • basil_assessment.json (age_band, compliance, progress)   │
│  • metrics.json (rolling stats + EWMA)                      │
│  • used_lessons/stories/topics/questions.json (dedup)       │
│  • checkpoints/ (pre/post train model snapshots)            │
└─────────────────────────────────────────────────────────────┘

Key Components

Agents

Agent	Model	Role
Tutor	gpt-4o-mini	Drives sessions, introduces concepts, asks questions
Sophie	gpt-4o-mini	Older sibling, picks lessons, models clear language
Task Generator	gpt-4o-mini	Generates 5-8 candidate tasks per episode
Task Selector	gpt-4o-mini	Picks best candidate based on teaching paragraph
English Grader	gpt-4o-mini	Scores English quality/domain relevance (0-3)
Task Grader	gpt-4o-mini	Scores task compliance (0-7), gated by English Grader
Task Naturalizer	gpt-4o-mini	Converts raw task text to natural dialogue (cached)
Subject Generator	gpt-4o-mini	Dynamically generates age-appropriate subjects
Assessment Agent	gpt-4o-mini	Evaluates Basil's developmental stage (age band)
Basil	Local GPT-2 + LoRA	The baby model being trained

Age Band System

Basil progresses through 8 developmental stages. Max tokens scale with the formula (band * 3) + 4:

Age Band	Description	Max Tokens	Task Categories
0	Pre-verbal	4	control
1	Proto-English	7	control, vocab
2	First words	10	control, vocab
3	Reliable words	13	control, vocab, relevance
4	Short phrases	16	vocab, relevance, memory
5	Sentences	19	relevance, memory, conversation
6	Conversation	22	memory, conversation
7	Reasoning	25	memory, conversation

Promotion/Demotion Logic

Promotion requires ALL of:

Average score ≥ 3.5 over 10 sessions
Compliance rate ≥ 60% (turns with score ≥ 4)
Progress signal = True (majority in window)
Not at or past the training threshold (training must run first)

Demotion requires:

Average score ≤ 1.0 over 10 sessions
AND (compliance ≤ 20% OR progress signal = False)

Directory Structure

bootstrap-basil/
├── README.md                    # This file
├── config.py                    # Central configuration
├── verify_setup.py              # Setup verification script
│
├── # Core Session Components
├── auto_session.py              # Classroom session runner (multi-episode, Phase A-F)
├── task_agent.py                # Task + rubric generation (generate-then-select)
├── task_contract.py             # TaskSpec contract/schema definitions
├── grader_agent.py              # Two-grader scoring (english + task, with gating)
├── score_override.py            # Programmatic score floors (target-word presence)
├── subject_generator.py         # Dynamic subject generation
├── curriculum_manager.py        # Rotation state management
├── memory_manager.py            # Assessment + summaries
├── metrics_manager.py           # Session metrics + rolling stats (EWMA)
├── orchestrator.py              # Batch scheduling + training workflow
├── parallel_generate.py         # Parallel data generation (multi-worker)
├── dedup_utils.py               # Shared deduplication (exact, fuzzy, semantic)
├── model_cache.py               # Per-process Basil model cache (LoRA enable/disable)
│
├── # Additional Session Types
├── whychain_session.py          # WhyChain session runner (why-question chains)
│
├── # Training
├── train_basil_v2.py            # Dual-objective training (World/trunk + Basil/LoRA)
├── force_train_and_eval.py      # Manual training + evaluation trigger
├── create_basil_v0001.py        # Initial model creation
├── assessment_agent.py          # LLM-based age band assessment
├── identity_probe.py            # Identity probe ("Who are you?")
├── reset.py                     # Reset utility (clear models, state, logs)
│
├── # Legacy/Interactive
├── chat_basil.py                # Interactive chat mode
├── complete_basil.py            # Completion mode
├── sophie_engine.py             # Sophie standalone
├── eval_basil.py                # Evaluation utilities
│
├── prompts/                     # Agent prompt templates
│   ├── classroom/                   # Classroom session flow prompts
│   │   ├── prompt_tutor_kickoff.txt     # Tutor session kickoff (turn 1)
│   │   ├── prompt_tutor_primer.txt      # Tutor session primer
│   │   ├── prompt_tutor_phase_b.txt     # Tutor teaching paragraph (per-episode)
│   │   ├── prompt_tutor_phase_f.txt     # Tutor answering Sophie (per-episode)
│   │   ├── prompt_tutor_wrapup.txt      # Tutor session wrapup
│   │   ├── prompt_sophie_lesson_select.txt  # Sophie lesson picker
│   │   ├── prompt_sophie_react_teaching.txt # Sophie reacting to teaching
│   │   ├── prompt_sophie_post_grade.txt     # Sophie post-grade encouragement
│   │   ├── prompt_sophie_wrapup.txt         # Sophie session wrapup
│   │   ├── subjects.json                    # Auto-generated subject candidates
│   │   └── used_lessons.json                # Run-scoped lesson dedup
│   ├── howitworks/                  # How It Works pipeline prompts + runner
│   │   ├── howitworks_session.py        # HowItWorks session runner
│   │   └── used_topics.json             # Tracks used topics (reset on train)
│   ├── storytime/                   # Storytime pipeline prompts + runner
│   │   ├── storytime_session.py         # Storytime session runner
│   │   ├── prompt_tutor_story_tell.txt  # Tutor reads bedtime story
│   │   ├── prompt_sophie_ask_basil.txt  # Sophie asks Basil about the story
│   │   ├── prompt_sophie_story_pick.txt # Sophie picks a story
│   │   ├── prompt_tutor_story_wrapup.txt # Tutor wraps up storytime
│   │   └── used_stories.json            # Tracks used stories (reset on train)
│   ├── whychain/                    # WhyChain pipeline prompts
│   │   └── used_questions.json          # Tracks used seed questions (reset on train)
│   ├── prompt_tutor_quiz_sophie.txt     # Popquiz: Tutor quizzes Sophie (shared)
│   ├── prompt_task_generator.txt     # Generate-then-select: candidate generation
│   ├── prompt_task_selector.txt      # Generate-then-select: best candidate selection
│   ├── prompt_task_agent.txt         # Legacy single-task generator
│   ├── prompt_task_naturalizer.txt   # Task naturalization
│   ├── prompt_task_validator.txt     # Task validation
│   ├── prompt_english_grader.txt      # English quality/domain relevance grader (0-3)
│   ├── prompt_task_grader.txt        # Task compliance grader (0-7, gated by english grader)
│   ├── prompt_subject_generator.txt  # Subject generation
│   ├── prompt_assessment_agent.txt   # Age band assessment
│   └── session_summary_prompt.txt    # Session summary
│
├── memory/                      # Persistent state
│   ├── basil_assessment.json    # Age band, compliance, progress
│   ├── rotation_state.json      # Subject/lesson blacklists
│   ├── metrics.json             # Rolling statistics + EWMA
│   ├── task_naturalizer_cache.json # Cached task naturalizations
│   ├── session_summaries/       # Per-session summaries
│   └── session_metrics/         # Per-session metrics JSON
│
├── models/                      # Basil model checkpoints
│   └── basil_v0001/             # Initial untrained model
│       └── basil_lora_adapter/  # LoRA adapter weights (after training)
│
├── checkpoints/                 # Pre/post train model snapshots
│   ├── pretrain_YYYYMMDD_*/     # Pre-train checkpoint
│   └── posttrain_YYYYMMDD_*/    # Post-train checkpoint
│
├── identities/                  # Identity probe tracking
│   └── identity_log.jsonl       # Per-batch identity probe results
│
├── logs/                        # Session logs (batch-consolidated)
│   ├── batch_*_graded.jsonl     # Graded turns with weights (training input)
│   ├── batch_*_sessions.jsonl   # Transcripts + episodes (debug/training)
│   ├── batch_*_meta.jsonl       # Per-session metrics (one line per session)
│   └── batch_*_summary.json     # Batch summaries
│
├── docs/                        # Project documentation
│   ├── SESSION_FLOW.md          # Detailed session flow (phases A-F)
│   ├── PROMPT_FLOW.md           # Prompt template flow diagrams
│   ├── SESSION_FLOW_AUDIT.md    # Static code audit report
│   ├── EPISODE_REFACTOR.md      # Episode architecture design doc
│   └── DEBUGGING_ROOT_CAUSE.md  # Historical debugging notes
│
└── utils/
    └── scoring.py               # Scoring utilities

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Set OpenAI API key
export OPENAI_API_KEY=sk-...

Setup

# Verify setup
python verify_setup.py

# Create initial Basil model (if needed)
python create_basil_v0001.py

Run a Session

# Run a single automated session
python auto_session.py --turns 10

# Run with specific subject
python auto_session.py --subject "Mathematics" --turns 5

# Run quietly (less output)
python auto_session.py --turns 10 --quiet

Run Batch Sessions

# Run forever (production mode)
python orchestrator.py run

# Run a specific number of sessions
python orchestrator.py run --sessions 5

# Run a specific number of batches
python orchestrator.py run --batches 2

# Check status
python orchestrator.py status

# Force training (with eval + rollback)
python orchestrator.py train --force

Train Basil

# Dual-objective training (default: alternating world/trunk + basil/LoRA)
# Uses basil_and_after masking and adaptive LoRA cap by default
python train_basil_v2.py --mode mixed

# Explicit masking mode
python train_basil_v2.py --mask-mode basil_and_after  # default
python train_basil_v2.py --mask-mode basil_only
python train_basil_v2.py --mask-mode none

# Manual LoRA epoch cap override (default: scales linearly with age_band)
python train_basil_v2.py --lora-max-epochs 30

# Train only the base model (world/trunk objective)
python train_basil_v2.py --mode world

# Train only the LoRA adapter (basil-policy objective)
python train_basil_v2.py --mode basil

# Force training + evaluation using existing logs
python force_train_and_eval.py

# Legacy modes (backward compatible)
python train_basil_v2.py --mode session
python train_basil_v2.py --mode graded

Temperature Sweep (Classroom, Graded)

Use this script to run a statistically grounded temperature sweep using real classroom tasks pulled from logs. It replays prompts at multiple temperatures, grades each output with the production grader stack, and reports which temperature gives the strongest training signal (score>=3).

Default: two-stage sweep — Coarse sweep 0.8–2.0, then refine around the best temperature. Grading runs in parallel (30 workers) with retries and jitter to avoid API rate limits.

# Two-stage sweep (coarse 0.8–2.0, refine around best; 30 grading workers)
python scripts/test_temperature_sweep.py

# Single-stage with custom temps
python scripts/test_temperature_sweep.py --temps 1.0,1.2,1.4,1.5,1.6

# Custom run
python scripts/test_temperature_sweep.py \
  --age-band 2 \
  --max-prompts 150 \
  --replicates 2 \
  --workers 30 \
  --sweep-mode two-stage \
  --bootstrap-iters 1000

Outputs:

Console summary per temperature (avg_score, %>=3, %>=4, %>=6, 95% CIs)
Machine-readable report: logs/temperature_sweep_<timestamp>.json
Paired bootstrap deltas vs the age-band baseline temperature

Checkpoint-to-Age-Band Mapping

Use this script to map basil_v* checkpoints to their post-train eval assessed_age_band so you can choose representative checkpoints (for example, band-1 and early band-2) before running temperature sweeps.

python scripts/checkpoint_age_band_map.py

It reads logs/train_*.log plus memory/session_metrics/session_*.json (training_phase="posttrain_eval") and prints a per-checkpoint mapping with promotion flags.

Configuration

Key settings in config.py:

# Session settings
SESSION_MAX_TURNS = 10
GRADE_EVERY_N_TURNS = 2
BASIL_MAX_TOKENS = 30  # Default fallback; actual value = (age_band * 3) + 4

# Assessment thresholds
ASSESSMENT_PROMOTE_SCORE = 3.5
ASSESSMENT_DEMOTE_SCORE = 1.0
ASSESSMENT_WINDOW_SESSIONS = 10

# Compliance/Progress gating
ASSESSMENT_MIN_COMPLIANCE_FOR_PROMOTION = 0.60
ASSESSMENT_MAX_COMPLIANCE_FOR_DEMOTION = 0.20
COMPLIANCE_SCORE_THRESHOLD = 4

# Session Lifecycle
MIN_GRADED_TURNS_PER_SESSION = 6
MAX_GRADED_TURNS_PER_SESSION = 20
EARLY_STOP_WINDOW_TURNS = 4
EARLY_STOP_MIN_AVG_SCORE = 2.5
EARLY_STOP_MAX_COMPLIANCE = 0.15
ENABLE_GRACEFUL_WRAPUP = True
WRAPUP_TURNS = 2

# Training Triggers (usable turns, scales with age_band)
# Formula: 1024 * (1 + age_band)
# A "usable turn" is one where score_to_weight_basil_policy(score, age_band) > 0.0
# age_band 0: 1,024 usable turns, age_band 1: 2,048, age_band 7: 8,192
MIN_SESSIONS_BEFORE_TRAIN = 3
TRAIN_ONLY_IF_PROGRESS_SIGNAL_RATE_AT_LEAST = 0.0
TRAIN_PROGRESS_SIGNAL_WINDOW = 10

# Post-Train Eval + Rollback
EVAL_SESSIONS_AFTER_TRAIN = 2
ROLLBACK_IF_SCORE_DROP_PCT = 0.15
ROLLBACK_IF_COMPLIANCE_DROP_ABS = 0.15
EVAL_COMPARE_WINDOW = 10

# LoRA Settings
LORA_ACTIVATION_AGE_BAND = 0  # LoRA training active from birth
LORA_RANK = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["c_attn", "c_proj"]  # GPT-2 attention layers
LORA_ADAPTER_NAME = "basil"
LORA_ADAPTER_SUBDIR = "basil_lora_adapter"

# LoRA Strength Scaling (inference)
def lora_strength_for_age_band(age_band):
    """0.0 at band 0, 1.0 at band 7"""
    return max(0.0, min(1.0, age_band / 7.0))

Key settings in train_basil_v2.py:

# Age-band-scaled early stopping (validation loss floor)
VAL_LOSS_FLOOR_BY_AGE_BAND = {
    0: 3.0,   # Just learn "this is English, speakers take turns"
    1: 2.5,   # Learn word boundaries, basic patterns
    2: 2.5,   # Learn topic relevance, simple word use
    3: 2.0,   # Learn phrase structure
    4: 1.7,   # Learn basic sentence patterns
    5: 1.4,   # Learn coherent responses
    6: 1.2,   # Learn nuance and reasoning
    7: 1.0,   # Full depth allowed
}

Age-band-aware score weights (BASIL_POLICY_SCORE_WEIGHTS_TABLE in config.py):

# Rows = scores 0-7, columns = age bands 0-7.
# Rule: score <= age_band -> 0 weight. Score 7 always -> 1.0.
# Uniform minimum of 0.15 with linear ramp to 1.0.
#              band0  band1  band2  band3  band4  band5  band6  band7
 0: {0: 0.00, 1: 0.00, 2: 0.00, 3: 0.00, 4: 0.00, 5: 0.00, 6: 0.00, 7: 0.00},
 1: {0: 0.15, 1: 0.00, ...},  # Bootstrap signal at band 0
 2: {0: 0.30, 1: 0.15, 2: 0.00, ...},
 ...
 7: {0: 1.00, 1: 1.00, 2: 1.00, 3: 1.00, 4: 1.00, 5: 1.00, 6: 1.00, 7: 1.00},

Session Types

Bootstrap Basil rotates between four session types, each exposing Basil to different language patterns:

School Day Sessions (Classroom)

Structured teaching sessions with Tutor, Sophie, and graded episodes. Multi-episode sessions (6+ episodes each) provide the core curriculum-driven training data. Each episode follows the Phase A-F loop (task generation, teaching, Basil attempt, grading, Sophie encouragement). Sophie's popquiz (where Tutor quizzes Sophie on the lesson material) alternates placement — sometimes before Basil's turn, sometimes after — to diversify the conversational patterns in training data and expose Basil to Sophie's example answers both as priming and as follow-up context.

Storytime Sessions

Bedtime story sessions where Tutor reads a story, Sophie asks Basil a question about it, and the session wraps up with a gentle recap. Single-episode sessions that expose Basil to narrative structure, vocabulary, and natural conversational patterns. Stories tracked in prompts/storytime/used_stories.json.

HowItWorks Sessions

Explanatory sessions where Tutor explains how something works (e.g., "How do magnets work?"), Sophie asks a follow-up, Tutor asks Basil a comprehension question, and the session concludes. Single-episode sessions that expose Basil to explanatory/instructional language patterns. Topics tracked in prompts/howitworks/used_topics.json.

WhyChain Sessions

Open-ended conversational sessions built around a chain of "why?" questions. Sophie and Tutor explore a topic through iterative questioning, with Basil observing the dialogue. These provide rich conversational examples and topic exploration patterns. Questions tracked in prompts/whychain/used_questions.json.

The orchestrator rotates between session types automatically. Content deduplication (exact, fuzzy Jaccard similarity, and LLM-based semantic checks via dedup_utils.py) prevents repetition within a training run. Used content lists are cleared after each training run.

School Day Session Flow

Subject Generation: LLM generates 15-40 age-appropriate school subjects → saved to prompts/classroom/subjects.json for debugging visibility
Subject Selection: Random pick from generated candidates (subjects are broad, deduplication happens at lesson level)
Tutor Kickoff: Announces SUBJECT_OF_THE_DAY
Sophie Lesson Pick: Chooses LESSON_OF_THE_DAY with semantic overlap check against lessons used in current training run (retries up to 3x if overlap detected)
Episode Loop (with hybrid stop policy):
- Phase A: Task Generator produces candidates → Task Selector picks best
- Phase B: Tutor teaches (B.1) → Sophie reacts (B.2) → Task delivered (B.3)
- Phase C: Basil responds
- Phase D: Two-grader scoring with gating (silent)
- Phase E: Sophie encourages + asks curiosity question
- Phase F: Tutor answers Sophie's question
- Stop conditions:
  - Hard max: MAX_GRADED_TURNS_PER_SESSION (20)
  - Early stop: if avg_score < 2.5 AND compliance < 15% after MIN_GRADED_TURNS (6)
Graceful Wrap-up:
- Tutor sends brief recap + encouragement (not graded)
- Sophie sends short closing line (not graded)
- Identity probe: "Who are you?" → Basil responds → logged to identity_log.jsonl
End of Session:
- Compute compliance_rate and progress_signal
- Update basil_assessment.json (potential age_band change)
- Save artifacts to batch-level files (graded, sessions, meta)
- Update rolling metrics (EWMA score/compliance)
- Display progress toward next training round

Training Workflow

The orchestrator manages automated training with evaluation and rollback:

┌─────────────────────────────────────────────────────────┐
│              TRAINING TRIGGER CHECK                      │
│  usable_turns >= 1024 * (1 + age_band)                   │
│  AND total_sessions >= 3                                 │
└─────────────────────────────────────────────────────────┘
                         │ (trigger)
                         ▼
┌─────────────────────────────────────────────────────────┐
│  1. Save pre-train checkpoint                            │
│  2. Compute baseline (last 10 normal sessions)          │
│  3. Run dual-objective training (mixed mode):            │
│     - Alternating WORLD epochs (trunk, basil_and_after mask)│
│     - Alternating BASIL epochs (LoRA, adaptive epoch cap)│
│  4. Save post-train checkpoint (base + LoRA separately)  │
│  5. Run 2 eval sessions (training_phase="posttrain")    │
│  6. Compare eval vs baseline                             │
└─────────────────────────────────────────────────────────┘
                         │
          ┌──────────────┴──────────────┐
          ▼                              ▼
┌─────────────────────┐    ┌─────────────────────────┐
│ Score drop >= 15%   │    │ No significant          │
│ AND compliance drop │    │ regression              │
│ >= 0.15 absolute    │    │                         │
│                     │    │                         │
│ → ROLLBACK to       │    │ → KEEP new model        │
│   pre-train ckpt    │    │                         │
└─────────────────────┘    └─────────────────────────┘

Data Model

Graded Turn (in `*_graded.jsonl`)

{
  "turn": 2,
  "graded": true,
  "subject": "Mathematics",
  "lesson": "Counting to 5",
  "tutor": "Basil, can you count to three?",
  "basil": "one two three",
  "task": {
    "task_text": "Count from one to three",
    "task_category": "vocab",
    "rubric": {"0": "...", "5": "..."},
    "grader_instructions": "..."
  },
  "grade": {
    "score": 4,
    "justification": "Basil correctly counted...",
    "evidence": ["one two three"]
  },
  "weight": 0.8,
  "age_band": 0
}

Basil Assessment (`memory/basil_assessment.json`)

{
  "age_band": 2,
  "capabilities": ["Occasionally produces target words...", "..."],
  "preferred_task_categories": ["control", "vocab"],
  "output_caps": {"basil_max_tokens": 10, "...": "..."},
  "progress_signal": false,
  "compliance_rate": 0.0,
  "score_history": [{"session_avg": 2.5, "compliance_rate": 0.4, "...": "..."}],
  "recent_session_metrics": {
    "avg_score_window": 2.5,
    "compliance_rate_window": 0.4,
    "progress_signal_window": false,
    "window_sessions": 3
  },
  "last_updated": "2026-02-05T..."
}

Session Metrics (`memory/session_metrics/session_*.json`)

{
  "session_id": "20260205_143022",
  "timestamp": "2026-02-05T14:30:22.123456",
  "subject_of_the_day": "Mathematics",
  "lesson_of_the_day": "Counting to 10",
  "age_band_start": 0,
  "age_band_end": 0,
  "graded_turns_count": 15,
  "avg_score_session": 2.3,
  "compliance_rate_session": 0.53,
  "progress_signal_session": false,
  "task_category_counts": {"control": 5, "vocab": 8, "relevance": 2},
  "avg_basil_tokens": 12.4,
  "early_stopped": false,
  "stop_reason": "completed",
  "training_phase": "normal"
}

Rolling Metrics (`memory/metrics.json`)

{
  "total_sessions": 42,
  "total_graded_turns": 630,
  "graded_turns_since_last_train": 230,
  "last_train_timestamp": "2026-02-05T12:00:00",
  "last_train_result": "kept",
  "ewma_score": 2.5,
  "ewma_compliance": 0.55,
  "last_n_session_ids": ["...", "..."],
  "recent_summary": {
    "avg_score": 2.4,
    "avg_compliance": 0.52,
    "progress_signal_rate": 0.4,
    "sessions_count": 10
  },
  "total_training_runs": 1,
  "total_rollbacks": 0
}

Recent Changes

Phase 10: Per-Phase Early Stopping, Training Stability (2026-02-22)

Training architecture and stability improvements based on observing Basil's bootstrapping through age_band 2:

Per-phase early stopping: In mixed mode, WORLD and BASIL phases now have independent patience counters, best_val_loss tracking, and validation loaders. Previously, a single shared patience counter caused BASIL training to be killed prematurely — BASIL epochs naturally raise the WORLD val loss (because LoRA specialization hurts generalization), burning shared patience ticks. With per-phase tracking, BASIL got 9 epochs instead of 1 in the first test, and subsequent runs have reached 6+ BASIL epochs with WORLD running to convergence. Cross-validation metrics (basil_val during WORLD, world_val during BASIL) are logged for monitoring.
MAX_TRAIN_TIME increased to 12 hours: Training was hitting the 8-hour time limit before convergence (WORLD still finding new val loss improvements at step 14,500 of 15,500). Increased to 12 hours to allow natural convergence.
Training trigger threshold doubled: get_train_every_usable_turns formula changed from 512 * (1 + age_band) to 1024 * (1 + age_band). This gives more time for assessment to stabilize between training runs and ensures a larger, more diverse dataset for each training cycle.
Assessment window widened: ASSESSMENT_WINDOW_SESSIONS increased from 3 to 10. The narrow window caused noisy age_band oscillation during parallel generation (3 bad sessions could trigger a demotion that was reversed 3 sessions later).
Metric consistency fix: Promotion-blocking logic in memory_manager.py now uses usable_turns_since_last_train and get_train_every_usable_turns (matching the actual training trigger), instead of the deprecated graded_turns_since_last_train / get_train_every_graded_turns. The mismatch previously caused Basil to get stuck — promotion blocked (graded threshold met) but training never triggered (usable threshold not met).
Dynamic target turns: parallel_generate.py now uses a shared mp.Value for target_turns that the monitor loop refreshes every ~30 seconds from the current age_band assessment. Previously, target was frozen at startup, causing mismatches when age_band changed mid-run.
Grading context for all session types: howitworks_session.py and storytime_session.py now pass subject and lesson to grade_response(), giving the English Grader domain context it was missing (e.g., "warm bath" is domain-relevant when the topic is "How does a hot water heater work?").
Orchestrator direct-training path: When usable_turns_since_last_train already exceeds the training threshold (e.g., after an aborted training run), the orchestrator now directly triggers training instead of launching generation with a target of 0.

Phase 9: Grading Fairness, Weighted Masking, Usable-Turn Triggers (2026-02-20)

A comprehensive update to the grading, training, and session flow systems based on iterative debugging of Basil's bootstrapping pipeline:

Two-grader gating architecture: grader_agent.py now runs two independent LLM graders — an english_grader (0-3, domain relevance) and a task_grader (0-7, task compliance). The English Grader's score gates the Task Grader's output: if english_score <= 1, the task score is capped at 2; if english_score == 2, capped at 3; otherwise uncapped. Final score = max(english_score, capped_task_score). This solved a pervasive problem where Basil parroting Sophie/Tutor phrases ("Nice try!", "Keep going!") received inflated task scores (4-5) despite containing zero domain content. The English Grader correctly rated these 0-1, and the gate prevents the Task Grader from rewarding mimicry.
Weighted trunk masking: WorldDataset evolved from binary masking (labels=-100) to graduated weight masking. Basil's generations and Sophie's immediate reaction are trained at fractional weight (half the LoRA policy weight for that score/age_band), while subsequent Sophie popquiz and Tutor wrap-up content train at full weight. If the LoRA weight is 0.0, those zones are fully masked. This preserves the trunk's exposure to improving Basil responses without memorizing garbage.
LoRA epoch cap scaled by age_band: lora_max_epochs_for_age_band(age_band, max_epochs=100) replaced the previous max(1, round(avg_score)) cap. LoRA epochs now scale linearly from 0 at age_band=0 to 100 at age_band=7, mirroring inference strength scaling. This ensures training effort and inference influence grow in lockstep, preventing early overfitting while giving mature LoRA the full training budget.
Usable-turn training triggers: Training thresholds now count "usable turns" (where score_to_weight_basil_policy(score, age_band) > 0.0) instead of raw graded turns. Formula: 1024 * (1 + age_band). This ensures each training round has sufficient high-quality data, not just volume.
Alternating popquiz order: Classroom sessions now randomly place Sophie's popquiz (where Tutor quizzes Sophie) either before or after Basil's turn in each episode. This diversifies the conversational patterns Basil is exposed to during training — sometimes seeing Sophie's correct answer as a priming example before attempting the task, sometimes seeing it as follow-up context after.
Programmatic score floor (score_override.py): Applies target-word presence checks as a safety net — if Basil's response contains the exact target word, the score is lifted to at least 6 regardless of LLM grading. This protects against false negatives from overly strict LLM graders.

V2 Experiment Pipeline Update: Trunk Masking, Adaptive LoRA Cap, LoRA Strength Scaling (2026-02-15)

Based on a systematic 9-fork experiment (3 masking modes × 3 training strategies), the production training pipeline was updated with three key changes:

Trunk masking (basil_and_after): WorldDataset now defaults to mask_mode="basil_and_after", which masks Basil's output tokens + Sophie's post-grade encouragement + Tutor's answer for each episode. The trunk only learns from the teaching/task setup content. This was the clear winner in the experiment matrix — basil_and_after masking produced the most coherent English outputs and prevented the trunk from absorbing Basil's garbage or Sophie's score-conditioned phrases. Interestingly, the trunk still learned encouraging dialogue patterns (e.g., "Nice try, can you...") from unmasked Sophie reactions and Tutor lines in earlier phases — a sign of genuine language generalization, not a masking failure. (Note: this was later refined to weighted masking in Phase 9.)
Adaptive LoRA epoch cap: The number of LoRA (BASIL) training epochs was capped at max(1, round(avg_score)), where avg_score is the mean score of included BasilDataset examples. This directly addressed the primary failure mode discovered in experiments: LoRA overfitting on garbage tokens. (Note: later replaced by age-band-linear scaling in Phase 9.)
Age-band-based LoRA strength at inference: lora_strength_for_age_band() in config.py provides smooth linear scaling from 0.0 (age_band=0, trunk-only) to 1.0 (age_band=7, full LoRA). This replaces the previous binary on/off behavior. The LoRA is still trained from birth, but its inference influence scales with developmental stage — preventing early-stage garbage from dominating generation while allowing mature LoRA refinements full expression.
New CLI args: --mask-mode (choices: none, basil_only, basil_and_after) and --lora-max-epochs (manual override for age-band cap) added to train_basil_v2.py.

Episode-Local LoRA Context + Dedup Hardening (2026-02-14)

Episode-local context for LoRA training: BasilDataset now uses only the current episode's dialogue (teaching, quiz, task) as context for each Basil training example — NOT the full session including prior episodes' Basil garbage. This prevents contamination of the LoRA's conditioning context. Single-episode sessions (storytime, howitworks) fall back to full session context since there are no prior Basil turns. The WorldDataset (trunk) is unaffected and still trains on full session transcripts.
Dedup retry hardening: Increased max_pick_retries / max_retries from 5 to 10 across all session types (classroom, storytime, howitworks, whychain) to handle dedup exhaustion during large-scale parallel generation runs.

Age-Band-Aware Score Weights + Val Loss Floor (2026-02-13)

BASIL_POLICY_SCORE_WEIGHTS_TABLE: Replaced the flat score-to-weight mapping with a 2D table (score × age_band). Key rules: score ≤ age_band gets 0 weight, score 7 always 1.0, uniform minimum 0.15 with linear ramp. At band 0, score=1 gets 0.15 as the bootstrap signal pulling LoRA toward English.
VAL_LOSS_FLOOR_BY_AGE_BAND: Added age-band-scaled validation loss floor for early stopping. Band 0 stops at loss 3.0, band 7 allows down to 1.0. Prevents over-training at early stages.
LoRA activation from birth: LORA_ACTIVATION_AGE_BAND = 0 confirmed — LoRA provides reinforcement bootstrap signal from the very first training run.

HowItWorks + WhyChain Session Types (2026-02-12)

HowItWorks sessions: Explanatory "how does X work?" sessions with Tutor explanation, Sophie follow-up, and Basil comprehension question. Single-episode format.
WhyChain sessions: Open-ended "why?" chain conversations between Sophie and Tutor, with Basil observing. Provides rich conversational examples.
Shared dedup utilities (dedup_utils.py): Three-layer dedup pipeline (exact match, fuzzy Jaccard similarity, LLM-based semantic check) shared across all session types.

Parallel Data Generation (2026-02-12)

parallel_generate.py: Multi-worker parallel generation with --target-turns and --no-train flags. Defaults to 20 workers. Each worker runs sessions independently with per-process model caching (model_cache.py).
model_cache.py: Per-process caching for the Basil model, with LoRA enable/disable support.
reset.py: Utility for clearing generated files with --keep-logs and --reset-model options.

Greedy Decoding Fix + Separate LoRA Learning Rate (2026-02-10)

Eliminated greedy decoding for early age bands: Previously, age_band 0-1 used do_sample=False (greedy), while age_band 2+ used sampling. Greedy decoding was intended to give the "strongest signal" from an undertrained model, but in practice it amplified slight distributional biases in the LoRA adapter into single-token repetition loops (e.g. "il il il il" on every turn). All age bands now use uniform sampling (temperature=1.0, top_k=50), which is the neutral default that samples from the model's distribution without reshaping it.
Separate LoRA learning rate: The LoRA adapter now trains with its own peak LR (LORA_PEAK_LR=3e-5), 3.3x lower than the trunk's 1e-4. LoRA's alpha/rank amplification (16/8 = 2x) means the effective update magnitude at 3e-5 is comparable to 6e-5 on the parameters. This produces more conservative adapter updates that don't overfit to noisy early training data.
Root cause analysis: The previous "benchmark benchmark benchmark" and "il il il il" collapse was caused by two interacting factors: (1) greedy decoding deterministically picking the highest-probability token at each step, creating a self-reinforcing loop, and (2) the LoRA adapter being trained at the same LR as the 254M-parameter trunk, causing it to overfit its 1.4M parameters too aggressively to noise in the early training data. Fixing both produced immediate results: Basil now generates varied English words, forms partial phrases, and in some cases correctly answers questions (scoring 6/7).

Dual-Objective Training with LoRA Adapters (2026-02-09)

Two training objectives: WORLD/TRUNK trains the base GPT-2 model on full transcripts for general language understanding; BASIL-POLICY trains a LoRA adapter specifically on Basil's conversational turns with score-weighted examples
LoRA adapters: Added peft dependency for parameter-efficient fine-tuning. Only ~1.44M parameters (0.56% of total) are trained during the BASIL-POLICY phase, while the full 254M base model is trained during WORLD phase
Mixed training mode: Default mixed mode alternates epochs between the two objectives, each with its own optimizer and learning rate scheduler
Sophie post-grade masking: Sophie's encouragement lines after grading are masked in training data to prevent data leakage (Basil was learning to repeat Sophie's phrases)
Basil-policy score weighting: Score 0 maps to weight 0.0 (discarded), score 7 maps to 1.0, with intermediate values scaled accordingly
LoRA-aware inference: auto_session.py and storytime_session.py automatically detect and load LoRA adapters when generating Basil's responses

Storytime Content Pipeline (2026-02-09)

New session type: Bedtime story sessions where Tutor reads a story with pauses for Basil to react
Story catalog: prompts/storytime/stories.json with age-appropriate stories
Used story tracking: prompts/storytime/used_stories.json prevents repetition within a training run (reset after training)
Orchestrator integration: Sessions alternate between school day and storytime automatically

Subject/Lesson Selection Refactor (2026-02-09)

Simplified subject selection: Removed subject-level semantic overlap check; subjects are broad categories that can yield many lessons. Generated candidates are saved to prompts/classroom/subjects.json for debugging visibility.
Lesson-level deduplication: Semantic overlap checking moved to lesson level (where it matters). Sophie's lesson picks are checked against lessons used in the current training run via check_lesson_overlap(), with automatic retry (up to 3x) if overlap detected.
Used lessons tracking: prompts/classroom/used_lessons.json tracks lessons used within each training run. List is cleared after training completes, allowing lesson reuse across training runs while ensuring variety within a single run's data collection.

Training Efficiency Improvements (2026-02-08)

Dynamic BLOCK_SIZE: Training sequence length now scales with age band (512 for bands 0-2, up to 1024 for bands 6-7), improving token utilization
Scaled LLM generations: Tutor and Sophie max_tokens scale with age_band via scaled_max_tokens(), producing richer content as Basil matures
Usable-turn training threshold: Training triggers count "usable turns" (where score_to_weight_basil_policy(score, age_band) > 0.0) rather than raw graded turns. Formula: 1024 * (1 + age_band). This ensures training data quality scales with expectations — only turns that would actually contribute non-zero LoRA weight count toward the trigger.

Prompt & Pipeline Streamlining

Removed deprecated prompt_tutor.txt and prompt_sophie.txt: Split into purpose-specific prompt files under prompts/classroom/ (prompt_tutor_kickoff.txt, prompt_sophie_lesson_select.txt, prompt_sophie_react_teaching.txt, prompt_sophie_post_grade.txt)
Removed "story_so_far" files: tutor_story_so_far.md and sophie_story_so_far.md were adding complexity without clear benefit; subject rotation and semantic dedup now handle diversity
Removed strategy system: The _compute_next_strategy() feedback loop (simplify/maintain/escalate) was not producing useful differentiation and has been removed
Simplified basil_assessment in prompts: Only the human-readable age_band_description is passed to prompts, not the full JSON blob
Physical action filtering via prompts: Instead of brittle regex filters, task generator and selector prompts now explain that Basil "can talk but has no body"

Identity Probe

Probe question updated: Changed from "Basil, who are you?" to "Who are you?"
Per-batch identity logging: Identity probe runs once per batch (last session), results saved to identities/identity_log.jsonl
Historical backfill: Old session logs were backfilled into identity_log.jsonl
Research context: The identity log tracks the secondary research question -- whether a sense of self emerges purely from language acquisition. The trajectory so far shows Basil moving from random tokens ("shaft slip slip shaft") through mimicked Tutor/Sophie phrases ("Hello, Sophie and little Basil! I'm") toward increasingly contextual responses.

Session Output

Progress indicator: End of each session displays accumulated graded turns and percentage progress toward next training round
Removed top_p from generation: Eliminated spurious transformer warnings for GPT-2

Guiding Principles

No gold-standard answers - We don't train Basil to mimic a teacher model's prose
Use rubrics, not targets - Tasks include scoring criteria with partial credit
Reward-weighted learning - Higher scores = higher training weights
Measure progress - Track scores by category, compliance, progress signals
Be robust to babble - Early rubrics reward controllability and any recognizable signal
Prevent "easy mode leveling" - Compliance gating prevents promotion on trivial tasks
Penalize parroting - Repeating teacher/sibling phrases is not language production; the grading pipeline explicitly detects and downscores conversational mimicry
Scale everything with age - LoRA strength, training epochs, max tokens, training thresholds, and masking weights all scale with age_band, preventing early-stage overfitting while unlocking full capacity at maturity

Test Checklist (Phase 4)

1. Basic Session Test

python orchestrator.py run --sessions 2

Verify:

logs/batch_*_graded.jsonl created (graded turns for training)
logs/batch_*_sessions.jsonl created (transcripts + episodes)
logs/batch_*_meta.jsonl created (per-session metrics)
memory/metrics.json updates (total_sessions, EWMA)
Sessions end with graceful wrap-up lines (Tutor recap, Sophie closing)

2. Early Stop Test

# Temporarily edit config.py:
# EARLY_STOP_MIN_AVG_SCORE = 7.0  # Force early stop

python orchestrator.py run --sessions 1

# Check session_metrics file:
# early_stopped: true
# stop_reason: "early_stop_low_signal"

3. Training Trigger Smoke Test

# Temporarily edit config.py:
# TRAIN_EVERY_GRADED_TURNS = 1
# MIN_SESSIONS_BEFORE_TRAIN = 1

python orchestrator.py run --sessions 3

Verify:

Pre-train checkpoint saved to checkpoints/pretrain_*/
Training invoked (new model in models/basil_v*/)
Post-train eval sessions run (training_phase="posttrain_eval")
metrics.json shows last_train_result ("kept" or "rolled_back")

4. Rollback Test

# Force a rollback by temporarily setting low thresholds:
# ROLLBACK_IF_SCORE_DROP_PCT = 0.0001
# ROLLBACK_IF_COMPLIANCE_DROP_ABS = 0.0001

python orchestrator.py train --force

Verify:

Rollback detected
Pre-train checkpoint restored as new model version
metrics.json shows last_train_result="rolled_back"

5. Status Check

python orchestrator.py status

Verify output includes:

Total sessions and graded turns
Graded turns since last train
EWMA metrics
Training trigger status

How to Run

Quick Start (3 sessions)

python orchestrator.py run --sessions 3

Production Mode (continuous)

python orchestrator.py run
# Ctrl+C to stop

Force Training with Eval

python orchestrator.py train --force

Monitor Progress

# Check status
python orchestrator.py status

# View batch log files (3 files per batch, ~20 sessions per batch)
ls -la logs/batch_*

# View rolling metrics
cat memory/metrics.json | python -m json.tool

Roadmap

See ROADMAP.md for a detailed discussion of the current state, potential levers for improvement, and open questions. This project has shown early promise but is not proven to work -- contributions and experimentation are welcome.

Future Work

Disclaimer

This project was built entirely through AI-assisted development (vibecoding with Cursor) by a non-technical hobbyist. The code is provided as-is with no warranties or guarantees of any kind. It is an experiment, not production software. Use at your own risk.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
experiments		experiments
prompts		prompts
scripts		scripts
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
assessment_agent.py		assessment_agent.py
auto_session.py		auto_session.py
backlog.md		backlog.md
chat_basil.py		chat_basil.py
complete_basil.py		complete_basil.py
config.py		config.py
create_basil_v0001.py		create_basil_v0001.py
curriculum_manager.py		curriculum_manager.py
dedup_utils.py		dedup_utils.py
eval_basil.py		eval_basil.py
file_lock_utils.py		file_lock_utils.py
force_train_and_eval.py		force_train_and_eval.py
grader_agent.py		grader_agent.py
identity_probe.py		identity_probe.py
log_words.py		log_words.py
log_words_and_entropy.py		log_words_and_entropy.py
memory_manager.py		memory_manager.py
metrics_manager.py		metrics_manager.py
model_cache.py		model_cache.py
orchestrator.py		orchestrator.py
parallel_generate.py		parallel_generate.py
requirements.txt		requirements.txt
reset.py		reset.py
science_domain_generator.py		science_domain_generator.py
score_override.py		score_override.py
sophie_engine.py		sophie_engine.py
story_genre_generator.py		story_genre_generator.py
subject_generator.py		subject_generator.py
subject_topic_generator.py		subject_topic_generator.py
task_agent.py		task_agent.py
task_contract.py		task_contract.py
teaching_angle_generator.py		teaching_angle_generator.py
test_model_comparison.py		test_model_comparison.py
test_sophie.py		test_sophie.py
test_whychain_integration.py		test_whychain_integration.py
train_basil_experiments.py		train_basil_experiments.py
train_basil_v2.py		train_basil_v2.py
verify_setup.py		verify_setup.py
whychain_session.py		whychain_session.py

License

hunterooc/bootstrap-basil

Folders and files

Latest commit

History

Repository files navigation

Bootstrap Basil

Overview

Current Implementation Status

Completed Features

Training Architecture

Objective 1: WORLD/TRUNK (Language Model)

Objective 2: BASIL-POLICY (LoRA Adapter)

Mixed Training Mode

LoRA Strength Scaling at Inference

Grading Pipeline

Two-Grader Architecture

English Grader Gating

Programmatic Score Floor

Why This Design?

LoRA Configuration

Architecture

Key Components

Agents

Age Band System

Promotion/Demotion Logic

Directory Structure

Quick Start

Prerequisites

Setup

Run a Session

Run Batch Sessions

Train Basil

Temperature Sweep (Classroom, Graded)

Checkpoint-to-Age-Band Mapping

Configuration

Session Types

School Day Sessions (Classroom)

Storytime Sessions

HowItWorks Sessions

WhyChain Sessions

School Day Session Flow

Training Workflow

Data Model

Graded Turn (in *_graded.jsonl)

Basil Assessment (memory/basil_assessment.json)

Session Metrics (memory/session_metrics/session_*.json)

Rolling Metrics (memory/metrics.json)

Recent Changes

Phase 10: Per-Phase Early Stopping, Training Stability (2026-02-22)

Phase 9: Grading Fairness, Weighted Masking, Usable-Turn Triggers (2026-02-20)

V2 Experiment Pipeline Update: Trunk Masking, Adaptive LoRA Cap, LoRA Strength Scaling (2026-02-15)

Episode-Local LoRA Context + Dedup Hardening (2026-02-14)

Age-Band-Aware Score Weights + Val Loss Floor (2026-02-13)

HowItWorks + WhyChain Session Types (2026-02-12)

Parallel Data Generation (2026-02-12)

Greedy Decoding Fix + Separate LoRA Learning Rate (2026-02-10)

Dual-Objective Training with LoRA Adapters (2026-02-09)

Storytime Content Pipeline (2026-02-09)

Subject/Lesson Selection Refactor (2026-02-09)

Training Efficiency Improvements (2026-02-08)

Prompt & Pipeline Streamlining

Identity Probe

Session Output

Guiding Principles

Test Checklist (Phase 4)

1. Basic Session Test

2. Early Stop Test

3. Training Trigger Smoke Test

4. Rollback Test

5. Status Check

How to Run

Quick Start (3 sessions)

Production Mode (continuous)

Force Training with Eval

Monitor Progress

Roadmap

Future Work

Disclaimer

License

Graded Turn (in `*_graded.jsonl`)

Basil Assessment (`memory/basil_assessment.json`)

Session Metrics (`memory/session_metrics/session_*.json`)

Rolling Metrics (`memory/metrics.json`)

Packages