The simplest repository for training a tiny Python coding model on your own CPU. No GPU required.
step 10 | loss 9.7521 | ppl 17189.75 | lr 2.70e-06 | tok/s 2573 | elapsed 31.8s
[CPU: 76% | RAM: 18.3/24GB (76%) | Process: 1.57GB]
step 50 | loss 8.6149 | ppl 5513.08 | lr 1.47e-05 | tok/s 2296 | elapsed 35.7s
[CPU: 79% | RAM: 18.7/24GB (78%) | Process: 1.56GB]
step 100 | loss 7.8829 | ppl 2651.57 | lr 2.97e-05 | tok/s 2998 | elapsed 27.3s
[CPU: 74% | RAM: 17.8/24GB (74%) | Process: 1.79GB]
nanoCoder is a from-scratch implementation of a decoder-only transformer optimized for CPU training. Point it at a Python codebase, and it learns to write code like it. The model architecture follows modern best practices (RoPE, GQA, SwiGLU, RMSNorm) in ~300 lines, the training loop is ~400 lines, and the whole project is plain, readable Python.
100-step proof-of-concept run on Python 3.14 standard library (11.4M tokens, M4 Pro):
Loss
10.0 |*
9.5 | *
9.0 | * *
8.5 | * *
8.0 | * * *
7.5 | * *
7.0 | *
+-----+-----+-----+-----+
0 25 50 75 100 Step
| Step | Loss | Perplexity | tok/s |
|---|---|---|---|
| 10 | 9.752 | 17,190 | 2,573 |
| 20 | 9.479 | 13,087 | 2,772 |
| 30 | 9.059 | 8,593 | 2,111 |
| 40 | 8.778 | 6,488 | 2,355 |
| 50 | 8.615 | 5,513 | 2,296 |
| 60 | 8.509 | 4,957 | 2,895 |
| 70 | 8.377 | 4,348 | 2,817 |
| 80 | 8.185 | 3,586 | 2,889 |
| 90 | 8.033 | 3,080 | 2,826 |
| 100 | 7.883 | 2,652 | 2,998 |
Resource usage: avg 76% CPU, 1.5-2GB process memory, 76% system RAM on an M4 Pro 24GB.
During development we discovered that PyTorch's scaled_dot_product_attention with bf16 on Apple Silicon CPU falls back to a naive kernel, making it ~50x slower than fp32 (53 tok/s vs 3,000 tok/s). This project defaults to fp32 for CPU training. bf16 is only beneficial on CUDA/MPS backends.
seq=512 fp32 bs=4: 0.75s/step 2716 tok/s
seq=512 bf16 bs=4: 38.38s/step 53 tok/s <-- 50x slower!
pip install torch numpy tokenizers datasets tqdm
Dependencies: pytorch, numpy, tokenizers (HuggingFace fast BPE), datasets (for HuggingFace dataset loading), tqdm (progress bars). Optional: gradio (web UI), wandb (logging).
Verify everything works (no training, just builds all 3 model sizes and runs forward/backward):
python run.py --smoke-testTrain on a codebase (prepares tokenizer + data, then prints the train command):
python run.py --codebase /path/to/your/python/projectTrain on Python's standard library (good first test, ~11M tokens):
python run.py --codebase $(python -c "import sysconfig; print(sysconfig.get_paths()['stdlib'])")Then train:
OMP_NUM_THREADS=8 python -u -m nanocoder.train \
--preset 50m \
--data-dir output/data \
--tokenizer output/tokenizer.json \
--output-dir output/checkpoints \
--max-steps 10000 \
--batch-size 4 \
--grad-accum 4 \
--no-grad-ckptOn an M4 Pro at ~3,000 tok/s, 10K steps processes ~40M tokens and takes about 3.5 hours.
After training, generate code:
python -m nanocoder.generate output/checkpoints/final.pt \
--tokenizer output/tokenizer.json \
--interactiveOr launch the web UI:
pip install gradio
python app.pynanoCoder uses a modern decoder-only transformer with the same building blocks as LLaMA/Mistral, scaled down:
| Component | Choice | Why |
|---|---|---|
| Position encoding | RoPE | Standard, no learned embeddings |
| Attention | Grouped Query (GQA) | Fewer KV heads saves memory |
| FFN | SwiGLU | Better than GELU at small scale |
| Normalization | RMSNorm (pre-norm) | Faster than LayerNorm |
| Weight tying | Embedding = LM head | Saves parameters |
Three preset sizes:
| Preset | Params | Layers | Hidden | Heads | KV Heads | FFN | Context |
|---|---|---|---|---|---|---|---|
50m |
25.2M | 12 | 384 | 6 | 2 | 1024 | 512 |
100m |
61.5M | 16 | 512 | 8 | 2 | 1408 | 2048 |
200m |
175.6M | 24 | 768 | 12 | 4 | 2048 | 2048 |
The 50m preset is the recommended starting point for CPU training. It uses ~1.5GB RAM, trains at ~3,000 tok/s on an M4 Pro, and can complete a meaningful training run in hours.
Benchmarked on Apple M4 Pro (24GB, 12 cores), fp32, OMP_NUM_THREADS=8:
| Batch Size | Seq Length | Time/Step | Throughput |
|---|---|---|---|
| 1 | 256 | 0.19s | 1,313 tok/s |
| 1 | 512 | 0.25s | 2,082 tok/s |
| 2 | 512 | 0.32s | 3,243 tok/s |
| 4 | 256 | 0.27s | 3,811 tok/s |
| 4 | 512 | 0.69s | 2,977 tok/s |
Estimated training times for the 50M model:
| Dataset | Steps | Time |
|---|---|---|
| Small codebase (~1M tokens) | 2,000 | ~20 min |
| Python stdlib (~11M tokens) | 10,000 | ~3.5 hr |
| Medium codebase (~50M tokens) | 50,000 | ~18 hr |
| Stack-Edu-Python subset (~500M tokens) | 200,000 | ~3 days |
nanocoder/
config.py 134 lines Model configs and presets
model.py 307 lines Transformer (RoPE, GQA, SwiGLU, RMSNorm)
tokenizer.py 174 lines Python-specific BPE tokenizer
data.py 390 lines Data loading (local files, HuggingFace, memmap)
train.py 434 lines Training loop with resource monitoring
generate.py 283 lines Inference (completion + fill-in-the-middle)
evaluate.py 432 lines HumanEval + MBPP benchmarks
synthetic.py 600 lines Phi-1 style textbook data generation
distill.py 389 lines Knowledge distillation from larger models
monitor.py 394 lines CPU/RAM/temp monitor with auto-throttle
app.py 541 lines Gradio web UI
run.py 175 lines Quick-start pipeline
Training
- Gradient accumulation for effective large batch sizes with tiny micro-batches
- Cosine learning rate schedule with linear warmup
- Resource monitoring: CPU, RAM, temperature tracking with auto-throttle
- Periodic checkpointing with resume support
Data
- Python-specific BPE tokenizer (preserves indentation)
- Fill-in-the-middle (FIM) data augmentation for code completion
- Memory-mapped datasets for efficient large-scale training
- AST-based quality filtering (rejects files that don't parse)
- HuggingFace dataset integration (The Stack, CodeParrot, etc.)
Generation
- Top-k, top-p (nucleus) sampling
- Fill-in-the-middle code infilling
- Interactive REPL mode
- Web UI with Gradio
Synthetic Data (the phi-1 playbook)
- Topic-seeded exercise generation across 60+ Python topics
- Textbook-style teaching content generation
- Supports Ollama (local), Anthropic API, or OpenAI API
- Execution-based quality filtering (runs generated code, keeps what passes)
Distillation
- Generate training labels from a larger teacher model
- Train the small student model on teacher outputs
- Works with local Ollama models (zero API cost)
Evaluation
- HumanEval (164 problems, pass@k)
- MBPP (974 problems, pass@k)
- Sandboxed execution with timeouts
The single biggest quality lever. Microsoft's phi-1 showed a 350M model trained on 7B "textbook quality" tokens (45% HumanEval) crushes a 350M model on 580B generic tokens (12.76% HumanEval). To generate synthetic data:
# Using a local Ollama model (free, no API key needed):
ollama pull qwen2.5-coder:7b
python -m nanocoder.synthetic all --provider ollama --model qwen2.5-coder:7b
# Using Anthropic API:
export ANTHROPIC_API_KEY=sk-...
python -m nanocoder.synthetic all --provider anthropic
# Compile into training format:
python -m nanocoder.synthetic compileDistill knowledge from a larger model into nanoCoder:
# Step 1: Generate teacher solutions
python -m nanocoder.distill generate \
--teacher ollama:qwen2.5-coder:7b \
--num-samples 5000
# Step 2: Train student on teacher data
python -m nanocoder.distill train \
--teacher-data output/distill/teacher_labels.jsonl \
--tokenizer output/tokenizer.json \
--preset 50mpython -m nanocoder.evaluate output/checkpoints/final.pt \
--tokenizer output/tokenizer.json \
--benchmarks humaneval mbppWhy fp32? PyTorch's SDPA attention kernel has no optimized bf16 path on CPU. We measured 50x slowdown with bf16 vs fp32. Always use fp32 for CPU training.
Why seq_len=512? At batch_size=4, seq_len=512 hits the throughput sweet spot (~3,000 tok/s). Longer sequences increase SDPA's quadratic cost without proportional benefit for the small model.
Why GQA with 2 KV heads? Grouped Query Attention with few KV heads saves memory and compute while retaining most of multi-head attention's expressiveness. Critical at small scale where every parameter counts.
Why no gradient checkpointing by default? For the 25M model, checkpointing adds ~2x overhead while saving negligible memory (the model only uses ~1.5GB total). Enable it for the 100M+ models where memory is the bottleneck.
Depth over width. Following SmolLM's finding, the 100M preset uses 16 layers with 512 hidden dim rather than fewer, wider layers. Deeper networks outperform wider ones at small scale.
- phi-1 (Microsoft) - "Textbooks Are All You Need": data quality > quantity
- SmolLM (HuggingFace) - depth > width for small models
- nanoGPT (Karpathy) - clean, minimal implementation philosophy
- TinyStarCoder-Py (BigCode) - Python-only small model baseline
MIT