nanoCoder

The simplest repository for training a tiny Python coding model on your own CPU. No GPU required.

step     10 | loss 9.7521 | ppl 17189.75 | lr 2.70e-06 | tok/s 2573 | elapsed 31.8s
  [CPU: 76% | RAM: 18.3/24GB (76%) | Process: 1.57GB]
step     50 | loss 8.6149 | ppl 5513.08 | lr 1.47e-05 | tok/s 2296 | elapsed 35.7s
  [CPU: 79% | RAM: 18.7/24GB (78%) | Process: 1.56GB]
step    100 | loss 7.8829 | ppl 2651.57 | lr 2.97e-05 | tok/s 2998 | elapsed 27.3s
  [CPU: 74% | RAM: 17.8/24GB (74%) | Process: 1.79GB]

nanoCoder is a from-scratch implementation of a decoder-only transformer optimized for CPU training. Point it at a Python codebase, and it learns to write code like it. The model architecture follows modern best practices (RoPE, GQA, SwiGLU, RMSNorm) in ~300 lines, the training loop is ~400 lines, and the whole project is plain, readable Python.

Training Loss

100-step proof-of-concept run on Python 3.14 standard library (11.4M tokens, M4 Pro):

Loss
 10.0 |*
  9.5 | *
  9.0 |   * *
  8.5 |       * *
  8.0 |           * * *
  7.5 |                 * *
  7.0 |                     *
      +-----+-----+-----+-----+
      0    25    50    75   100   Step

Step	Loss	Perplexity	tok/s
10	9.752	17,190	2,573
20	9.479	13,087	2,772
30	9.059	8,593	2,111
40	8.778	6,488	2,355
50	8.615	5,513	2,296
60	8.509	4,957	2,895
70	8.377	4,348	2,817
80	8.185	3,586	2,889
90	8.033	3,080	2,826
100	7.883	2,652	2,998

Resource usage: avg 76% CPU, 1.5-2GB process memory, 76% system RAM on an M4 Pro 24GB.

Key Finding: bf16 is 50x slower on CPU

During development we discovered that PyTorch's scaled_dot_product_attention with bf16 on Apple Silicon CPU falls back to a naive kernel, making it ~50x slower than fp32 (53 tok/s vs 3,000 tok/s). This project defaults to fp32 for CPU training. bf16 is only beneficial on CUDA/MPS backends.

seq=512 fp32 bs=4:  0.75s/step  2716 tok/s
seq=512 bf16 bs=4: 38.38s/step    53 tok/s   <-- 50x slower!

Install

pip install torch numpy tokenizers datasets tqdm

Dependencies: pytorch, numpy, tokenizers (HuggingFace fast BPE), datasets (for HuggingFace dataset loading), tqdm (progress bars). Optional: gradio (web UI), wandb (logging).

Quick Start

Verify everything works (no training, just builds all 3 model sizes and runs forward/backward):

python run.py --smoke-test

Train on a codebase (prepares tokenizer + data, then prints the train command):

python run.py --codebase /path/to/your/python/project

Train on Python's standard library (good first test, ~11M tokens):

python run.py --codebase $(python -c "import sysconfig; print(sysconfig.get_paths()['stdlib'])")

Then train:

OMP_NUM_THREADS=8 python -u -m nanocoder.train \
    --preset 50m \
    --data-dir output/data \
    --tokenizer output/tokenizer.json \
    --output-dir output/checkpoints \
    --max-steps 10000 \
    --batch-size 4 \
    --grad-accum 4 \
    --no-grad-ckpt

On an M4 Pro at ~3,000 tok/s, 10K steps processes ~40M tokens and takes about 3.5 hours.

After training, generate code:

python -m nanocoder.generate output/checkpoints/final.pt \
    --tokenizer output/tokenizer.json \
    --interactive

Or launch the web UI:

pip install gradio
python app.py

Architecture

nanoCoder uses a modern decoder-only transformer with the same building blocks as LLaMA/Mistral, scaled down:

Component	Choice	Why
Position encoding	RoPE	Standard, no learned embeddings
Attention	Grouped Query (GQA)	Fewer KV heads saves memory
FFN	SwiGLU	Better than GELU at small scale
Normalization	RMSNorm (pre-norm)	Faster than LayerNorm
Weight tying	Embedding = LM head	Saves parameters

Three preset sizes:

Preset	Params	Layers	Hidden	Heads	KV Heads	FFN	Context
`50m`	25.2M	12	384	6	2	1024	512
`100m`	61.5M	16	512	8	2	1408	2048
`200m`	175.6M	24	768	12	4	2048	2048

The 50m preset is the recommended starting point for CPU training. It uses ~1.5GB RAM, trains at ~3,000 tok/s on an M4 Pro, and can complete a meaningful training run in hours.

CPU Training Performance

Benchmarked on Apple M4 Pro (24GB, 12 cores), fp32, OMP_NUM_THREADS=8:

Batch Size	Seq Length	Time/Step	Throughput
1	256	0.19s	1,313 tok/s
1	512	0.25s	2,082 tok/s
2	512	0.32s	3,243 tok/s
4	256	0.27s	3,811 tok/s
4	512	0.69s	2,977 tok/s

Estimated training times for the 50M model:

Dataset	Steps	Time
Small codebase (~1M tokens)	2,000	~20 min
Python stdlib (~11M tokens)	10,000	~3.5 hr
Medium codebase (~50M tokens)	50,000	~18 hr
Stack-Edu-Python subset (~500M tokens)	200,000	~3 days

Project Structure

nanocoder/
    config.py      134 lines   Model configs and presets
    model.py       307 lines   Transformer (RoPE, GQA, SwiGLU, RMSNorm)
    tokenizer.py   174 lines   Python-specific BPE tokenizer
    data.py        390 lines   Data loading (local files, HuggingFace, memmap)
    train.py       434 lines   Training loop with resource monitoring
    generate.py    283 lines   Inference (completion + fill-in-the-middle)
    evaluate.py    432 lines   HumanEval + MBPP benchmarks
    synthetic.py   600 lines   Phi-1 style textbook data generation
    distill.py     389 lines   Knowledge distillation from larger models
    monitor.py     394 lines   CPU/RAM/temp monitor with auto-throttle
app.py             541 lines   Gradio web UI
run.py             175 lines   Quick-start pipeline

Features

Training

Gradient accumulation for effective large batch sizes with tiny micro-batches
Cosine learning rate schedule with linear warmup
Resource monitoring: CPU, RAM, temperature tracking with auto-throttle
Periodic checkpointing with resume support

Data

Python-specific BPE tokenizer (preserves indentation)
Fill-in-the-middle (FIM) data augmentation for code completion
Memory-mapped datasets for efficient large-scale training
AST-based quality filtering (rejects files that don't parse)
HuggingFace dataset integration (The Stack, CodeParrot, etc.)

Generation

Top-k, top-p (nucleus) sampling
Fill-in-the-middle code infilling
Interactive REPL mode
Web UI with Gradio

Synthetic Data (the phi-1 playbook)

Topic-seeded exercise generation across 60+ Python topics
Textbook-style teaching content generation
Supports Ollama (local), Anthropic API, or OpenAI API
Execution-based quality filtering (runs generated code, keeps what passes)

Distillation

Generate training labels from a larger teacher model
Train the small student model on teacher outputs
Works with local Ollama models (zero API cost)

Evaluation

HumanEval (164 problems, pass@k)
MBPP (974 problems, pass@k)
Sandboxed execution with timeouts

Synthetic Data Generation

The single biggest quality lever. Microsoft's phi-1 showed a 350M model trained on 7B "textbook quality" tokens (45% HumanEval) crushes a 350M model on 580B generic tokens (12.76% HumanEval). To generate synthetic data:

# Using a local Ollama model (free, no API key needed):
ollama pull qwen2.5-coder:7b
python -m nanocoder.synthetic all --provider ollama --model qwen2.5-coder:7b

# Using Anthropic API:
export ANTHROPIC_API_KEY=sk-...
python -m nanocoder.synthetic all --provider anthropic

# Compile into training format:
python -m nanocoder.synthetic compile

Distillation

Distill knowledge from a larger model into nanoCoder:

# Step 1: Generate teacher solutions
python -m nanocoder.distill generate \
    --teacher ollama:qwen2.5-coder:7b \
    --num-samples 5000

# Step 2: Train student on teacher data
python -m nanocoder.distill train \
    --teacher-data output/distill/teacher_labels.jsonl \
    --tokenizer output/tokenizer.json \
    --preset 50m

Evaluation

python -m nanocoder.evaluate output/checkpoints/final.pt \
    --tokenizer output/tokenizer.json \
    --benchmarks humaneval mbpp

Design Decisions

Why fp32? PyTorch's SDPA attention kernel has no optimized bf16 path on CPU. We measured 50x slowdown with bf16 vs fp32. Always use fp32 for CPU training.

Why seq_len=512? At batch_size=4, seq_len=512 hits the throughput sweet spot (~3,000 tok/s). Longer sequences increase SDPA's quadratic cost without proportional benefit for the small model.

Why GQA with 2 KV heads? Grouped Query Attention with few KV heads saves memory and compute while retaining most of multi-head attention's expressiveness. Critical at small scale where every parameter counts.

Why no gradient checkpointing by default? For the 25M model, checkpointing adds ~2x overhead while saving negligible memory (the model only uses ~1.5GB total). Enable it for the 100M+ models where memory is the bottleneck.

Depth over width. Following SmolLM's finding, the 100M preset uses 16 layers with 512 hidden dim rather than fewer, wider layers. Deeper networks outperform wider ones at small scale.

Inspired By

phi-1 (Microsoft) - "Textbooks Are All You Need": data quality > quantity
SmolLM (HuggingFace) - depth > width for small models
nanoGPT (Karpathy) - clean, minimal implementation philosophy
TinyStarCoder-Py (BigCode) - Python-only small model baseline

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
nanocoder		nanocoder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoCoder

Training Loss

Key Finding: bf16 is 50x slower on CPU

Install

Quick Start

Architecture

CPU Training Performance

Project Structure

Features

Synthetic Data Generation

Distillation

Evaluation

Design Decisions

Inspired By

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoCoder

Training Loss

Key Finding: bf16 is 50x slower on CPU

Install

Quick Start

Architecture

CPU Training Performance

Project Structure

Features

Synthetic Data Generation

Distillation

Evaluation

Design Decisions

Inspired By

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages