Skip to content

jeffelin/nanocoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanoCoder

The simplest repository for training a tiny Python coding model on your own CPU. No GPU required.

step     10 | loss 9.7521 | ppl 17189.75 | lr 2.70e-06 | tok/s 2573 | elapsed 31.8s
  [CPU: 76% | RAM: 18.3/24GB (76%) | Process: 1.57GB]
step     50 | loss 8.6149 | ppl 5513.08 | lr 1.47e-05 | tok/s 2296 | elapsed 35.7s
  [CPU: 79% | RAM: 18.7/24GB (78%) | Process: 1.56GB]
step    100 | loss 7.8829 | ppl 2651.57 | lr 2.97e-05 | tok/s 2998 | elapsed 27.3s
  [CPU: 74% | RAM: 17.8/24GB (74%) | Process: 1.79GB]

nanoCoder is a from-scratch implementation of a decoder-only transformer optimized for CPU training. Point it at a Python codebase, and it learns to write code like it. The model architecture follows modern best practices (RoPE, GQA, SwiGLU, RMSNorm) in ~300 lines, the training loop is ~400 lines, and the whole project is plain, readable Python.

Training Loss

100-step proof-of-concept run on Python 3.14 standard library (11.4M tokens, M4 Pro):

Loss
 10.0 |*
  9.5 | *
  9.0 |   * *
  8.5 |       * *
  8.0 |           * * *
  7.5 |                 * *
  7.0 |                     *
      +-----+-----+-----+-----+
      0    25    50    75   100   Step
Step Loss Perplexity tok/s
10 9.752 17,190 2,573
20 9.479 13,087 2,772
30 9.059 8,593 2,111
40 8.778 6,488 2,355
50 8.615 5,513 2,296
60 8.509 4,957 2,895
70 8.377 4,348 2,817
80 8.185 3,586 2,889
90 8.033 3,080 2,826
100 7.883 2,652 2,998

Resource usage: avg 76% CPU, 1.5-2GB process memory, 76% system RAM on an M4 Pro 24GB.

Key Finding: bf16 is 50x slower on CPU

During development we discovered that PyTorch's scaled_dot_product_attention with bf16 on Apple Silicon CPU falls back to a naive kernel, making it ~50x slower than fp32 (53 tok/s vs 3,000 tok/s). This project defaults to fp32 for CPU training. bf16 is only beneficial on CUDA/MPS backends.

seq=512 fp32 bs=4:  0.75s/step  2716 tok/s
seq=512 bf16 bs=4: 38.38s/step    53 tok/s   <-- 50x slower!

Install

pip install torch numpy tokenizers datasets tqdm

Dependencies: pytorch, numpy, tokenizers (HuggingFace fast BPE), datasets (for HuggingFace dataset loading), tqdm (progress bars). Optional: gradio (web UI), wandb (logging).

Quick Start

Verify everything works (no training, just builds all 3 model sizes and runs forward/backward):

python run.py --smoke-test

Train on a codebase (prepares tokenizer + data, then prints the train command):

python run.py --codebase /path/to/your/python/project

Train on Python's standard library (good first test, ~11M tokens):

python run.py --codebase $(python -c "import sysconfig; print(sysconfig.get_paths()['stdlib'])")

Then train:

OMP_NUM_THREADS=8 python -u -m nanocoder.train \
    --preset 50m \
    --data-dir output/data \
    --tokenizer output/tokenizer.json \
    --output-dir output/checkpoints \
    --max-steps 10000 \
    --batch-size 4 \
    --grad-accum 4 \
    --no-grad-ckpt

On an M4 Pro at ~3,000 tok/s, 10K steps processes ~40M tokens and takes about 3.5 hours.

After training, generate code:

python -m nanocoder.generate output/checkpoints/final.pt \
    --tokenizer output/tokenizer.json \
    --interactive

Or launch the web UI:

pip install gradio
python app.py

Architecture

nanoCoder uses a modern decoder-only transformer with the same building blocks as LLaMA/Mistral, scaled down:

Component Choice Why
Position encoding RoPE Standard, no learned embeddings
Attention Grouped Query (GQA) Fewer KV heads saves memory
FFN SwiGLU Better than GELU at small scale
Normalization RMSNorm (pre-norm) Faster than LayerNorm
Weight tying Embedding = LM head Saves parameters

Three preset sizes:

Preset Params Layers Hidden Heads KV Heads FFN Context
50m 25.2M 12 384 6 2 1024 512
100m 61.5M 16 512 8 2 1408 2048
200m 175.6M 24 768 12 4 2048 2048

The 50m preset is the recommended starting point for CPU training. It uses ~1.5GB RAM, trains at ~3,000 tok/s on an M4 Pro, and can complete a meaningful training run in hours.

CPU Training Performance

Benchmarked on Apple M4 Pro (24GB, 12 cores), fp32, OMP_NUM_THREADS=8:

Batch Size Seq Length Time/Step Throughput
1 256 0.19s 1,313 tok/s
1 512 0.25s 2,082 tok/s
2 512 0.32s 3,243 tok/s
4 256 0.27s 3,811 tok/s
4 512 0.69s 2,977 tok/s

Estimated training times for the 50M model:

Dataset Steps Time
Small codebase (~1M tokens) 2,000 ~20 min
Python stdlib (~11M tokens) 10,000 ~3.5 hr
Medium codebase (~50M tokens) 50,000 ~18 hr
Stack-Edu-Python subset (~500M tokens) 200,000 ~3 days

Project Structure

nanocoder/
    config.py      134 lines   Model configs and presets
    model.py       307 lines   Transformer (RoPE, GQA, SwiGLU, RMSNorm)
    tokenizer.py   174 lines   Python-specific BPE tokenizer
    data.py        390 lines   Data loading (local files, HuggingFace, memmap)
    train.py       434 lines   Training loop with resource monitoring
    generate.py    283 lines   Inference (completion + fill-in-the-middle)
    evaluate.py    432 lines   HumanEval + MBPP benchmarks
    synthetic.py   600 lines   Phi-1 style textbook data generation
    distill.py     389 lines   Knowledge distillation from larger models
    monitor.py     394 lines   CPU/RAM/temp monitor with auto-throttle
app.py             541 lines   Gradio web UI
run.py             175 lines   Quick-start pipeline

Features

Training

  • Gradient accumulation for effective large batch sizes with tiny micro-batches
  • Cosine learning rate schedule with linear warmup
  • Resource monitoring: CPU, RAM, temperature tracking with auto-throttle
  • Periodic checkpointing with resume support

Data

  • Python-specific BPE tokenizer (preserves indentation)
  • Fill-in-the-middle (FIM) data augmentation for code completion
  • Memory-mapped datasets for efficient large-scale training
  • AST-based quality filtering (rejects files that don't parse)
  • HuggingFace dataset integration (The Stack, CodeParrot, etc.)

Generation

  • Top-k, top-p (nucleus) sampling
  • Fill-in-the-middle code infilling
  • Interactive REPL mode
  • Web UI with Gradio

Synthetic Data (the phi-1 playbook)

  • Topic-seeded exercise generation across 60+ Python topics
  • Textbook-style teaching content generation
  • Supports Ollama (local), Anthropic API, or OpenAI API
  • Execution-based quality filtering (runs generated code, keeps what passes)

Distillation

  • Generate training labels from a larger teacher model
  • Train the small student model on teacher outputs
  • Works with local Ollama models (zero API cost)

Evaluation

  • HumanEval (164 problems, pass@k)
  • MBPP (974 problems, pass@k)
  • Sandboxed execution with timeouts

Synthetic Data Generation

The single biggest quality lever. Microsoft's phi-1 showed a 350M model trained on 7B "textbook quality" tokens (45% HumanEval) crushes a 350M model on 580B generic tokens (12.76% HumanEval). To generate synthetic data:

# Using a local Ollama model (free, no API key needed):
ollama pull qwen2.5-coder:7b
python -m nanocoder.synthetic all --provider ollama --model qwen2.5-coder:7b

# Using Anthropic API:
export ANTHROPIC_API_KEY=sk-...
python -m nanocoder.synthetic all --provider anthropic

# Compile into training format:
python -m nanocoder.synthetic compile

Distillation

Distill knowledge from a larger model into nanoCoder:

# Step 1: Generate teacher solutions
python -m nanocoder.distill generate \
    --teacher ollama:qwen2.5-coder:7b \
    --num-samples 5000

# Step 2: Train student on teacher data
python -m nanocoder.distill train \
    --teacher-data output/distill/teacher_labels.jsonl \
    --tokenizer output/tokenizer.json \
    --preset 50m

Evaluation

python -m nanocoder.evaluate output/checkpoints/final.pt \
    --tokenizer output/tokenizer.json \
    --benchmarks humaneval mbpp

Design Decisions

Why fp32? PyTorch's SDPA attention kernel has no optimized bf16 path on CPU. We measured 50x slowdown with bf16 vs fp32. Always use fp32 for CPU training.

Why seq_len=512? At batch_size=4, seq_len=512 hits the throughput sweet spot (~3,000 tok/s). Longer sequences increase SDPA's quadratic cost without proportional benefit for the small model.

Why GQA with 2 KV heads? Grouped Query Attention with few KV heads saves memory and compute while retaining most of multi-head attention's expressiveness. Critical at small scale where every parameter counts.

Why no gradient checkpointing by default? For the 25M model, checkpointing adds ~2x overhead while saving negligible memory (the model only uses ~1.5GB total). Enable it for the 100M+ models where memory is the bottleneck.

Depth over width. Following SmolLM's finding, the 100M preset uses 16 layers with 512 hidden dim rather than fewer, wider layers. Deeper networks outperform wider ones at small scale.

Inspired By

  • phi-1 (Microsoft) - "Textbooks Are All You Need": data quality > quantity
  • SmolLM (HuggingFace) - depth > width for small models
  • nanoGPT (Karpathy) - clean, minimal implementation philosophy
  • TinyStarCoder-Py (BigCode) - Python-only small model baseline

License

MIT

About

A from-scratch 25M parameter Python coding model that trains on your CPU in hours. Modern transformer architecture (RoPE, GQA, SwiGLU) in 4,200 lines of readable Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages