_ _ __ __ _ _
| | | | | \/ | | | __ _| |__ ___
| | | | | |\/| | | | / _` | '_ \/ __|
| |___| |___| | | | | |__| (_| | |_) \__ \
|_____|_____|_| |_| |_____\__,_|_.__/|___/
Made with ❤️ by Kumar Ishan — with a lot of help from ✨ 🤖 So please be kind 😍.
This repo holds two tutorial tracks that follow code from Andrej Karpathy: a minimal microgpt walkthrough (single-file GPT, no ML frameworks) and a nanochat walkthrough (full tokenizer → pretrain → SFT → RL → eval → inference → web UI). Each chapter is reading-first: you trace real code, run hands-on steps, and build intuition for why each piece exists.
The material here was generated with the repo-tutorial skill from kumarishan/kistack — structured, chapter-by-chapter guides derived from the repositories themselves.
microgpt — microgpt.py
“The most atomic way to train and run inference for a GPT in pure, dependency-free Python.”
| Chapter | What your reading covers |
|---|---|
| Chapter 1: Setup and First Run | Running the file with zero deps, reading train/inference output, a map of the whole pipeline |
| Chapter 2: Dataset and Tokenization | Character vocab, token IDs, BOS trick, sliding-window targets |
| Chapter 3: Autograd from Scratch | The Value type, computation graphs, local grads, reverse-mode autodiff |
| Chapter 4: Model Architecture Foundations | Init, linear(), stable softmax(), rmsnorm() |
| Chapter 5: Multi-Head Self-Attention | Q/K/V, scaled dot-product attention, KV cache, causal masking |
| Chapter 6: The Full GPT Forward Pass | MLP block, residual stream, token → logits |
| Chapter 7: Training — Loss, Backprop, and Adam | Cross-entropy, backward() through the graph, Adam |
| Chapter 8: Inference and Sampling | Autoregressive generation, temperature, sampling, what to try next |
Start here: microgpt/README.md
nanochat — full LLM stack (Review in progress)
A path from BPE and architecture through distributed pretraining, SFT, RL (GRPO), evaluation, fast inference, and a streaming chat server.
| Chapter | What your reading covers |
|---|---|
| Chapter 1: Setup and First Run | uv, project layout, device selection, tiny CPU end-to-end run, full pipeline overview |
| Chapter 2: LLM Fundamentals — Tokens, Embeddings, and Attention | Autoregressive LM, embeddings, self-attention, transformer blocks, cross-entropy |
| Chapter 3: Tokenization — Training BPE from Scratch | BPE, split patterns, RustBPE / tiktoken, specials, chat tokens, loss masking |
| Chapter 4: The GPT Architecture — Attention, RoPE, GQA, and Modern Innovations | RoPE, QK-norm, GQA, sliding windows, FA3, smearing, scalars, logit softcap |
| Chapter 5: Data Pipeline — Loading, Packing, and Distributing Training Data | Parquet / mix data, BOS-aligned cropping, DDP sharding, resumable state |
| Chapter 6: Pretraining — The Training Loop, Optimizers, and Mixed Precision | Training loop, CE + BPB, AdamW, Muon / Polar Express, AMP, accumulation, LR schedule |
| Chapter 7: Distributed Training — Multi-GPU with DDP | Data parallel, process groups, torchrun, all_reduce, checkpoints, MFU |
| Chapter 8: Evaluation — Bits-Per-Byte, CORE, and In-Context Learning | Bits-per-byte, in-context learning, CORE / MMLU-style evals, running eval scripts |
| Chapter 9: Supervised Fine-Tuning — Teaching the Model to Chat | Chat templates, assistant-only loss, benchmarks, mixtures, forgetting |
| Chapter 10: Reinforcement Learning — GRPO, Rewards, and Tool Use | Policy gradients, REINFORCE, GRPO/DAPO, rewards, tool use (e.g. Python sandbox) |
| Chapter 11: Inference — KV Cache, Flash Attention, and Sampling | KV cache, prefill vs decode, Flash vs SDPA, temperature / top-k, streaming |
| Chapter 12: Chat Interfaces — CLI and Web Server | CLI state, FastAPI, SSE, OpenAI-style API, async workers, multi-GPU serve |
| Chapter 13: Scaling Laws, Compute-Optimal Training, and Going Further | Chinchilla-style scaling, compute-optimal training, depth dial, MFU, FP8, where to go next |
Start here: nanochat/README.md
- Pick microgpt if you want every idea visible in ~200 lines of Python, and then head over to nanochat when you want the full production-shaped stack.
- Open the track README for time estimates and prerequisites, then follow chapters in order.
- Do the hands-on steps in each chapter — the goal is not skimming but tracing and running the real code paths.
See CONTRIBUTING.md. Please open an issue first to discuss improvements, vetting, or ideas for other repos that could use a similar tutorial.