Build an LLM from Scratch

A progressive, notebook-driven course that builds a modern decoder-only language model from raw text all the way up to a Mixture-of-Experts variant — entirely from scratch in PyTorch (no HuggingFace), sized to train on a laptop in minutes.

The Learning Journey

graph LR
  A[Bag-of-Words] --> B[Embeddings]
  B --> C[Attention]
  C --> D[Modern Components<br/>RoPE/RMSNorm/SwiGLU]
  D --> E[Llama-style Block]
  E --> F[Full Transformer Training]
  F --> G[Mixture-of-Experts]

Each new method earns its place by beating a measured number from the step before. Each notebook teaches step by step in plain language, with runnable code, inline plots, and assert-based sanity checks that act as mathematical "grade" on your implementation.

The Curriculum

#	Notebook	What you build
00	Setup & tour	environment check, auto device selection, the roadmap
01	Data & Bag-of-Words	word-level tokenizer; a BoW next-word baseline (the order-blind starting point)
02	Embeddings	dense embeddings that beat BoW with far fewer parameters; word geometry
03	Attention	scaled dot-product $\rightarrow$ causal mask $\rightarrow$ multi-head $\rightarrow$ grouped-query attention (GQA)
04	Modern components	RoPE, RMSNorm, SwiGLU — each vs the older idea it replaces
05	Assembling the model	tour of the full Llama-style decoder block (imported from `model.py`), weight tying, param counts
06	Training	the training loop, loss curves, checkpointing
07	Evaluation & generation	perplexity, sampling (greedy/temperature/top-k/top-p), KV-cache
08	Tuning	learning-rate warmup + cosine decay, hyperparameter sweeps
09	BPE tokenizer	Byte-Pair Encoding from scratch vs char-level ($\approx$2x compression)
10	Mixture-of-Experts	top-k routing + load balancing; the capstone

Architecture Overview

The reusable model lives in model.py. It implements a modern Llama-style architecture:

Positional Embeddings: Rotary Positional Embeddings (RoPE)
Normalization: RMSNorm (Root Mean Square Layer Normalization)
Activation Function: SwiGLU
Attention Mechanism: Grouped-Query Attention (GQA)
Inference Optimization: KV-Caching for efficient generation

Setup

Prerequisites

Python 3.10+
pip or uv
jupyterlab and jupytext

Installation

# Using standard pip
python3 -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\Activate.ps1       # Windows (PowerShell)
pip install -r requirements.txt

# OR using uv (recommended for speed)
uv venv && uv pip install -r requirements.txt

Hardware Support

The notebooks automatically detect and use the best available compute engine:

NVIDIA GPU (CUDA) on Windows or Linux — preferred for training.
Apple Silicon GPU (MPS) on macOS (M1–M5) — high performance on Mac.
CPU — universal fallback; works on any machine, but significantly slower for training.

Windows / Linux NVIDIA users: Ensure you have a CUDA-enabled PyTorch build installed from pytorch.org. The default pip install torch may only provide CPU support.

How to Run & Learn

Running Notebooks

Notebooks live in notebooks/ as paired files: a jupytext .py source (the version-controlled truth) and a generated .ipynb. Open the .ipynb in JupyterLab for an interactive experience:

jupyter lab

Alternatively, run a notebook headlessly:

python notebooks/01_data_and_bag_of_words.py

Running on Google Colab

You can also run this project directly in a Google Colab notebook.

To get started, open a new Colab notebook and run the following cell:

!git clone https://github.com/orbek/train-llm.git
%cd train-llm
!pip install -r requirements.txt

Tip: For best performance during training, go to Runtime $\rightarrow$ Change runtime type and select T4 GPU.

How to Learn Effectively

Observe the Plots: Each step includes inline plots to visualize how embeddings or attention patterns change as you add complexity.
The "Assert" Check: Every notebook contains assert statements that validate your implementation against mathematical ground truths. If an assertion fails, stop and investigate! It means your code isn't behaving like a real LLM component should.
Modify & Break: The best way to learn is to change a hyperparameter (e.g., the number of attention heads or the learning rate) and observe how it impacts the loss curves in Notebook 06.

Results

By notebook 07, you will have a ~9.4M-parameter model that generates recognizable Shakespearean text:

ROMEO:
The sea of the sea that would not see the sea
That thou hast seen the sea of the sea

(Note: The committed training run is capped for fast notebook rendering; raise max_iters in notebook 06 for sharper samples.)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assets		assets
checkpoints		checkpoints
data		data
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
test_model.py		test_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build an LLM from Scratch

The Learning Journey

The Curriculum

Architecture Overview

Setup

Prerequisites

Installation

Hardware Support

How to Run & Learn

Running Notebooks

Running on Google Colab

How to Learn Effectively

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Build an LLM from Scratch

The Learning Journey

The Curriculum

Architecture Overview

Setup

Prerequisites

Installation

Hardware Support

How to Run & Learn

Running Notebooks

Running on Google Colab

How to Learn Effectively

Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages