LLM-from-Scratch

Build a Large Language Model from scratch, following the Stanford CS336 assignments.

assignment1

The official original repository: https://github.com/stanford-cs336/assignment1-basics/tree/main

What you will implement

Byte-pair encoding (BPE) tokenizer (§2)
Transformer language model (LM) (§3)
The cross-entropy loss function and the AdamW optimizer (§4)
The training loop, with support for serializing and loading model and optimizer state (§5)

What you will run

Train a BPE tokenizer on the TinyStories dataset.
Run your trained tokenizer on the dataset to convert it into a sequence of integer IDs.
Train a Transformer LM on the TinyStories dataset.
Generate samples and evaluate perplexity using the trained Transformer LM.
Train models on OpenWebText and submit your attained perplexities to a leaderboard.

Directory structure

📦 Click to expand assignment1-basics directory structure

assignment1-basics/
├── config/          # configs for TinyStories / OWT experiments
├── cs336_basics/    # starter code and basic tokenizer examples
├── data/            # TinyStories and OpenWebText data
├── model/           # BPE vocab and merges used by the tokenizer
├── runs/            # training runs and saved checkpoints
├── script/          # scripts for training, tokenization, and experiments
├── src/             # core tokenizer and transformer implementation
│   ├── attention.py     # attention modules
│   ├── config.py        # model & training configs
│   ├── dataloader.py    # dataset & dataloader
│   ├── embedding.py     # token & positional embeddings
│   ├── generate.py      # text generation logic
│   ├── linear.py        # linear layers
│   ├── optimizer.py     # optimizer implementations
│   ├── rmsnorm.py       # RMSNorm layer
│   ├── rope.py          # rotary positional embeddings (RoPE)
│   ├── softmax.py       # numerically stable softmax
│   ├── swiglu.py        # SwiGLU feedforward components
│   ├── tokenizer.py     # BPE tokenizer (train / encode / decode)
│   ├── tracker.py       # training metrics & logging
│   ├── transformer.py   # Transformer language model
│   └── utils.py         # shared utilities
├── tests/           # pytest tests and fixtures
├── wandb/           # Weights & Biases logs and metadata
└── ...

Setup

Environment

Prepare the environment with uv as described in assignment1 README – Environment.

Data

Download the pretraining datasets as described in assignment1 README – Download Data.

Quick Start

Unit tests

Run unit tests for the components which have implemented:

cd assignment1-basics
Run unit tests (they call functions in assignment1-basics/tests/adapters.py):
- Run all 48 tests: uv run pytest
- Run a specific component test, e.g. uv run pytest -k test_transformer_lm
📦 Click to expand: Unit test results

Run

Train a BPE tokenizer on the TinyStories dataset.

uv run python script/train_bpe_tokenizer.py
Run your trained tokenizer on the dataset to convert it into a sequence of integer IDs.

uv run python script/tokenize_and_bin.py
Train a Transformer LM on the TinyStories dataset.
- Use tokenized TinyStories dataset to train model, and evaluate perplexity
  
  uv run python script/train.py
- Tune the learning rate: [1e-1, 5e-2, 2e-2, 1e-2, 5e-3, 2e-3, 1e-3]
  
  uv run python script/learning_rate_experiment.py
  
  📦 Click to expand: Learning rate experiment results
  
  optimal learning rate: 1e-2
- Batch size variations: [8, 16, 32, 64] (GPU memory limit)
  
  uv run python script/batch_size_experiment.py
  
  📦 Click to expand: Batch size experiment results
  
  optimal batch size: 64

Generate samples and evaluate perplexity using the trained Transformer LM.

Generate and decode

uv run python script/generate_and_decode.py

📦 Click to expand: Generate and decode results

Input

Once upon a time

Output

Once upon a time, there was a small dog named Spot. Spot loved to play with his toy car. He would run around the park to play with. They all day, and the sun went on a tree. They liked to play with the toys with a ball.
One day, Tom and Sam were playing with the ball together. They played together and had lots of fun. At the end of the day, Tim and his friends were very happy. They played together all day, laughing and having fun.
<|endoftext|>

Train models on OpenWebText and submit your attained perplexities to a leaderboard.
- train tokenizer
  
  uv run python script/train_bpe_tokenizer.py --config owt
- dataset tokenization
  
  uv run python script/tokenize_and_bin.py --config owt
- pretrain model
  
  uv run python script/train.py --config owt
📦 Click to expand: Training result

wandb report: https://api.wandb.ai/links/viko/axveizyy

official leaderboard: Assignment 1 (Basics) Leaderboard

Other

Tuning the learning rate with SGD example

uv run python script/learning_rate_tuning_sgd.py

Ablations results

layer normalization

📦 Click to expand: layer_norm_ablation - w/ vs. w/o RMSNorm

📦 Click to expand: pre_norm_ablation - pre-norm vs. post-norm
position embeddings

📦 Click to expand: no_pos_emb - RoPE vs. NoPE
SwiGLU vs. SiLU

📦 Click to expand: swiglu_ablation - SwiGLU vs. SiLU

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assignment1-basics		assignment1-basics
pics		pics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-from-Scratch

assignment1

Directory structure

Setup

Environment

Data

Quick Start

Unit tests

Run

Other

Ablations results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-from-Scratch

assignment1

Directory structure

Setup

Environment

Data

Quick Start

Unit tests

Run

Other

Ablations results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages