Large Language Models from Scratch

This project explores the implementation of language models from the ground up, starting with a simple bigram model and progressing to a complete decoder-only transformer. The goal is to deeply understand the theory behind LLMs and gain practical experience in implementing them.

Project Goals

Understand the theory: Master the fundamental concepts of language models
Practical implementation: Code architectures from scratch using PyTorch
Educational progression: Move from simple (bigram) to complex (transformer)
Detailed documentation: Explain every component and concept

Input - Training Data

For training, we use the Tiny Shakespeare dataset (1M character-level token), a ~1 MB collection of several Shakespeare plays. It provides a rich but compact corpus of classical English text, making it ideal for experimenting with character-level language models. The goal is to train a model to generate new passages in a Shakespeare-like style.

Large Language Models from Scratch

This project explores the implementation of language models from the ground up, starting with a simple bigram model and progressing to a complete decoder-only transformer. The goal is to deeply understand the theory behind LLMs and gain practical experience in implementing them.

Project Goals

Understand the theory: Master the fundamental concepts of language models
Practical implementation: Code architectures from scratch using PyTorch
Educational progression: Move from simple (bigram) to complex (transformer)
Detailed documentation: Explain every component and concept

Input - Training Data

For training, we use the Tiny Shakespeare dataset (1M character-level tokens), a ~1 MB collection of several Shakespeare plays. It provides a rich but compact corpus of classical English text, making it ideal for experimenting with character-level language models. The goal is to train a model to generate new passages in a Shakespeare-like style.

Project Content

1. Bigram Model (`models/bigram.py`)

The starting point – A simple character-level model that predicts the next character based solely on the previous character.

Purpose:
Introduce the basics of language modeling and autoregressive generation in the simplest possible setup.

Key concepts covered:

Character-level tokenization
Token embeddings
Cross-entropy loss
Text generation through sampling

Parameters:

Embedding dimension: 32
Context window: 8 characters
~10K parameters

2. Intermediate Transformer Block (`models/single_head_attention.py`, `models/multiple_head_attention.py`, `models/single_block_transformer.py`)

The bridge to modern transformers – Introduces the core components of transformer architectures in a simplified.

Key concepts covered:

Single-Head Self-Attention: Learn how the model weighs the importance of each token relative to others.
Multi-Head Self-Attention: Capture different types of relationships simultaneously.
Feed-Forward Networks: Enrich token representations independently.
Residual Connections & Layer Normalization: Stabilize and improve training of deeper models.
Modular Transformer Block: Combines attention, feed-forward, residuals, and normalization into one unit.
Causal Masking: Ensures autoregressive generation.

Learning goals:

Understand attention mechanisms and multi-head benefits.
Experiment with a modular transformer before scaling up.

Parameters (CPU-optimized):

Embedding dimension: 64
Number of heads: 2–4
Number of layers: 1–2
Context window: 32 tokens
~100–200K parameters

3. Decoder-Only Transformer (`models/decoder_transformer.py`)

The modern architecture – A GPT-style decoder-only transformer that generates text autoregressively using causal attention.

Key concepts implemented:

Multi-Head Self-Attention
Positional Embeddings
Feed-Forward Networks
Layer Normalization
Residual Connections
Causal Masking

Why decoder-only?
Unlike the original Transformer, which uses both encoder and decoder, GPT-style models rely solely on the decoder. This allows the model to generate text one token at a time while attending only to previous tokens.

Current Parameters (CPU-optimized):

Embedding dimension: 128
Number of heads: 4
Number of layers: 3
Context window: 64 tokens
~850K parameters

Learning Resources

This implementation was built following:

Attention is all you need - The seminal transformer paper by Vaswani et al.
Let's build GPT: from scratch, in code, spelled out. - Andrej Karpathy's

Getting Started

Prerequisites

pip install torch

Running the Models - demo

In my macbook I have python3 but it's depends on what setting you have in your computer (python or python3 ...)

Bigram Model:

python3 models/bigram_model.py

Intermediate models:

python3 models/single_head_attention.py
python3 models/multiple_head_attention.py
python3 models/single_block_transformer.py

Transformer Model:

python3 models/transformer_model.py

Results & Performance

Bigram Model

Simple but limited: Only considers the immediate previous character
Fast training: Converges quickly on CPU
Baseline performance: Good for understanding fundamentals but very limited in generation quality

Transformer Model

Context awareness: Can attend to all previous tokens in the sequence
Better generation quality: More coherent and contextual text. The quality depends on hyperparameters and training duration. With a reasonnable training with a CPU, we notice the english structure but it is still not english.
Scalable architecture: Foundation for larger models

Possible Improvements

Short-term Enhancements

GPU Training: use a GPU to train the model deeper and faster
Hyperparameter Scaling:
- Increase embedding dimensions (384, 512, 768...)
- More attention heads (8, 12, 16...)
- Deeper networks (6, 12, 24+ layers)
- Larger context windows (256, 512, 1024+ tokens)
Better Tokenization: Move from character-level to subword tokenization (BPE, SentencePiece)
Regularization: Experiment with dropout rates, weight decay

Advanced Extensions

Encoder-Decoder Transformer: Implement the full transformer architecture for tasks like translation, assistance...
Attention Variants: Explore different attention mechanisms (sparse attention, local attention)
Model Scaling: Experiment with larger architectures (GPT-2, GPT-3 scale)
Fine-tuning Capabilities: Add support for task-specific fine-tuning

Architecture Details

Decoder-Only vs Full Transformer

Full Transformer (Original Paper):
Input → Encoder → Decoder → Output

Decoder-Only (This Implementation):
Input → Decoder → Output

The decoder-only architecture is simpler but just as powerful for language generation tasks, which is why it's used in models like GPT, PaLM, and LLaMA.

Key Components Explained

Causal Self-Attention: Each token can only see previous tokens
Multi-Head Attention: Multiple attention patterns learned in parallel
Position Embeddings: Learned representations of token positions
Layer Norm: Applied before (pre-norm) each sub-layer
Residual Connections: Help gradient flow in deep networks

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
models		models
notebooks		notebooks
README.md		README.md
genereted_text_exemple.ipynb		genereted_text_exemple.ipynb
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Models from Scratch

Project Goals

Input - Training Data

Large Language Models from Scratch

Project Goals

Input - Training Data

Project Content

1. Bigram Model (`models/bigram.py`)

2. Intermediate Transformer Block (`models/single_head_attention.py`, `models/multiple_head_attention.py`, `models/single_block_transformer.py`)

3. Decoder-Only Transformer (`models/decoder_transformer.py`)

Learning Resources

Getting Started

Prerequisites

Running the Models - demo

Results & Performance

Bigram Model

Transformer Model

Possible Improvements

Short-term Enhancements

Advanced Extensions

Architecture Details

Decoder-Only vs Full Transformer

Key Components Explained

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Large Language Models from Scratch

Project Goals

Input - Training Data

Large Language Models from Scratch

Project Goals

Input - Training Data

Project Content

1. Bigram Model (models/bigram.py)

2. Intermediate Transformer Block (models/single_head_attention.py, models/multiple_head_attention.py, models/single_block_transformer.py)

3. Decoder-Only Transformer (models/decoder_transformer.py)

Learning Resources

Getting Started

Prerequisites

Running the Models - demo

Results & Performance

Bigram Model

Transformer Model

Possible Improvements

Short-term Enhancements

Advanced Extensions

Architecture Details

Decoder-Only vs Full Transformer

Key Components Explained

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Bigram Model (`models/bigram.py`)

2. Intermediate Transformer Block (`models/single_head_attention.py`, `models/multiple_head_attention.py`, `models/single_block_transformer.py`)

3. Decoder-Only Transformer (`models/decoder_transformer.py`)

Packages