MLLm is an educational repository dedicated to demystifying the architecture and training of Large Language Models (LLMs). This project guides you through the complete journey of building a GPT-style transformer from the ground up, fine-tuning it for instruction-following, and adapting it for specialized classification tasks.
This repository is organized into three progressive modules. We recommend following them in this order:
- Build from Scratch: Understand the core architecture. Implement tokenization, embeddings, multi-head attention, and transformer blocks.
- Instruction Fine-tuning: Learn how to take a pre-trained model and teach it to follow user instructions using the Alpaca format.
- Spam Classification: Master transfer learning. Learn how to freeze model layers and adapt a language model for binary classification.
The project implements a Decoder-only Transformer (GPT-style) with the following components:
- Tokenization: Byte-Pair Encoding (BPE) using OpenAI's
tiktoken. - Embeddings: Combined Token and Positional embeddings.
- Attention Mechanism: Scaled dot-product self-attention with causal masking (look-ahead mask).
- Multi-Head Attention: Parallel attention heads for capturing diverse context.
- Transformer Block: Integrated Layer Normalization, GELU activations, and Residual (Skip) connections.
- Output Head: Linear layer projecting to vocabulary size (for generation) or class size (for classification).
- Python: 3.12 or higher
- GPU: CUDA-compatible GPU recommended for training (but not required for inference)
- Package Manager: uv (recommended) or
pip
# Clone the repository
git clone <repository-url>
cd MLLm
# Using uv (recommended)
uv sync
# OR using standard pip
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .Located in /build_llm_from_scratch. This module contains the foundational code for the transformer architecture.
- Training: Train a small GPT model on Project Gutenberg books.
- Inference: Generate text using greedy or top-k sampling.
- Key Files:
gpt.py,transformer.py,multi_head_attention.py,causal_attention.py.
cd build_llm_from_scratch
python main.py # Start training
python inference.py # Interactive generationLocated in /instruction_finetuning. Adapts a pre-trained GPT-2 (124M) model to follow instructions.
- Dataset: Uses the Alpaca-style instruction dataset (Instruction, Input, Response).
- Technique: Supervised Fine-Tuning (SFT) on the full model.
- Key Files:
dataset.py,utils.py(formatting & collate functions).
cd instruction_finetuning
python download_instruction_dataset.py
python main.py # Start fine-tuningLocated in /llm_spam_classification. Demonstrates how to repurpose an LLM for specialized tasks.
- Transfer Learning: Loads pre-trained GPT-2 weights and replaces the language modeling head with a classification head.
- Selective Unfreezing: Freezes earlier layers to preserve general language knowledge while training only the final transformer block and head.
- Key Files:
finetune.py,utils.py(accuracy & loss monitoring).
cd llm_spam_classification
python download_sms_data.py
python main.py # Start classification training- Deep Learning: PyTorch, TensorFlow (utilities)
- NLP: Tiktoken, Scikit-learn
- Visualization: Matplotlib, Plotly, TensorBoard
- Data Handling: Pandas, NumPy
- Productivity: TQDM (progress bars), Jupyter
-
Causal Masking: Why
$e^{-\infty} = 0$ is the key to generative models. - Residual Connections: Solving the vanishing gradient problem in deep networks.
- Layer Normalization: Stabilizing training and preventing internal covariate shift.
- Transfer Learning: The power of selective unfreezing and head replacement.
Contributions are welcome! Whether it's adding a new model architecture (like RoPE embeddings), improving the training loop, or adding more datasets, feel free to open a PR.
This project is licensed under the MIT License - see the LICENSE file for details.