Skip to content

lyshuga/myGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# myGPT A minimal GPT (Generative Pre-trained Transformer) implementation in PyTorch, inspired by Andrej Karpathy's nanoGPT. This project provides a clean, readable implementation of the GPT architecture with support for training from scratch or fine-tuning from pretrained GPT-2 weights. ## Features - **Clean GPT Architecture**: Decoder-only transformer with causal self-attention - **Flash Attention**: Automatic use of PyTorch 2.0+ scaled dot product attention for faster training - **Multiple Model Sizes**: Support for GPT-2 small (~10M params) and standard (~124M params) configurations - **Pretrained Weights**: Load pretrained GPT-2 weights from Hugging Face (`gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`) - **Distributed Training**: Multi-GPU training support via PyTorch DDP (DistributedDataParallel) - **Mixed Precision**: BFloat16 automatic mixed precision for faster training - **Gradient Accumulation**: Simulate larger batch sizes on limited hardware - **Cosine LR Schedule**: Learning rate warmup with cosine decay - **Text Generation**: Temperature and top-k sampling for text generation ## Project Structure ``` myGPT/ � model.py # GPT model architecture � train.py # Training script (full GPT-2 124M) � train_small.py # Training script (smaller ~10M model) � data_loader.py # Dataset and data loading utilities � input.txt # Training data (text file) � README.md ``` ## Model Architecture The model follows the standard GPT-2 architecture: | Component | Description | |-----------|-------------| | Token Embedding | Maps vocab indices to dense vectors | | Position Embedding | Learnable positional encodings | | Transformer Blocks | N layers of attention + MLP | | Causal Self-Attention | Multi-head attention with causal masking | | MLP | Feed-forward network with GELU activation | | Layer Norm | Pre-norm architecture | | LM Head | Projects to vocabulary (weight-tied with token embedding) | ### Model Configurations | Config | Layers | Heads | Embed Dim | Block Size | Parameters | |--------|--------|-------|-----------|------------|------------| | `GPTConfigSmall` | 6 | 6 | 384 | 256 | ~10M | | `GPTConfig` | 12 | 12 | 768 | 1024 | ~124M | ## Installation ```bash # Clone the repository git clone https://github.com/yourusername/myGPT.git cd myGPT # Install dependencies pip install torch tiktoken transformers numpy ``` ### Requirements - Python 3.8+ - PyTorch 2.0+ (recommended for Flash Attention) - tiktoken - transformers (for loading pretrained weights) - numpy ## Usage ### Training from Scratch 1. Prepare your training data as a text file (e.g., `input.txt`) 2. Train a small model: ```bash python train_small.py ``` 3. Train a larger model: ```bash python train.py ``` ### Multi-GPU Training (DDP) ```bash torchrun --standalone --nproc_per_node=4 train_small.py ``` ### Loading Pretrained GPT-2 ```python from model import GPT # Load pretrained GPT-2 weights model = GPT.from_pretrained('gpt2') # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl' ``` ### Text Generation ```python import torch import tiktoken from model import GPT, GPTConfig # Load model model = GPT(GPTConfig()) model.load_state_dict(torch.load('checkpoint.pt')) model.eval() # Tokenize prompt enc = tiktoken.get_encoding("gpt2") prompt = "Once upon a time" idx = torch.tensor([enc.encode(prompt)], dtype=torch.long) # Generate with torch.no_grad(): output = model.generate(idx, max_new_tokens=100, temperature=0.8, top_k=40) print(enc.decode(output[0].tolist())) ``` ## Training Configuration Key hyperparameters in `train_small.py`: | Parameter | Value | Description | |-----------|-------|-------------| | Batch Size | 16 | Micro batch size | | Block Size | 256 | Context length (sequence length) | | Learning Rate | 3e-4 | Peak learning rate | | Warmup Steps | 100 | Linear warmup steps | | Max Steps | 2000 | Total training steps | | Weight Decay | 0.2 | AdamW weight decay | | Gradient Accumulation | Auto | Based on total batch size | ## Data Format The `TextDataset` class in `data_loader.py` expects a plain text file. It: - Tokenizes text using GPT-2's BPE tokenizer (tiktoken) - Automatically splits into 90% train / 10% validation - Returns (input, target) pairs where target is input shifted by one token ## Logging Training logs are saved to `log/log.txt` and include: - Training and validation loss - Learning rate - Tokens processed per second - Generated text samples at evaluation intervals ## License MIT License ## Acknowledgments - [Andrej Karpathy's nanoGPT](https://github.com/karpathy/nanoGPT) - [OpenAI GPT-2](https://github.com/openai/gpt-2) - [Hugging Face Transformers](https://github.com/huggingface/transformers)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages