GitHub - lyshuga/myGPT

# myGPT A minimal GPT (Generative Pre-trained Transformer) implementation in PyTorch, inspired by Andrej Karpathy's nanoGPT. This project provides a clean, readable implementation of the GPT architecture with support for training from scratch or fine-tuning from pretrained GPT-2 weights. ## Features - **Clean GPT Architecture**: Decoder-only transformer with causal self-attention - **Flash Attention**: Automatic use of PyTorch 2.0+ scaled dot product attention for faster training - **Multiple Model Sizes**: Support for GPT-2 small (~10M params) and standard (~124M params) configurations - **Pretrained Weights**: Load pretrained GPT-2 weights from Hugging Face (`gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`) - **Distributed Training**: Multi-GPU training support via PyTorch DDP (DistributedDataParallel) - **Mixed Precision**: BFloat16 automatic mixed precision for faster training - **Gradient Accumulation**: Simulate larger batch sizes on limited hardware - **Cosine LR Schedule**: Learning rate warmup with cosine decay - **Text Generation**: Temperature and top-k sampling for text generation ## Project Structure ``` myGPT/ � model.py # GPT model architecture � train.py # Training script (full GPT-2 124M) � train_small.py # Training script (smaller ~10M model) � data_loader.py # Dataset and data loading utilities � input.txt # Training data (text file) � README.md ``` ## Model Architecture The model follows the standard GPT-2 architecture: | Component | Description | |-----------|-------------| | Token Embedding | Maps vocab indices to dense vectors | | Position Embedding | Learnable positional encodings | | Transformer Blocks | N layers of attention + MLP | | Causal Self-Attention | Multi-head attention with causal masking | | MLP | Feed-forward network with GELU activation | | Layer Norm | Pre-norm architecture | | LM Head | Projects to vocabulary (weight-tied with token embedding) | ### Model Configurations | Config | Layers | Heads | Embed Dim | Block Size | Parameters | |--------|--------|-------|-----------|------------|------------| | `GPTConfigSmall` | 6 | 6 | 384 | 256 | ~10M | | `GPTConfig` | 12 | 12 | 768 | 1024 | ~124M | ## Installation ```bash # Clone the repository git clone https://github.com/yourusername/myGPT.git cd myGPT # Install dependencies pip install torch tiktoken transformers numpy ``` ### Requirements - Python 3.8+ - PyTorch 2.0+ (recommended for Flash Attention) - tiktoken - transformers (for loading pretrained weights) - numpy ## Usage ### Training from Scratch 1. Prepare your training data as a text file (e.g., `input.txt`) 2. Train a small model: ```bash python train_small.py ``` 3. Train a larger model: ```bash python train.py ``` ### Multi-GPU Training (DDP) ```bash torchrun --standalone --nproc_per_node=4 train_small.py ``` ### Loading Pretrained GPT-2 ```python from model import GPT # Load pretrained GPT-2 weights model = GPT.from_pretrained('gpt2') # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl' ``` ### Text Generation ```python import torch import tiktoken from model import GPT, GPTConfig # Load model model = GPT(GPTConfig()) model.load_state_dict(torch.load('checkpoint.pt')) model.eval() # Tokenize prompt enc = tiktoken.get_encoding("gpt2") prompt = "Once upon a time" idx = torch.tensor([enc.encode(prompt)], dtype=torch.long) # Generate with torch.no_grad(): output = model.generate(idx, max_new_tokens=100, temperature=0.8, top_k=40) print(enc.decode(output[0].tolist())) ``` ## Training Configuration Key hyperparameters in `train_small.py`: | Parameter | Value | Description | |-----------|-------|-------------| | Batch Size | 16 | Micro batch size | | Block Size | 256 | Context length (sequence length) | | Learning Rate | 3e-4 | Peak learning rate | | Warmup Steps | 100 | Linear warmup steps | | Max Steps | 2000 | Total training steps | | Weight Decay | 0.2 | AdamW weight decay | | Gradient Accumulation | Auto | Based on total batch size | ## Data Format The `TextDataset` class in `data_loader.py` expects a plain text file. It: - Tokenizes text using GPT-2's BPE tokenizer (tiktoken) - Automatically splits into 90% train / 10% validation - Returns (input, target) pairs where target is input shifted by one token ## Logging Training logs are saved to `log/log.txt` and include: - Training and validation loss - Learning rate - Tokens processed per second - Generated text samples at evaluation intervals ## License MIT License ## Acknowledgments - [Andrej Karpathy's nanoGPT](https://github.com/karpathy/nanoGPT) - [OpenAI GPT-2](https://github.com/openai/gpt-2) - [Hugging Face Transformers](https://github.com/huggingface/transformers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
data_loader.py		data_loader.py
input.txt		input.txt
model.py		model.py
more.txt		more.txt
train.py		train.py
train_small.py		train_small.py

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages