Repository files navigation # myGPT
A minimal GPT (Generative Pre-trained Transformer) implementation in PyTorch, inspired by Andrej Karpathy's nanoGPT. This project provides a clean, readable implementation of the GPT architecture with support for training from scratch or fine-tuning from pretrained GPT-2 weights.
## Features
- **Clean GPT Architecture**: Decoder-only transformer with causal self-attention
- **Flash Attention**: Automatic use of PyTorch 2.0+ scaled dot product attention for faster training
- **Multiple Model Sizes**: Support for GPT-2 small (~10M params) and standard (~124M params) configurations
- **Pretrained Weights**: Load pretrained GPT-2 weights from Hugging Face (`gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`)
- **Distributed Training**: Multi-GPU training support via PyTorch DDP (DistributedDataParallel)
- **Mixed Precision**: BFloat16 automatic mixed precision for faster training
- **Gradient Accumulation**: Simulate larger batch sizes on limited hardware
- **Cosine LR Schedule**: Learning rate warmup with cosine decay
- **Text Generation**: Temperature and top-k sampling for text generation
## Project Structure
```
myGPT/
� model.py # GPT model architecture
� train.py # Training script (full GPT-2 124M)
� train_small.py # Training script (smaller ~10M model)
� data_loader.py # Dataset and data loading utilities
� input.txt # Training data (text file)
� README.md
```
## Model Architecture
The model follows the standard GPT-2 architecture:
| Component | Description |
|-----------|-------------|
| Token Embedding | Maps vocab indices to dense vectors |
| Position Embedding | Learnable positional encodings |
| Transformer Blocks | N layers of attention + MLP |
| Causal Self-Attention | Multi-head attention with causal masking |
| MLP | Feed-forward network with GELU activation |
| Layer Norm | Pre-norm architecture |
| LM Head | Projects to vocabulary (weight-tied with token embedding) |
### Model Configurations
| Config | Layers | Heads | Embed Dim | Block Size | Parameters |
|--------|--------|-------|-----------|------------|------------|
| `GPTConfigSmall` | 6 | 6 | 384 | 256 | ~10M |
| `GPTConfig` | 12 | 12 | 768 | 1024 | ~124M |
## Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/myGPT.git
cd myGPT
# Install dependencies
pip install torch tiktoken transformers numpy
```
### Requirements
- Python 3.8+
- PyTorch 2.0+ (recommended for Flash Attention)
- tiktoken
- transformers (for loading pretrained weights)
- numpy
## Usage
### Training from Scratch
1. Prepare your training data as a text file (e.g., `input.txt`)
2. Train a small model:
```bash
python train_small.py
```
3. Train a larger model:
```bash
python train.py
```
### Multi-GPU Training (DDP)
```bash
torchrun --standalone --nproc_per_node=4 train_small.py
```
### Loading Pretrained GPT-2
```python
from model import GPT
# Load pretrained GPT-2 weights
model = GPT.from_pretrained('gpt2') # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
```
### Text Generation
```python
import torch
import tiktoken
from model import GPT, GPTConfig
# Load model
model = GPT(GPTConfig())
model.load_state_dict(torch.load('checkpoint.pt'))
model.eval()
# Tokenize prompt
enc = tiktoken.get_encoding("gpt2")
prompt = "Once upon a time"
idx = torch.tensor([enc.encode(prompt)], dtype=torch.long)
# Generate
with torch.no_grad():
output = model.generate(idx, max_new_tokens=100, temperature=0.8, top_k=40)
print(enc.decode(output[0].tolist()))
```
## Training Configuration
Key hyperparameters in `train_small.py`:
| Parameter | Value | Description |
|-----------|-------|-------------|
| Batch Size | 16 | Micro batch size |
| Block Size | 256 | Context length (sequence length) |
| Learning Rate | 3e-4 | Peak learning rate |
| Warmup Steps | 100 | Linear warmup steps |
| Max Steps | 2000 | Total training steps |
| Weight Decay | 0.2 | AdamW weight decay |
| Gradient Accumulation | Auto | Based on total batch size |
## Data Format
The `TextDataset` class in `data_loader.py` expects a plain text file. It:
- Tokenizes text using GPT-2's BPE tokenizer (tiktoken)
- Automatically splits into 90% train / 10% validation
- Returns (input, target) pairs where target is input shifted by one token
## Logging
Training logs are saved to `log/log.txt` and include:
- Training and validation loss
- Learning rate
- Tokens processed per second
- Generated text samples at evaluation intervals
## License
MIT License
## Acknowledgments
- [Andrej Karpathy's nanoGPT](https://github.com/karpathy/nanoGPT )
- [OpenAI GPT-2](https://github.com/openai/gpt-2 )
- [Hugging Face Transformers](https://github.com/huggingface/transformers )
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.