A complete implementation of a Transformer model built from scratch using PyTorch, including:
- Byte Pair Encoding (BPE) Tokenization
- Token Embeddings
- Positional Encoding
- Multi-Head Self-Attention
- Feed-Forward Networks
- Layer Normalization
- Complete Encoder-Decoder Architecture
small_transformer_model/
├── README.md
├── requirements.txt
├── config/
│ └── model_config.json
├── src/
│ ├── tokenizer/ # BPE tokenization implementation
│ ├── model/ # Transformer architecture components
│ ├── training/ # Training loop and dataset handling
│ └── utils/ # Helper functions and metrics
├── data/ # Training data
├── checkpoints/ # Saved model checkpoints
├── logs/ # Training logs
├── notebooks/ # Jupyter notebooks for demos
└── scripts/ # Training and inference scripts
pip install -r requirements.txtPlace your training text data in data/raw/ directory.
python scripts/prepare_data.py --input data/raw/corpus.txt --output data/processed/python scripts/train.py --config config/model_config.jsonpython scripts/inference.py --model checkpoints/best_model.pt --text "Your input text"Edit config/model_config.json to customize:
- Vocabulary size
- Embedding dimensions
- Number of attention heads
- Number of encoder/decoder layers
- Feed-forward dimensions
- Training hyperparameters
- BPE Tokenizer: Implements Byte Pair Encoding from scratch
- Vocabulary: Manages token-to-ID mappings
- Token Embedding: Learns dense representations for tokens
- Positional Encoding: Adds position information using sinusoidal functions
- Multi-Head Attention: Self-attention mechanism with multiple heads
- Feed-Forward Network: Position-wise fully connected layers
- Layer Normalization: Normalizes activations for stable training
- Encoder: Stack of encoder layers
- Decoder: Stack of decoder layers with masked attention
- Transformer: Complete encoder-decoder architecture
- Dataset: Custom dataset class for loading and batching
- Trainer: Complete training loop with validation
- Optimizer: Adam optimizer with learning rate scheduling
✅ Complete from-scratch implementation ✅ No high-level transformer libraries used ✅ Educational comments throughout code ✅ Modular and extensible design ✅ Support for custom datasets ✅ Checkpoint saving and loading ✅ Training visualization and logging
MIT License
- Attention Is All You Need (Vaswani et al., 2017)
- Byte Pair Encoding (Gage, 1994)