Skip to content

drago-codes-21/TRANSFORMER

Repository files navigation

Documentation

Small Transformer Model - Built From Scratch

A complete implementation of a Transformer model built from scratch using PyTorch, including:

  • Byte Pair Encoding (BPE) Tokenization
  • Token Embeddings
  • Positional Encoding
  • Multi-Head Self-Attention
  • Feed-Forward Networks
  • Layer Normalization
  • Complete Encoder-Decoder Architecture

Directory Structure

small_transformer_model/
├── README.md
├── requirements.txt
├── config/
│   └── model_config.json
├── src/
│   ├── tokenizer/          # BPE tokenization implementation
│   ├── model/              # Transformer architecture components
│   ├── training/           # Training loop and dataset handling
│   └── utils/              # Helper functions and metrics
├── data/                   # Training data
├── checkpoints/            # Saved model checkpoints
├── logs/                   # Training logs
├── notebooks/              # Jupyter notebooks for demos
└── scripts/                # Training and inference scripts

Installation

pip install -r requirements.txt

Quick Start

1. Prepare Your Data

Place your training text data in data/raw/ directory.

2. Train the Tokenizer

python scripts/prepare_data.py --input data/raw/corpus.txt --output data/processed/

3. Train the Model

python scripts/train.py --config config/model_config.json

4. Run Inference

python scripts/inference.py --model checkpoints/best_model.pt --text "Your input text"

Model Configuration

Edit config/model_config.json to customize:

  • Vocabulary size
  • Embedding dimensions
  • Number of attention heads
  • Number of encoder/decoder layers
  • Feed-forward dimensions
  • Training hyperparameters

Components

Tokenization

  • BPE Tokenizer: Implements Byte Pair Encoding from scratch
  • Vocabulary: Manages token-to-ID mappings

Model Architecture

  • Token Embedding: Learns dense representations for tokens
  • Positional Encoding: Adds position information using sinusoidal functions
  • Multi-Head Attention: Self-attention mechanism with multiple heads
  • Feed-Forward Network: Position-wise fully connected layers
  • Layer Normalization: Normalizes activations for stable training
  • Encoder: Stack of encoder layers
  • Decoder: Stack of decoder layers with masked attention
  • Transformer: Complete encoder-decoder architecture

Training

  • Dataset: Custom dataset class for loading and batching
  • Trainer: Complete training loop with validation
  • Optimizer: Adam optimizer with learning rate scheduling

Features

✅ Complete from-scratch implementation ✅ No high-level transformer libraries used ✅ Educational comments throughout code ✅ Modular and extensible design ✅ Support for custom datasets ✅ Checkpoint saving and loading ✅ Training visualization and logging

License

MIT License

References

  • Attention Is All You Need (Vaswani et al., 2017)
  • Byte Pair Encoding (Gage, 1994)

About

TRANSFORMER

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors