A complete implementation of a GPT-style Transformer architecture built from scratch using TensorFlow, featuring custom Multi-Head Self Attention, Positional Encoding, Causal Masking, Hyperparameter Tuning, Text Generation, and Attention Visualization.
- Overview
- Motivation
- Key Features
- Architecture Overview
- Transformer Components
- Project Structure
- Dataset
- Training Pipeline
- Hyperparameter Tuning
- Text Generation
- Attention Visualization
- Installation
- Project Setup
- Running The Project
- Results
- Technologies Used
- Future Improvements
- Learning Outcomes
- References
- License
- Author
ShakespeareGPT is a decoder-only Transformer Language Model inspired by modern GPT architectures and the groundbreaking paper:
Attention Is All You Need (Vaswani et al., 2017)
The project implements the complete Transformer pipeline manually without relying on TensorFlow's built-in MultiHeadAttention layer.
Every major Transformer component has been implemented from scratch, including:
- Scaled Dot Product Attention
- Multi Head Self Attention
- Sinusoidal Positional Encoding
- Causal Attention Masking
- Feed Forward Networks
- Residual Connections
- Layer Normalization
- Autoregressive Text Generation
- Hyperparameter Optimization
- Attention Visualization
The model is trained on Shakespeare's literary works and learns to generate text one character at a time in an autoregressive fashion.
Unlike high-level implementations that hide important details behind library abstractions, ShakespeareGPT focuses on understanding how modern Large Language Models work internally.
The objective of this project is not merely to generate text but to provide a practical implementation of the concepts that power systems such as:
- GPT
- GPT-2
- GPT-3
- GPT-4
- Claude
- Gemini
- LLaMA
while maintaining complete transparency over every architectural component.
Large Language Models have fundamentally transformed Natural Language Processing and Generative AI.
Modern systems are built upon Transformer architectures that utilize attention mechanisms to learn long-range dependencies and contextual relationships within text.
While many developers use pre-trained models through APIs, relatively few understand how these models operate internally.
The goal of ShakespeareGPT is to bridge that gap by building a complete Transformer architecture from first principles.
This project was developed to answer questions such as:
- How does self-attention actually work?
- Why are Transformers more effective than RNNs and LSTMs?
- How do GPT models generate coherent text?
- What role does positional encoding play?
- Why is causal masking necessary?
- How can attention maps be visualized and interpreted?
- How can hyperparameter tuning improve model performance?
By implementing every major component manually, ShakespeareGPT provides a deeper understanding of modern language modeling systems.
- GPT-Style Decoder-Only Architecture
- Custom Multi-Head Self Attention
- Scaled Dot Product Attention
- Sinusoidal Positional Encoding
- Causal Masking
- Layer Normalization
- Residual Connections
- Feed Forward Networks
- GELU Activation Function
- TensorFlow Dataset Pipeline
- Dynamic Sequence Generation
- Efficient Data Streaming
- Automatic Checkpoint Saving
- Early Stopping
- TensorBoard Integration
- Validation Monitoring
- Perplexity Tracking
- Keras Tuner Integration
- Automatic Architecture Search
- Embedding Dimension Optimization
- Transformer Depth Optimization
- Attention Head Optimization
- Feed Forward Dimension Optimization
- Learning Rate Optimization
- Dropout Optimization
- Prompt-Based Generation
- Temperature Sampling
- Top-K Sampling
- Top-P (Nucleus) Sampling
- Configurable Generation Length
- Autoregressive Decoding
- Attention Heatmap Generation
- Attention Head Inspection
- Layer-wise Attention Analysis
- Transformer Interpretability Tools
ShakespeareGPT follows a decoder-only Transformer architecture similar to GPT-style language models.
The model receives a sequence of characters and learns to predict the next character given all previous characters.
Input Characters
│
▼
Character Embedding
│
▼
Positional Encoding
│
▼
Transformer Block × N
│
▼
Layer Normalization
│
▼
Vocabulary Projection
│
▼
Softmax
│
▼
Next Character Prediction
The architecture is specifically designed for autoregressive language modeling where each prediction depends only on previously observed characters.
Each input character is mapped into a dense vector representation.
Example:
A → [0.23, -0.81, 0.52, ...]
B → [-0.14, 0.91, -0.44, ...]
C → [0.77, -0.31, 0.12, ...]
These learned embeddings allow the model to represent semantic relationships between characters.
Transformers process all tokens simultaneously.
Unlike recurrent architectures, they do not inherently understand sequence order.
To solve this issue, ShakespeareGPT uses sinusoidal positional encodings introduced in the original Transformer paper.
Mathematically:
PE(pos,2i)=sin(pos/10000^(2i/d_model))
PE(pos,2i+1)=cos(pos/10000^(2i/d_model))
This allows the model to incorporate positional information while maintaining the ability to generalize to unseen sequence lengths.
Attention is computed using:
Attention(Q,K,V)=softmax(QKᵀ/√d)V
Where:
- Q = Query Matrix
- K = Key Matrix
- V = Value Matrix
- d = Attention Dimension
This mechanism enables each character to selectively focus on relevant context from previous positions.
Rather than learning a single attention distribution, multiple attention heads operate in parallel.
Benefits include:
- Learning multiple contextual relationships simultaneously
- Capturing long-range dependencies
- Improving representation quality
- Better modeling of linguistic structure
Each head learns different aspects of the sequence.
Some heads may focus on punctuation while others focus on speaker names, sentence boundaries, or semantic context.
Future information must not be visible during training.
Consider:
Input:
TO BE OR N
Target:
O
The model should only observe characters before the target.
Causal masks ensure that future positions are hidden by assigning large negative values before the softmax operation.
This guarantees true autoregressive behavior.
Every Transformer block contains a position-wise feed forward network:
Dense(ff_dim)
│
▼
GELU
│
▼
Dense(embed_dim)
This network enables non-linear transformations of learned representations and significantly increases the model's expressive power.
Residual connections help stabilize training and improve gradient flow.
Instead of learning complete transformations, layers learn residual updates:
Output = Layer(x) + x
This allows deeper Transformer architectures to train effectively.
Layer normalization stabilizes activations and improves convergence.
It is applied after each residual connection within the Transformer block.
Benefits include:
- Faster convergence
- Improved stability
- Reduced internal covariate shift
- Better gradient propagation
The project follows a modular architecture to ensure maintainability, scalability, and separation of concerns.
ShakespeareGPT/
│
├── configs/
│ └── config.py
│
├── data/
│ └── raw/
│ └── shakespeare.txt
│
├── src/
│ │
│ ├── data/
│ │ ├── tokenizer.py
│ │ └── dataset.py
│ │
│ ├── layers/
│ │ ├── positional_encoding.py
│ │ ├── attention.py
│ │ └── transformer_block.py
│ │
│ ├── models/
│ │ └── bardformer.py
│ │
│ ├── training/
│ │ └── trainer.py
│ │
│ ├── tuning/
│ │ └── tuner.py
│ │
│ ├── inference/
│ │ ├── sampling.py
│ │ └── generate.py
│ │
│ └── visualization/
│ └── attention_visualizer.py
│
├── train.py
├── tune.py
├── generate.py
├── visualize_attention.py
│
├── artifacts/
├── checkpoints/
├── logs/
├── outputs/
├── tuner_results/
│
├── requirements.txt
├── LICENSE
├── .gitignore
└── README.md
Contains all configurable hyperparameters and project settings.
Examples:
- Learning Rate
- Batch Size
- Sequence Length
- Embedding Dimension
- Number of Attention Heads
- Number of Transformer Layers
- Generation Parameters
Centralizing configuration makes experimentation significantly easier.
Responsible for all dataset processing operations.
Implements a custom character-level tokenizer.
Responsibilities:
- Vocabulary Creation
- Character-to-Index Mapping
- Index-to-Character Mapping
- Encoding
- Decoding
- Artifact Persistence
Generated Artifacts:
vocabulary.pkl
char2idx.pkl
idx2char.pkl
Builds TensorFlow datasets used during training.
Responsibilities:
- Sequence Creation
- Sliding Window Generation
- Dataset Splitting
- Batching
- Prefetching
- Pipeline Optimization
Output:
(train_dataset, validation_dataset)Contains custom Transformer building blocks.
Implements sinusoidal positional encodings from the original Transformer paper.
Purpose:
Allow the model to understand sequence order.
Implements:
- Scaled Dot Product Attention
- Multi Head Self Attention
- Causal Masking
This file forms the core of the Transformer architecture.
Defines a complete Transformer block consisting of:
- Multi Head Attention
- Feed Forward Network
- Layer Normalization
- Residual Connections
- Dropout
Defines the complete GPT-style architecture.
Components:
- Embedding Layer
- Positional Encoding
- Transformer Stack
- Final Projection Layer
This file assembles all lower-level components into a complete language model.
Responsible for model training.
Features:
- Model Compilation
- Training Loop
- Checkpointing
- TensorBoard Logging
- Early Stopping
- Perplexity Calculation
- Progress Reporting
Performs hyperparameter optimization.
Search Space Includes:
- Embedding Dimension
- Attention Heads
- Transformer Depth
- Feed Forward Dimension
- Learning Rate
- Dropout
Implements advanced generation strategies.
Supported Methods:
- Temperature Sampling
- Top-K Sampling
- Top-P Sampling
These techniques improve generation diversity and realism.
Responsible for text generation.
Pipeline:
Prompt ↓ Tokenization ↓ Inference ↓ Sampling ↓ Decoding ↓ Generated Text
Provides interpretability tools.
Features:
- Attention Extraction
- Heatmap Generation
- Layer Inspection
- Head Inspection
The model is trained on Shakespeare's literary works.
The corpus contains:
- Tragedies
- Comedies
- Historical Plays
- Sonnets
- Character Dialogues
- Monologues
The complete corpus provides a rich source of language patterns suitable for training an autoregressive language model.
Most modern language models use:
- Word Tokenization
- Subword Tokenization
- Byte Pair Encoding (BPE)
This project intentionally uses character-level tokenization because it provides a deeper understanding of sequence modeling fundamentals.
Advantages:
- Simpler tokenizer implementation
- No Out-of-Vocabulary tokens
- Better understanding of Transformer mechanics
- Full control over preprocessing
Challenges:
- Longer training times
- Longer dependencies
- More difficult learning problem
Despite these challenges, character-level models are excellent educational tools for understanding language modeling.
The raw corpus undergoes the following transformations:
Raw Shakespeare Corpus
│
▼
Character Tokenization
│
▼
Vocabulary Creation
│
▼
Character Encoding
│
▼
Sliding Window Generation
│
▼
Input / Target Pairs
│
▼
Train / Validation Split
│
▼
TensorFlow Dataset Pipeline
The training workflow is fully automated.
Running:
python train.pyexecutes the following sequence:
Create Project Directories
│
▼
Build Tokenizer Artifacts
│
▼
Load Shakespeare Dataset
│
▼
Create TensorFlow Datasets
│
▼
Build Transformer Model
│
▼
Compile Model
│
▼
Train Model
│
▼
Evaluate Validation Metrics
│
▼
Save Best Checkpoint
│
▼
Generate TensorBoard Logs
The following metrics are monitored during training.
Measures prediction error on the training dataset.
Lower values indicate better model performance.
Measures generalization performance on unseen data.
Used for:
- Model Selection
- Early Stopping
- Hyperparameter Evaluation
Character-level prediction accuracy.
Represents the percentage of correctly predicted next characters.
A standard metric used for evaluating language models.
Perplexity is computed as:
Perplexity = exp(loss)
Lower perplexity indicates better predictive capability.
The best model is automatically saved using validation loss.
Checkpoint Path:
checkpoints/best_model.weights.h5
Only the highest-performing checkpoint is preserved.
This prevents storage waste while ensuring the best model remains available for inference.
Training logs are automatically generated.
Launch TensorBoard:
tensorboard --logdir logs/tensorboardAvailable Visualizations:
- Training Loss Curves
- Validation Loss Curves
- Accuracy Curves
- Learning Dynamics
- Weight Histograms
ShakespeareGPT includes an automated hyperparameter optimization pipeline powered by Keras Tuner.
The objective is to discover the best-performing Transformer configuration by evaluating multiple architectural combinations.
The tuning process explores different values for:
| Hyperparameter | Search Space |
|---|---|
| Embedding Dimension | 128, 256, 384, 512 |
| Attention Heads | 4, 8 |
| Transformer Layers | 4, 6, 8 |
| Feed Forward Dimension | 512, 1024, 2048 |
| Dropout Rate | 0.1, 0.2, 0.3 |
| Learning Rate | 1e-4, 5e-4, 1e-3 |
Execute:
python tune.pyThe tuning pipeline performs:
Build Model
│
▼
Train Candidate Model
│
▼
Evaluate Validation Loss
│
▼
Record Results
│
▼
Generate Next Candidate
│
▼
Select Best Configuration
The best hyperparameters are automatically reported after the search completes.
After training, ShakespeareGPT can generate entirely new Shakespeare-style text.
The generation process is autoregressive.
At every step:
Previous Context
│
▼
Transformer Prediction
│
▼
Sampling Strategy
│
▼
Next Character
│
▼
Append To Sequence
│
▼
Repeat
The newly generated character becomes part of the context for the next prediction.
Controls randomness during generation.
Low Temperature:
0.3 - 0.7
Characteristics:
- Conservative
- Predictable
- Less Creative
High Temperature:
1.0 - 1.5
Characteristics:
- Creative
- Diverse
- Less Stable
Default:
TEMPERATURE = 0.8Restricts sampling to the K most probable candidates.
Example:
Top K = 20
Only the 20 most likely characters remain eligible for selection.
Benefits:
- Reduces noise
- Improves coherence
- Prevents unlikely outputs
Rather than keeping a fixed number of candidates, Top-P keeps the smallest set whose cumulative probability exceeds P.
Example:
TOP_P = 0.9This allows dynamic adaptation depending on prediction confidence.
Benefits:
- More natural text
- Better diversity
- Improved generation quality
Execute:
python generate.pyExample Prompt:
KING:
Example Output:
KING:
My lord, the heavens shine upon thee this night.
The stars themselves bear witness to thy glory,
And every wind doth whisper of thy noble deeds.
Generated text is automatically saved to:
outputs/generated_text.txt
One of the most powerful aspects of Transformer models is the ability to inspect attention patterns.
ShakespeareGPT includes a complete visualization pipeline for analyzing learned attention distributions.
Attention heatmaps allow us to understand:
- Which characters influence predictions
- Long-range dependencies
- Context utilization
- Model reasoning behavior
This makes Transformers significantly more interpretable than many other neural architectures.
Execute:
python visualize_attention.pyExample Prompt:
KING:
Output:
outputs/attention_heatmap.png
For a sequence length of:
256Each attention head produces:
256 × 256attention scores.
For:
8 headsand
6 layersthe model learns:
48 independent attention patternswhich collectively model contextual relationships.
git clone https://github.com/7vik2005/ShakespeareGPT.git
cd ShakespeareGPTpython -m venv venv
venv\Scripts\activatepython3 -m venv venv
source venv/bin/activatepip install -r requirements.txtpython --versionExpected:
Python 3.9+
Download the Shakespeare corpus and place it inside:
data/raw/shakespeare.txt
Expected structure:
data/
└── raw/
└── shakespeare.txt
The training pipeline automatically handles:
- Vocabulary Creation
- Character Encoding
- Artifact Generation
No manual preprocessing is required.
python train.pypython tune.pypython generate.pypython visualize_attention.pyRepresentative results obtained using the default configuration:
| Metric | Value |
|---|---|
| Vocabulary Size | 97 |
| Sequence Length | 256 |
| Embedding Dimension | 256 |
| Attention Heads | 8 |
| Transformer Layers | 6 |
| Feed Forward Dimension | 1024 |
| Batch Size | 32 |
| Parameters | ~12 Million |
| Validation Accuracy | 53.4% |
| Validation Loss | 1.82 |
| Perplexity | 6.17 |
| Training Epochs | 20 |
Prompt:
KING:
Generated:
KING:
What sayest thou, my noble friend?
The heavens whisper through the silent air,
And all the stars bear witness unto fate.
- Python
- TensorFlow
- Keras
- NumPy
- Matplotlib
- TensorBoard
- Keras Tuner
- Transformer Architecture
- Attention Mechanisms
- Autoregressive Language Modeling
Potential future extensions include:
- Byte Pair Encoding (BPE)
- Word-Level Tokenization
- Rotary Positional Embeddings
- Flash Attention
- Mixed Precision Training
- Weight Tying
- Distributed Multi-GPU Training
- Quantization
- Fine-Tuning Support
- LoRA Integration
- Transformer Scaling Experiments
- Instruction Tuning
This project demonstrates practical understanding of:
- Neural Networks
- Optimization
- Representation Learning
- Language Modeling
- Sequence Prediction
- Text Generation
- Self Attention
- Multi Head Attention
- Positional Encoding
- Decoder-Only Design
- Modular Design
- Training Pipelines
- Experiment Management
- Configuration Management
- Checkpointing
- Logging
- Hyperparameter Optimization
- Model Evaluation
Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Łukasz Kaiser Illia Polosukhin
NeurIPS 2017
https://arxiv.org/abs/1706.03762
This project is licensed under the MIT License.
See the LICENSE file for complete details.
AI Engineer • Full Stack Developer • Machine Learning Enthusiast
If you found this project useful, consider starring the repository.