Transformer Implementation from Scratch

Project Overview

This repository contains a transformer based language model trained on a dataset of english-french sentences. It uses BPE (byte pair encoding) for tokenization from scratch. The goal is to understand the internal workings and implementation of language models. This 22M parameter model is trained on ~6M of text data and serves as a demonstration/learning tool, not for general use.

Original paper link: Attention is all you need

I have broken down the components into separate modules. Each component is initially implemented using NumPy to enhance understanding of the underlying mathematics, and then integrated using PyTorch for the complete model.

Repository Structure

│transformer-from-scratch/
│
├── data/
│   ├── English.txt
│   ├── French.txt
│   ├── shakespeare.txt
│
├── en-fr-transformer/
│   ├── en-fr-train.ipynb                          # Training notebook for EN-FR transformer
│   ├── model.py                                   # Complete Transformer model
│   ├── en-fr-toy-transformer-experiment.ipynb     # experiemntation notebook
│
├── simple-transformer/                            # Building blocks
│   ├── decoder.py
│   ├── encoder.py
│   ├── normalisation.ipynb
│   ├── positional_encoding.ipynb
│   ├── self_attention.ipynb
│
└── README.md                                      # Documentation

Model Architecture

The model.py file contains the main transformer model with:

3 encoder layers
3 decoder layers
8 attention heads
Embedding dimension of 512
Feed-forward dimension of 2048
Dropout rate of 0.1
Maximum sequence length of 100 characters

Key Components

MultiHeadAttention: Implements the multi-head self-attention mechanism.
PositionWiseFeedForward: The feed-forward network applied to each position.
PositionalEncoding: Adds positional information to the embeddings.
EncoderLayer: Full encoder layer with self-attention and feed-forward networks.
DecoderLayer: Full decoder layer with self-attention, cross-attention, and feed-forward networks.
Transformer: The complete model integrating all components.

Model Parameters

Source vocabulary size: 108 (English characters + special tokens)
Target vocabulary size: 143 (French characters including accented characters + special tokens)
Total trainable parameters: 22,254,724 (22M)

Dataset

The model is trained on a dataset of English-French sentence pairs. The dataset is preprocessed to:

Filter sentences that exceed the maximum sequence length.
Ensure all characters in the sentences are within the defined vocabularies.
Add special tokens (START, END, PADDING, UNKNOWN).

Training Details

Hardware: Trained on MPS (Metal Performance Shaders) for Apple Silicon
Training time: 100 minutes for 2 epochs
Batch size: 32
Optimizer: AdamW with learning rate 1e-4
Loss function: Cross-entropy loss (ignoring padding tokens)
Gradient clipping: Applied at 1.0

Training Progress

Epoch 1

Initial loss: ~1.8342
Final train loss: 1.3595
Validation loss: 0.9313

Epoch 2

Initial loss: ~0.8570
Final train loss: 0.8604
Validation loss: 0.6674

Sample Translations

English Input	French Translation	Literal Translation
What should we do when the day starts?	`<SOS>` Ce que devrais-nous faire le dire ?	What should we do to say it?
What do you think about the book?	`<SOS>` Ce que vous pensez-vous ?	What you think-you?
Where are you going?	`<SOS>` Où vous avez en train de matin ?	Where are you in the process of morning?
I want a new book.	`<SOS>` Je veux un livre un livre.	I want a book a book.

Notes

The current model is trained at the character level rather than using subword tokens (like BPE).
While the model shows promising results for simple sentences, it still has limitations for more complex translations.
The sample translations indicate that the model has begun to learn patterns but still produces some inaccuracies.

Thank you for exploring this Transformer implementation! I hope it is helpful for understanding this powerful architecture. Contributions and feedback are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
en-fr transformer		en-fr transformer
simple-transformer		simple-transformer
transformer_positional_experiments		transformer_positional_experiments
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Implementation from Scratch

Project Overview

Repository Structure

Model Architecture

Key Components

Model Parameters

Dataset

Training Details

Training Progress

Sample Translations

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer Implementation from Scratch

Project Overview

Repository Structure

Model Architecture

Key Components

Model Parameters

Dataset

Training Details

Training Progress

Sample Translations

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages