Skip to content

anushacodes/transformer-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Implementation from Scratch

Project Overview

This repository contains a transformer based language model trained on a dataset of english-french sentences. It uses BPE (byte pair encoding) for tokenization from scratch. The goal is to understand the internal workings and implementation of language models. This 22M parameter model is trained on ~6M of text data and serves as a demonstration/learning tool, not for general use.

Original paper link: Attention is all you need

I have broken down the components into separate modules. Each component is initially implemented using NumPy to enhance understanding of the underlying mathematics, and then integrated using PyTorch for the complete model.

Repository Structure

│transformer-from-scratch/
│
├── data/
│   ├── English.txt
│   ├── French.txt
│   ├── shakespeare.txt
│
├── en-fr-transformer/
│   ├── en-fr-train.ipynb                          # Training notebook for EN-FR transformer
│   ├── model.py                                   # Complete Transformer model
│   ├── en-fr-toy-transformer-experiment.ipynb     # experiemntation notebook
│
├── simple-transformer/                            # Building blocks
│   ├── decoder.py
│   ├── encoder.py
│   ├── normalisation.ipynb
│   ├── positional_encoding.ipynb
│   ├── self_attention.ipynb
│
└── README.md                                      # Documentation

Model Architecture

The model.py file contains the main transformer model with:

  • 3 encoder layers
  • 3 decoder layers
  • 8 attention heads
  • Embedding dimension of 512
  • Feed-forward dimension of 2048
  • Dropout rate of 0.1
  • Maximum sequence length of 100 characters

Key Components

  • MultiHeadAttention: Implements the multi-head self-attention mechanism.
  • PositionWiseFeedForward: The feed-forward network applied to each position.
  • PositionalEncoding: Adds positional information to the embeddings.
  • EncoderLayer: Full encoder layer with self-attention and feed-forward networks.
  • DecoderLayer: Full decoder layer with self-attention, cross-attention, and feed-forward networks.
  • Transformer: The complete model integrating all components.

Model Parameters

  • Source vocabulary size: 108 (English characters + special tokens)
  • Target vocabulary size: 143 (French characters including accented characters + special tokens)
  • Total trainable parameters: 22,254,724 (22M)

Dataset

The model is trained on a dataset of English-French sentence pairs. The dataset is preprocessed to:

  • Filter sentences that exceed the maximum sequence length.
  • Ensure all characters in the sentences are within the defined vocabularies.
  • Add special tokens (START, END, PADDING, UNKNOWN).

Training Details

  • Hardware: Trained on MPS (Metal Performance Shaders) for Apple Silicon
  • Training time: 100 minutes for 2 epochs
  • Batch size: 32
  • Optimizer: AdamW with learning rate 1e-4
  • Loss function: Cross-entropy loss (ignoring padding tokens)
  • Gradient clipping: Applied at 1.0

Training Progress

Epoch 1

  • Initial loss: ~1.8342
  • Final train loss: 1.3595
  • Validation loss: 0.9313

Epoch 2

  • Initial loss: ~0.8570
  • Final train loss: 0.8604
  • Validation loss: 0.6674

Sample Translations

English Input French Translation Literal Translation
What should we do when the day starts? <SOS> Ce que devrais-nous faire le dire ? What should we do to say it?
What do you think about the book? <SOS> Ce que vous pensez-vous ? What you think-you?
Where are you going? <SOS> Où vous avez en train de matin ? Where are you in the process of morning?
I want a new book. <SOS> Je veux un livre un livre. I want a book a book.

Notes

  • The current model is trained at the character level rather than using subword tokens (like BPE).
  • While the model shows promising results for simple sentences, it still has limitations for more complex translations.
  • The sample translations indicate that the model has begun to learn patterns but still produces some inaccuracies.

Thank you for exploring this Transformer implementation! I hope it is helpful for understanding this powerful architecture. Contributions and feedback are welcome!

About

A PyTorch implementation from scratch of the Transformer model for English to French neural machine translation, based on "Attention Is All You Need."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors