Skip to content

VashuTheGreat/Small-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Story Maker SLM (Small Language Model)

A transformer-based Small Language Model trained from scratch to generate creative children's stories. This project implements a GPT-style architecture using PyTorch and trains it on the TinyStories dataset.

🎯 Project Overview

This project demonstrates building a complete language model pipeline from scratch, including:

  • Custom transformer architecture implementation
  • Data preprocessing and tokenization
  • Training loop with validation
  • Text generation capabilities

The model is trained on the TinyStories dataset, which contains simple children's stories, making it perfect for training a small language model.

πŸ—οΈ Architecture

The model implements a GPT-style decoder-only transformer with the following components:

Core Components

  • Embedding Layer: Converts token IDs to dense vector representations
  • Positional Encoding: Adds positional information using sinusoidal functions
  • Multi-Head Attention: Implements self-attention mechanism with multiple heads
  • Masked Multi-Head Attention: Prevents the model from looking ahead during training
  • Feed-Forward Network: Two-layer MLP with GELU activation
  • Layer Normalization: Stabilizes training
  • Residual Connections: Enables deeper networks

Model Configuration

- Embedding Dimension: 384
- Number of Layers: 6
- Number of Attention Heads: 6
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
- Context Length: 128 tokens
- Dropout: 0.1

πŸ“ Project Structure

transformer/
β”œβ”€β”€ SLM.ipynb                      # Main training notebook
β”œβ”€β”€ Untitled.ipynb                 # Experimental notebook
β”œβ”€β”€ best_model_params.pt           # Trained model weights (162 MB)
β”œβ”€β”€ best_model_params (2).pt       # Alternative model checkpoint (120 MB)
β”œβ”€β”€ train.bin                      # Training data (943 MB)
β”œβ”€β”€ validation.bin                 # Validation data (9.5 MB)

πŸš€ Getting Started

Prerequisites

pip install torch numpy tiktoken datasets nltk tqdm matplotlib

Dataset Preparation

The model uses the TinyStories dataset with GPT-2 tokenization:

from datasets import load_dataset
import tiktoken

# Load dataset
ds = load_dataset("roneneldan/TinyStories")

# Initialize tokenizer
enc = tiktoken.get_encoding("gpt2")

Training

The training process includes:

  • Optimizer: Adam with learning rate 1e-4
  • Loss Function: Cross-Entropy Loss
  • Learning Rate Scheduler: ReduceLROnPlateau
  • Training Epochs: 19,500 iterations
  • Batch Size: 32
  • Validation: Periodic validation with best model checkpointing
# Initialize model
config = GPTConfig(
    vocab_size=50257,
    block_size=128,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=True
)
model = GPT(config)

# Train the model
# See SLM.ipynb for complete training loop

🎨 Text Generation

Basic Generation

def generatetext(prompt, max_new_tokens=50, temperature=1.0, top_k=None):
    model = GPT(config)
    state_dict = torch.load("best_model_params (2).pt", map_location=device)
    model.load_state_dict(state_dict)
    model.eval()

    input_ids = torch.tensor([enc.encode(prompt)], device=device)

    with torch.inference_mode():
        output_ids = model.generate(
            idx=input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

    return enc.decode(output_ids[0].tolist())

Example Output

prompt = "Once upon a time"
generated_text = generatetext(prompt, max_new_tokens=200, temperature=0.8, top_k=50)
print(generated_text)

Sample Output:

Once upon a time. One day, there was a little girl named Timmy. One day, a little
girl said, but they got a car to play with his mommy.

When they went back and said, "Lily and her mom. "I need to play with me me?"
His mommy and said, "I can stay. He wanted to do you? I can come and said, "You
can do you too."

πŸ“Š Training Results

The model was trained with the following approach:

  • Data Processing: Binary tokenized format using memory-mapped files for efficiency
  • Validation Strategy: Regular validation checks with best model saving
  • Loss Tracking: Both training and validation loss monitored

πŸ”§ Key Features

  1. Custom Transformer Implementation: Built from scratch without using high-level transformer libraries
  2. Efficient Data Loading: Memory-mapped binary files for handling large datasets
  3. Flexible Generation: Supports temperature and top-k sampling for diverse outputs
  4. Model Checkpointing: Automatic saving of best model based on validation loss
  5. Visualization: Training/validation loss plotting for monitoring

πŸ“ Technical Details

Data Preprocessing

  • Text is tokenized using the GPT-2 BPE tokenizer
  • Sequences are stored in binary format (.bin files) for fast loading
  • Memory-mapped arrays prevent RAM overflow

Attention Mechanism

  • Implements causal (masked) self-attention
  • Prevents information leakage from future tokens
  • Multi-head attention with 6 heads for diverse representations

Optimization

  • AdamW optimizer with weight decay (0.1)
  • Learning rate scheduling based on validation loss
  • Gradient clipping for training stability

πŸŽ“ Learning Outcomes

This project demonstrates:

  • Understanding of transformer architecture
  • Implementation of attention mechanisms
  • Training large language models
  • Text generation techniques
  • PyTorch best practices

πŸ“š References

🀝 Contributing

Feel free to fork this repository and experiment with:

  • Different model architectures
  • Hyperparameter tuning
  • Alternative datasets
  • Enhanced generation strategies

πŸ“„ License

This project is open source and available for educational purposes.

πŸ™ Acknowledgments

  • TinyStories dataset creators
  • OpenAI for the GPT-2 tokenizer
  • PyTorch team for the excellent framework

Note: This is an educational project demonstrating transformer implementation from scratch. For production use cases, consider using established libraries like Hugging Face Transformers.

About

Implementation of transformer layers to build a small language model from scratch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors