A modular pipeline for training language models with custom tokenizers.
main.py: Main training scriptconfig.py: Configuration and hyperparametersmodels/: Model architecturesdata/: Data loading and preprocessingtraining/: Training loops and utilitiestokenization/: Tokenizer training pipelineinference/: Scripts for testing trained models and tokenizers
This project implements a custom GRU-based Language Model with the following architecture:
The model uses Gated Recurrent Units (GRUs) with layer normalization for stable training. It processes text sequences token-by-token, maintaining a hidden state that captures contextual information.
Note: The model_arch.jpg file should be in the project root showing the GRU architecture flow.
- Converts input token IDs to dense vector representations
- Dimension: Configurable (default: 512)
The core recurrent component with three gates:
- Reset Gate (W_r): Controls what information from the previous hidden state to forget
- Update Gate (W_z): Determines how much of the new candidate to incorporate
- Candidate Gate (W_n): Computes the new memory candidate using reset-filtered previous state
- Applied to concatenated inputs and hidden states for training stability
- Helps prevent gradient vanishing/exploding issues
- Linear layer mapping hidden states to vocabulary logits
- Dimension: Vocabulary size × Hidden dimension
- Token Processing: Each input token is embedded and processed sequentially
- Gate Computation: Reset and update gates control information flow
- State Update: Hidden state updated via gated interpolation between old and new candidates
- Prediction: Output logits generated for next token prediction
- Layer Normalization: Stabilizes training and improves convergence
- Custom GRU Implementation: Full control over gating mechanisms
- Flexible Dimensions: Configurable embedding and hidden sizes
- Text Generation: Built-in generation with temperature, top-k, and nucleus sampling
- Loss: Cross-entropy loss on next-token prediction
- Optimization: Adam optimizer with configurable learning rate
- Regularization: Layer normalization (no explicit dropout)
- Train a tokenizer:
python -m tokenization.main_tokenizer --tokenizer-type byte_level_bpe # add required arguments if any- Process the dataset:
python -m data.process_dataset # add rquired arguments if any- Train the model:
python -m main # add required arguments if anybyte_level_bpe: Byte-level BPE (default)char_bpe: Character-level BPE
See inference/README.md for testing trained models and tokenizers.
