Skip to content

Feature/sota optimizations#358

Open
adityagupta26 wants to merge 5 commits intoopenai:mainfrom
adityagupta26:feature/sota-optimizations
Open

Feature/sota optimizations#358
adityagupta26 wants to merge 5 commits intoopenai:mainfrom
adityagupta26:feature/sota-optimizations

Conversation

@adityagupta26
Copy link

Description

This PR upgrades the baseline train_gpt.py with several state-of-the-art techniques used by top leaderboard entries to achieve ~1.14 BPB.

Architectural Improvements
BigramHash Embedding: Adds token-pair hashing for cheap local context.
SmearGate: Implements learned gating to blend information between adjacent tokens.
Improved Initialization: Linear layers now use orthogonal initialization.
Training & Optimizer Enhancements
Quantization-Aware Training (QAT): Uses Straight-Through Estimators (STE) to simulate Int8 rounding during training.
Stochastic Weight Averaging (SWA): Averages weights during the warmdown phase for better generalization.
Muon Upgrade: Adds weight decay support to the Muon optimizer.
Compression & Evaluation
Magnitude Pruning: Zeroes out the smallest 3% of weights post-training to maximize compression.
Zstandard (Zstd-22): Replaces zlib with maximum Zstd compression for the 16MB artifact.
Sliding Window Evaluation: Implements strided evaluation (stride = 64) to provide tokens with near-full context.
Verification
Verified syntax correctness with py_compile.
Confirmed environment setup using uv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants