Open
Conversation
), and clarify submission rules (openai#43)
…dow Eval, and SWA
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR upgrades the baseline train_gpt.py with several state-of-the-art techniques used by top leaderboard entries to achieve ~1.14 BPB.
Architectural Improvements
BigramHash Embedding: Adds token-pair hashing for cheap local context.
SmearGate: Implements learned gating to blend information between adjacent tokens.
Improved Initialization: Linear layers now use orthogonal initialization.
Training & Optimizer Enhancements
Quantization-Aware Training (QAT): Uses Straight-Through Estimators (STE) to simulate Int8 rounding during training.
Stochastic Weight Averaging (SWA): Averages weights during the warmdown phase for better generalization.
Muon Upgrade: Adds weight decay support to the Muon optimizer.
Compression & Evaluation
Magnitude Pruning: Zeroes out the smallest 3% of weights post-training to maximize compression.
Zstandard (Zstd-22): Replaces zlib with maximum Zstd compression for the 16MB artifact.
Sliding Window Evaluation: Implements strided evaluation (stride = 64) to provide tokens with near-full context.
Verification
Verified syntax correctness with py_compile.
Confirmed environment setup using uv.