[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb by Skrisps26 · Pull Request #354 · openai/parameter-golf

Skrisps26 · 2026-03-21T16:43:25Z

MLA + SmearGate + BigramHash + SWA

Summary

Non-record submission demonstrating a stacked architecture combining:

Multi-Head Latent Attention (MLA) with kv_rank=128
SmearGate MLP (relu^2 gated, mlp_mult=3)
BigramHash embeddings (10240 buckets, dim=128)
Stochastic Weight Averaging (start_frac=0.4, every=50 steps)
Muon optimizer (momentum=0.99, WD=0.04)
Mixed int5/int6 quantization + zstd-22
Sliding-window evaluation (stride=64)

Results

Metric	Value
Pre-quantization val_bpb	1.2838
Roundtrip val_bpb	1.3559
Model size	14.449MB
Training steps	7001 / 20000
Tokens seen	~3.7B
Step time	~83ms

Key Finding

MLA attention, while parameter-efficient, adds significant compute overhead
per step (~83ms vs ~43ms for the baseline). In a fixed 10-minute window on
8xH100s this reduces token throughput from ~7.2B (baseline) to ~3.7B —
roughly half the training data. The pre-quantization bpb of 1.2838 suggests
the architecture itself is competitive; the bottleneck is throughput, not
capacity.

Replacing MLA with standard GQA would recover the full step budget (~11,500
steps at ~52ms/step) and likely push final bpb below 1.15.

Architecture

vocab_size=1024, num_layers=13, model_dim=512
num_heads=8, num_kv_heads=4, kv_rank=128
mlp_mult=3, bigram_buckets=10240, bigram_dim=128
SWA: start_frac=0.4, every=50
Quantization: int5 MLP, int6 attention, fp16 embeddings
Compression: zstd-22

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

train_gpt.py — training script
train.log — full training log
submission.json — metadata

Copilot

Pull request overview

Adds a new non-record 16MB submission record capturing an experiment that stacks MLA attention + SmearGate MLP + BigramHash embeddings + SWA and includes the training/eval code snapshot plus reported metrics/artifacts.

Changes:

Adds a self-contained train_gpt.py implementing MLA/SmearGate/BigramHash, SWA, Muon optimizer, and mixed int5/int6(+fp16) quantized serialization.
Adds record metadata (submission.json) and a writeup (README.md) describing results and reproduction.
Adds a training log artifact (currently UUID-named).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/train_gpt.py	New training script implementing the stacked architecture + quantized artifact roundtrip.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/submission.json	Submission metrics/size metadata for the run.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/README.md	Human-readable summary of configuration, results, and run command.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/0a10b225-50af-46ef-8fb9-5183fe30fb70.txt	Captured training output/log for the run (currently not named `train.log`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/train_gpt.py

records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/README.md

records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/submission.json

records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/train_gpt.py

…ash/train_gpt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ash/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ash/submission.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

non-record: MLA + SmearGate + BigramHash + SWA, pre-quant 1.2838 bpb

ffaf5db

Copilot AI review requested due to automatic review settings March 21, 2026 16:43

Copilot started reviewing on behalf of Skrisps26 March 21, 2026 16:43 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

Skrisps26 and others added 3 commits March 21, 2026 22:24

Update records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramH…

5359232

…ash/train_gpt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramH…

b0e8570

…ash/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramH…

6c4d699

…ash/submission.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb#354

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb#354
Skrisps26 wants to merge 4 commits intoopenai:mainfrom
Skrisps26:main

Skrisps26 commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Skrisps26 commented Mar 21, 2026

MLA + SmearGate + BigramHash + SWA

Summary

Results

Key Finding

Architecture

Run Command

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants