Skip to content

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb#354

Open
Skrisps26 wants to merge 4 commits intoopenai:mainfrom
Skrisps26:main
Open

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb#354
Skrisps26 wants to merge 4 commits intoopenai:mainfrom
Skrisps26:main

Conversation

@Skrisps26
Copy link

MLA + SmearGate + BigramHash + SWA

Summary

Non-record submission demonstrating a stacked architecture combining:

  • Multi-Head Latent Attention (MLA) with kv_rank=128
  • SmearGate MLP (relu^2 gated, mlp_mult=3)
  • BigramHash embeddings (10240 buckets, dim=128)
  • Stochastic Weight Averaging (start_frac=0.4, every=50 steps)
  • Muon optimizer (momentum=0.99, WD=0.04)
  • Mixed int5/int6 quantization + zstd-22
  • Sliding-window evaluation (stride=64)

Results

Metric Value
Pre-quantization val_bpb 1.2838
Roundtrip val_bpb 1.3559
Model size 14.449MB
Training steps 7001 / 20000
Tokens seen ~3.7B
Step time ~83ms

Key Finding

MLA attention, while parameter-efficient, adds significant compute overhead
per step (~83ms vs ~43ms for the baseline). In a fixed 10-minute window on
8xH100s this reduces token throughput from ~7.2B (baseline) to ~3.7B —
roughly half the training data. The pre-quantization bpb of 1.2838 suggests
the architecture itself is competitive; the bottleneck is throughput, not
capacity.

Replacing MLA with standard GQA would recover the full step budget (~11,500
steps at ~52ms/step) and likely push final bpb below 1.15.

Architecture

  • vocab_size=1024, num_layers=13, model_dim=512
  • num_heads=8, num_kv_heads=4, kv_rank=128
  • mlp_mult=3, bigram_buckets=10240, bigram_dim=128
  • SWA: start_frac=0.4, every=50
  • Quantization: int5 MLP, int6 attention, fp16 embeddings
  • Compression: zstd-22

Run Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

  • train_gpt.py — training script
  • train.log — full training log
  • submission.json — metadata

Copilot AI review requested due to automatic review settings March 21, 2026 16:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record 16MB submission record capturing an experiment that stacks MLA attention + SmearGate MLP + BigramHash embeddings + SWA and includes the training/eval code snapshot plus reported metrics/artifacts.

Changes:

  • Adds a self-contained train_gpt.py implementing MLA/SmearGate/BigramHash, SWA, Muon optimizer, and mixed int5/int6(+fp16) quantized serialization.
  • Adds record metadata (submission.json) and a writeup (README.md) describing results and reproduction.
  • Adds a training log artifact (currently UUID-named).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/train_gpt.py New training script implementing the stacked architecture + quantized artifact roundtrip.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/submission.json Submission metrics/size metadata for the run.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/README.md Human-readable summary of configuration, results, and run command.
records/track_non_record_16mb/2026-03-21_MLA_SmearGate_BigramHash/0a10b225-50af-46ef-8fb9-5183fe30fb70.txt Captured training output/log for the run (currently not named train.log).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Skrisps26 and others added 3 commits March 21, 2026 22:24
…ash/train_gpt.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ash/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ash/submission.json

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants