Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB) by nathon-lee · Pull Request #334 · openai/parameter-golf

nathon-lee · 2026-03-21T10:42:02Z

Summary

Non-record submission: 11-layer 512-dim GPT with PartialRoPE, LNScale, SmearGate,
BigramHash(2048×64), U-Net skip connections, EMA + SWA, and test-time training (TTT).

Trained on 1×H100 PCIe for ~107 minutes (~equivalent to 8×H100 SXM 10 minutes).

Results

Metric	Value
val_bpb (pre-TTT)	1.2207
val_loss	2.0611
Training steps	3374 / 20000
Training time	~107 min (1×H100 PCIe)
Model params	26.7M
Artifact size	15.4 MB ✅

Key Techniques

Partial RoPE (16/64 dims): position encoding on subset of head dims
LN Scale: RMSNorm damped by 1/√(layer+1)
SmearGate: per-dim gate blending current + previous token
BigramHash(2048, dim=64): hash-based bigram context embeddings
U-Net skip connections with learnable weights
Muon optimizer (Newton-Schulz) + Adam for embeddings
EMA(0.997) + SWA (last 40%)
Uniform int5 quantization + zstd-22
Sliding window eval (stride=64) + SGD TTT (3 epochs)

Changes from v1

BigramHash: 4096×128 → 2048×64 (-426K params)
Quantization: mixed int5/int6 → uniform int5
Artifact: 17.4MB ❌ → 15.4MB ✅

Checklist

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

Changes: - BigramHash: 4096x128 -> 2048x64 (-426K params) - Quantization: mixed int5/int6 -> uniform int5 for all weights - Artifact: 17.4MB -> 15.4MB Results (1xH100 PCIe, 6400s): - val_loss: 2.0611 (pre-TTT) - val_bpb: 1.2207 (pre-TTT) - params: 26,666,073 - artifact: 16,132,620 bytes - total: 16,182,081 bytes Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com> fix: reduce artifact to 15.4MB (under 16MB limit) Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com> fix: reduce artifact to 15.4MB (under 16MB limit) Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com> fix: reduce artifact to 15.4MB (under 16MB limit) Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 80min)

dc22ba8

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

nathon-lee force-pushed the nathon-lee-v1 branch from ac9d4fc to 0499c9a Compare March 21, 2026 16:28

nathon-lee changed the title ~~Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 80min, val_bpb=1.2108)~~ Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB) Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)#334

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)#334
nathon-lee wants to merge 2 commits intoopenai:mainfrom
nathon-lee:nathon-lee-v1

nathon-lee commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nathon-lee commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Key Techniques

Changes from v1

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nathon-lee commented Mar 21, 2026 •

edited

Loading