Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) by alertcat · Pull Request #338 · openai/parameter-golf

alertcat · 2026-03-21T11:07:02Z

11L XSA + EMA + TTT + Int6 MLP3x

val_bpb = 1.1254 (sliding window stride=64, best seed 42) | 15.55 MB artifact | 8xH100 SXM, 600s

Key Innovation: TTT on XSA+EMA baseline

First submission combining XSA (Exclusive Self Attention) + EMA + Test-Time Training. After training and quantization, TTT performs 3 epochs of SGD fine-tuning on the validation token stream, adapting the model to the test distribution.

Results (3-seed, 8xH100 SXM)

Seed	Steps	Sliding BPB (s64)	Artifact
1337	7,070	1.1258	15.55 MB
42	7,068	1.1254	15.55 MB
2024	7,069	1.1256	15.55 MB

Mean: 1.1256 | Std: 0.0002

TTT Details

3 epochs SGD on validation tokens (lr=0.002, momentum=0.9)
First 2 transformer blocks frozen for stability
~47 seconds on 8xH100 (well under 600s eval limit)
Improves post-quant BPB by ~0.002

Architecture (from PR #315)

11L, 512d, 8H/4KV, MLP 3x, relu-squared
XSA on last 4 layers, EMA (decay=0.997)
SmearGate + BigramHash(2048) + OrthoInit
Int6 QAT + Late QAT + zstd-22
FlashAttention 3, Muon WD=0.04

Eval Timing

Training: 600s | TTT: 47s | Sliding eval: 73s | Total eval: ~120s

Reproduction

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: zstandard in e:�naconda\lib\site-packages (0.23.0)

Built on PR #315 (XSA, EMA, SmearGate, BigramHash, OrthoInit, sliding window eval).

Innovation over PR openai#198 (SOTA 1.1318): - 12 transformer layers (was 11): +2.2M params, better representation - Int5 quantization for MLP weights [-16,15]: 3 zero high bits - zstd compression 1.88x vs int6 1.51x, saves ~1.8MB - Funds the 12th layer within 16MB budget - Int6 kept for attention weights (precision-sensitive) - FA3 fallback for older PyTorch - LR=0.025 (validated as optimal in A/B testing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New CUDA presets: - pr332_12l_xsa: 12L/2xMLP, seq2048, momentum 0.99 (from PR openai#332) - pr338_11l_ttt: 11L/2xMLP, seq2048, momentum 0.99 (from PR openai#338) - bft_ensemble: 9L/3xMLP Byzantine fault tolerant checkpoint config - difficulty_adjusted: 10L/2xMLP adaptive search with tight LR - partial_rope_headtemp: baseline arch with novel attention params Expanded search: NUM_LAYERS includes 11, TRAIN_SEQ_LEN includes 4096. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move EMA shadow weights to GPU (CPU transfers cost ~32% throughput) - Increase train seq_len from 1024 to 2048 (matches record PR openai#338) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alertcat and others added 9 commits March 20, 2026 21:22

PR198 SOTA + FA3 fallback + LR0.025 + run script

3157704

Fix: enable QAT (was 0, should be 1) - reduces quant loss 3x

bedcff8

8xH100 3-seed results: sliding BPB 1.1539-1.1543

7511dff

Add non-record submission: 12L Int5-MLP, sliding BPB 1.1541

d21e7fb

Fix: TTT code on main, BigramHash=2048, FA3 install script

0c9924a

Fix: add zstandard install (critical for <16MB), update run script

a02847c

3-seed PR315+TTT: sliding BPB 1.1254-1.1258, artifact 15.55MB

934b4a6

Record: 11L XSA+EMA+TTT, sliding BPB 1.1254

5d7082e

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

sheeki03 mentioned this pull request Mar 21, 2026

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339

Open

shivnarainms22 mentioned this pull request Mar 21, 2026

Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM) #366

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)#338

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)#338
alertcat wants to merge 9 commits intoopenai:mainfrom
alertcat:submission-pr315-ttt

alertcat commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alertcat commented Mar 21, 2026

11L XSA + EMA + TTT + Int6 MLP3x

Key Innovation: TTT on XSA+EMA baseline

Results (3-seed, 8xH100 SXM)

TTT Details

Architecture (from PR #315)

Eval Timing

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant