Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)#332
Open
saml212 wants to merge 2 commits intoopenai:mainfrom
Open
Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)#332saml212 wants to merge 2 commits intoopenai:mainfrom
saml212 wants to merge 2 commits intoopenai:mainfrom
Conversation
12 layers funded by gradient-guided adaptive quantization: measure gradient sensitivity during warmdown, allocate int5/int6/int7 per-tensor. Enables 12th layer in 15.7 MB. 3-seed mean 1.1320 (std 0.0002). Key finding: Late QAT is counterproductive at 12L — the per-step overhead (~7ms) costs more training steps than the quant improvement saves.
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 21, 2026
ccb7206 to
9b2aec3
Compare
9b2aec3 to
4b062e0
Compare
RyanLisse
added a commit
to RyanLisse/parameter-golf
that referenced
this pull request
Mar 21, 2026
New CUDA presets: - pr332_12l_xsa: 12L/2xMLP, seq2048, momentum 0.99 (from PR openai#332) - pr338_11l_ttt: 11L/2xMLP, seq2048, momentum 0.99 (from PR openai#338) - bft_ensemble: 9L/3xMLP Byzantine fault tolerant checkpoint config - difficulty_adjusted: 10L/2xMLP adaptive search with tight LR - partial_rope_headtemp: baseline arch with novel attention params Expanded search: NUM_LAYERS includes 11, TRAIN_SEQ_LEN includes 4096. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)
val_bpb: 1.1320 (sliding window, stride=64) | 15.7 MB | 8xH100 SXM, 600s
Progress from prior submissions
What's new
Gradient-Guided Adaptive Quantization. Standard int6 quantization treats all weight tensors equally, but not all tensors are equally sensitive to quantization noise. We accumulate per-tensor squared gradient magnitudes during the last 10% of warmdown (zero throughput cost — gradients are already computed), then rank tensors by sensitivity at quantization time:
This adaptive allocation saves ~1 MB vs uniform int6, funding a 12th transformer layer while staying under 16 MB.
12 layers (up from 9). Extra depth funded by gradient-guided compression headroom. MLP narrowed to 1408 (from 1536 at 11L) — extra depth outweighs narrower width at this scale.
Batch=524K. Reducing batch size from 786K to 524K gives 22% more optimization steps (8,060 vs ~7,000) at lower per-step cost (74ms vs ~84ms). More gradient updates outweigh larger batch quality in a fixed-time budget.
Partial RoPE (16 of 64 dims). Rotary embeddings applied to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale. RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training at 12 layers. Zero new parameters.
XSA (Exclusive Self Attention) on last 4 layers. Removes self-value bias from attention output via orthogonal projection. Forces attention to carry cross-token information only. Zero new parameters.
EMA (decay=0.997) replacing SWA. Exponential moving average every step instead of periodic checkpoint averaging. Smoother weight distribution, better generalization and compression.
Negative finding: Late QAT at 12 layers
We tested Late QAT (STE int6 fake-quantization in the last 4% of training). At 12 layers the per-step overhead (~7ms) forces a lower wallclock cap, costing ~770 training steps. The lost model quality exceeds the quantization improvement: 1.1361 (with Late QAT) vs 1.1321 (without). Late QAT's value depends on the step budget — at high layer counts where step time is already elevated, the throughput cost dominates.
Results
Reproducibility (3 seeds)
Mean: 1.1320 | Std: 0.0002 | Submitted: seed 1337
Run command