Skip to content

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)#339

Open
sheeki03 wants to merge 1 commit intoopenai:mainfrom
sheeki03:submission/11l-backout-1.1364
Open

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)#339
sheeki03 wants to merge 1 commit intoopenai:mainfrom
sheeki03:submission/11l-backout-1.1364

Conversation

@sheeki03
Copy link

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)

val_bpb: 1.1364 (sliding window, stride=64) | 16.17 MB | 8xH100 SXM, 600s

Known Issue

Artifact is 16,170,051 bytes — 170KB over the 16,000,000 byte cap. The code supports INT5_MLP=1 which switches MLP quantization from int6 to int5, saving 1-2MB. A follow-up run is planned to bring the artifact under the cap.

Progress from prior submissions

PR #198 This Delta
val_bpb (sliding) 1.1318 (s64) 1.1364 (s64) +0.0046
Steps (600s) 7412 6642 -770
Step time 81ms 90ms +9ms
Artifact 15.7 MB 16.2 MB +0.5 MB

Note: Our baseline replication of PR #198's config yielded 1.1435 (vs their reported 1.1318), likely due to hardware/driver differences (RunPod community cloud vs dedicated). Relative to our own baseline, Backout improves by -0.0071.

What's new

Backout Connection — A learned residual subtraction from a mid-network hidden state. After the U-Net encoder-decoder forward pass, the model subtracts lambda * h_mid from the final representation, where lambda is a learned scalar (initialized at 0.2) and h_mid is the hidden state at layer num_layers // 2.

This acts as a learned negative residual that removes redundant mid-network information, sharpening the final representation for the language modeling head. Zero additional matrix parameters — only one learned scalar.

Controlled comparison (same hardware, same run)

Metric Baseline (PR #198 config) + Backout Delta
val_bpb (sliding, s=64) 1.1435 1.1364 -0.0071
val_loss 1.9307 1.9188 -0.0119
Steps (600s) 5246 6642 +1396
Step time 114ms 90ms -24ms
Artifact 17.1 MB (zlib) 16.2 MB (zstd) -0.9 MB

Results

Metric Value
Pre-quant val_bpb 1.1544
Int6 roundtrip val_bpb 1.1588
Int6 sliding val_bpb (s64) 1.1364
Steps completed (600s cap) 6642
Step time 90ms
Artifact size 16,170,051 bytes
Code size 70,854 bytes
SWA checkpoints averaged 6

Architecture

11 layers, 512 dim, 8 heads / 4 KV heads, MLP 3x, relu-squared, SmearGate, BigramHash(4096), OrthoInit, Muon + AdamW with WD=0.04, SWA, int6 mixed quant + zstd, FA3, seq 2048, sliding window eval stride=64.

Backout layer: num_layers // 2 (layer 5). Lambda: learned scalar, initialized at 0.2.

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=4096 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
BACKOUT_ENABLED=1 BACKOUT_LAMBDA_INIT=0.2 \
LAWA_ENABLED=0 INT5_MLP=0 VE_ENABLED=0 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Hardware

8xH100 SXM 80GB HBM3 (RunPod, EUR-IS-3)

Next steps

  1. Run with INT5_MLP=1 to bring artifact under 16MB
  2. Multi-seed validation (3 seeds)
  3. Combine Backout with XSA + EMA + TTT from PRs Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315, Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338

Adds Backout Connection — learned residual subtraction from mid-network
hidden state. Improves val_bpb by 0.0071 over PR openai#198 baseline with
zero additional matrix parameters (one learned scalar).

val_bpb: 1.1364 (sliding window, stride=64)
Artifact: 16,170,051 bytes (170KB over cap, fixable with INT5_MLP=1)
Hardware: 8xH100 SXM, 600s wallclock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant