Skip to content

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)#334

Open
nathon-lee wants to merge 2 commits intoopenai:mainfrom
nathon-lee:nathon-lee-v1
Open

Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB)#334
nathon-lee wants to merge 2 commits intoopenai:mainfrom
nathon-lee:nathon-lee-v1

Conversation

@nathon-lee
Copy link

@nathon-lee nathon-lee commented Mar 21, 2026

Summary

Non-record submission: 11-layer 512-dim GPT with PartialRoPE, LNScale, SmearGate,
BigramHash(2048×64), U-Net skip connections, EMA + SWA, and test-time training (TTT).

Trained on 1×H100 PCIe for ~107 minutes (~equivalent to 8×H100 SXM 10 minutes).

Results

Metric Value
val_bpb (pre-TTT) 1.2207
val_loss 2.0611
Training steps 3374 / 20000
Training time ~107 min (1×H100 PCIe)
Model params 26.7M
Artifact size 15.4 MB ✅

Key Techniques

  • Partial RoPE (16/64 dims): position encoding on subset of head dims
  • LN Scale: RMSNorm damped by 1/√(layer+1)
  • SmearGate: per-dim gate blending current + previous token
  • BigramHash(2048, dim=64): hash-based bigram context embeddings
  • U-Net skip connections with learnable weights
  • Muon optimizer (Newton-Schulz) + Adam for embeddings
  • EMA(0.997) + SWA (last 40%)
  • Uniform int5 quantization + zstd-22
  • Sliding window eval (stride=64) + SGD TTT (3 epochs)

Changes from v1

  • BigramHash: 4096×128 → 2048×64 (-426K params)
  • Quantization: mixed int5/int6 → uniform int5
  • Artifact: 17.4MB ❌ → 15.4MB ✅

Checklist

  • README.md
  • submission.json
  • train.log
  • train_gpt.py
  • Artifact under 16MB ✅'

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>
Changes:
- BigramHash: 4096x128 -> 2048x64 (-426K params)
- Quantization: mixed int5/int6 -> uniform int5 for all weights
- Artifact: 17.4MB -> 15.4MB

Results (1xH100 PCIe, 6400s):
- val_loss: 2.0611 (pre-TTT)
- val_bpb: 1.2207 (pre-TTT)
- params: 26,666,073
- artifact: 16,132,620 bytes
- total: 16,182,081 bytes

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

fix: reduce artifact to 15.4MB (under 16MB limit)

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

fix: reduce artifact to 15.4MB (under 16MB limit)

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>

fix: reduce artifact to 15.4MB (under 16MB limit)

Signed-off-by: nathon-lee <nathon-lee@users.noreply.github.com>
@nathon-lee nathon-lee changed the title Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 80min, val_bpb=1.2108) Non-record: 11L PartialRoPE + LNScale + EMA + SWA + TTT (1xH100 107min, val_bpb=1.2207, 15.4MB) Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant