Skip to content

Record: 11L XSA + EMA + Int5-MLP (val_bpb=1.1399)#349

Open
Mapika wants to merge 1 commit intoopenai:mainfrom
Mapika:submission/11L-XSA-EMA-Int5MLP
Open

Record: 11L XSA + EMA + Int5-MLP (val_bpb=1.1399)#349
Mapika wants to merge 1 commit intoopenai:mainfrom
Mapika:submission/11L-XSA-EMA-Int5MLP

Conversation

@Mapika
Copy link

@Mapika Mapika commented Mar 21, 2026

Summary

  • 11 layers with XSA (Exclusive Self-Attention) on last 4 layers
  • Continuous GPU float32 EMA (decay=0.997) — every step, no CPU transfers
  • Mixed int5 MLP / int6 attention / int8 embedding quantization
  • 8% magnitude pruning + zstd-22 compression

3-Seed Results (8xH100)

Seed val_bpb artifact_bytes valid
42 1.14005 15,919,150 yes
1337 1.13874 15,999,808 yes
7 1.14080 15,882,678 yes
Mean 1.1399
Std 0.0009

All seeds trained in <600s, all under 16MB.

Architecture

  • 26.8M params, 512 dim, 8 heads, 4 KV heads (GQA)
  • SmearGate + BigramHash(2048) + U-Net skip connections
  • Muon optimizer (WD=0.04), cosine warmdown (3000 iters)
  • ~5,850 steps at ~102ms/step

Seeds 42/1337/7: 1.1401/1.1387/1.1408 (mean 1.1399, std 0.0009).
All under 16MB, trained in <600s on 8xH100.

Key techniques:
- 11 layers with XSA on last 4 layers
- Continuous GPU float32 EMA (decay=0.997)
- Mixed int5 MLP / int6 attention quantization
- 8% magnitude pruning + zstd-22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant