Skip to content

docs: add TIPS.md and resolve environment dependency issues (#280, #82, #43)#357

Open
adityagupta26 wants to merge 68 commits intoopenai:0hq-patch-1from
adityagupta26:feature/resolve-open-issues
Open

docs: add TIPS.md and resolve environment dependency issues (#280, #82, #43)#357
adityagupta26 wants to merge 68 commits intoopenai:0hq-patch-1from
adityagupta26:feature/resolve-open-issues

Conversation

@adityagupta26
Copy link

This PR addresses several open issues and provides additional documentation to help participants get started more effectively.

Key Changes
Environment Fix (#280): Added flash-attn to requirements.txt to ensure the Flash Attention 3 interface is available in the RunPod environment as expected by the training scripts.
Newcomer Documentation (#82): Created TIPS.md, a collection of actionable advice for participants, covering architectural ideas (like Depth Recurrence), training optimizations (Muon), and evaluation tricks (Sliding Window).
Rule Clarification (#43): Explicitly documented in TIPS.md that the tokenizer size is not counted toward the 16MB artifact limit, providing clarity for those optimizing their vocabulary.
Architectural Guidance (#202): Included weight-sharing (layer tying) as a suggested strategy in the new tips document to help users stay within parameter constraints.
Verification
Verified that TIPS.md renders correctly in Markdown.
Confirmed requirements.txt correctly includes the new dependency.
Synced and rebased against the latest main branch.

0hq and others added 30 commits March 18, 2026 16:33
## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

**val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB)

Four orthogonal improvements over the naive baseline:

1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

### Run command

```bash
RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Key metrics

| Metric | Value |
|--------|-------|
| Steps (10 min cap) | 12,395 |
| int6/int8 sliding val_bpb | **1.1630** |
| Quantization penalty | +0.0015 BPB |
| Artifact size | 15,353,490 bytes |
… 1.2129)

10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129
across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats
(t=34.12, p<<0.001).

Key changes:
- 10 layers (vs 9 baseline)
- Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- FP16 tied embedding export (reduces quant gap)
- Int6 quantization for middle layers 2-7 (fits under 16MB)

Mean artifact size: 15.36MB (under 16MB cap).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aluating the graph after each sub-batch step
Use eager mx.eval() to fix running train script on 16GB Mac devices
keep tok_emb.weight in fp16 during int8 export (kills the quant gap),
shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600
and matrix LR to 0.06.

tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* SOTA attempt

* Improve score on SXM

---------

Co-authored-by: spokane-way <spokane@way>
Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB).

Key changes:
- 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params)
- QAT: STE fake-quantize simulates int6 during training
- Int6 quantization on all block weights (layers 0-8)
- Sliding window eval (stride=64) for ~0.033 BPB free gain
- FP16 tied embedding + lower LRs (carried over)

5-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1652 (std=0.0017)
  Mean rt_bpb:    1.1985
  t-statistic:    78.93 (p << 0.001)
  All artifacts under 16MB (mean: 15.64MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The window_starts filter dropped windows shorter than stride,
silently skipping up to (stride-1) tokens at the end of the
validation set. Now includes all windows with >= 1 scoreable
token, and clamps the score start for short final windows.
Co-authored-by: spokane-way <spokane@way>
…val_bpb=1.1748) (openai#60)

* Add NTK Eval + Overtone Init submission (1.2160 BPB)

Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002)

* Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006)

* Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB)

* Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone

* Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone

---------

Co-authored-by: notapplica <notapplica@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Warmdown-quantization co-optimization, val_bpb=1.2154

Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization
penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate
NTK-RoPE extrapolation (eval@1408).

Full warmdown sweep across 10 values and detailed analysis in README.

* breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256

---------

Co-authored-by: Sam Larson <saml212@users.noreply.github.com>
cocohearts and others added 30 commits March 19, 2026 16:30
val_bpb: 1.1556 (post-quant int6+zstd-22, sliding window eval stride=64)
Summary
A 22.4M parameter transformer language model trained in under 10 minutes on 8×H100 GPUs, compressed to a 15.1MB artifact via int6 quantization-aware training and zstd-22. The architecture combines a SmearGate bigram embedding layer, orthogonal weight initialization, 3× MLP expansion, U-Net skip connections, and decoupled Muon weight decay, evaluated with sliding window context at stride 64.
Architecture
Transformer Core
A 9-layer, 512-dim transformer with 8 attention heads (4 KV heads via grouped-query attention) and tied input/output embeddings over a 1024-token BPE vocabulary. Sequence length during training is 1024 tokens.
SmearGate
A learned per-dimension gate (~512 params) that blends each token's embedding with the previous token's embedding before the transformer processes anything:
```python
gate = sigmoid(self.gate)  # shape \[dim], init ≈ 0.95
output = gate \* current\_emb + (1 - gate) \* prev\_token\_emb
```
This injects bigram (two-token) context directly into the embedding layer. Normally a transformer must discover token-pair relationships through self-attention; SmearGate provides this signal for free. The gate is initialized via `sigmoid(3.0) ≈ 0.95` so it starts near-identity (mostly current token), and the model learns per-dimension how much previous-token blending is useful.
Applied after embedding lookup and bigram hash addition, before RMS normalization.
Bigram Hash Embedding
A 4096-bucket hash table (dim=128, projected to 512) maps consecutive token pairs to learned embeddings via `(prev \* 92821 + cur) % 4096`. This gives the model direct access to token-pair features at minimal parameter cost.
MLP 3× Expansion
MLP hidden dimension is 3× the model dimension (1536 for a 512-dim model). The space savings from int6 quantization fund this extra capacity — wider MLPs allow more expressive nonlinear feature transformation between attention operations.
U-Net Skip Connections
The 9-layer transformer is split into an encoder half (4 layers) and a decoder half (5 layers) with learned skip weights connecting corresponding encoder/decoder layers. This gives the decoder direct access to earlier representations without relying solely on the residual stream.
Training
Muon Optimizer with Weight Decay
The Muon optimizer (MomentUm Orthogonalized by Newton-Schulz) runs SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via 5-step Newton-Schulz iteration. This is equivalent to steepest descent under the spectral norm, improving the conditioning of the optimization landscape.
Decoupled weight decay (`p.mul\_(1 - wd \* lr)`, wd=0.01) is applied before each gradient update. This keeps weights smaller and better-distributed, which directly benefits both generalization and downstream quantization — tighter weight distributions quantize into fewer int6 buckets with less error and compress better with zstd.
Momentum is warmed from 0.92 → 0.99 over the first 1500 steps.
Orthogonal Weight Initialization
All non-zero-init CastedLinear weight matrices are initialized with `nn.init.orthogonal\_()`. Orthogonal matrices have all singular values equal to 1, meaning gradients flow uniformly through the network at initialization with no vanishing or exploding signals. Additionally, since Muon's Newton-Schulz step orthogonalizes updates, starting from an already-orthogonal matrix means early updates are immediately useful rather than spent correcting a random initialization. With only ~12k steps in the 10-minute budget, faster convergence matters.
Int6 Quantization-Aware Training (STE)
All 2D weight matrices are fake-quantized to int6 ([-31, 31]) during every forward pass via Straight-Through Estimator — the forward pass sees quantized weights while gradients flow through the rounding operation as if it were identity. The model learns weight configurations that are inherently robust to post-training quantization. The tied embedding matrix is stored as fp16 passthrough (not quantized), since it serves double duty for both input embeddings and output predictions where errors compound in both directions.
Learning Rate Schedule
Warmup over 20 steps, followed by linear warmdown over the final 3000 steps. Separate learning rates for tied embeddings (0.030), matrix parameters (0.020), and scalar parameters (0.020).
Evaluation
Sliding Window (stride=64)
Instead of chopping validation text into non-overlapping chunks (where tokens near the start of each chunk lack context), sliding window uses overlapping windows with stride 64 and the full 1024-token context window. Each scored token gets 960+ tokens of prior context. This is purely an evaluation-time technique — it does not change the model.
Export
Int6 + zstd-22 Compression
All quantized weights are packed into int8 containers and compressed with zstandard at level 22. The int6 representation plus aggressive compression brings the full submission (model + code) to 15.1MB, under the 16MB cap.
Metrics
Metric	Value
Post-quant sliding window val_bpb	1.1556
Post-quant sliding window val_loss	1.9511
Post-quant standard val_bpb	1.1891
Post-quant standard val_loss	2.0077
Quantization gap (standard eval)	~0.0001 BPB
Model parameters	22,368,840
Artifact size (int6+zstd-22)	15,878,809 bytes (15.1 MB)
Train steps completed	12,047
Train time	600s (10.0 min)
Sliding window eval time	75s
Peak GPU memory	11,340 MiB
Configuration
```
VOCAB\_SIZE=1024
NUM\_LAYERS=9
MODEL\_DIM=512
NUM\_HEADS=8
NUM\_KV\_HEADS=4
MLP\_MULT=3
TIE\_EMBEDDINGS=1
USE\_SMEARGATE=1
TRAIN\_SEQ\_LEN=1024
TRAIN\_BATCH\_TOKENS=524288
LOGIT\_SOFTCAP=30.0
ROPE\_BASE=10000.0
QK\_GAIN\_INIT=1.5
BIGRAM\_HASH\_BUCKETS=4096
BIGRAM\_HASH\_DIM=128
TIED\_EMBED\_LR=0.030
MATRIX\_LR=0.020
SCALAR\_LR=0.020
MUON\_MOMENTUM=0.99
MUON\_MOMENTUM\_WARMUP\_START=0.92
MUON\_MOMENTUM\_WARMUP\_STEPS=1500
MUON\_WEIGHT\_DECAY=0.01
MUON\_BACKEND\_STEPS=5
WARMDOWN\_ITERS=3000
WARMUP\_STEPS=20
EVAL\_STRIDE=64
MAX\_WALLCLOCK\_SECONDS=600
SEED=1337
```
Command
```bash
RUN\_ID=smeargate\_orthoinit\_muonwd \\
DATA\_PATH=./data/datasets/fineweb10B\_sp1024 \\
TOKENIZER\_PATH=./data/tokenizers/fineweb\_1024\_bpe.model \\
torchrun --standalone --nproc\_per\_node=8 train\_gpt.py
```
Hardware
8× NVIDIA H100 80GB HBM3 SXM (RunPod).
…SWA — improved config (Muon WD=0.04, SWA every 50), mean val_bpb=1.1458
…/submission.json

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Major upgrade: 11 layers + decoupled weight decay + zstd-22 compression.

Key changes:
- 11 layers (was 9) — more depth, funded by int6+zstd compression
- Weight decay 0.04 on Muon + AdamW — quantization-friendly weights
- zstd-22 compression — saves 1.5MB vs zlib, critical for 11L fit
- Higher Muon momentum (0.99) + warmup tuning
- SWA attempted but dropped (hurts with QAT)

3-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1502 (std=0.0004)
  t-statistic: 313.20 (p << 0.001)
  All artifacts under 16MB (mean 15.4MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…0.04

Key improvements over original 1.1453:
- bigram_vocab_size: 4096 → 10240 (fewer hash collisions)
- SWA_start_frac: 0.5 → 0.4 (more converged checkpoints)
- warmdown: 4000 → 3000 (more full-LR training)
- weight_decay: 0.04 global (both Muon and AdamW)

3-seed results: 1.14271, 1.14298, 1.14260 (mean=1.14276, std=0.00016)
All params set as defaults in train_gpt.py. Run: bash eval/eval.sh
Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)
Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556
Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)
…mbed-int6

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)
…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)
Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)
Co-authored-by: Codex <noreply@openai.com>
Update the text to reflect the passive voice grammar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.