Task
Integrate Muon optimizer (Newton-Schulz orthogonalization) for 35% speedup vs AdamW.
Reference: https://kellerjordan.github.io/posts/muon/
Context
- R04 SimpleBackward: β
DONE β tied-emb gradient correct
- R07 Overfit-100: β
DONE β BPB 1.2000, gates GREEN
- R08 Trinity-3k: β
DONE β hidden=243, dims 3^k
Gates (G1-G7)
G1: Implement Muon for 2D weights
G2: Hybrid optimizer setup
G3: Smoke test on overfit-100
G4: Wall-time check
G5: Stability test
G6: A/B vs AdamW
G7: Apply to NCA pretrain config
OVERALL GATE
PASS_IF: BPB_overfit100 < 0.5
AND step/sec β₯ 18
AND stable_2000_steps
AND convergence β₯ 35% vs AdamW
Risk Mitigation
- If NaN on 3^k dims β fallback to NorMuon (stable variant for small models)
- If step/sec drops β reduce Newton-Schulz iterations to 5
Commit
"feat(optimizer): integrate Muon with Newton-Schulz (35% speedup)"
Refs: R12, EPIC-110
Dependencies
- Requires: R04, R07, R08 (all β
DONE)
- Blocks: R13 (5Γ60K Scaling), R20 (End-to-End)
Task
Integrate Muon optimizer (Newton-Schulz orthogonalization) for 35% speedup vs AdamW.
Reference: https://kellerjordan.github.io/posts/muon/
Context
Gates (G1-G7)
G1: Implement Muon for 2D weights
G2: Hybrid optimizer setup
G3: Smoke test on overfit-100
G4: Wall-time check
G5: Stability test
G6: A/B vs AdamW
G7: Apply to NCA pretrain config
optimizer="muon"flagOVERALL GATE
Risk Mitigation
Commit
"feat(optimizer): integrate Muon with Newton-Schulz (35% speedup)"
Refs: R12, EPIC-110
Dependencies