🔬 R12: Muon Optimizer Integration

## Task
Integrate Muon optimizer (Newton-Schulz orthogonalization) for 35% speedup vs AdamW.

**Reference:** https://kellerjordan.github.io/posts/muon/

## Context
- R04 SimpleBackward: ✅ DONE — tied-emb gradient correct
- R07 Overfit-100: ✅ DONE — BPB 1.2000, gates GREEN
- R08 Trinity-3k: ✅ DONE — hidden=243, dims 3^k

## Gates (G1-G7)

### G1: Implement Muon for 2D weights
- [ ] Newton-Schulz orthogonalization algorithm
- [ ] Apply to attn/MLP 2D weight tensors
- [ ] Default iterations=10 (Jordan parameter)

### G2: Hybrid optimizer setup
- [ ] Muon → attn/MLP (2D weights)
- [ ] AdamW → embed/norm (1D/3D params)
- [ ] Learning rate schedule: cosinedecay(0.1 → 1e-5)

### G3: Smoke test on overfit-100
- [ ] Target: BPB < 0.5 on 100-sample dataset
- [ ] Baseline: AdamW achieved ~0.7

### G4: Wall-time check
- [ ] Target: step/sec ≥ 18 (production throughput)
- [ ] Baseline: AdamW ~13 step/sec on M1 Pro

### G5: Stability test
- [ ] Run 2000 steps, check for NaN/Inf
- [ ] Verify grad_norm finite throughout

### G6: A/B vs AdamW
- [ ] Muon convergence ≥ 35% faster
- [ ] Compare final loss curves side-by-side

### G7: Apply to NCA pretrain config
- [ ] Ready for R11 swap when NCA complete
- [ ] Config switch: `optimizer="muon"` flag

## OVERALL GATE
```
PASS_IF: BPB_overfit100 < 0.5 
      AND step/sec ≥ 18 
      AND stable_2000_steps 
      AND convergence ≥ 35% vs AdamW
```

## Risk Mitigation
- If NaN on 3^k dims → fallback to **NorMuon** (stable variant for small models)
- If step/sec drops → reduce Newton-Schulz iterations to 5

## Commit
"feat(optimizer): integrate Muon with Newton-Schulz (35% speedup)"
Refs: R12, EPIC-110

## Dependencies
- Requires: R04, R07, R08 (all ✅ DONE)
- Blocks: R13 (5×60K Scaling), R20 (End-to-End)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔬 R12: Muon Optimizer Integration #535

Task

Context

Gates (G1-G7)

G1: Implement Muon for 2D weights

G2: Hybrid optimizer setup

G3: Smoke test on overfit-100

G4: Wall-time check

G5: Stability test

G6: A/B vs AdamW

G7: Apply to NCA pretrain config

OVERALL GATE

Risk Mitigation

Commit

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

🔬 R12: Muon Optimizer Integration #535

Description

Task

Context

Gates (G1-G7)

G1: Implement Muon for 2D weights

G2: Hybrid optimizer setup

G3: Smoke test on overfit-100

G4: Wall-time check

G5: Stability test

G6: A/B vs AdamW

G7: Apply to NCA pretrain config

OVERALL GATE

Risk Mitigation

Commit

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions