Skip to content

πŸ”¬ R12: Muon Optimizer IntegrationΒ #535

@gHashTag

Description

@gHashTag

Task

Integrate Muon optimizer (Newton-Schulz orthogonalization) for 35% speedup vs AdamW.

Reference: https://kellerjordan.github.io/posts/muon/

Context

  • R04 SimpleBackward: βœ… DONE β€” tied-emb gradient correct
  • R07 Overfit-100: βœ… DONE β€” BPB 1.2000, gates GREEN
  • R08 Trinity-3k: βœ… DONE β€” hidden=243, dims 3^k

Gates (G1-G7)

G1: Implement Muon for 2D weights

  • Newton-Schulz orthogonalization algorithm
  • Apply to attn/MLP 2D weight tensors
  • Default iterations=10 (Jordan parameter)

G2: Hybrid optimizer setup

  • Muon β†’ attn/MLP (2D weights)
  • AdamW β†’ embed/norm (1D/3D params)
  • Learning rate schedule: cosinedecay(0.1 β†’ 1e-5)

G3: Smoke test on overfit-100

  • Target: BPB < 0.5 on 100-sample dataset
  • Baseline: AdamW achieved ~0.7

G4: Wall-time check

  • Target: step/sec β‰₯ 18 (production throughput)
  • Baseline: AdamW ~13 step/sec on M1 Pro

G5: Stability test

  • Run 2000 steps, check for NaN/Inf
  • Verify grad_norm finite throughout

G6: A/B vs AdamW

  • Muon convergence β‰₯ 35% faster
  • Compare final loss curves side-by-side

G7: Apply to NCA pretrain config

  • Ready for R11 swap when NCA complete
  • Config switch: optimizer="muon" flag

OVERALL GATE

PASS_IF: BPB_overfit100 < 0.5 
      AND step/sec β‰₯ 18 
      AND stable_2000_steps 
      AND convergence β‰₯ 35% vs AdamW

Risk Mitigation

  • If NaN on 3^k dims β†’ fallback to NorMuon (stable variant for small models)
  • If step/sec drops β†’ reduce Newton-Schulz iterations to 5

Commit

"feat(optimizer): integrate Muon with Newton-Schulz (35% speedup)"
Refs: R12, EPIC-110

Dependencies

  • Requires: R04, R07, R08 (all βœ… DONE)
  • Blocks: R13 (5Γ—60K Scaling), R20 (End-to-End)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions