Skip to content

dusterbloom: perf(models): fused MoE gate+up — 3→2 expert matmuls per layer#12

Open
dusterbloom wants to merge 3 commits intomainfrom
dusterbloom/perf-foundational-wins
Open

dusterbloom: perf(models): fused MoE gate+up — 3→2 expert matmuls per layer#12
dusterbloom wants to merge 3 commits intomainfrom
dusterbloom/perf-foundational-wins

Conversation

@dusterbloom
Copy link
Copy Markdown
Owner

Summary

SwitchMlpWeights::forward_gather_fused() lazy-concatenates gate+up weights on first call, then dispatches a single gather_qmm instead of two separate calls. FfnBlock::forward() now routes through the fused path instead of forward_gather_global_sort().

Numbers (35B-A3B-3bit on M4 base)

Metric Before After Δ
S=1 decode 27 ms 17 ms −37%
S=16 verify 253 ms 112 ms −56%
MoE/layer at K=1 ~0.68 ms 0.47 ms −31%

Test plan

  • cargo check -p higgs-models — clean
  • cargo test -p higgs-models --lib qwen3_next::71 passed, 0 failed (includes new test_moe_gate_up_fusion_parity)
  • cargo run -p higgs-bench --release --bin bench_decode -- --model <qwen3_5_moe-key> — A/B against main on a Qwen3.5-MoE checkpoint
  • cargo clippy -p higgs-models — clean

Notes

  • Single-file change: crates/higgs-models/src/qwen3_next.rs (+175 / −9)
  • The fused weight tensor is built lazily on first forward call; concat happens once per model load, not per token
  • Co-authored with Claude Sonnet 4.6

dusterbloom and others added 2 commits May 3, 2026 21:23
SwitchMlpWeights::forward_gather_fused() lazy-concatenates gate+up
weights on first call, then dispatches a single gather_qmm instead
of two separate calls. FfnBlock::forward() now routes through the
fused path instead of forward_gather_global_sort().

Measured on 35B-A3B-3bit M4 base:
- S=1 decode:   27ms → 17ms  (−37%)
- S=16 verify: 253ms → 112ms (−56%)
- MoE/layer at K=1: 0.47ms (down from ~0.68ms)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Reflow `let fw/fs/fb = ops::concatenate_axis(..)` from broken-line
indentation back onto single lines so `cargo fmt --all -- --check`
passes in CI.
@dusterbloom dusterbloom force-pushed the dusterbloom/perf-foundational-wins branch from 2e9e083 to 18bf33f Compare May 3, 2026 21:26
Two errors flagged by `-D clippy::doc-markdown` and
`-D clippy::cast-sign-loss`/`-D clippy::as-conversions`:

- Backtick `MoE` and `gather_qmm` in the `fused_gate_up` doc comment.
- Replace `top_k as u32` with the same `u32::try_from(top_k).map_err(...)`
  pattern already used by `forward_gather_global_sort`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants