Skip to content

feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35

Open
lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
lifelongeeek:feat/expert-wise-scale
Open

feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35
lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
lifelongeeek:feat/expert-wise-scale

Conversation

@lifelongeeek
Copy link

Description

Add per-expert scale quantization support for MoE expert MLPs in NeuronX Distributed Inference.

When quantizing MoE models (e.g., Qwen3-30B-A3B) to FP8, the existing implementation uses a single scale across all experts in the fused operator. This destroys per-expert precision and results in near-zero accuracy on downstream benchmarks. This PR introduces per-expert scale preservation during HF-to-Neuron state dict conversion and a two-pass quantization flow that applies specifically to expert MLP modules.

Changes

1. Config (config.py)

  • Add expert_wise_scale boolean to MoENeuronConfig (default: False, fully backward compatible)

2. State Dict Conversion (modeling_qwen3_moe.py)

  • When expert_wise_scale=True: fuse per-expert scales into [num_experts, 1, dim] tensors for both gate_up_proj and down_proj
  • When expert_wise_scale=False: average scales across experts into [1, 1, dim] tensors (existing behavior preserved)
  • Handle padding for non-divisible intermediate sizes

3. Quantization Conversion (model_wrapper.py)

  • When expert_wise_scale=True with per_channel_symmetric:
    • Pass 1: Convert non-expert modules (attention, lm_head, etc.) with per_channel_symmetric, skip expert_mlps
    • Pass 2: Convert expert MoE modules with expert_wise_per_channel_symmetric (using include=["*expert_mlps.mlp_op*"])
  • When expert_wise_scale=False: standard single-pass conversion (no behavior change)

Model Information

Model Name: Qwen3-30B-A3B (and applicable to all Qwen3 MoE models)

Model Architecture: MoE Decoder-only transformer (48 layers, 128 experts, top-8 routing, GQA)

Purpose: Text generation — improving FP8 quantization accuracy for MoE models served via vLLM on Trainium

Checklist

Required Components

  • Accuracy Test: Validated via lm-eval (IFEval, TruthfulQA) and vllm benchmark_serving — see Test Results below
  • README.md: N/A — this is a core infrastructure change, not a new model contribution under contrib/
  • Source Code (src/): Changes follow existing NxDI patterns (state dict conversion in modeling file, quantization config in config, convert logic in model_wrapper)

Optional Components

  • Unit Tests: Not included — validated end-to-end on hardware

Folder Structure

This PR modifies existing core files, not the contrib/ folder:

src/neuronx_distributed_inference/models/
  config.py                              (+1 line)
  model_wrapper.py                       (+47 lines, -7 lines)
  qwen3_moe/modeling_qwen3_moe.py        (+96 lines)

Testing

How did you test this change?

End-to-end quantization → compilation → serving → evaluation pipeline on Qwen3-30B-A3B with a single trn2.3xlarge instance. All tests performed sequentially to avoid OOM.

  1. Quantization: FP8 (f8e4m3, per_channel_symmetric) with --expert-wise-scale flag
  2. Compilation & Serving: vLLM with NxDI backend, tp_degree=4, moe_tp_degree=4, moe_ep_degree=1, max_model_len=1024
  3. Accuracy: lm_eval --tasks ifeval,truthfulqa_gen --num_fewshot 0 --limit 100
  4. Throughput: benchmark_serving.py --dataset-name random --num-prompts 100 --random-input-len 128 --random-output-len 128 --max-concurrency 1

Test Results:

Variant IFEval inst_strict IFEval prompt_strict TruthfulQA bleu_acc TruthfulQA rougeL_acc Output tok/s TTFT (ms)
BF16 unquantized 0.853 0.780 0.480 0.530 57.8 424.7
FP8 per-expert scale (expert_wise_scale=True) 0.405 0.220 0.450 0.420 58.7 468.0
FP8 averaged scale (expert_wise_scale=False) 0.000 0.000 0.030 0.030 57.2 487.5
FP8 E0 scale (baseline, no fusion) 0.000 0.000 0.010 0.040
FP8 router_only (expert MLP excluded) 0.184 0.060 0.340 0.460 56.9 463.5
FP8 router + per-expert scale 0.166 0.080 0.280 0.370 58.7 468.4

Key findings:

  • Per-expert scale retains ~50% of BF16 accuracy while averaged/E0 scale produces near-zero results
  • Throughput is unchanged or slightly improved (FP8 reduces memory bandwidth)
  • Router quantization degrades accuracy significantly — should NOT be combined with expert quantization

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.27
  • Instance Type(s): trn2.3xlarge (1 Neuron device, 4 NeuronCores, 96 GB HBM)
  • PyTorch Version: 22.8.0
  • Python Version: 3.12

Additional Information

Usage Example

Step 1: Quantize with per-expert scales

# quantize_qwen3_moe_per_expert_scale.py
import os
from neuronx_distributed_inference.models.config import MoENeuronConfig
from neuronx_distributed_inference.models.qwen3_moe.modeling_qwen3_moe import (
    Qwen3MoeInferenceConfig,
    NeuronQwen3MoeForCausalLM,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

model_path = "/path/to/Qwen3-30B-A3B"
quantized_model_path = "/path/to/quantized_checkpoint"
os.makedirs(quantized_model_path, exist_ok=True)

neuron_config = MoENeuronConfig(
    tp_degree=4,
    moe_tp_degree=4,
    moe_ep_degree=1,
    batch_size=1,
    max_context_length=1024,
    seq_len=1024,
    n_positions=1024,
    quantized=True,
    quantized_checkpoints_path=quantized_model_path,
    quantization_dtype="f8e4m3",
    quantization_type="per_channel_symmetric",
    modules_to_not_convert=["mlp.gate"],  # exclude router from quantization
)

config = Qwen3MoeInferenceConfig(
    neuron_config, load_config=load_pretrained_config(model_path)
)
NeuronQwen3MoeForCausalLM.save_quantized_state_dict(model_path, config)

The quantized checkpoint will contain individual per-expert scales (e.g., experts.{e}.gate_proj.scale).
At serving time, these are fused into [num_experts, 1, dim] tensors when expert_wise_scale=True.

Step 2: Serve with expert_wise_scale enabled

VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference \
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Qwen3-30B-A3B \
  --max-model-len 1024 --tensor-parallel-size 4 \
  --port 13579 --max-num-seqs 1 \
  --override-neuron-config '{
    "quantized": true,
    "quantized_checkpoints_path": "/path/to/quantized_checkpoint",
    "quantization_dtype": "f8e4m3",
    "quantization_type": "per_channel_symmetric",
    "expert_wise_scale": true,
    "tp_degree": 4, "moe_tp_degree": 4, "moe_ep_degree": 1,
    "batch_size": 1, "max_context_length": 1024, "seq_len": 1024,
    "async_mode": true,
    "modules_to_not_convert": ["mlp.gate"],
    "on_device_sampling_config": {"dynamic": true, "global_topk": 20}
  }'

Known Limitations

  • Expert Parallelism (moe_ep_degree > 1) is NOT supported with this feature due to NxDI limitation: Selective Loading with Expert parallelism is not supported in token generation
  • Only validated on Qwen3 MoE architecture; other MoE models (Mixtral, DeepSeek, etc.) may require additional state dict conversion logic
  • Shared experts are excluded from quantization (modules_to_not_convert)

Related Issues

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Tested end-to-end with vLLM Neuron backend (openai.api_server + benchmark_serving)

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

…n MoE models

Add per-expert scale quantization support for MoE expert MLPs, which
preserves individual expert scale factors instead of averaging them.
This significantly improves FP8 quantization accuracy for MoE models
like Qwen3-30B-A3B.

Changes:
- config.py: Add expert_wise_scale config option to MoENeuronConfig
- modeling_qwen3_moe.py: Fuse per-expert scales during HF->Neuron
  state_dict conversion (gate_up_proj and down_proj), with fallback
  to averaged scales when expert_wise_scale=False
- model_wrapper.py: Two-pass quantization conversion when
  expert_wise_scale=True (Pass 1: per_channel_symmetric for non-expert
  modules, Pass 2: expert_wise_per_channel_symmetric for expert MLPs)

Validated on Qwen3-30B-A3B (trn2.3xlarge):
- IFEval inst_strict: 0.405 (vs 0.00 with averaged scale)
- TruthfulQA bleu_acc: 0.45 (vs 0.03 with averaged scale)
- Throughput: 58.7 tok/s (unchanged from baseline)
…config

Move use_expert_wise_scale into the per-model loop and read from
model.config.neuron_config instead of self.neuron_config. This
ensures correct behavior in fused speculation mode where a non-MoE
draft model and MoE target model are processed in the same loop —
previously the global flag would incorrectly apply two-pass expert
quantization to a draft model that has no expert_mlps module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant