feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models by lifelongeeek · Pull Request #35 · aws-neuron/neuronx-distributed-inference

lifelongeeek · 2026-02-13T09:59:09Z

Description

Add per-expert scale quantization support for MoE expert MLPs in NeuronX Distributed Inference.

When quantizing MoE models (e.g., Qwen3-30B-A3B) to FP8, the existing implementation uses a single scale across all experts in the fused operator. This destroys per-expert precision and results in near-zero accuracy on downstream benchmarks. This PR introduces per-expert scale preservation during HF-to-Neuron state dict conversion and a two-pass quantization flow that applies specifically to expert MLP modules.

Changes

1. Config (config.py)

Add expert_wise_scale boolean to MoENeuronConfig (default: False, fully backward compatible)

2. State Dict Conversion (modeling_qwen3_moe.py)

When expert_wise_scale=True: fuse per-expert scales into [num_experts, 1, dim] tensors for both gate_up_proj and down_proj
When expert_wise_scale=False: average scales across experts into [1, 1, dim] tensors (existing behavior preserved)
Handle padding for non-divisible intermediate sizes

3. Quantization Conversion (model_wrapper.py)

When expert_wise_scale=True with per_channel_symmetric:
- Pass 1: Convert non-expert modules (attention, lm_head, etc.) with per_channel_symmetric, skip expert_mlps
- Pass 2: Convert expert MoE modules with expert_wise_per_channel_symmetric (using include=["*expert_mlps.mlp_op*"])
When expert_wise_scale=False: standard single-pass conversion (no behavior change)

Model Information

Model Name: Qwen3-30B-A3B (and applicable to all Qwen3 MoE models)

Model Architecture: MoE Decoder-only transformer (48 layers, 128 experts, top-8 routing, GQA)

Purpose: Text generation — improving FP8 quantization accuracy for MoE models served via vLLM on Trainium

Checklist

Required Components

Accuracy Test: Validated via lm-eval (IFEval, TruthfulQA) and vllm benchmark_serving — see Test Results below
README.md: N/A — this is a core infrastructure change, not a new model contribution under contrib/
Source Code (src/): Changes follow existing NxDI patterns (state dict conversion in modeling file, quantization config in config, convert logic in model_wrapper)

Optional Components

Unit Tests: Not included — validated end-to-end on hardware

Folder Structure

This PR modifies existing core files, not the contrib/ folder:

src/neuronx_distributed_inference/models/
  config.py                              (+1 line)
  model_wrapper.py                       (+47 lines, -7 lines)
  qwen3_moe/modeling_qwen3_moe.py        (+96 lines)

Testing

How did you test this change?

End-to-end quantization → compilation → serving → evaluation pipeline on Qwen3-30B-A3B with a single trn2.3xlarge instance. All tests performed sequentially to avoid OOM.

Quantization: FP8 (f8e4m3, per_channel_symmetric) with --expert-wise-scale flag
Compilation & Serving: vLLM with NxDI backend, tp_degree=4, moe_tp_degree=4, moe_ep_degree=1, max_model_len=1024
Accuracy: lm_eval --tasks ifeval,truthfulqa_gen --num_fewshot 0 --limit 100
Throughput: benchmark_serving.py --dataset-name random --num-prompts 100 --random-input-len 128 --random-output-len 128 --max-concurrency 1

Test Results:

Variant	IFEval inst_strict	IFEval prompt_strict	TruthfulQA bleu_acc	TruthfulQA rougeL_acc	Output tok/s	TTFT (ms)
BF16 unquantized	0.853	0.780	0.480	0.530	57.8	424.7
FP8 per-expert scale (`expert_wise_scale=True`)	0.405	0.220	0.450	0.420	58.7	468.0
FP8 averaged scale (`expert_wise_scale=False`)	0.000	0.000	0.030	0.030	57.2	487.5
FP8 E0 scale (baseline, no fusion)	0.000	0.000	0.010	0.040	—	—
FP8 router_only (expert MLP excluded)	0.184	0.060	0.340	0.460	56.9	463.5
FP8 router + per-expert scale	0.166	0.080	0.280	0.370	58.7	468.4

Key findings:

Per-expert scale retains ~50% of BF16 accuracy while averaged/E0 scale produces near-zero results
Throughput is unchanged or slightly improved (FP8 reduces memory bandwidth)
Router quantization degrades accuracy significantly — should NOT be combined with expert quantization

Compatibility

Tested with:

Neuron SDK Version(s): 2.27
Instance Type(s): trn2.3xlarge (1 Neuron device, 4 NeuronCores, 96 GB HBM)
PyTorch Version: 22.8.0
Python Version: 3.12

Additional Information

Usage Example

Step 1: Quantize with per-expert scales

# quantize_qwen3_moe_per_expert_scale.py
import os
from neuronx_distributed_inference.models.config import MoENeuronConfig
from neuronx_distributed_inference.models.qwen3_moe.modeling_qwen3_moe import (
    Qwen3MoeInferenceConfig,
    NeuronQwen3MoeForCausalLM,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

model_path = "/path/to/Qwen3-30B-A3B"
quantized_model_path = "/path/to/quantized_checkpoint"
os.makedirs(quantized_model_path, exist_ok=True)

neuron_config = MoENeuronConfig(
    tp_degree=4,
    moe_tp_degree=4,
    moe_ep_degree=1,
    batch_size=1,
    max_context_length=1024,
    seq_len=1024,
    n_positions=1024,
    quantized=True,
    quantized_checkpoints_path=quantized_model_path,
    quantization_dtype="f8e4m3",
    quantization_type="per_channel_symmetric",
    modules_to_not_convert=["mlp.gate"],  # exclude router from quantization
)

config = Qwen3MoeInferenceConfig(
    neuron_config, load_config=load_pretrained_config(model_path)
)
NeuronQwen3MoeForCausalLM.save_quantized_state_dict(model_path, config)

The quantized checkpoint will contain individual per-expert scales (e.g., experts.{e}.gate_proj.scale).
At serving time, these are fused into [num_experts, 1, dim] tensors when expert_wise_scale=True.

Step 2: Serve with expert_wise_scale enabled

VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference \
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/Qwen3-30B-A3B \
  --max-model-len 1024 --tensor-parallel-size 4 \
  --port 13579 --max-num-seqs 1 \
  --override-neuron-config '{
    "quantized": true,
    "quantized_checkpoints_path": "/path/to/quantized_checkpoint",
    "quantization_dtype": "f8e4m3",
    "quantization_type": "per_channel_symmetric",
    "expert_wise_scale": true,
    "tp_degree": 4, "moe_tp_degree": 4, "moe_ep_degree": 1,
    "batch_size": 1, "max_context_length": 1024, "seq_len": 1024,
    "async_mode": true,
    "modules_to_not_convert": ["mlp.gate"],
    "on_device_sampling_config": {"dynamic": true, "global_topk": 20}
  }'

Known Limitations

Expert Parallelism (moe_ep_degree > 1) is NOT supported with this feature due to NxDI limitation: Selective Loading with Expert parallelism is not supported in token generation
Only validated on Qwen3 MoE architecture; other MoE models (Mixtral, DeepSeek, etc.) may require additional state dict conversion logic
Shared experts are excluded from quantization (modules_to_not_convert)

Related Issues

vLLM Integration

This model/feature is intended for use with vLLM
Tested end-to-end with vLLM Neuron backend (openai.api_server + benchmark_serving)

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

…n MoE models Add per-expert scale quantization support for MoE expert MLPs, which preserves individual expert scale factors instead of averaging them. This significantly improves FP8 quantization accuracy for MoE models like Qwen3-30B-A3B. Changes: - config.py: Add expert_wise_scale config option to MoENeuronConfig - modeling_qwen3_moe.py: Fuse per-expert scales during HF->Neuron state_dict conversion (gate_up_proj and down_proj), with fallback to averaged scales when expert_wise_scale=False - model_wrapper.py: Two-pass quantization conversion when expert_wise_scale=True (Pass 1: per_channel_symmetric for non-expert modules, Pass 2: expert_wise_per_channel_symmetric for expert MLPs) Validated on Qwen3-30B-A3B (trn2.3xlarge): - IFEval inst_strict: 0.405 (vs 0.00 with averaged scale) - TruthfulQA bleu_acc: 0.45 (vs 0.03 with averaged scale) - Throughput: 58.7 tok/s (unchanged from baseline)

…config Move use_expert_wise_scale into the per-model loop and read from model.config.neuron_config instead of self.neuron_config. This ensures correct behavior in fused speculation mode where a non-MoE draft model and MoE target model are processed in the same loop — previously the global flag would incorrectly apply two-pass expert quantization to a draft model that has no expert_mlps module.

lifelongeeek added 2 commits February 13, 2026 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35

feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35
lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
lifelongeeek:feat/expert-wise-scale

lifelongeeek commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lifelongeeek commented Feb 13, 2026

Description

Changes

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Usage Example

Known Limitations

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant