feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35
Open
lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
Open
feat: add expert_wise_scale support for per-expert FP8 quantization in MoE models#35lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
lifelongeeek wants to merge 2 commits intoaws-neuron:mainfrom
Conversation
…n MoE models Add per-expert scale quantization support for MoE expert MLPs, which preserves individual expert scale factors instead of averaging them. This significantly improves FP8 quantization accuracy for MoE models like Qwen3-30B-A3B. Changes: - config.py: Add expert_wise_scale config option to MoENeuronConfig - modeling_qwen3_moe.py: Fuse per-expert scales during HF->Neuron state_dict conversion (gate_up_proj and down_proj), with fallback to averaged scales when expert_wise_scale=False - model_wrapper.py: Two-pass quantization conversion when expert_wise_scale=True (Pass 1: per_channel_symmetric for non-expert modules, Pass 2: expert_wise_per_channel_symmetric for expert MLPs) Validated on Qwen3-30B-A3B (trn2.3xlarge): - IFEval inst_strict: 0.405 (vs 0.00 with averaged scale) - TruthfulQA bleu_acc: 0.45 (vs 0.03 with averaged scale) - Throughput: 58.7 tok/s (unchanged from baseline)
…config Move use_expert_wise_scale into the per-model loop and read from model.config.neuron_config instead of self.neuron_config. This ensures correct behavior in fused speculation mode where a non-MoE draft model and MoE target model are processed in the same loop — previously the global flag would incorrectly apply two-pass expert quantization to a draft model that has no expert_mlps module.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add per-expert scale quantization support for MoE expert MLPs in NeuronX Distributed Inference.
When quantizing MoE models (e.g., Qwen3-30B-A3B) to FP8, the existing implementation uses a single scale across all experts in the fused operator. This destroys per-expert precision and results in near-zero accuracy on downstream benchmarks. This PR introduces per-expert scale preservation during HF-to-Neuron state dict conversion and a two-pass quantization flow that applies specifically to expert MLP modules.
Changes
1. Config (
config.py)expert_wise_scaleboolean toMoENeuronConfig(default:False, fully backward compatible)2. State Dict Conversion (
modeling_qwen3_moe.py)expert_wise_scale=True: fuse per-expert scales into[num_experts, 1, dim]tensors for bothgate_up_projanddown_projexpert_wise_scale=False: average scales across experts into[1, 1, dim]tensors (existing behavior preserved)3. Quantization Conversion (
model_wrapper.py)expert_wise_scale=Truewithper_channel_symmetric:per_channel_symmetric, skipexpert_mlpsexpert_wise_per_channel_symmetric(usinginclude=["*expert_mlps.mlp_op*"])expert_wise_scale=False: standard single-pass conversion (no behavior change)Model Information
Model Name: Qwen3-30B-A3B (and applicable to all Qwen3 MoE models)
Model Architecture: MoE Decoder-only transformer (48 layers, 128 experts, top-8 routing, GQA)
Purpose: Text generation — improving FP8 quantization accuracy for MoE models served via vLLM on Trainium
Checklist
Required Components
contrib/src/): Changes follow existing NxDI patterns (state dict conversion in modeling file, quantization config in config, convert logic in model_wrapper)Optional Components
Folder Structure
This PR modifies existing core files, not the
contrib/folder:Testing
How did you test this change?
End-to-end quantization → compilation → serving → evaluation pipeline on Qwen3-30B-A3B with a single trn2.3xlarge instance. All tests performed sequentially to avoid OOM.
--expert-wise-scaleflaglm_eval --tasks ifeval,truthfulqa_gen --num_fewshot 0 --limit 100benchmark_serving.py --dataset-name random --num-prompts 100 --random-input-len 128 --random-output-len 128 --max-concurrency 1Test Results:
expert_wise_scale=True)expert_wise_scale=False)Key findings:
Compatibility
Tested with:
Additional Information
Usage Example
Step 1: Quantize with per-expert scales
Step 2: Serve with
expert_wise_scaleenabledKnown Limitations
moe_ep_degree > 1) is NOT supported with this feature due to NxDI limitation:Selective Loading with Expert parallelism is not supported in token generationmodules_to_not_convert)Related Issues
vLLM Integration
By submitting this PR, I confirm that: