Support Qwen3 and Qwen2.5 Omni model quantization#1404
Support Qwen3 and Qwen2.5 Omni model quantization#1404lvliang-intel wants to merge 28 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
Thank you for the PR! Could you help verify all inferences (vLLM, Transformers 4, and Transformers 5) before merging? |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Quantize:Inference with transformers 5.1.0
vLLM tests are currently blocked because the latest vLLM version depends on an outdated Transformers release. Qwen3-Omni requires Transformers >= 5.1.0 to address several known issues. |
There was a problem hiding this comment.
Pull request overview
Adds quantization support for the Qwen3-Omni MoE model family by integrating model-specific loading/version gating, calibration forward behavior for thinker/talker, and custom multimodal block discovery.
Changes:
- Added explicit Transformers version guard for
qwen3_omni_moe. - Introduced Qwen3-Omni processor/template registration and model-specific multimodal block name discovery.
- Implemented a Qwen3-Omni-specific forward path to run thinker (and optionally talker) during calibration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds a project-specific word to typos’ allowlist. |
| auto_round/utils/model.py | Adds Transformers version guard and adjusts lm_head discovery logic. |
| auto_round/utils/common.py | Adds _no_split_modules normalization and extends multimodal ignore-key lists. |
| auto_round/special_model_handler.py | Adds Qwen3-Omni special forward + block discovery + ignore-layer rule. |
| auto_round/compressors/shard_writer.py | Improves tie_word_embeddings lookup for nested multimodal configs. |
| auto_round/compressors/mllm/utils.py | Extends multimodal ignore-key list for Qwen3-Omni components. |
| auto_round/compressors/mllm/template.py | Registers a Qwen3-Omni model template with the new processor. |
| auto_round/compressors/mllm/processor.py | Adds a custom processor for Qwen3-Omni chat-template inputs. |
| auto_round/compressors/base.py | Imports the new normalization helper. |
| auto_round/auto_scheme/utils.py | Uses normalized _no_split_modules when dispatching across devices. |
| ) | ||
|
|
||
| # Run talker forward if available (for calibration purposes) | ||
| if hasattr(model, "talker") and model.has_talker: |
There was a problem hiding this comment.
This can raise AttributeError when model.has_talker doesn’t exist (the hasattr only checks talker). Use getattr(model, "has_talker", False) (and optionally also ensure model.talker is not None) to make this guard safe.
| if hasattr(model, "talker") and model.has_talker: | |
| if getattr(model, "has_talker", False) and getattr(model, "talker", None) is not None: |
auto_round/special_model_handler.py
Outdated
| # Use text projection to convert thinker embeddings to talker space | ||
| if hasattr(model.talker, "text_projection"): | ||
| # Get thinker embeddings | ||
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | ||
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) |
There was a problem hiding this comment.
This path assumes input_ids is provided; if calibration runs with inputs_embeds (or other modalities without input_ids), this will throw and then be silently ignored (due to the broad except), meaning the talker forward never runs. Consider deriving inputs from inputs_embeds when present, or projecting from thinker_output.hidden_states[-1] (which you already compute) instead of re-embedding input_ids.
| # Use text projection to convert thinker embeddings to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Get thinker embeddings | |
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | |
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) | |
| # Use text projection to convert thinker hidden states to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Project thinker hidden states directly into the talker embedding space | |
| talker_inputs_embeds = model.talker.text_projection(thinker_hidden) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
you could update transformers after installing vllm |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…into lvl/support_omni
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
…into lvl/support_omni
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…into lvl/support_omni
Qwen2.5 Omni quantize and inference test pass:CUDA_VISIBLE_DEVICES=3 python quantize_qwen25_omni.py --model /mnt/disk2/lvl/Qwen2.5-Omni-3B --output tmp_qwen25_omni_w4a16 --iters 200 CUDA_VISIBLE_DEVICES=6 python run_qwen25_omni.py --model-dir tmp_qwen25_omni_w4a16 --enable-audio-output ============================================================
|
|
|
||
|
|
||
| SPECIAL_MULTIMODAL_BLOCK = {"deepseek_vl_v2": _get_deepseek_vl2_multimodal_block} | ||
| def _get_qwen2_5_omni_multimodal_block(model, quant_vision=False): |
There was a problem hiding this comment.
Since the code for these two models has grown to 300+ lines, it’s making the main file quite cluttered. Shall we refine this file later
There was a problem hiding this comment.
Sure, we will refactor this file later.
|
Awesome work, Liang Ge! |
vLLM inference test with Qwen Omni 2.5 quantized model, accuracy is goodCUDA_VISIBLE_DEVICES=5 python run_qwen25_omni_vllm.py --model-dir ./tmp_qwen25_omni_w4a16 ============================================================
|
vLLM inference test with Qwen3 Omni quantized mode, accuracy is not good. Looks like vLLM issue since transfromer inference test is good for Qwen3 Omni.CUDA_VISIBLE_DEVICES=5 python run_qwen3_omni_vllm.py --model-dir ./tmp_qwen3_omni_w4a16 ============================================================
|
Description
This update adds quantization support for Qwen3-Omni by integrating a custom MLLM processor and template, implementing dedicated forward logic for thinker/talker calibration, and introducing model-specific block discovery.
Note: This feature requires Transformers >= 5.1.0, as earlier versions contain compatibility issues with Qwen3-Omni.
Type of Change
Related Issues
#1387
Fixes or relates to #
Checklist Before Submitting