Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model support (dense and MoE variants) to the ATOM framework, targeting vLLM plugin mode. It introduces model configurations, model implementations with hybrid attention (GatedDeltaNet linear attention + full attention), multimodal support via Qwen3VL integration, and a new GDN attention backend for vLLM.
Changes:
- Added Qwen3.5 dense and MoE model configs, model implementations, and conditional generation wrappers with multimodal (vision-language) support
- Added a new GatedDeltaNet attention backend for vLLM plugin mode and refactored existing attention/loader code for Qwen3.5 compatibility
- Extended weight loading, quantization config, and MoE layers to handle Qwen3.5's separate projection weights and fused expert patterns
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/model_config/qwen3_5.py | New: Qwen3.5 dense model configuration |
| atom/model_config/qwen3_5_moe.py | New: Qwen3.5 MoE model configuration |
| atom/models/qwen3_5.py | New: Main Qwen3.5 model implementation (GDN, attention, CausalLM, ConditionalGeneration) |
| atom/models/interfaces.py | New: Multimodal protocol interfaces for model support |
| atom/models/utils.py | Added utility classes (StageMissingLayer, collect_children, no_init_weights, common_prefix) |
| atom/models/qwen3_next.py | Refactored for Qwen3.5 compatibility: changed config access to .text_config, added vLLM attention path |
| atom/plugin/vllm/attention_backend/attention_gdn.py | New: GatedDeltaNet attention backend for vLLM |
| atom/plugin/vllm/attention_backend/gdn_attn.py | New: GDN attention backend wrapper |
| atom/plugin/vllm/model_wrapper.py | Added ATOMForConditionalGeneration multimodal wrapper |
| atom/plugin/vllm/register.py | Registered Qwen3.5 model architectures |
| atom/plugin/vllm/platform.py | Updated get_attn_backend_cls signature for vLLM compatibility |
| atom/plugin/config.py | Added vllm_config field to PluginConfig |
| atom/plugin/attention.py | Changed supported kernel block sizes from [16, 32] to [16] |
| atom/config.py | Enhanced quant config for packed modules; fallback config loading via vLLM; missing return bug |
| atom/model_loader/loader.py | Added WeightsMapper for weight name remapping; plugin mode weight loading changes |
| atom/model_engine/model_runner.py | Added Qwen3.5 to model registry; changed hf_config to text_config |
| atom/model_ops/base_attention.py | Updated linear attention forward to pass layer_name |
| atom/model_ops/attention_gdn.py | Added vLLM forward context support (with is_vllm bug) |
| atom/model_ops/layernorm.py | Switched RMSNormGated to vLLM's rmsnorm_fn (breaks non-vllm) |
| atom/model_ops/linear.py | Extended weight_loader for tuple shard_ids; new QKVZBAParallelLinear shard types |
| atom/model_ops/moe.py | Added w13 shard support; parameterized gate_up_proj name |
| atom/utils/selector.py | Added vLLM-specific GDN attention backend selection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/model_ops/layernorm.py
Outdated
| # if torch.compiler.is_compiling(): | ||
| # return self.forward_native(x, z) | ||
| # return self.forward_native(x, z) | ||
|
|
||
| from vllm.model_executor.layers.fla.ops.layernorm_guard import rmsnorm_fn | ||
|
|
||
| return rmsnorm_fn( | ||
| x, | ||
| self.weight, | ||
| self.bias, | ||
| z=z, | ||
| eps=self.eps, | ||
| group_size=self.group_size, | ||
| norm_before_gate=self.norm_before_gate, | ||
| ) |
6b0a02e to
27eafed
Compare
Signed-off-by: ganyi <ygan@amd.com>
27eafed to
b068288
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model support (dense and MoE variants) to the ATOM framework, leveraging vLLM's plugin mode for multimodal inference. It also refactors shared components (layernorm fusion, weight loading, attention backends) and improves the CI benchmark infrastructure.
Changes:
- Adds Qwen3.5 dense and MoE model implementations with hybrid attention (GatedDeltaNet linear + full attention), multimodal support via Qwen3VL integration, and vLLM plugin registration
- Refactors RMSNorm quantization fusion into reusable
DualRMSNormandfuse_rmsnorm_group_quantutilities inlayernorm.py, removing duplicate code fromdeepseek_v2.py - Restructures CI benchmark workflow to use external
models.jsonconfig and adds a dedicated profiler analysis job with a newregression_rerun.pyscript
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/models/qwen3_5.py | New Qwen3.5 model (dense, MoE, multimodal) implementation |
| atom/models/qwen3_next.py | Refactored to support Qwen3.5 via text_config, method extraction |
| atom/model_config/qwen3_5.py | Qwen3.5 dense model configuration |
| atom/model_config/qwen3_5_moe.py | Qwen3.5 MoE model configuration |
| atom/model_ops/layernorm.py | DualRMSNorm, fuse_rmsnorm_group_quant, RMSNorm fused quant refactor |
| atom/model_ops/linear.py | Tuple shard loading, new QKVZBAParallelLinear shard IDs, import change |
| atom/model_ops/moe.py | w13 shard support, make_expert_params_mapping extension |
| atom/models/deepseek_v2.py | Migrated fusion code to layernorm.py, uses DualRMSNorm |
| atom/config.py | QuantizationConfig updates, vLLM config fallback, env var generalization |
| atom/model_loader/loader.py | WeightsMapper class, text_config-aware loading |
| atom/model_engine/model_runner.py | Qwen3.5 model registration, text_config loading |
| atom/plugin/vllm/model_wrapper.py | ATOMForConditionalGeneration multimodal wrapper |
| atom/plugin/vllm/register.py | Qwen3.5 model registration |
| atom/plugin/vllm/attention_backend/attention_gdn.py | GatedDeltaNet vLLM attention backend |
| atom/plugin/vllm/attention_backend/gdn_attn.py | GDN backend class for vLLM plugin |
| atom/plugin/config.py | Added vllm_config field |
| atom/plugin/attention.py | Block size change (16 only) |
| atom/utils/selector.py | vLLM-aware GDN backend selection |
| atom/utils/envs.py | New master RMSNORM_QUANT_FUSION switch, custom all-gather |
| atom/models/utils.py | StageMissingLayer, collect_children, no_init_weights utilities |
| atom/models/interfaces.py | SupportsMultiModal protocol for ATOM models |
| atom/model_ops/base_attention.py | layer_name passthrough, prefix storage |
| atom/model_ops/attention_gdn.py | GatedDeltaNet rename fix, layer_name param |
| atom/model_ops/attentions/gdn_attn.py | GatedDeltaNet rename fix |
| atom/model_ops/embed_head.py | Configurable custom all-gather |
| docs/model_ops_guide.md | DualRMSNorm documentation |
| docs/environment_variables.md | Updated env var docs |
| .github/workflows/atom-benchmark.yaml | Refactored CI: models.json, profiler analysis job |
| .github/benchmark/models.json | Externalized model configs |
| .github/scripts/regression_rerun.py | New regression re-run config generator |
| .github/scripts/summarize.py | model_id in regression report |
| .github/scripts/atom_test.sh | Cache clearing, stop command, result filename |
Comments suppressed due to low confidence (1)
atom/model_ops/layernorm.py:708
- Bug:
forward_cudanow unconditionally imports fromvllm.model_executor.layers.fla.ops.layernorm_guard, which will fail withImportErrorin standalone ATOM mode (non-vLLM). The previous code correctly fell back toself.forward_native(x, z). SinceRMSNormGatedis a general utility inmodel_ops/layernorm.py(not specific to vLLM plugin mode), this import should be guarded, e.g. with a try/except that falls back to the native implementation, or a check for vLLM availability.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/models/qwen3_5.py
Outdated
| self.config.n_shared_experts = 1 | ||
| self.config.n_routed_experts = self.config.num_experts |
| @staticmethod | ||
| def get_supported_kernel_block_sizes(): | ||
| return [16, 32] | ||
| return [16] |
There was a problem hiding this comment.
gluon pa does not support block-size 32, so I removed it
There was a problem hiding this comment.
can we add some comments here why only support block size 16
atom/model_engine/model_runner.py
Outdated
| self.model = model_class(config) | ||
| torch.set_default_device(None) | ||
| load_model(self.model, config.model, config.hf_config, config.load_dummy) | ||
| if hasattr(config.hc_config, "text_config"): |
|
This PR requires ROCm/aiter#2292 merged into aiter |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
atom/model_ops/layernorm.py
Outdated
| # return self.forward_native(x, z) | ||
| # return self.forward_native(x, z) | ||
|
|
||
| from vllm.model_executor.layers.fla.ops.layernorm_guard import rmsnorm_fn |
There was a problem hiding this comment.
I'll revert this.
atom/models/qwen3_5.py
Outdated
|
|
||
|
|
||
| @triton.jit | ||
| def shard_qkvzba_kernel( |
There was a problem hiding this comment.
move to model ops or aiter..
| ######################################################## | ||
|
|
||
| # ConditionalGeneration model scope should only works on plugin mode | ||
| if is_vllm(): |
There was a problem hiding this comment.
There are many required multimodal modules missing in atom, so qwen3.5 only supported on vllm plugin mode yet. And we will extend it to atom native support in the future
ChuanLi1101
left a comment
There was a problem hiding this comment.
Review focusing on critical and potential bug issues in the newly added code.
atom/model_engine/model_runner.py
Outdated
| self.model = model_class(config) | ||
| torch.set_default_device(None) | ||
| load_model(self.model, config.model, config.hf_config, config.load_dummy) | ||
| if hasattr(config.hc_config, "text_config"): |
There was a problem hiding this comment.
[Critical] Typo: config.hc_config should be config.hf_config
config (which is Config) has no attribute hc_config. This will raise AttributeError at runtime for any model with a text_config sub-config (including Qwen3.5). Should be:
if hasattr(config.hf_config, "text_config"):
atom/models/qwen3_5.py
Outdated
|
|
||
| self.config = config | ||
| self.config.n_shared_experts = 1 | ||
| self.config.n_routed_experts = self.config.num_experts |
There was a problem hiding this comment.
[Critical] self.config.num_experts will crash for dense (non-MoE) Qwen3.5 models
Qwen3_5Model is shared by both Qwen3_5ForCausalLM (dense) and Qwen3_5MoeForCausalLM (MoE). For dense models, the config is Qwen3_5TextConfig which does not define num_experts, so this line will raise AttributeError.
Suggested fix — guard with a conditional or use getattr:
self.config.n_shared_experts = getattr(self.config, "n_shared_experts", 1)
self.config.n_routed_experts = getattr(self.config, "num_experts", 0)Or split Qwen3_5Model to avoid unconditionally setting MoE-specific attributes on a dense config.
atom/model_ops/linear.py
Outdated
| gemm_a8w8_blockscale_bpreshuffle_triton = None | ||
|
|
||
| # For Triton FP8 Blockscale GEMM is mostly slower then AITER GEMM, we turn off Triton FP8 GEMM | ||
| from aiter.ops.triton.gemm.basic.gemm_a8w8_blockscale import ( |
There was a problem hiding this comment.
[Potential Bug] Unprotected import lost its fallback guard
Previously gemm_a8w8_blockscale_bpreshuffle_triton was set to None as a safe fallback. Now this import is outside the try-except block — if aiter.ops.triton.gemm.basic.gemm_a8w8_blockscale is not available (e.g., older aiter version), it will crash module import entirely.
Consider wrapping this in its own try-except with a None fallback:
try:
from aiter.ops.triton.gemm.basic.gemm_a8w8_blockscale import (
gemm_a8w8_blockscale_preshuffle as gemm_a8w8_blockscale_bpreshuffle_triton,
)
except ImportError:
gemm_a8w8_blockscale_bpreshuffle_triton = NoneSigned-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model support (dense and MoE variants) to ATOM, leveraging vLLM's plugin mode for hybrid attention architectures (GatedDeltaNet linear attention + full attention). It includes multimodal (vision-language) support via Qwen3VL integration.
Changes:
- Adds Qwen3.5 dense and MoE model implementations, configs, and vLLM plugin integration including a GatedDeltaNet attention backend
- Refactors shared components (Qwen3Next, weight loading, attention selection) to support the new model architecture and vLLM plugin mode
- Adds utility classes (WeightsMapper, StageMissingLayer, interfaces) and environment variable controls for the new model support
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/models/qwen3_5.py | New Qwen3.5 model implementation (dense, MoE, multimodal) |
| atom/model_config/qwen3_5.py | Qwen3.5 dense model configuration |
| atom/model_config/qwen3_5_moe.py | Qwen3.5 MoE model configuration |
| atom/plugin/vllm/attention_backend/attention_gdn.py | GatedDeltaNet attention backend for vLLM plugin mode |
| atom/plugin/vllm/attention_backend/gdn_attn.py | GDN attention backend wrapper for vLLM |
| atom/plugin/vllm/model_wrapper.py | ATOMForConditionalGeneration wrapper for multimodal |
| atom/plugin/vllm/register.py | Register Qwen3.5 models in vLLM plugin registry |
| atom/plugin/config.py | Add vllm_config to PluginConfig |
| atom/plugin/attention.py | Reduce supported block sizes to [16] |
| atom/models/qwen3_next.py | Refactor for Qwen3.5 compatibility and vLLM support |
| atom/models/interfaces.py | New multimodal protocol interfaces (currently unused) |
| atom/models/utils.py | Add utility classes (StageMissingLayer, collect_children, etc.) |
| atom/model_ops/linear.py | Support tuple shard IDs and new QKV loading modes |
| atom/model_ops/layernorm.py | Switch RMSNormGated to use vLLM's rmsnorm_fn |
| atom/model_ops/base_attention.py | Pass layer_name to GDN forward, store prefix |
| atom/model_ops/attention_gdn.py | Rename GatedDetlaNet → GatedDeltaNet, add layer_name param |
| atom/model_ops/attentions/gdn_attn.py | Update import for renamed GatedDeltaNet |
| atom/model_ops/embed_head.py | Configurable custom all-gather via env var |
| atom/model_loader/loader.py | Add WeightsMapper, support weights remapping and text_config |
| atom/model_engine/model_runner.py | Add Qwen3.5 to model arch dict, text_config handling |
| atom/config.py | Enhanced quantization config, vLLM fallback for config loading |
| atom/utils/selector.py | Route GDN attention to vLLM-specific backend when in plugin mode |
| atom/utils/envs.py | Add ATOM_USE_CUSTOM_ALL_GATHER env var |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/model_engine/model_runner.py
Outdated
| self.model = model_class(config) | ||
| torch.set_default_device(None) | ||
| load_model(self.model, config.model, config.hf_config, config.load_dummy) | ||
| if hasattr(config.hc_config, "text_config"): |
atom/models/qwen3_5.py
Outdated
| compilation_config = get_current_vllm_config().compilation_config | ||
| if prefix in compilation_config.static_forward_context: | ||
| raise ValueError(f"Duplicate layer name: {prefix}") | ||
| print(f"register layer {prefix} to static forward context for Mamba") |
atom/model_ops/linear.py
Outdated
| from aiter.ops.triton.gemm.basic.gemm_a8w8_blockscale import ( | ||
| gemm_a8w8_blockscale_preshuffle as gemm_a8w8_blockscale_bpreshuffle_triton, | ||
| ) |
atom/models/qwen3_5.py
Outdated
| self, | ||
| hidden_states: torch.Tensor, | ||
| ) -> torch.Tensor | None: | ||
| print("hidden states shape: ", hidden_states.shape) |
atom/model_ops/layernorm.py
Outdated
| # if torch.compiler.is_compiling(): | ||
| # return self.forward_native(x, z) | ||
| # return self.forward_native(x, z) | ||
|
|
||
| from vllm.model_executor.layers.fla.ops.layernorm_guard import rmsnorm_fn | ||
|
|
||
| return rmsnorm_fn( | ||
| x, | ||
| self.weight, | ||
| self.bias, | ||
| z=z, | ||
| eps=self.eps, | ||
| group_size=self.group_size, | ||
| norm_before_gate=self.norm_before_gate, | ||
| ) |
atom/models/interfaces.py
Outdated
| on those tokens. Note however that doing so increases memory usage | ||
| as an additional buffer is needed to hold the input embeddings. | ||
| """ | ||
| from .utils import _merge_multimodal_embeddings |
atom/models/qwen3_5.py
Outdated
| self.config.n_shared_experts = 1 | ||
| self.config.n_routed_experts = self.config.num_experts |
atom/models/qwen3_5.py
Outdated
|
|
||
| @MULTIMODAL_REGISTRY.register_processor( | ||
| Qwen3VLMultiModalProcessor, | ||
| info=Qwen3_5MoeProcessingInfo, |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model support (both dense and MoE variants) to the ATOM framework, operating exclusively in vLLM plugin mode. The implementation provides a hybrid attention architecture combining GatedDeltaNet linear attention layers (using ATOM's optimized kernels) with full attention layers (using vLLM's native implementation), plus multimodal (vision-language) support via Qwen3VL integration.
Changes:
- Added Qwen3.5 model configs, model classes, and multimodal wrappers with dedicated weight loading and mapping logic
- Refactored
Qwen3NextGatedDeltaNet,Qwen3NextAttention, andQwen3NextDecoderLayerto support both standalone and vLLM plugin modes, with new GatedDeltaNet attention backend for vLLM - Enhanced weight loading infrastructure with
WeightsMapper, fused expert weight support, and tuple-shard loading forMergedColumnParallelLinear
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/model_config/qwen3_5.py | New Qwen3.5 dense model configuration (text + vision) |
| atom/model_config/qwen3_5_moe.py | New Qwen3.5 MoE model configuration |
| atom/models/qwen3_5.py | Main Qwen3.5 model implementation with dense, MoE, and multimodal variants |
| atom/models/qwen3_next.py | Refactored base classes for Qwen3.5 compatibility (config access, method extraction) |
| atom/models/interfaces.py | New multimodal interface protocols (appears unused) |
| atom/models/utils.py | Added utility classes/functions for model registration and weight init |
| atom/plugin/vllm/attention_backend/attention_gdn.py | New GatedDeltaNet attention backend for vLLM plugin mode |
| atom/plugin/vllm/attention_backend/gdn_attn.py | Backend wrapper for GDN attention |
| atom/plugin/vllm/model_wrapper.py | Added ATOMForConditionalGeneration wrapper for multimodal models |
| atom/plugin/vllm/register.py | Registered Qwen3.5 model classes |
| atom/plugin/config.py | Added vllm_config to plugin config |
| atom/plugin/attention.py | Reduced supported block sizes to [16] |
| atom/config.py | Enhanced QuantizationConfig with vllm integration, fallback config loading |
| atom/model_loader/loader.py | Added WeightsMapper, fused expert loading, weight name remapping |
| atom/model_ops/linear.py | Added tuple shard_id support and new QKVZBA shard ids |
| atom/model_ops/layernorm.py | Switched RMSNormGated to use vllm's rmsnorm kernel |
| atom/model_ops/base_attention.py | Added layer_name parameter and prefix tracking |
| atom/model_ops/attention_gdn.py | Fixed typo GatedDetlaNet → GatedDeltaNet, added layer_name param |
| atom/model_ops/attentions/gdn_attn.py | Updated to use corrected GatedDeltaNet class name |
| atom/model_ops/embed_head.py | Made custom all-gather configurable via env var |
| atom/utils/selector.py | Added vLLM-specific GDN attention backend selection |
| atom/utils/envs.py | Added ATOM_USE_CUSTOM_ALL_GATHER env var |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/model_ops/layernorm.py
Outdated
| from vllm.model_executor.layers.fla.ops.layernorm_guard import rmsnorm_fn | ||
|
|
||
| return rmsnorm_fn( | ||
| x, | ||
| self.weight, | ||
| self.bias, | ||
| z=z, | ||
| eps=self.eps, | ||
| group_size=self.group_size, | ||
| norm_before_gate=self.norm_before_gate, | ||
| ) |
atom/models/qwen3_next.py
Outdated
| self.tp_size = get_tensor_model_parallel_world_size() | ||
| self.tp_rank = get_tensor_model_parallel_rank() | ||
| self.atom_config = atom_config | ||
| config = atom_config.hf_config.text_config |
Signed-off-by: ganyi <ygan@amd.com>
atom/models/qwen3_5.py
Outdated
| compilation_config = get_current_vllm_config().compilation_config | ||
| if prefix in compilation_config.static_forward_context: | ||
| raise ValueError(f"Duplicate layer name: {prefix}") | ||
| print(f"register layer {prefix} to static forward context for Mamba") |
There was a problem hiding this comment.
can use logger here
atom/models/interfaces.py
Outdated
|
|
||
|
|
||
| @runtime_checkable | ||
| class SupportsMultiModal(Protocol): |
There was a problem hiding this comment.
Hi, @ganyi1996ppo
Is this class entirely ported from vLLM? If yes, we can directly use the vLLM SupportsMultiModal to reduce the effort of maintain.
Signed-off-by: ganyi <ygan@amd.com>
atom/models/interfaces.py
Outdated
| @contextmanager | ||
| def _mark_language_model( | ||
| self, | ||
| atom_config: nn.Module, |
There was a problem hiding this comment.
Hi @ganyi1996ppo
Here the argument is atom_config, while the datatype comment is nn.Module, it seems misleading.
Meanwhile the passed argument is vllm_config, code is here:
with self._mark_language_model(vllm_config):
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model support to ATOM (both dense and MoE variants), operating exclusively in vLLM plugin mode. The implementation leverages a hybrid attention architecture with GatedDeltaNet (linear attention) and full attention layers, multimodal (vision-language) capabilities via Qwen3VL integration, and a dual-class pattern for vLLM compatibility.
Changes:
- Added Qwen3.5 dense and MoE model implementations with hybrid attention, multimodal support, and custom weight loading for both FP8 and BF16 checkpoint formats
- Refactored
Qwen3NextGatedDeltaNetand related classes to support Qwen3.5's separate QKVZ/BA projections, and enhanced the weight loader to handle fused expert weights and tuple shard IDs - Added vLLM plugin infrastructure: GDN attention backend, conditional generation wrapper (
ATOMForConditionalGeneration), model registration, and config enhancements
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/models/qwen3_5.py | New Qwen3.5 model implementations (dense, MoE, multimodal) with triton kernels and dual-class vLLM pattern |
| atom/models/qwen3_next.py | Refactored GatedDeltaNet, decoder layer, and MoE block for Qwen3.5 compatibility |
| atom/model_config/qwen3_5.py | Qwen3.5 dense configuration (text, vision, composite) |
| atom/model_config/qwen3_5_moe.py | Qwen3.5 MoE configuration |
| atom/plugin/vllm/attention_backend/attention_gdn.py | GatedDeltaNet attention backend for vLLM plugin mode |
| atom/plugin/vllm/attention_backend/gdn_attn.py | Backend wrapper class for GDN attention |
| atom/plugin/vllm/model_wrapper.py | New ATOMForConditionalGeneration wrapper with multimodal/MRoPE support |
| atom/plugin/vllm/register.py | Model registration entries for Qwen3.5 |
| atom/model_loader/loader.py | WeightsMapper, fused expert loading, and enhanced weight loading pipeline |
| atom/model_ops/linear.py | Tuple shard_id support and new shard types (qkv, z, b, a) in weight loaders |
| atom/config.py | QuantizationConfig enhancements, vLLM fallback for config loading, packed_modules_mapping support |
| atom/model_ops/base_attention.py | Added layer_name param and prefix attribute to linear attention |
| atom/model_ops/attention_gdn.py | Fixed GatedDetlaNet → GatedDeltaNet typo, added layer_name param |
| atom/model_ops/attentions/gdn_attn.py | Updated import for renamed GatedDeltaNet class |
| atom/model_ops/embed_head.py | Configurable custom all-gather via env variable |
| atom/models/utils.py | Added utility classes/functions for model construction |
| atom/plugin/config.py | Added vllm_config field to PluginConfig |
| atom/plugin/attention.py | Reduced supported block sizes to [16] only |
| atom/utils/selector.py | vLLM-aware attention backend selection |
| atom/utils/envs.py | New ATOM_USE_CUSTOM_ALL_GATHER env variable |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3.5 model family support to ATOM (dense + MoE, including multimodal ConditionalGeneration) targeting vLLM plugin mode, and extends ATOM’s plugin/loader infrastructure to support Qwen3.5-specific attention backends and checkpoint weight mappings.
Changes:
- Add new Qwen3.5 / Qwen3.5-MoE HF config definitions and a new
atom/models/qwen3_5.pyimplementation (hybrid linear attention + full attention + multimodal wrappers). - Add vLLM-plugin GatedDeltaNet (GDN) attention backend and wire backend selection to choose plugin vs non-plugin implementations.
- Enhance weight loading utilities (name mapping + packed-module handling + fused expert handling) and expose a new env toggle for TP all-gather.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
atom/utils/selector.py |
Passes plugin-mode flag into attention backend selection; routes GDN backend to vLLM plugin implementation. |
atom/utils/envs.py |
Adds ATOM_USE_CUSTOM_ALL_GATHER env var definition. |
atom/plugin/vllm/register.py |
Registers Qwen3.5 ConditionalGeneration architectures for vLLM plugin mode. |
atom/plugin/vllm/model_wrapper.py |
Extends vLLM wrapper interfaces to support multimodal + MRoPE forwarding. |
atom/plugin/vllm/attention_backend/gdn_attn.py |
Adds a vLLM plugin attention-backend entrypoint for GDN. |
atom/plugin/vllm/attention_backend/attention_gdn.py |
Implements GatedDeltaNet attention for vLLM v1 attention metadata/caching. |
atom/plugin/config.py |
Stores the raw vllm_config on the plugin config object. |
atom/plugin/attention.py |
Changes supported kernel KV block sizes exposed to vLLM plugin attention backend. |
atom/models/utils.py |
Adds utilities for “no init weights” / collecting children and a common_prefix helper. |
atom/models/qwen3_next.py |
Refactors Qwen3-Next to improve compatibility with Qwen3.5 + vLLM attention integration. |
atom/models/qwen3_5.py |
New Qwen3.5 implementation including multimodal ConditionalGeneration in vLLM plugin mode. |
atom/model_ops/linear.py |
Extends packed-shard weight loading for merged column-parallel linears; adds additional shard ids. |
atom/model_ops/embed_head.py |
Adds env-controlled toggle for custom TP all-gather in LM head. |
atom/model_ops/base_attention.py |
Threads layer_name into linear-attention op path and stores prefix. |
atom/model_ops/attentions/gdn_attn.py |
Fixes GatedDeltaNet naming typo and type annotations. |
atom/model_ops/attention_gdn.py |
Renames GatedDetlaNet → GatedDeltaNet and updates forward signature to accept layer_name. |
atom/model_loader/loader.py |
Adds WeightsMapper, supports mapping during iteration, adds fused-expert handling hooks, extends packed-module loading behavior. |
atom/model_config/qwen3_5.py |
Adds Qwen3.5 dense config definitions. |
atom/model_config/qwen3_5_moe.py |
Adds Qwen3.5 MoE config definitions. |
atom/config.py |
Adds vLLM config fallback for unsupported HF configs; extends quant config with packed-module mapping awareness. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "ATOM_USE_CUSTOM_ALL_GATHER": lambda: os.getenv( | ||
| "ATOM_USE_CUSTOM_ALL_GATHER", "0" | ||
| ).lower() | ||
| == "1", |
| for shard_idx, target_name in enumerate(packed_value): | ||
| param_name = name.replace(k, target_name) | ||
| if "output_scale" not in param_name: | ||
| param = model.get_parameter(param_name) | ||
| weight_loader = getattr(param, "weight_loader") | ||
| futures.append( | ||
| executor.submit( | ||
| weight_loader, param, weight_tensor, shard_idx |
| for ori_param, ( | ||
| model_param, | ||
| shard_id, | ||
| ) in self.packed_modules_mapping.items(): | ||
| if proj_name in model_param: |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
Adds Qwen3.5 (dense + MoE, incl. multimodal ConditionalGeneration) support to ATOM, targeting vLLM plugin mode integration and hybrid attention (GatedDeltaNet + full attention) interoperability, along with weight-loading/mapping enhancements needed for Qwen3.5 checkpoints.
Changes:
- Introduces Qwen3.5 model/config implementations (dense + MoE) and a vLLM-side GatedDeltaNet attention backend.
- Extends plugin-mode integration (model registry/wrapper, config plumbing) and improves weight loading/mapping to support packed/fused patterns.
- Adds a fused Triton split/chunk kernel used by the Qwen3.5 linear-attention path and adds an env toggle for custom all-gather.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_envs.py | Tracks new env var in test cleanup list (needs explicit default/override assertions). |
| atom/utils/selector.py | Routes GDN attention backend selection differently when running under vLLM plugin mode. |
| atom/utils/envs.py | Adds ATOM_USE_CUSTOM_ALL_GATHER env var (default enabled). |
| atom/plugin/vllm/register.py | Registers Qwen3.5 ConditionalGeneration architectures for vLLM model registry overrides. |
| atom/plugin/vllm/model_wrapper.py | Extends wrapper interfaces for multimodal/MRoPE and maps Qwen3.5 arch → ATOM implementations. |
| atom/plugin/vllm/attention_backend/gdn_attn.py | Adds vLLM plugin attention backend wrapper for GDN. |
| atom/plugin/vllm/attention_backend/attention_gdn.py | Implements vLLM-plugin-mode GatedDeltaNet attention path. |
| atom/plugin/vllm/attention_backend/init.py | Package marker for vLLM attention backend modules. |
| atom/plugin/config.py | Stores full vllm_config in plugin config for downstream access. |
| atom/plugin/attention.py | Updates supported kernel block sizes list for plugin attention backend. |
| atom/models/utils.py | Adds utilities for module-child collection and meta-device init avoidance. |
| atom/models/qwen3_next.py | Refactors Qwen3-Next components for improved vLLM compatibility and config variations. |
| atom/models/qwen3_5.py | Adds Qwen3.5 dense/MoE model implementations + vLLM multimodal wrappers and weight mapping hooks. |
| atom/model_ops/split_chunk.py | New fused Triton kernel to split/chunk Qwen3.5 projection outputs efficiently. |
| atom/model_ops/linear.py | Enhances packed shard weight loading and expands QKVZBA shard IDs; adjusts Triton GEMM imports. |
| atom/model_ops/embed_head.py | Makes TP all-gather implementation selectable via env var. |
| atom/model_ops/base_attention.py | Passes layer_name through linear-attention custom op to backend impl. |
| atom/model_ops/attentions/gdn_attn.py | Fixes GatedDeltaNet naming and type annotations. |
| atom/model_ops/attention_gdn.py | Renames class to GatedDeltaNet and updates forward signature to accept layer_name. |
| atom/model_loader/loader.py | Adds WeightsMapper, plugin-mode config selection, packed-module list handling, and fused-expert loading hooks. |
| atom/model_config/qwen3_5.py | Adds Qwen3.5 dense HF config definitions (text + vision). |
| atom/model_config/qwen3_5_moe.py | Adds Qwen3.5 MoE HF config definitions (text + vision). |
| atom/config.py | Improves HF config loading fallback in plugin mode and passes vLLM quant info into QuantizationConfig. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(packed_value, list): | ||
| # Checkpoint has fused weight, split into separate params | ||
| for shard_idx, target_name in enumerate(packed_value): | ||
| param_name = name.replace(k, target_name) | ||
| if "output_scale" not in param_name: | ||
| param = model.get_parameter(param_name) | ||
| weight_loader = getattr(param, "weight_loader") | ||
| futures.append( | ||
| executor.submit( | ||
| weight_loader, param, weight_tensor, shard_idx | ||
| ) | ||
| ) | ||
| loaded_weights_record.add(prefix + param_name) |
atom/model_ops/split_chunk.py
Outdated
| tl.store(z_ptr + z_out_base + dim_idx, z_vals, mask=mask) | ||
|
|
||
| # Store zeros to core_attn_out: coalesced write | ||
| zeros = tl.zeros([BLOCK_SIZE], dtype=tl.float16) |
| # Must be >= head_v_dim (128) and >= 2*num_v_heads_tp (32) | ||
| BLOCK_SIZE = 128 |
There was a problem hiding this comment.
the ba proj data will only need a program to handle, the others are all handled by 128
| "ATOM_DISABLE_MMAP", | ||
| "ATOM_DISABLE_VLLM_PLUGIN", | ||
| "ATOM_DISABLE_VLLM_PLUGIN_ATTENTION", | ||
| "ATOM_USE_CUSTOM_ALL_GATHER", | ||
| ] |
Signed-off-by: ganyi <ygan@amd.com>
Summary
This PR adds Qwen3.5 model support to ATOM, including both dense and MoE variants. The implementation leverages ATOM's vLLM plugin mode to provide seamless integration with vLLM's ecosystem while utilizing ATOM's optimized kernels for linear attention (GatedDeltaNet) layers. Only works on vllm plugin mode.
Key Features
1. Qwen3.5 Model Support
2. Multi-Modal Input Support
Technical Details
Files Added
Model Configuration:
atom/model_config/qwen3_5.py- Qwen3.5 dense configatom/model_config/qwen3_5_moe.py- Qwen3.5 MoE configModel Implementation:
atom/models/qwen3_5.py- Main Qwen3.5 model implementationQwen3_5GatedDeltaNet- Linear attention layer (optimized with ATOM kernels)Qwen3_5Attention- Full attention layer (uses vLLM native implementation)Qwen3_5ForCausalLM- Dense model wrapperQwen3_5MoeForCausalLM- MoE model wrapperQwen3_5ForConditionalGeneration- Multimodal model with vision encoderPlugin Integration:
atom/plugin/vllm/attention_backend/attention_gdn.py- GatedDeltaNet backend for vLLMatom/plugin/vllm/model_wrapper.py- Enhanced wrapper for Qwen3.5 modelsFiles Modified
Core Components:
atom/config.py- Added Qwen3.5 config supportatom/model_loader/loader.py- Enhanced weight loading for Qwen3.5atom/model_engine/model_runner.py- vLLM plugin mode integrationatom/models/qwen3_next.py- Refactored for Qwen3.5 compatibilityatom/model_ops/base_attention.py- Enhanced attention abstractionatom/model_ops/layernorm.py- Added support for Qwen3.5 norm layersatom/model_ops/linear.py- Weight loading improvementsatom/model_ops/moe.py- MoE layer enhancementsArchitecture
Qwen3.5 uses a hybrid architecture with:
Linear Attention (GatedDeltaNet): Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 14, 16, 17, 18, 20, 21, 22, 24, 25, 26, 28, 29, 30, 32, 33, 34, 36, 37, 38
torch.ops.aiter.linear_attention_with_output_baseFull Attention: Layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39
torch.ops.vllm.unified_attention_with_outputMoE Layers: Integrated into most layers
Usage
Starting Server
Text-Only Inference
Multimodal (Image + Text) Inference
Environment Variables
Required:
Testing
Validated on:
Text Only
gsm8k res on Qwen/Qwen3.5-35B-A3B-FP8
Image
Request
Result:
Implementation Notes
vLLM Plugin Mode Only: This implementation requires vLLM plugin mode. Standalone ATOM mode is not supported for Qwen3.5.
Hybrid Attention: GatedDeltaNet layers use ATOM's optimized kernels, while full attention uses vLLM's native implementation for maximum compatibility.
Multimodal: Inherits from
Qwen3VLForConditionalGenerationfor vision-language support.Breaking Changes
None - this is a new model addition.
Related Issues
Submission Checklist