add fix for qwen35 mtp layer name#2
Open
artem-osmosis wants to merge 8 commits into
Open
Conversation
Guard EnergonProvider import with try/except so megatron-bridge works when megatron.energon is not installed. Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com> Co-Authored-By: gongyisheng <yishenggong9437@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
Add megatron_param_name field to HFWeightTuple so downstream consumers (e.g. RL pipelines) can map exported HF weights back to their original Megatron parameter names. Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com> Co-Authored-By: gongyisheng <yishenggong9437@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
Propagate actual megatron linear_in/linear_out parameter names through HFWeightTuple in peft_bridge adapter weight streaming instead of None. Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com> Co-Authored-By: gongyisheng <yishenggong9437@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
- Add _pad_right_dim0 padding for vocab size mismatch in ColumnParallelMapping (embedding/output_layer weights) - Fix HFWeightTuple unpacking in state.py, utils.py, and auto_bridge.py to handle the new 3-field NamedTuple (param_name, weight, megatron_param_name) Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com> Co-Authored-By: gongyisheng <yishenggong9437@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
Handle MoE expert layers (e.g. GPT-OSS gate_up_proj) whose HF param names do not end with the conventional ".weight" suffix: - _select_hf_base_param_name: return hf_param even when it lacks the expected suffix, so MoE expert mappings are no longer silently dropped. - _megatron_to_hf_adapter_name: append hf_suffix directly when hf_base_name does not end with base_suffix instead of returning None. - _make_lora_param_name: remove hard ".weight" suffix requirement; gracefully handle both standard and MoE-style param names. For grouped MoE experts: - Use a single ".weight0" lookup instead of per-expert iteration, since grouped expert adapters share 2D weights across all experts. - unsqueeze(0) on adapter tensors to restore the expected 3D shape when yielding HFWeightTuples for grouped experts. Made-with: Cursor
This reverts commit d123265.
Refactor grouped expert adapter handling in peft_bridge to use overridable hook methods instead of hardcoded GPT-OSS-specific logic: - _select_hf_base_param_name: accept HF params that don't end with the expected suffix (e.g. MoE expert names like gate_up_proj) - _resolve_hf_adapter_param_name / _make_lora_param_name: handle param names without .weight suffix gracefully - Add _get_grouped_expert_base_suffixes() hook: default per-expert iteration; GPT-OSS overrides to single .weight0 lookup - Add _prepare_expert_adapter_for_hf() hook: default no-op; GPT-OSS overrides to unsqueeze(0) for 3D shape restoration - Add unit test for grouped expert LoRA adapter export path Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When we try to load Qwen3.5 into miles using megatron bridge, occasionally (for the MoE models specifically), mtp modules are loaded as
language_model.mtp.layers.0.transformer_layer.mlp.router.weight, notlanguage_model.mtp.layers.0.mtp_model_layer.mlp.router.weightetc. This handles both cases.