Skip to content

add fix for qwen35 mtp layer name#2

Open
artem-osmosis wants to merge 8 commits into
radixark:bridgefrom
artem-osmosis:qwen35_miles_lora_v4
Open

add fix for qwen35 mtp layer name#2
artem-osmosis wants to merge 8 commits into
radixark:bridgefrom
artem-osmosis:qwen35_miles_lora_v4

Conversation

@artem-osmosis
Copy link
Copy Markdown

@artem-osmosis artem-osmosis commented May 11, 2026

When we try to load Qwen3.5 into miles using megatron bridge, occasionally (for the MoE models specifically), mtp modules are loaded as language_model.mtp.layers.0.transformer_layer.mlp.router.weight, not language_model.mtp.layers.0.mtp_model_layer.mlp.router.weight etc. This handles both cases.

yushengsu-thu and others added 8 commits April 9, 2026 01:06
Guard EnergonProvider import with try/except so megatron-bridge works
when megatron.energon is not installed.

Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com>
Co-Authored-By: gongyisheng <yishenggong9437@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add megatron_param_name field to HFWeightTuple so downstream consumers
(e.g. RL pipelines) can map exported HF weights back to their original
Megatron parameter names.

Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com>
Co-Authored-By: gongyisheng <yishenggong9437@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Propagate actual megatron linear_in/linear_out parameter names through
HFWeightTuple in peft_bridge adapter weight streaming instead of None.

Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com>
Co-Authored-By: gongyisheng <yishenggong9437@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
- Add _pad_right_dim0 padding for vocab size mismatch in
  ColumnParallelMapping (embedding/output_layer weights)
- Fix HFWeightTuple unpacking in state.py, utils.py, and auto_bridge.py
  to handle the new 3-field NamedTuple (param_name, weight,
  megatron_param_name)

Co-Authored-By: Yusheng Su <yushengsu.thu@gmail.com>
Co-Authored-By: gongyisheng <yishenggong9437@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Handle MoE expert layers (e.g. GPT-OSS gate_up_proj) whose HF param
names do not end with the conventional ".weight" suffix:

- _select_hf_base_param_name: return hf_param even when it lacks the
  expected suffix, so MoE expert mappings are no longer silently dropped.
- _megatron_to_hf_adapter_name: append hf_suffix directly when
  hf_base_name does not end with base_suffix instead of returning None.
- _make_lora_param_name: remove hard ".weight" suffix requirement;
  gracefully handle both standard and MoE-style param names.

For grouped MoE experts:
- Use a single ".weight0" lookup instead of per-expert iteration, since
  grouped expert adapters share 2D weights across all experts.
- unsqueeze(0) on adapter tensors to restore the expected 3D shape when
  yielding HFWeightTuples for grouped experts.

Made-with: Cursor
Refactor grouped expert adapter handling in peft_bridge to use
overridable hook methods instead of hardcoded GPT-OSS-specific logic:

- _select_hf_base_param_name: accept HF params that don't end with
  the expected suffix (e.g. MoE expert names like gate_up_proj)
- _resolve_hf_adapter_param_name / _make_lora_param_name: handle
  param names without .weight suffix gracefully
- Add _get_grouped_expert_base_suffixes() hook: default per-expert
  iteration; GPT-OSS overrides to single .weight0 lookup
- Add _prepare_expert_adapter_for_hf() hook: default no-op;
  GPT-OSS overrides to unsqueeze(0) for 3D shape restoration
- Add unit test for grouped expert LoRA adapter export path

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants