feat(engine): lora support for MoE models (single node/ cross node)#1159
feat(engine): lora support for MoE models (single node/ cross node)#1159gursimar wants to merge 1 commit intoinclusionAI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for LoRA fine-tuning on MoE architectures, specifically Qwen3 MoE, within the Megatron engine. It includes the necessary conversion functions, registry updates, and configuration examples. However, the implementation of expert adapter collection in megatron_engine.py is flawed, as it risks incomplete weight updates in multi-node expert parallel settings; it should be moved to the expert-specific update loop. Additionally, the warning logic for missing tensors in the pipeline parallel head was incorrectly restricted to LoRA-only scenarios and should be restored to cover all cases. Finally, while setting cpu=False for adapter exports resolves hangs in certain Slurm environments, it may increase the risk of OOM errors in large-scale MoE configurations.
d2f252c to
a1f16c2
Compare
a1f16c2 to
1fbaf0b
Compare
Description
Add LoRA support for Megatron MoE models and enable cross-node LoRA training with Megatron + vLLM.
This PR extends the Megatron-to-HF LoRA conversion path for Qwen3 MoE adapters, updates Megatron parameter collection so MoE LoRA tensors are included in distributed weight updates, and improves vLLM-side LoRA loading so adapter shards can be reconstructed and activated correctly during XCCL updates.
With these changes, AReaL can now:
Validated manually on Qwen3-30B-A3B with:
Observed eval reward improvement in the tested setup from about 0.23 to about 0.90.
Related Issue
Fixes #(issue)
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A
Additional Context
Key implementation changes:
qwen3_moe_loraconversion support in Megatron conversion registryMain files:
areal/engine/megatron_engine.pyareal/engine/megatron_utils/megatron.pyareal/engine/megatron_utils/megatron_lora.pyareal/engine/vllm_ext/vllm_worker_extension.pyexamples/math/gsm8k_grpo_megatron_lora_moe.yamlDocs updated:
docs/en/reference/lora.mddocs/zh/reference/lora.md