feat(engine): lora support for MoE models (single node/ cross node) by gursimar · Pull Request #1159 · inclusionAI/AReaL

gursimar · 2026-04-09T23:34:22Z

Description

Add LoRA support for Megatron MoE models and enable cross-node LoRA training with Megatron + vLLM.

This PR extends the Megatron-to-HF LoRA conversion path for Qwen3 MoE adapters, updates Megatron parameter collection so MoE LoRA tensors are included in distributed weight updates, and improves vLLM-side LoRA loading so adapter shards can be reconstructed and activated correctly during XCCL updates.

With these changes, AReaL can now:

train LoRA adapters on MoE models with Megatron
synchronize LoRA updates from Megatron to vLLM rollout workers
run LoRA training across multiple nodes

Validated manually on Qwen3-30B-A3B with:

single-node 8 x 80 GB GPUs [Still in Progress]

multi-node 3 x 8 x 80 GB GPUs

Observed eval reward improvement in the tested setup from about 0.23 to about 0.90.

Related Issue

Fixes #(issue)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key implementation changes:

add qwen3_moe_lora conversion support in Megatron conversion registry
convert grouped and per-expert Qwen3 MoE LoRA weights into expert-indexed HF/vLLM adapter names
include MoE LoRA expert tensors in Megatron distributed weight collection when LoRA is enabled
update vLLM LoRA XCCL loading path to materialize received adapter tensors on CPU before rebuilding the adapter
add a Megatron LoRA MoE GSM8K example config for Qwen3-30B Base model

Main files:

areal/engine/megatron_engine.py
areal/engine/megatron_utils/megatron.py
areal/engine/megatron_utils/megatron_lora.py
areal/engine/vllm_ext/vllm_worker_extension.py
examples/math/gsm8k_grpo_megatron_lora_moe.yaml

Docs updated:

docs/en/reference/lora.md
docs/zh/reference/lora.md

gemini-code-assist

Code Review

This pull request introduces support for LoRA fine-tuning on MoE architectures, specifically Qwen3 MoE, within the Megatron engine. It includes the necessary conversion functions, registry updates, and configuration examples. However, the implementation of expert adapter collection in megatron_engine.py is flawed, as it risks incomplete weight updates in multi-node expert parallel settings; it should be moved to the expert-specific update loop. Additionally, the warning logic for missing tensors in the pipeline parallel head was incorrectly restricted to LoRA-only scenarios and should be restored to cover all cases. Finally, while setting cpu=False for adapter exports resolves hangs in certain Slurm environments, it may increase the risk of OOM errors in large-scale MoE configurations.

…upport

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

Comment thread areal/engine/megatron_engine.py

Comment thread areal/engine/megatron_engine.py

gursimar force-pushed the lora-megatron-moe branch 2 times, most recently from d2f252c to a1f16c2 Compare April 14, 2026 18:36

gursimar requested review from garrett4wade and rchardx as code owners April 14, 2026 18:36

feat(engine): lora support for MoE models; single node + cross node s…

1fbaf0b

…upport

gursimar force-pushed the lora-megatron-moe branch from a1f16c2 to 1fbaf0b Compare April 15, 2026 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): lora support for MoE models (single node/ cross node)#1159

feat(engine): lora support for MoE models (single node/ cross node)#1159
gursimar wants to merge 1 commit intoinclusionAI:mainfrom
gursimar:lora-megatron-moe

gursimar commented Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gursimar commented Apr 9, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant