Skip to content

feat(engine): lora support for MoE models (single node/ cross node)#1159

Open
gursimar wants to merge 1 commit intoinclusionAI:mainfrom
gursimar:lora-megatron-moe
Open

feat(engine): lora support for MoE models (single node/ cross node)#1159
gursimar wants to merge 1 commit intoinclusionAI:mainfrom
gursimar:lora-megatron-moe

Conversation

@gursimar
Copy link
Copy Markdown
Contributor

@gursimar gursimar commented Apr 9, 2026

Description

Add LoRA support for Megatron MoE models and enable cross-node LoRA training with Megatron + vLLM.

This PR extends the Megatron-to-HF LoRA conversion path for Qwen3 MoE adapters, updates Megatron parameter collection so MoE LoRA tensors are included in distributed weight updates, and improves vLLM-side LoRA loading so adapter shards can be reconstructed and activated correctly during XCCL updates.

With these changes, AReaL can now:

  • train LoRA adapters on MoE models with Megatron
  • synchronize LoRA updates from Megatron to vLLM rollout workers
  • run LoRA training across multiple nodes

Validated manually on Qwen3-30B-A3B with:

  • single-node 8 x 80 GB GPUs [Still in Progress]
train_sn
  • multi-node 3 x 8 x 80 GB GPUs

Observed eval reward improvement in the tested setup from about 0.23 to about 0.90.

Related Issue

Fixes #(issue)

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key implementation changes:

  • add qwen3_moe_lora conversion support in Megatron conversion registry
  • convert grouped and per-expert Qwen3 MoE LoRA weights into expert-indexed HF/vLLM adapter names
  • include MoE LoRA expert tensors in Megatron distributed weight collection when LoRA is enabled
  • update vLLM LoRA XCCL loading path to materialize received adapter tensors on CPU before rebuilding the adapter
  • add a Megatron LoRA MoE GSM8K example config for Qwen3-30B Base model

Main files:

  • areal/engine/megatron_engine.py
  • areal/engine/megatron_utils/megatron.py
  • areal/engine/megatron_utils/megatron_lora.py
  • areal/engine/vllm_ext/vllm_worker_extension.py
  • examples/math/gsm8k_grpo_megatron_lora_moe.yaml

Docs updated:

  • docs/en/reference/lora.md
  • docs/zh/reference/lora.md

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for LoRA fine-tuning on MoE architectures, specifically Qwen3 MoE, within the Megatron engine. It includes the necessary conversion functions, registry updates, and configuration examples. However, the implementation of expert adapter collection in megatron_engine.py is flawed, as it risks incomplete weight updates in multi-node expert parallel settings; it should be moved to the expert-specific update loop. Additionally, the warning logic for missing tensors in the pipeline parallel head was incorrectly restricted to LoRA-only scenarios and should be restored to cover all cases. Finally, while setting cpu=False for adapter exports resolves hangs in certain Slurm environments, it may increase the risk of OOM errors in large-scale MoE configurations.

Comment thread areal/engine/megatron_engine.py
Comment thread areal/engine/megatron_engine.py
@gursimar gursimar force-pushed the lora-megatron-moe branch 2 times, most recently from d2f252c to a1f16c2 Compare April 14, 2026 18:36
@gursimar gursimar force-pushed the lora-megatron-moe branch from a1f16c2 to 1fbaf0b Compare April 15, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant