Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808

ChenhanYu · 2026-01-22T17:46:43Z

What does this PR do?

Type of change: ? Bug

Overview: ?

This PR fix 2 bugs which impact DeepSeek calibration as well as PP forward of MoE models.

The WAR in MoELayer that change the topk to num_experts only works if no group-topk (a.k.a group routing) is used. Only changing topk will lead to out-of-range error since topk can never be num_experts when group_topk != None. Currently only DeepSeek-V3 uses group_topk and DeepSeek-V3 does not have difficulty to calibrate all experts. As a result, we disable the WAR when detecting group_topk.
A previous PR inserted global barrier in quantization.plugin.megatron 6ef9954#diff-0fa2ba4ecc36c5ff031be9f9a5af080e7aa3afa331c438f02f501b9432ec6d6aL228-R515 This leads to dead lock when using PP since PP rank will never be able to sync during pipeline forward. For MoE, this can be even worse if the barrier is only visited by some EP/PP rank. Using collective communication over the global world (a.k.a global comm) in megatron plugin should be prohibited. Using collective on sub communication group should avoid using megatron.core.parallel_state (a.k.a mpu) in the future. Instead, use the local pg_collection from each module. Any usage of collective communication must be inspected carefully with test as PP, TP, and EP.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

coderabbitai · 2026-01-22T17:47:12Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Removes barrier synchronization in SequentialMLP.sync_moe_local_experts_amax and adds conditional top-k routing adjustments during calibration mode. When calibration is active and config.moe_router_num_groups is None, temporarily sets router.topk to router.num_experts before forwarding.

Changes

Cohort / File(s)	Summary
Megatron MoE Calibration Logic `modelopt/torch/quantization/plugins/megatron.py`	Removed barrier synchronization in `sync_moe_local_experts_amax`. Added conditional top-k routing adjustments in `SequentialMLP.forward` and `MoELayer.forward`: when in calibration mode and `config.moe_router_num_groups` is None, temporarily sets `router.topk` to `router.num_experts`, executes forward pass, then restores original topk.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the main changes: fixing MOE amax remedy for DSR1 and removing a global barrier in Megatron quantization plugins, which aligns with the code modifications.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chg: remove global barrier in SequentialMLP Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

codecov · 2026-01-22T18:00:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.18%. Comparing base (945ee02) to head (606cfba).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #808   +/-   ##
=======================================
  Coverage   74.18%   74.18%           
=======================================
  Files         192      192           
  Lines       19236    19236           
=======================================
  Hits        14271    14271           
  Misses       4965     4965

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

realAsma · 2026-01-22T18:50:36Z

Thanks for catching this bug. We should not need the collective comm barrier.

realAsma · 2026-01-22T18:54:19Z

modelopt/torch/quantization/plugins/megatron.py

        share the same amax.
        """
-        torch.distributed.barrier()
+        # torch.distributed.barrier()


Suggested change

# torch.distributed.barrier()

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu requested a review from a team as a code owner January 22, 2026 17:46

ChenhanYu requested a review from mxinO January 22, 2026 17:46

ChenhanYu force-pushed the chenhany/fix_moe_amax_remedy_for_dsr1 branch from 9201fe8 to 9d65635 Compare January 22, 2026 17:48

chg: disable moe amax routing remedy when router group is not None.

5bc77c5

chg: remove global barrier in SequentialMLP Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the chenhany/fix_moe_amax_remedy_for_dsr1 branch from 9d65635 to 5bc77c5 Compare January 22, 2026 17:49

ChenhanYu changed the title ~~Chenhany/fix moe amax remedy for dsr1~~ Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins Jan 22, 2026

ChenhanYu requested review from jenchen13, kinjalpatel27 and realAsma and removed request for mxinO January 22, 2026 17:50

kinjalpatel27 approved these changes Jan 22, 2026

View reviewed changes

realAsma reviewed Jan 22, 2026

View reviewed changes

chg: remove comments

606cfba

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

jenchen13 approved these changes Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808

Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808

ChenhanYu commented Jan 22, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

realAsma commented Jan 22, 2026

Uh oh!

realAsma Jan 22, 2026

Uh oh!

ChenhanYu Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808

Are you sure you want to change the base?

Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808

Conversation

ChenhanYu commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

realAsma commented Jan 22, 2026

Uh oh!

realAsma Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

ChenhanYu Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChenhanYu commented Jan 22, 2026 •

edited

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

codecov bot commented Jan 22, 2026 •

edited

Loading