Skip to content

Conversation

@ChenhanYu
Copy link
Collaborator

@ChenhanYu ChenhanYu commented Jan 22, 2026

What does this PR do?

Type of change: ? Bug

Overview: ?

This PR fix 2 bugs which impact DeepSeek calibration as well as PP forward of MoE models.

  1. The WAR in MoELayer that change the topk to num_experts only works if no group-topk (a.k.a group routing) is used. Only changing topk will lead to out-of-range error since topk can never be num_experts when group_topk != None. Currently only DeepSeek-V3 uses group_topk and DeepSeek-V3 does not have difficulty to calibrate all experts. As a result, we disable the WAR when detecting group_topk.

  2. A previous PR inserted global barrier in quantization.plugin.megatron 6ef9954#diff-0fa2ba4ecc36c5ff031be9f9a5af080e7aa3afa331c438f02f501b9432ec6d6aL228-R515 This leads to dead lock when using PP since PP rank will never be able to sync during pipeline forward. For MoE, this can be even worse if the barrier is only visited by some EP/PP rank. Using collective communication over the global world (a.k.a global comm) in megatron plugin should be prohibited. Using collective on sub communication group should avoid using megatron.core.parallel_state (a.k.a mpu) in the future. Instead, use the local pg_collection from each module. Any usage of collective communication must be inspected carefully with test as PP, TP, and EP.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

@ChenhanYu ChenhanYu requested a review from a team as a code owner January 22, 2026 17:46
@ChenhanYu ChenhanYu requested a review from mxinO January 22, 2026 17:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Removes barrier synchronization in SequentialMLP.sync_moe_local_experts_amax and adds conditional top-k routing adjustments during calibration mode. When calibration is active and config.moe_router_num_groups is None, temporarily sets router.topk to router.num_experts before forwarding.

Changes

Cohort / File(s) Summary
Megatron MoE Calibration Logic
modelopt/torch/quantization/plugins/megatron.py
Removed barrier synchronization in sync_moe_local_experts_amax. Added conditional top-k routing adjustments in SequentialMLP.forward and MoELayer.forward: when in calibration mode and config.moe_router_num_groups is None, temporarily sets router.topk to router.num_experts, executes forward pass, then restores original topk.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the main changes: fixing MOE amax remedy for DSR1 and removing a global barrier in Megatron quantization plugins, which aligns with the code modifications.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChenhanYu ChenhanYu force-pushed the chenhany/fix_moe_amax_remedy_for_dsr1 branch from 9201fe8 to 9d65635 Compare January 22, 2026 17:48
chg: remove global barrier in SequentialMLP
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the chenhany/fix_moe_amax_remedy_for_dsr1 branch from 9d65635 to 5bc77c5 Compare January 22, 2026 17:49
@ChenhanYu ChenhanYu changed the title Chenhany/fix moe amax remedy for dsr1 Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins Jan 22, 2026
@ChenhanYu ChenhanYu requested review from jenchen13, kinjalpatel27 and realAsma and removed request for mxinO January 22, 2026 17:50
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.18%. Comparing base (945ee02) to head (606cfba).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #808   +/-   ##
=======================================
  Coverage   74.18%   74.18%           
=======================================
  Files         192      192           
  Lines       19236    19236           
=======================================
  Hits        14271    14271           
  Misses       4965     4965           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@realAsma
Copy link
Contributor

Thanks for catching this bug. We should not need the collective comm barrier.

share the same amax.
"""
torch.distributed.barrier()
# torch.distributed.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# torch.distributed.barrier()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants