-
Notifications
You must be signed in to change notification settings - Fork 243
Fix moe amax remedy for dsr1 and remove global barrier in quantization megatron plugins #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughRemoves barrier synchronization in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9201fe8 to
9d65635
Compare
chg: remove global barrier in SequentialMLP Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
9d65635 to
5bc77c5
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #808 +/- ##
=======================================
Coverage 74.18% 74.18%
=======================================
Files 192 192
Lines 19236 19236
=======================================
Hits 14271 14271
Misses 4965 4965 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for catching this bug. We should not need the collective comm barrier. |
| share the same amax. | ||
| """ | ||
| torch.distributed.barrier() | ||
| # torch.distributed.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # torch.distributed.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
What does this PR do?
Type of change: ? Bug
Overview: ?
This PR fix 2 bugs which impact DeepSeek calibration as well as PP forward of MoE models.
The WAR in
MoELayerthat change thetopktonum_expertsonly works if no group-topk (a.k.a group routing) is used. Only changing topk will lead to out-of-range error sincetopkcan never benum_expertswhengroup_topk != None. Currently onlyDeepSeek-V3usesgroup_topkand DeepSeek-V3 does not have difficulty to calibrate all experts. As a result, we disable the WAR when detectinggroup_topk.A previous PR inserted global barrier in
quantization.plugin.megatron6ef9954#diff-0fa2ba4ecc36c5ff031be9f9a5af080e7aa3afa331c438f02f501b9432ec6d6aL228-R515 This leads to dead lock when using PP since PP rank will never be able to sync during pipeline forward. For MoE, this can be even worse if the barrier is only visited by some EP/PP rank. Using collective communication over the global world (a.k.a global comm) in megatron plugin should be prohibited. Using collective on sub communication group should avoid usingmegatron.core.parallel_state(a.k.ampu) in the future. Instead, use the localpg_collectionfrom each module. Any usage of collective communication must be inspected carefully with test as PP, TP, and EP.Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information