Skip to content

Fix hardcoded PP size#221

Open
alizmhdi wants to merge 1 commit intoaliyun:masterfrom
alizmhdi:FIX/hardcoded-PP
Open

Fix hardcoded PP size#221
alizmhdi wants to merge 1 commit intoaliyun:masterfrom
alizmhdi:FIX/hardcoded-PP

Conversation

@alizmhdi
Copy link

@alizmhdi alizmhdi commented Feb 5, 2026

This change fixes a hardcoded PP size that caused incorrect DP group construction, which in turn produced wrong collectives and extra communication traffic. With the proper PP value, DP groups are formed per stage and collective operations reflect the intended topology.

Copilot AI review requested due to automatic review settings February 5, 2026 04:02
@CLAassistant
Copy link

CLAassistant commented Feb 5, 2026

CLA assistant check
All committers have signed the CLA.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the Pipeline Parallelism (PP) size was hardcoded to 1, causing incorrect Data Parallelism (DP) group construction. The fix properly reads the PP size from the workload configuration, which ensures correct collective operations and eliminates unnecessary communication traffic.

Changes:

  • Replaced hardcoded PP_size = 1 with conditional logic that reads from workload->pipeline_model_parallelism
  • Applied the fix consistently to both mock_nccl_grobal_group_init and mock_nccl_comms_init functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants