[megatron] support mlp_padding_free & sp; refactor TransformerLayer by Jintao-Huang · Pull Request #62 · modelscope/mcore-bridge

Jintao-Huang · 2026-05-05T13:59:34Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a CustomTransformerLayer to centralize transformer logic and replaces previous monkey-patching, while also refactoring model loaders and registration for increased flexibility. Review feedback identifies a potential TypeError in the CustomTransformerLayer constructor and suggests more robust attention_mask handling in the forward method to account for positional arguments. Additionally, improvements were recommended for layer numbering consistency in specific MLP modules and for the accuracy of warning logs.

gemini-code-assist · 2026-05-05T14:01:45Z

+        hidden_states, context = self._forward_attention(*args, **kwargs)
+        mlp_padding_free = self.config.mlp_padding_free and 'attention_mask' in kwargs
+        mask = None
+        enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1
+        pad_size = 0
+        if mlp_padding_free and hidden_states.shape[1] > 1:
+            if enable_sp:
+                hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False)
+            mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t()


The forward method assumes attention_mask is always passed as a keyword argument. In Megatron-Core's TransformerBlock, layers are typically called with attention_mask as the second positional argument. This means kwargs.get('attention_mask') will be None, effectively disabling mlp_padding_free or causing a KeyError at line 252. Additionally, using the bitwise NOT operator ~ assumes a boolean mask; consider making this more robust for float masks.

Suggested change

hidden_states, context = self._forward_attention(*args, **kwargs)

mlp_padding_free = self.config.mlp_padding_free and 'attention_mask' in kwargs

mask = None

enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1

pad_size = 0

if mlp_padding_free and hidden_states.shape[1] > 1:

if enable_sp:

hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False)

mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t()

hidden_states, context = self._forward_attention(*args, **kwargs)

attention_mask = kwargs.get('attention_mask', args[1] if len(args) > 1 else None)

mlp_padding_free = self.config.mlp_padding_free and attention_mask is not None

mask = None

enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1

pad_size = 0

if mlp_padding_free and hidden_states.shape[1] > 1:

if enable_sp:

hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False)

mask = ((~attention_mask).sum(dim=(1, 2)) > 0).t()

Jintao-Huang · 2026-05-05T15:52:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a CustomTransformerLayer to replace the previous monkey-patching of Megatron-Core's TransformerLayer, providing a more structured way to handle custom forward logic such as padding-free MLPs. It updates model loaders for GLM4 and MinimaxM2, refactors the model registration process to inject this custom layer, and moves utility functions like patch_deepcopy to a central location. However, the review identified several high-severity issues in the new CustomTransformerLayer: the constructor fragilely bypasses TransformerLayer's initialization, the sequence parallel implementation for padding-free logic risks OOM by gathering full activations on all ranks, and the attention mask handling assumes a boolean type which may cause runtime errors with float masks. Additionally, hardcoding the gradient execution handler may bypass performance optimizations.

gemini-code-assist · 2026-05-05T15:54:21Z

+        if mlp_padding_free and hidden_states.shape[1] > 1:
+            if enable_sp:
+                hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False)
+            mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t()


The expression (~kwargs['attention_mask']) assumes that attention_mask is a boolean tensor. In many Megatron and HuggingFace configurations, attention_mask is provided as a float tensor (e.g., 0.0 for valid tokens and a large negative value for masked ones). Applying the bitwise NOT operator ~ to a float tensor will raise a TypeError. You should ensure the mask is boolean or use a comparison (e.g., kwargs['attention_mask'] == 0) to identify valid tokens.

gemini-code-assist · 2026-05-05T15:54:21Z

+        # TORCH_MINOR = int(torch.__version__.split('.')[1])
+        # use_nvfuser = TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 10)
+        # self.bias_dropout_add_exec_handler = nullcontext if use_nvfuser else torch.enable_grad
+        self.bias_dropout_add_exec_handler = torch.enable_grad


self.bias_dropout_add_exec_handler is hardcoded to torch.enable_grad. In the original Megatron-Core implementation, this is typically conditional on the availability of nvfuser (using nullcontext if available). Hardcoding it may bypass performance optimizations or lead to unnecessary gradient tracking in certain fusion scenarios.

support mlp_padding_free & sp

6bcd6bd

Jintao-Huang changed the title ~~[megatron] support mlp_padding_free & sp~~ [megatron] support mlp_padding_free & sp; refactor TransformerLayer May 5, 2026

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Jintao-Huang added 2 commits May 5, 2026 22:06

fix

8709348

fix

b3f108b

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Jintao-Huang merged commit 042439c into modelscope:main May 5, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] support mlp_padding_free & sp; refactor TransformerLayer#62

[megatron] support mlp_padding_free & sp; refactor TransformerLayer#62
Jintao-Huang merged 3 commits intomodelscope:mainfrom
Jintao-Huang:mlp_padding_free_support_sp

Jintao-Huang commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jintao-Huang commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant