[megatron] support mlp_padding_free & sp; refactor TransformerLayer#62
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a CustomTransformerLayer to centralize transformer logic and replaces previous monkey-patching, while also refactoring model loaders and registration for increased flexibility. Review feedback identifies a potential TypeError in the CustomTransformerLayer constructor and suggests more robust attention_mask handling in the forward method to account for positional arguments. Additionally, improvements were recommended for layer numbering consistency in specific MLP modules and for the accuracy of warning logs.
| hidden_states, context = self._forward_attention(*args, **kwargs) | ||
| mlp_padding_free = self.config.mlp_padding_free and 'attention_mask' in kwargs | ||
| mask = None | ||
| enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1 | ||
| pad_size = 0 | ||
| if mlp_padding_free and hidden_states.shape[1] > 1: | ||
| if enable_sp: | ||
| hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False) | ||
| mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t() |
There was a problem hiding this comment.
The forward method assumes attention_mask is always passed as a keyword argument. In Megatron-Core's TransformerBlock, layers are typically called with attention_mask as the second positional argument. This means kwargs.get('attention_mask') will be None, effectively disabling mlp_padding_free or causing a KeyError at line 252. Additionally, using the bitwise NOT operator ~ assumes a boolean mask; consider making this more robust for float masks.
| hidden_states, context = self._forward_attention(*args, **kwargs) | |
| mlp_padding_free = self.config.mlp_padding_free and 'attention_mask' in kwargs | |
| mask = None | |
| enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1 | |
| pad_size = 0 | |
| if mlp_padding_free and hidden_states.shape[1] > 1: | |
| if enable_sp: | |
| hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False) | |
| mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t() | |
| hidden_states, context = self._forward_attention(*args, **kwargs) | |
| attention_mask = kwargs.get('attention_mask', args[1] if len(args) > 1 else None) | |
| mlp_padding_free = self.config.mlp_padding_free and attention_mask is not None | |
| mask = None | |
| enable_sp = self.config.sequence_parallel and self.config.tensor_model_parallel_size > 1 | |
| pad_size = 0 | |
| if mlp_padding_free and hidden_states.shape[1] > 1: | |
| if enable_sp: | |
| hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False) | |
| mask = ((~attention_mask).sum(dim=(1, 2)) > 0).t() |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a CustomTransformerLayer to replace the previous monkey-patching of Megatron-Core's TransformerLayer, providing a more structured way to handle custom forward logic such as padding-free MLPs. It updates model loaders for GLM4 and MinimaxM2, refactors the model registration process to inject this custom layer, and moves utility functions like patch_deepcopy to a central location. However, the review identified several high-severity issues in the new CustomTransformerLayer: the constructor fragilely bypasses TransformerLayer's initialization, the sequence parallel implementation for padding-free logic risks OOM by gathering full activations on all ranks, and the attention mask handling assumes a boolean type which may cause runtime errors with float masks. Additionally, hardcoding the gradient execution handler may bypass performance optimizations.
| if mlp_padding_free and hidden_states.shape[1] > 1: | ||
| if enable_sp: | ||
| hidden_states = gather_from_sequence_parallel_region(hidden_states, tensor_parallel_output_grad=False) | ||
| mask = ((~kwargs['attention_mask']).sum(dim=(1, 2)) > 0).t() |
There was a problem hiding this comment.
The expression (~kwargs['attention_mask']) assumes that attention_mask is a boolean tensor. In many Megatron and HuggingFace configurations, attention_mask is provided as a float tensor (e.g., 0.0 for valid tokens and a large negative value for masked ones). Applying the bitwise NOT operator ~ to a float tensor will raise a TypeError. You should ensure the mask is boolean or use a comparison (e.g., kwargs['attention_mask'] == 0) to identify valid tokens.
| # TORCH_MINOR = int(torch.__version__.split('.')[1]) | ||
| # use_nvfuser = TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 10) | ||
| # self.bias_dropout_add_exec_handler = nullcontext if use_nvfuser else torch.enable_grad | ||
| self.bias_dropout_add_exec_handler = torch.enable_grad |
There was a problem hiding this comment.
self.bias_dropout_add_exec_handler is hardcoded to torch.enable_grad. In the original Megatron-Core implementation, this is typically conditional on the availability of nvfuser (using nullcontext if available). Hardcoding it may bypass performance optimizations or lead to unnecessary gradient tracking in certain fusion scenarios.
No description provided.