[gemma4] feat: add Gemma-4 31B dense model support#8
Open
Zhichenzzz wants to merge 1 commit into
Open
Conversation
Port of upstream NVIDIA-NeMo#3885 (limited Gemma4 dense support) onto the radixark fork, adapted to this forks norm fusion. - gemma4_provider.py: add attention_k_eq_v field; gate _install_tied_kv on attention_k_eq_v (covers MoE and dense; 26B and 31B both set it True, so no regression to the 26B MoE path). - gemma4_vl_bridge.py: unblock the dense path for models without per-layer embeddings (hidden_size_per_layer_input==0, e.g. 31B); branch provider config on enable_moe_block; add dense GatedMLP weight mappings (inert on MoE). Dense mlp.linear_fc1.layer_norm_weight maps to HF pre_feedforward_layernorm (linear_proj already carries post_attention_layernorm). Validated e2e: 31B-it RL train-inference lpdiff ~0.007, stable through weight updates; 26B MoE not regressed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #7 (Gemma-4 MoE HF-faithfulness fixes).
Ports upstream NVIDIA-NeMo/Megatron-Bridge#3885 ("limited Gemma4 dense support") onto this fork, adapted to our fork's norm fusion, adding Gemma-4 31B-it (dense, non-MoE) RL support.
Changes
gemma4_provider.pyattention_k_eq_v: bool = Falsefield._install_tied_kv: gate onattention_k_eq_v(covers both MoE and dense) instead ofnum_moe_experts is None. Both 26B-A4B-it and 31B-it setattention_k_eq_v=True, so this is a no-op for the 26B MoE path.gemma4_vl_bridge.pyhidden_size_per_layer_input == 0, e.g. 31B-it). Dense models with PLE still error since MCore lacks PLE support.enable_moe_block: dense path setsnum_moe_experts=Noneandffn_hidden_size=intermediate_size. MoE path unchanged.provider.attention_k_eq_v = text_config.attention_k_eq_v.mlp.linear_fc1.weightviaGatedMLPMapping(gate_proj, up_proj)mlp.linear_fc2.weight←down_projmlp.linear_fc1.layer_norm_weight←pre_feedforward_layernorm.weight— differs from upstream [model] feat: Add limited Gemma4 dense model support NVIDIA-NeMo/Megatron-Bridge#3885, which maps it topost_attention_layernorm. Our fork fusespost_attention_layernormintolinear_proj.post_layernorm(TERowParallelLinearLayerNorm), so the MLP's fused fc1 norm corresponds to HF's MLP input norm (pre_feedforward_layernorm), notpost_attention_layernorm. Wrong norm → high lpdiff; lpdiff ~0.007 confirms this mapping.Validation
31B-it dense (e2e RL on dapo-math-17k, 8× H200, TP=4, no expert parallelism, n=8 / response 768 for reward variance):
train_rollout_logprob_abs_diff≈ 0.007 (range 0.004–0.008, well under the 0.02 target).weight_version1 → 2 → 3, lpdiff stays in the 0.004–0.008 band (no blowup).26B-A4B-it MoE regression check on this branch: lpdiff 0.0086 — matches the validated baseline from #7. No regression.
Notes
bridge, this PR's base will retarget tobridgeand the visible diff narrows to the dense delta.