fix: lazily import megatron-core in model_utils#2648
Open
RayenTian wants to merge 3 commits into
Open
Conversation
model_utils.py is on the GRPO driver's import path, so its top-level megatron-core import (added in #2036/#2078) forced the driver env to include the optional "mcore" extra just to import the module. Move the megatron-core imports into the two linear-CE-fusion functions that use them and guard the GPTModel annotation with TYPE_CHECKING, so the module imports without mcore. megatron is imported only when the GPTModel forward patch runs (megatron paths that already have mcore). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>
Contributor
Author
|
/ok to test db80063 |
…failure Tests that build a bare in-process vllm.LLM (rather than going through a Ray actor) crash with "Cannot re-initialize CUDA in forked subprocess" when CUDA is already initialized in the parent pytest process and vLLM forks its EngineCore. This surfaces when such a test runs first/alone in a shard (e.g. under FAST mode, where most other vLLM tests are deselected). Add an autouse fixture that sets VLLM_WORKER_MULTIPROC_METHOD=spawn for any vllm-marked test and restores the previous value afterward, making these tests robust to ordering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>
Contributor
Author
|
/ok to test 686d3db |
Signed-off-by: ruit <ruit@nvidia.com>
Contributor
Author
|
/ok to test 7a46396 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Move the top-level megatron-core imports in nemo_rl/distributed/model_utils.py into the two functions that actually use them, so the module can be imported without the optional mcore extra installed.
Why
nemo_rl/distributed/model_utils.pyis on the GRPO driver's import path:examples/nemo_gym/run_grpo_nemo_gym.py
→ nemo_rl.algorithms.grpo
→ nemo_rl.algorithms.loss.loss_functions (imports model_utils since #2078)
→ nemo_rl.distributed.model_utils (top-level
import megatron.coresince #2036)megatron-core is an optional dependency (the mcore extra), and it is not part of the driver's base environment (UV_PROJECT_ENVIRONMENT). Since #2036 added an unconditional, module-level from megatron.core... import GPTModel and #2078 wired model_utils onto the driver import chain, simply importing grpo now eagerly imports megatron-core — for every backend, including FSDP/non-megatron runs. When the driver env lacks the mcore extra this fails at startup with:
The only consumers of the megatron symbols are the two linear-CE-fusion functions (patch_gpt_model_forward_for_linear_ce_fusion and _gpt_forward_with_linear_ce_fusion), which are a Megatron-GPTModel monkeypatch and only ever run on Megatron training paths (where mcore is installed by definition). Nothing else in the module — including the widely imported DistributedCrossEntropy — touches megatron.
What changed
No runtime behavior change: megatron-core is now imported only when the GPTModel forward patch is invoked, i.e. exactly the paths that already have mcore.