Skip to content

fix: lazily import megatron-core in model_utils#2648

Open
RayenTian wants to merge 3 commits into
mainfrom
ruit/fix_mcore_import
Open

fix: lazily import megatron-core in model_utils#2648
RayenTian wants to merge 3 commits into
mainfrom
ruit/fix_mcore_import

Conversation

@RayenTian
Copy link
Copy Markdown
Contributor

What does this PR do ?

Move the top-level megatron-core imports in nemo_rl/distributed/model_utils.py into the two functions that actually use them, so the module can be imported without the optional mcore extra installed.

Why

nemo_rl/distributed/model_utils.py is on the GRPO driver's import path:

examples/nemo_gym/run_grpo_nemo_gym.py
→ nemo_rl.algorithms.grpo
→ nemo_rl.algorithms.loss.loss_functions (imports model_utils since #2078)
→ nemo_rl.distributed.model_utils (top-level import megatron.core since #2036)

megatron-core is an optional dependency (the mcore extra), and it is not part of the driver's base environment (UV_PROJECT_ENVIRONMENT). Since #2036 added an unconditional, module-level from megatron.core... import GPTModel and #2078 wired model_utils onto the driver import chain, simply importing grpo now eagerly imports megatron-core — for every backend, including FSDP/non-megatron runs. When the driver env lacks the mcore extra this fails at startup with:

  ModuleNotFoundError: No module named 'megatron'

The only consumers of the megatron symbols are the two linear-CE-fusion functions (patch_gpt_model_forward_for_linear_ce_fusion and _gpt_forward_with_linear_ce_fusion), which are a Megatron-GPTModel monkeypatch and only ever run on Megatron training paths (where mcore is installed by definition). Nothing else in the module — including the widely imported DistributedCrossEntropy — touches megatron.

What changed

  • Removed the three top-level from megatron.core... imports.
  • Import them lazily inside the two functions that use them.
  • Guarded the GPTModel type annotation with TYPE_CHECKING (and made the parameter annotation a forward-reference string, since this module has no from future import annotations).

No runtime behavior change: megatron-core is now imported only when the GPTModel forward patch is invoked, i.e. exactly the paths that already have mcore.

model_utils.py is on the GRPO driver's import path, so its top-level
megatron-core import (added in #2036/#2078) forced the driver env to
include the optional "mcore" extra just to import the module.

Move the megatron-core imports into the two linear-CE-fusion functions
that use them and guard the GPTModel annotation with TYPE_CHECKING, so
the module imports without mcore. megatron is imported only when the
GPTModel forward patch runs (megatron paths that already have mcore).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian requested a review from a team as a code owner May 31, 2026 05:56
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@RayenTian RayenTian added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 31, 2026
@RayenTian
Copy link
Copy Markdown
Contributor Author

/ok to test db80063

…failure

Tests that build a bare in-process vllm.LLM (rather than going through a Ray
actor) crash with "Cannot re-initialize CUDA in forked subprocess" when CUDA is
already initialized in the parent pytest process and vLLM forks its EngineCore.
This surfaces when such a test runs first/alone in a shard (e.g. under FAST mode,
where most other vLLM tests are deselected).

Add an autouse fixture that sets VLLM_WORKER_MULTIPROC_METHOD=spawn for any
vllm-marked test and restores the previous value afterward, making these tests
robust to ordering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian requested a review from a team as a code owner June 1, 2026 02:48
@RayenTian
Copy link
Copy Markdown
Contributor Author

/ok to test 686d3db

Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian
Copy link
Copy Markdown
Contributor Author

/ok to test 7a46396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant