fix: lazily import megatron-core in model_utils by RayenTian · Pull Request #2648 · NVIDIA-NeMo/RL

RayenTian · 2026-05-31T05:56:52Z

What does this PR do ?

Move the top-level megatron-core imports in nemo_rl/distributed/model_utils.py into the two functions that actually use them, so the module can be imported without the optional mcore extra installed.

Why

nemo_rl/distributed/model_utils.py is on the GRPO driver's import path:

examples/nemo_gym/run_grpo_nemo_gym.py
→ nemo_rl.algorithms.grpo
→ nemo_rl.algorithms.loss.loss_functions (imports model_utils since #2078)
→ nemo_rl.distributed.model_utils (top-level import megatron.core since #2036)

megatron-core is an optional dependency (the mcore extra), and it is not part of the driver's base environment (UV_PROJECT_ENVIRONMENT). Since #2036 added an unconditional, module-level from megatron.core... import GPTModel and #2078 wired model_utils onto the driver import chain, simply importing grpo now eagerly imports megatron-core — for every backend, including FSDP/non-megatron runs. When the driver env lacks the mcore extra this fails at startup with:

  ModuleNotFoundError: No module named 'megatron'

The only consumers of the megatron symbols are the two linear-CE-fusion functions (patch_gpt_model_forward_for_linear_ce_fusion and _gpt_forward_with_linear_ce_fusion), which are a Megatron-GPTModel monkeypatch and only ever run on Megatron training paths (where mcore is installed by definition). Nothing else in the module — including the widely imported DistributedCrossEntropy — touches megatron.

What changed

Removed the three top-level from megatron.core... imports.
Import them lazily inside the two functions that use them.
Guarded the GPTModel type annotation with TYPE_CHECKING (and made the parameter annotation a forward-reference string, since this module has no from future import annotations).

No runtime behavior change: megatron-core is now imported only when the GPTModel forward patch is invoked, i.e. exactly the paths that already have mcore.

model_utils.py is on the GRPO driver's import path, so its top-level megatron-core import (added in #2036/#2078) forced the driver env to include the optional "mcore" extra just to import the module. Move the megatron-core imports into the two linear-CE-fusion functions that use them and guard the GPTModel annotation with TYPE_CHECKING, so the module imports without mcore. megatron is imported only when the GPTModel forward patch runs (megatron paths that already have mcore). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>

copy-pr-bot · 2026-05-31T05:56:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

RayenTian · 2026-05-31T05:57:20Z

/ok to test db80063

…failure Tests that build a bare in-process vllm.LLM (rather than going through a Ray actor) crash with "Cannot re-initialize CUDA in forked subprocess" when CUDA is already initialized in the parent pytest process and vLLM forks its EngineCore. This surfaces when such a test runs first/alone in a shard (e.g. under FAST mode, where most other vLLM tests are deselected). Add an autouse fixture that sets VLLM_WORKER_MULTIPROC_METHOD=spawn for any vllm-marked test and restores the previous value afterward, making these tests robust to ordering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>

RayenTian · 2026-06-01T02:48:20Z

/ok to test 686d3db

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian · 2026-06-01T09:51:12Z

/ok to test 7a46396

RayenTian requested a review from a team as a code owner May 31, 2026 05:56

RayenTian added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 31, 2026

copy-pr-bot Bot temporarily deployed to public May 31, 2026 05:57 Inactive

RayenTian requested a review from yuki-97 May 31, 2026 05:57

copy-pr-bot Bot temporarily deployed to public May 31, 2026 05:57 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 05:58 Inactive

copy-pr-bot Bot temporarily deployed to test May 31, 2026 05:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 06:00 Inactive

RayenTian requested a review from a team as a code owner June 1, 2026 02:48

RayenTian temporarily deployed to public June 1, 2026 02:48 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:48 Inactive

RayenTian temporarily deployed to public June 1, 2026 02:48 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:48 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:49 Inactive

copy-pr-bot Bot temporarily deployed to test June 1, 2026 02:51 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 02:52 Inactive

RayenTian temporarily deployed to public June 1, 2026 02:52 — with GitHub Actions Inactive

remove unit test fix

7a46396

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian temporarily deployed to public June 1, 2026 09:50 — with GitHub Actions Inactive

RayenTian temporarily deployed to public June 1, 2026 09:51 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 09:51 Inactive

RayenTian temporarily deployed to public June 1, 2026 09:51 — with GitHub Actions Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 09:51 Inactive

copy-pr-bot Bot deployed to test June 1, 2026 09:54 Active

copy-pr-bot Bot deployed to public June 1, 2026 09:55 Active

RayenTian temporarily deployed to public June 1, 2026 09:55 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lazily import megatron-core in model_utils#2648

fix: lazily import megatron-core in model_utils#2648
RayenTian wants to merge 3 commits into
mainfrom
ruit/fix_mcore_import

RayenTian commented May 31, 2026

Uh oh!

copy-pr-bot Bot commented May 31, 2026

Uh oh!

RayenTian commented May 31, 2026

Uh oh!

RayenTian commented Jun 1, 2026

Uh oh!

RayenTian commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RayenTian commented May 31, 2026

What does this PR do ?

Why

Uh oh!

copy-pr-bot Bot commented May 31, 2026

Uh oh!

RayenTian commented May 31, 2026

Uh oh!

RayenTian commented Jun 1, 2026

Uh oh!

RayenTian commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant