fix: nemotron-super-v3-hellaswag checkpoint robustness by adil-a · Pull Request #2056 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-26T07:28:59Z

Summary

Three independent failures hit the nemotron_super_v3_hellaswag checkpoint robustness CI test (originally e.g. https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/304597767). Verified end-to-end on cw-dfw, 4 nodes / 32 GPUs, then on the eos CI runners.

Phase 4 NCCL watchdog SIGABRT — dist_env.timeout_minutes: 1 (60 s) is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4 _barrier(), the 60 s collective timeout fires. This is also what caused the earlier print_trainable_parameters SIGABRT — the per-parameter implicit .norm() collectives on DTensor params didn't have headroom under the tight timeout. Override dist_env.timeout_minutes: 30 in ci.checkpoint_robustness: so only the robustness run sees the longer timeout; normal training keeps the tight default.
Phase 4 HF forward crash on Nemotron-H — forcing attn_implementation="flash_attention_2" for trust_remote_code models routes Nemotron-H through HF's flash_attention_forward → flash_attn_varlen_func path, which trips a vectorized_gather_kernel index-OOB inside the CUDA kernel. Detect model_type=="nemotron_h" from the consolidated config and fall back to sdpa (still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold.
End-of-test interpreter teardown hang — even with destroy_global_state already unregistered, PyTorch's internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery (non-deterministic ~50% hang). os._exit(0) after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL.
Side fix: ci.time 25 m → 45 m — the CI launcher runs the finetune phase (~5 min) and the robustness phase (~21 min) under the same SLURM allocation, so a 25 min wall left no headroom and TIMEOUT-killed the robustness phase mid-Phase-4.
Side fix: drop ci.known_issue_id: AM-156 — the marker was making tests/ci_tests/utils/generate_ci_tests.py skip the recipe entirely. It now generates a real CI job for the recipe (which is the underlying reason this PR exists).

Test plan

cw-dfw 4×8 H100 — direct python invocation, all 4 phases pass: Phase 3 KL = 0, Phase 4 KL = 2.16e-2, exit 0:0, 19 min wall (slurm 11336876)
cw-dfw 4×8 H100 — pytest invocation, exit 0:0, 19 min wall (slurm 11335402)
eos CI 4×8 H100 (this branch) — nemotron_super_v3_hellaswag SUCCESS, finetune 327 s + robustness 1259 s under the 45 min wall: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/49532337 (the vllm_deploy variant in this run failed with a 404 on automodel-deploy:pipe.49532337 — that's an unrelated container-registry infra issue, not from this PR)

Note: this branch also has a temporary commit (ci: TEMP scope nightly_recipes) that slims tests/ci_tests/configs/llm_finetune/nightly_recipes.yml to a single recipe so the focused pipeline above only schedules nemotron_super_v3_hellaswag. That commit will be reverted before merge.

🤖 Generated with Claude Code

Three independent failures in `tests/functional_tests/checkpoint_robustness/test_checkpoint_robustness_llm.py` for the `nemotron_super_v3_hellaswag` config (verified end-to-end on cw-dfw, 4 nodes / 32 GPUs): 1. **Phase 4 NCCL watchdog SIGABRT** — `dist_env.timeout_minutes: 1` (60 s) in the YAML is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4 `_barrier()`, the 60 s collective timeout fires and aborts the job. This is also what caused the earlier `print_trainable_parameters` SIGABRT (the per-parameter implicit `.norm()` collectives on DTensor params didn't have headroom). Override `dist_env.timeout_minutes: 30` in `ci.checkpoint_robustness:` so only the robustness run sees the longer timeout — normal training keeps the tight default. 2. **Phase 4 HF forward crash on Nemotron-H** — forcing `attn_implementation="flash_attention_2"` for `trust_remote_code` models routes Nemotron-H through HF's `flash_attention_forward → flash_attn_varlen_func` path, which trips a `vectorized_gather_kernel` index OOB inside the CUDA kernel. Sniff `model_type=="nemotron_h"` from the consolidated config and fall back to `sdpa` (still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold. 3. **End-of-test interpreter teardown hang** — even with `destroy_global_state` already unregistered, PyTorch's own internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery, causing a non-deterministic ~50% hang. `os._exit(0)` after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-26T07:29:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Temporary commit to get a focused CI signal for the checkpoint-robustness fix on this branch. Revert before merging the PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

The recipe carried `ci.known_issue_id: AM-156`, which causes `tests/ci_tests/utils/generate_ci_tests.py` to skip the recipe entirely (`return None, None` at line 140). This PR fixes the underlying issue, so the recipe should now generate a CI job again. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

CI run on eos h100 hit SLURM TIMEOUT at the 25-min wall: finetune phase took 322 s and the robustness test got cancelled mid-Phase-4 reload after another ~17 min. The launcher runs finetune + robustness in the same SLURM allocation, so the wall has to cover both. 45 min gives ~15 min of headroom on top of the ~30 min wall-clock we observed locally. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners April 26, 2026 07:29

adil-a added 3 commits April 26, 2026 07:42

ci: TEMP scope nightly_recipes to nemotron_super_v3_hellaswag only

182f90c

Temporary commit to get a focused CI signal for the checkpoint-robustness fix on this branch. Revert before merging the PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nemotron-super-v3-hellaswag checkpoint robustness#2056

fix: nemotron-super-v3-hellaswag checkpoint robustness#2056
adil-a wants to merge 4 commits intor0.4.0from
fix/nemotron-super-v3-hellaswag-ckpt-robustness

adil-a commented Apr 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adil-a commented Apr 26, 2026 •

edited

Loading