Skip to content

fix: nemotron-super-v3-hellaswag checkpoint robustness#2056

Open
adil-a wants to merge 4 commits intor0.4.0from
fix/nemotron-super-v3-hellaswag-ckpt-robustness
Open

fix: nemotron-super-v3-hellaswag checkpoint robustness#2056
adil-a wants to merge 4 commits intor0.4.0from
fix/nemotron-super-v3-hellaswag-ckpt-robustness

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Apr 26, 2026

Summary

Three independent failures hit the nemotron_super_v3_hellaswag checkpoint robustness CI test (originally e.g. https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/304597767). Verified end-to-end on cw-dfw, 4 nodes / 32 GPUs, then on the eos CI runners.

  • Phase 4 NCCL watchdog SIGABRTdist_env.timeout_minutes: 1 (60 s) is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4 _barrier(), the 60 s collective timeout fires. This is also what caused the earlier print_trainable_parameters SIGABRT — the per-parameter implicit .norm() collectives on DTensor params didn't have headroom under the tight timeout. Override dist_env.timeout_minutes: 30 in ci.checkpoint_robustness: so only the robustness run sees the longer timeout; normal training keeps the tight default.
  • Phase 4 HF forward crash on Nemotron-H — forcing attn_implementation="flash_attention_2" for trust_remote_code models routes Nemotron-H through HF's flash_attention_forward → flash_attn_varlen_func path, which trips a vectorized_gather_kernel index-OOB inside the CUDA kernel. Detect model_type=="nemotron_h" from the consolidated config and fall back to sdpa (still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold.
  • End-of-test interpreter teardown hang — even with destroy_global_state already unregistered, PyTorch's internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery (non-deterministic ~50% hang). os._exit(0) after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL.
  • Side fix: ci.time 25 m → 45 m — the CI launcher runs the finetune phase (~5 min) and the robustness phase (~21 min) under the same SLURM allocation, so a 25 min wall left no headroom and TIMEOUT-killed the robustness phase mid-Phase-4.
  • Side fix: drop ci.known_issue_id: AM-156 — the marker was making tests/ci_tests/utils/generate_ci_tests.py skip the recipe entirely. It now generates a real CI job for the recipe (which is the underlying reason this PR exists).

Test plan

  • cw-dfw 4×8 H100 — direct python invocation, all 4 phases pass: Phase 3 KL = 0, Phase 4 KL = 2.16e-2, exit 0:0, 19 min wall (slurm 11336876)
  • cw-dfw 4×8 H100 — pytest invocation, exit 0:0, 19 min wall (slurm 11335402)
  • eos CI 4×8 H100 (this branch)nemotron_super_v3_hellaswag SUCCESS, finetune 327 s + robustness 1259 s under the 45 min wall: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/49532337 (the vllm_deploy variant in this run failed with a 404 on automodel-deploy:pipe.49532337 — that's an unrelated container-registry infra issue, not from this PR)

Note: this branch also has a temporary commit (ci: TEMP scope nightly_recipes) that slims tests/ci_tests/configs/llm_finetune/nightly_recipes.yml to a single recipe so the focused pipeline above only schedules nemotron_super_v3_hellaswag. That commit will be reverted before merge.

🤖 Generated with Claude Code

Three independent failures in `tests/functional_tests/checkpoint_robustness/test_checkpoint_robustness_llm.py` for the `nemotron_super_v3_hellaswag` config (verified end-to-end on cw-dfw, 4 nodes / 32 GPUs):

1. **Phase 4 NCCL watchdog SIGABRT** — `dist_env.timeout_minutes: 1` (60 s) in the YAML is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4 `_barrier()`, the 60 s collective timeout fires and aborts the job. This is also what caused the earlier `print_trainable_parameters` SIGABRT (the per-parameter implicit `.norm()` collectives on DTensor params didn't have headroom). Override `dist_env.timeout_minutes: 30` in `ci.checkpoint_robustness:` so only the robustness run sees the longer timeout — normal training keeps the tight default.

2. **Phase 4 HF forward crash on Nemotron-H** — forcing `attn_implementation="flash_attention_2"` for `trust_remote_code` models routes Nemotron-H through HF's `flash_attention_forward → flash_attn_varlen_func` path, which trips a `vectorized_gather_kernel` index OOB inside the CUDA kernel. Sniff `model_type=="nemotron_h"` from the consolidated config and fall back to `sdpa` (still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold.

3. **End-of-test interpreter teardown hang** — even with `destroy_global_state` already unregistered, PyTorch's own internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery, causing a non-deterministic ~50% hang. `os._exit(0)` after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a added 3 commits April 26, 2026 07:42
Temporary commit to get a focused CI signal for the checkpoint-robustness
fix on this branch. Revert before merging the PR.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
The recipe carried `ci.known_issue_id: AM-156`, which causes
`tests/ci_tests/utils/generate_ci_tests.py` to skip the recipe entirely
(`return None, None` at line 140). This PR fixes the underlying issue, so
the recipe should now generate a CI job again.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
CI run on eos h100 hit SLURM TIMEOUT at the 25-min wall: finetune phase took
322 s and the robustness test got cancelled mid-Phase-4 reload after another
~17 min. The launcher runs finetune + robustness in the same SLURM
allocation, so the wall has to cover both. 45 min gives ~15 min of headroom
on top of the ~30 min wall-clock we observed locally.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant