fix: nemotron-super-v3-hellaswag checkpoint robustness#2056
Open
fix: nemotron-super-v3-hellaswag checkpoint robustness#2056
Conversation
Three independent failures in `tests/functional_tests/checkpoint_robustness/test_checkpoint_robustness_llm.py` for the `nemotron_super_v3_hellaswag` config (verified end-to-end on cw-dfw, 4 nodes / 32 GPUs): 1. **Phase 4 NCCL watchdog SIGABRT** — `dist_env.timeout_minutes: 1` (60 s) in the YAML is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4 `_barrier()`, the 60 s collective timeout fires and aborts the job. This is also what caused the earlier `print_trainable_parameters` SIGABRT (the per-parameter implicit `.norm()` collectives on DTensor params didn't have headroom). Override `dist_env.timeout_minutes: 30` in `ci.checkpoint_robustness:` so only the robustness run sees the longer timeout — normal training keeps the tight default. 2. **Phase 4 HF forward crash on Nemotron-H** — forcing `attn_implementation="flash_attention_2"` for `trust_remote_code` models routes Nemotron-H through HF's `flash_attention_forward → flash_attn_varlen_func` path, which trips a `vectorized_gather_kernel` index OOB inside the CUDA kernel. Sniff `model_type=="nemotron_h"` from the consolidated config and fall back to `sdpa` (still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold. 3. **End-of-test interpreter teardown hang** — even with `destroy_global_state` already unregistered, PyTorch's own internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery, causing a non-deterministic ~50% hang. `os._exit(0)` after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Temporary commit to get a focused CI signal for the checkpoint-robustness fix on this branch. Revert before merging the PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
The recipe carried `ci.known_issue_id: AM-156`, which causes `tests/ci_tests/utils/generate_ci_tests.py` to skip the recipe entirely (`return None, None` at line 140). This PR fixes the underlying issue, so the recipe should now generate a CI job again. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
CI run on eos h100 hit SLURM TIMEOUT at the 25-min wall: finetune phase took 322 s and the robustness test got cancelled mid-Phase-4 reload after another ~17 min. The launcher runs finetune + robustness in the same SLURM allocation, so the wall has to cover both. 45 min gives ~15 min of headroom on top of the ~30 min wall-clock we observed locally. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent failures hit the
nemotron_super_v3_hellaswagcheckpoint robustness CI test (originally e.g. https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/304597767). Verified end-to-end on cw-dfw, 4 nodes / 32 GPUs, then on the eos CI runners.dist_env.timeout_minutes: 1(60 s) is way short of the rank-0 vanilla-HF reload of the 120 B model (~3 min). While other ranks wait at the post-Phase-4_barrier(), the 60 s collective timeout fires. This is also what caused the earlierprint_trainable_parametersSIGABRT — the per-parameter implicit.norm()collectives on DTensor params didn't have headroom under the tight timeout. Overridedist_env.timeout_minutes: 30inci.checkpoint_robustness:so only the robustness run sees the longer timeout; normal training keeps the tight default.attn_implementation="flash_attention_2"fortrust_remote_codemodels routes Nemotron-H through HF'sflash_attention_forward → flash_attn_varlen_funcpath, which trips avectorized_gather_kernelindex-OOB inside the CUDA kernel. Detectmodel_type=="nemotron_h"from the consolidated config and fall back tosdpa(still flash-backed via PyTorch's optimized SDPA, just skips HF's varlen routing). Phase 4 KL = 2.16e-2, well under the 7e-2 threshold.destroy_global_statealready unregistered, PyTorch's internal atexit handlers + DTensor/FSDP destructors + ProcessGroupNCCL watchdog shutdown still issue collectives on the default PG during interpreter teardown. With MoE+EP this races against ranks still in pytest's AST machinery (non-deterministic ~50% hang).os._exit(0)after the test bypasses every Python finalizer — the OS reclaims resources without ever entering NCCL.ci.time25 m → 45 m — the CI launcher runs the finetune phase (~5 min) and the robustness phase (~21 min) under the same SLURM allocation, so a 25 min wall left no headroom and TIMEOUT-killed the robustness phase mid-Phase-4.ci.known_issue_id: AM-156— the marker was makingtests/ci_tests/utils/generate_ci_tests.pyskip the recipe entirely. It now generates a real CI job for the recipe (which is the underlying reason this PR exists).Test plan
nemotron_super_v3_hellaswagSUCCESS, finetune 327 s + robustness 1259 s under the 45 min wall: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/49532337 (thevllm_deployvariant in this run failed with a 404 onautomodel-deploy:pipe.49532337— that's an unrelated container-registry infra issue, not from this PR)🤖 Generated with Claude Code