Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs by lfengad · Pull Request #15 · NVIDIA/cosmos-framework

lfengad · 2026-06-03T14:05:13Z

tests/nano_inference_smoke_test.py: 8-GPU Cosmos3-Nano t2vs (text2video + sound) smoke. Asserts a vision.mp4 is produced, then decodes its muxed audio track (PyAV) and checks it is real sound: finite, non-empty, non-constant, above the silence floor.
tests/nano_training_smoke_test.py: 8-GPU Vision SFT 1-iter smoke. Downloads the bridge subset + Wan VAE, converts Cosmos3-Nano -> DCP, runs the 1-iter launcher, and asserts training finishes with a finite loss + a written checkpoint. All run output goes under the pytest tmp dir.
tests/launch_regression_test.py: prepare inputs in-test via the new h100_inputs fixture (download + convert, honoring pre-set env vars, cleaned on teardown) instead of requiring tests/_stage_h100_inputs.sh env vars; re-captured h100 goldens at transformers==4.57.6; map H200 to the h100 goldens key (the GPU CI runs on 8xH200).
tests/{launch_sft_vision_nano_1iter.sh,vision_sft_nano_1iter.toml}: 1-iter SFT recipe fixtures (moved from examples/; the launcher reuses the shared examples/_sft_launcher_common.sh).
.github/workflows/gpu-tests.yml: on push/PR to main, run the 8-GPU smoke tests and the 4-GPU SFT regression on a self-hosted 8xH200 runner.

All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed, SFT regression 2 passed.

- tests/nano_inference_smoke_test.py: 8-GPU Cosmos3-Nano t2vs (text2video + sound) smoke. Asserts a vision.mp4 is produced, then decodes its muxed audio track (PyAV) and checks it is real sound: finite, non-empty, non-constant, above the silence floor. - tests/nano_training_smoke_test.py: 8-GPU Vision SFT 1-iter smoke. Downloads the bridge subset + Wan VAE, converts Cosmos3-Nano -> DCP, runs the 1-iter launcher, and asserts training finishes with a finite loss + a written checkpoint. All run output goes under the pytest tmp dir. - tests/launch_regression_test.py: prepare inputs in-test via the new h100_inputs fixture (download + convert, honoring pre-set env vars, cleaned on teardown) instead of requiring tests/_stage_h100_inputs.sh env vars; re-captured h100 goldens at transformers==4.57.6; map H200 to the h100 goldens key (the GPU CI runs on 8xH200). - tests/{launch_sft_vision_nano_1iter.sh,vision_sft_nano_1iter.toml}: 1-iter SFT recipe fixtures (moved from examples/; the launcher reuses the shared examples/_sft_launcher_common.sh). - .github/workflows/gpu-tests.yml: on push/PR to main, run the 8-GPU smoke tests and the 4-GPU SFT regression on a self-hosted 8xH200 runner. All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed, SFT regression 2 passed.

Replace the single gpu-tests.yml (one job, two test steps) with two workflows so the 8-GPU nano smoke tests and the 4-GPU SFT regression run and report independently: - .github/workflows/gpu-smoke-tests.yml: nano t2vs + 1-iter SFT smoke (--num-gpus=8 --levels=2). - .github/workflows/gpu-regression.yml: SFT loss/grad-norm regression (TEST_MAX_GPUS=4, --num-gpus=4 --levels=2). Both run on [self-hosted, gpu, h200] for push/PR to main with pytest -v -s (live logs) and an if: always() cleanup; distinct concurrency groups so they don't cancel each other.

The smoke + regression tests used hardcoded --master_port values (50012/ 50022/50023, 29560/50112), which raise `DistNetworkError: ... EADDRINUSE ... port: 50022` when a port is held by a lingering process, in TIME_WAIT, or a concurrent run. Each test now binds an OS-assigned free port (_free_port) right before launching torchrun and passes it as --master_port / the launcher MASTER_PORT. Dropped the now-unused LaunchSpec.master_port field. Verified on 8xH100: nano training smoke 1 passed, no EADDRINUSE.

…n failure - Replace gpu-regression.yml with two single-spec workflows so the generator (VFM, vision_sft_nano) and reasoner (VLM, llava_ov_datapacker) regressions run and report independently, each via `pytest -k <spec>`: .github/workflows/gpu-regression-generator.yml .github/workflows/gpu-regression-reasoner.yml - launch_regression_test.py: on a goldens/parse mismatch, include the run-log tail and the got-vs-expected series in the failure message (the log also streams live under `pytest -s`), so failures carry the run detail.

Dinghow

Overall LGTM.

Dinghow · 2026-06-04T02:33:52Z

nit: shall we take some GB200 for CI?

…r) and split smoke CI - tests/nano_inference_smoke_test.py: one inference call over three modalities (t2vs text2video+sound, action policy, action forward_dynamics); validates each sample's vision.mp4 (PyAV decode), the t2vs audio (not-noise), and the policy action array. - tests/nano_training_smoke_test.py: convert -> train 5 -> export -> t2i-from- export pipeline with per-step checks: DCP + exported-model completeness (file/shard + index counts + tensor-manifest self-consistency, no tensor load), loss-degrades (min(loss)<first), and a valid output image. - tests/{vision_sft_nano_5iter.toml,launch_sft_vision_nano_5iter.sh}: 5-step smoke recipe (replaces the 1-iter fixtures). - CI: split gpu-smoke-tests.yml into gpu-smoke-inference.yml and gpu-smoke-training.yml (training timeout 90 min). Each if: always() cleanup clears run output (pytest tmp; + examples/checkpoints for training), keeping examples/data + the HF cache. Verified on 8xH100: inference smoke and training pipeline both pass.

lfengad force-pushed the feat/cosmos3-super-task-checkpoints branch from c039357 to 2b8e75b Compare June 3, 2026 14:18

lfengad added 3 commits June 3, 2026 07:47

Dinghow previously approved these changes Jun 4, 2026

View reviewed changes

lfengad dismissed Dinghow’s stale review via ad450b6 June 4, 2026 03:36

Lint

b31ff92

lfengad force-pushed the feat/cosmos3-super-task-checkpoints branch from ad450b6 to b31ff92 Compare June 4, 2026 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15

Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15
lfengad wants to merge 6 commits into
mainfrom
feat/cosmos3-super-task-checkpoints

lfengad commented Jun 3, 2026

Uh oh!

Dinghow left a comment

Uh oh!

Dinghow commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfengad commented Jun 3, 2026

Uh oh!

Dinghow left a comment

Choose a reason for hiding this comment

Uh oh!

Dinghow commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dinghow commented Jun 4, 2026 •

edited

Loading