Skip to content

Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15

Open
lfengad wants to merge 6 commits into
mainfrom
feat/cosmos3-super-task-checkpoints
Open

Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15
lfengad wants to merge 6 commits into
mainfrom
feat/cosmos3-super-task-checkpoints

Conversation

@lfengad
Copy link
Copy Markdown
Collaborator

@lfengad lfengad commented Jun 3, 2026

  • tests/nano_inference_smoke_test.py: 8-GPU Cosmos3-Nano t2vs (text2video + sound) smoke. Asserts a vision.mp4 is produced, then decodes its muxed audio track (PyAV) and checks it is real sound: finite, non-empty, non-constant, above the silence floor.
  • tests/nano_training_smoke_test.py: 8-GPU Vision SFT 1-iter smoke. Downloads the bridge subset + Wan VAE, converts Cosmos3-Nano -> DCP, runs the 1-iter launcher, and asserts training finishes with a finite loss + a written checkpoint. All run output goes under the pytest tmp dir.
  • tests/launch_regression_test.py: prepare inputs in-test via the new h100_inputs fixture (download + convert, honoring pre-set env vars, cleaned on teardown) instead of requiring tests/_stage_h100_inputs.sh env vars; re-captured h100 goldens at transformers==4.57.6; map H200 to the h100 goldens key (the GPU CI runs on 8xH200).
  • tests/{launch_sft_vision_nano_1iter.sh,vision_sft_nano_1iter.toml}: 1-iter SFT recipe fixtures (moved from examples/; the launcher reuses the shared examples/_sft_launcher_common.sh).
  • .github/workflows/gpu-tests.yml: on push/PR to main, run the 8-GPU smoke tests and the 4-GPU SFT regression on a self-hosted 8xH200 runner.

All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed, SFT regression 2 passed.

- tests/nano_inference_smoke_test.py: 8-GPU Cosmos3-Nano t2vs (text2video +
  sound) smoke. Asserts a vision.mp4 is produced, then decodes its muxed audio
  track (PyAV) and checks it is real sound: finite, non-empty, non-constant,
  above the silence floor.
- tests/nano_training_smoke_test.py: 8-GPU Vision SFT 1-iter smoke. Downloads
  the bridge subset + Wan VAE, converts Cosmos3-Nano -> DCP, runs the 1-iter
  launcher, and asserts training finishes with a finite loss + a written
  checkpoint. All run output goes under the pytest tmp dir.
- tests/launch_regression_test.py: prepare inputs in-test via the new
  h100_inputs fixture (download + convert, honoring pre-set env vars, cleaned
  on teardown) instead of requiring tests/_stage_h100_inputs.sh env vars;
  re-captured h100 goldens at transformers==4.57.6; map H200 to the h100
  goldens key (the GPU CI runs on 8xH200).
- tests/{launch_sft_vision_nano_1iter.sh,vision_sft_nano_1iter.toml}: 1-iter
  SFT recipe fixtures (moved from examples/; the launcher reuses the shared
  examples/_sft_launcher_common.sh).
- .github/workflows/gpu-tests.yml: on push/PR to main, run the 8-GPU smoke
  tests and the 4-GPU SFT regression on a self-hosted 8xH200 runner.

All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so
the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed,
SFT regression 2 passed.
@lfengad lfengad force-pushed the feat/cosmos3-super-task-checkpoints branch from c039357 to 2b8e75b Compare June 3, 2026 14:18
lfengad added 3 commits June 3, 2026 07:47
Replace the single gpu-tests.yml (one job, two test steps) with two
workflows so the 8-GPU nano smoke tests and the 4-GPU SFT regression run
and report independently:

- .github/workflows/gpu-smoke-tests.yml: nano t2vs + 1-iter SFT smoke
  (--num-gpus=8 --levels=2).
- .github/workflows/gpu-regression.yml: SFT loss/grad-norm regression
  (TEST_MAX_GPUS=4, --num-gpus=4 --levels=2).

Both run on [self-hosted, gpu, h200] for push/PR to main with pytest -v -s
(live logs) and an if: always() cleanup; distinct concurrency groups so
they don't cancel each other.
The smoke + regression tests used hardcoded --master_port values (50012/
50022/50023, 29560/50112), which raise
`DistNetworkError: ... EADDRINUSE ... port: 50022` when a port is held by a
lingering process, in TIME_WAIT, or a concurrent run. Each test now binds an
OS-assigned free port (_free_port) right before launching torchrun and passes
it as --master_port / the launcher MASTER_PORT. Dropped the now-unused
LaunchSpec.master_port field.

Verified on 8xH100: nano training smoke 1 passed, no EADDRINUSE.
…n failure

- Replace gpu-regression.yml with two single-spec workflows so the generator
  (VFM, vision_sft_nano) and reasoner (VLM, llava_ov_datapacker) regressions
  run and report independently, each via `pytest -k <spec>`:
    .github/workflows/gpu-regression-generator.yml
    .github/workflows/gpu-regression-reasoner.yml
- launch_regression_test.py: on a goldens/parse mismatch, include the run-log
  tail and the got-vs-expected series in the failure message (the log also
  streams live under `pytest -s`), so failures carry the run detail.
Dinghow
Dinghow previously approved these changes Jun 4, 2026
Copy link
Copy Markdown
Collaborator

@Dinghow Dinghow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

@Dinghow
Copy link
Copy Markdown
Collaborator

Dinghow commented Jun 4, 2026

nit: shall we take some GB200 for CI?

…r) and split smoke CI

- tests/nano_inference_smoke_test.py: one inference call over three modalities
  (t2vs text2video+sound, action policy, action forward_dynamics); validates
  each sample's vision.mp4 (PyAV decode), the t2vs audio (not-noise), and the
  policy action array.
- tests/nano_training_smoke_test.py: convert -> train 5 -> export -> t2i-from-
  export pipeline with per-step checks: DCP + exported-model completeness
  (file/shard + index counts + tensor-manifest self-consistency, no tensor
  load), loss-degrades (min(loss)<first), and a valid output image.
- tests/{vision_sft_nano_5iter.toml,launch_sft_vision_nano_5iter.sh}: 5-step
  smoke recipe (replaces the 1-iter fixtures).
- CI: split gpu-smoke-tests.yml into gpu-smoke-inference.yml and
  gpu-smoke-training.yml (training timeout 90 min). Each if: always() cleanup
  clears run output (pytest tmp; + examples/checkpoints for training), keeping
  examples/data + the HF cache.

Verified on 8xH100: inference smoke and training pipeline both pass.
@lfengad lfengad force-pushed the feat/cosmos3-super-task-checkpoints branch from ad450b6 to b31ff92 Compare June 4, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants