Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15
Open
lfengad wants to merge 6 commits into
Open
Add Cosmos3-Nano GPU smoke tests + GPU CI; self-prep regression inputs#15lfengad wants to merge 6 commits into
lfengad wants to merge 6 commits into
Conversation
- tests/nano_inference_smoke_test.py: 8-GPU Cosmos3-Nano t2vs (text2video +
sound) smoke. Asserts a vision.mp4 is produced, then decodes its muxed audio
track (PyAV) and checks it is real sound: finite, non-empty, non-constant,
above the silence floor.
- tests/nano_training_smoke_test.py: 8-GPU Vision SFT 1-iter smoke. Downloads
the bridge subset + Wan VAE, converts Cosmos3-Nano -> DCP, runs the 1-iter
launcher, and asserts training finishes with a finite loss + a written
checkpoint. All run output goes under the pytest tmp dir.
- tests/launch_regression_test.py: prepare inputs in-test via the new
h100_inputs fixture (download + convert, honoring pre-set env vars, cleaned
on teardown) instead of requiring tests/_stage_h100_inputs.sh env vars;
re-captured h100 goldens at transformers==4.57.6; map H200 to the h100
goldens key (the GPU CI runs on 8xH200).
- tests/{launch_sft_vision_nano_1iter.sh,vision_sft_nano_1iter.toml}: 1-iter
SFT recipe fixtures (moved from examples/; the launcher reuses the shared
examples/_sft_launcher_common.sh).
- .github/workflows/gpu-tests.yml: on push/PR to main, run the 8-GPU smoke
tests and the 4-GPU SFT regression on a self-hosted 8xH200 runner.
All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so
the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed,
SFT regression 2 passed.
c039357 to
2b8e75b
Compare
Replace the single gpu-tests.yml (one job, two test steps) with two workflows so the 8-GPU nano smoke tests and the 4-GPU SFT regression run and report independently: - .github/workflows/gpu-smoke-tests.yml: nano t2vs + 1-iter SFT smoke (--num-gpus=8 --levels=2). - .github/workflows/gpu-regression.yml: SFT loss/grad-norm regression (TEST_MAX_GPUS=4, --num-gpus=4 --levels=2). Both run on [self-hosted, gpu, h200] for push/PR to main with pytest -v -s (live logs) and an if: always() cleanup; distinct concurrency groups so they don't cancel each other.
The smoke + regression tests used hardcoded --master_port values (50012/ 50022/50023, 29560/50112), which raise `DistNetworkError: ... EADDRINUSE ... port: 50022` when a port is held by a lingering process, in TIME_WAIT, or a concurrent run. Each test now binds an OS-assigned free port (_free_port) right before launching torchrun and passes it as --master_port / the launcher MASTER_PORT. Dropped the now-unused LaunchSpec.master_port field. Verified on 8xH100: nano training smoke 1 passed, no EADDRINUSE.
…n failure
- Replace gpu-regression.yml with two single-spec workflows so the generator
(VFM, vision_sft_nano) and reasoner (VLM, llava_ov_datapacker) regressions
run and report independently, each via `pytest -k <spec>`:
.github/workflows/gpu-regression-generator.yml
.github/workflows/gpu-regression-reasoner.yml
- launch_regression_test.py: on a goldens/parse mismatch, include the run-log
tail and the got-vs-expected series in the failure message (the log also
streams live under `pytest -s`), so failures carry the run detail.
Collaborator
|
nit: shall we take some GB200 for CI? |
…r) and split smoke CI
- tests/nano_inference_smoke_test.py: one inference call over three modalities
(t2vs text2video+sound, action policy, action forward_dynamics); validates
each sample's vision.mp4 (PyAV decode), the t2vs audio (not-noise), and the
policy action array.
- tests/nano_training_smoke_test.py: convert -> train 5 -> export -> t2i-from-
export pipeline with per-step checks: DCP + exported-model completeness
(file/shard + index counts + tensor-manifest self-consistency, no tensor
load), loss-degrades (min(loss)<first), and a valid output image.
- tests/{vision_sft_nano_5iter.toml,launch_sft_vision_nano_5iter.sh}: 5-step
smoke recipe (replaces the 1-iter fixtures).
- CI: split gpu-smoke-tests.yml into gpu-smoke-inference.yml and
gpu-smoke-training.yml (training timeout 90 min). Each if: always() cleanup
clears run output (pytest tmp; + examples/checkpoints for training), keeping
examples/data + the HF cache.
Verified on 8xH100: inference smoke and training pipeline both pass.
ad450b6 to
b31ff92
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All GPU tests are gated by the gpus()/level() markers + --num-gpus/--levels, so the no-GPU pre-commit CI is unaffected. Verified on 8xH100: nano smoke 2 passed, SFT regression 2 passed.