Skip to content

ci: parallelize unit and integration tests across 4 GPU instances#1185

Merged
garrett4wade merged 4 commits intomainfrom
mzy/parallelize-ci
Apr 16, 2026
Merged

ci: parallelize unit and integration tests across 4 GPU instances#1185
garrett4wade merged 4 commits intomainfrom
mzy/parallelize-ci

Conversation

@nuzant
Copy link
Copy Markdown
Collaborator

@nuzant nuzant commented Apr 15, 2026

Description

Split the CI matrix from 2 jobs (sglang, vllm) to 4 by adding a test_type dimension (unit, integration) to the existing variant matrix. Each variant now provisions two separate GCP runners that execute in parallel, reducing overall wall-clock time for the CI pipeline.

Fixed unit test bugs introduced by previous PRs.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

The 2x2 matrix produces these 4 parallel jobs:

  1. sglang unit -- unit tests only
  2. sglang integration -- SFT + GRPO integration tests
  3. vllm unit -- unit tests only
  4. vllm integration -- SFT + GRPO integration tests

Files changed:

  • .github/workflows/test-areal.yml

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@nuzant nuzant requested a review from garrett4wade as a code owner April 15, 2026 06:51
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@nuzant nuzant added the safe-to-test Ready to run unit-tests in a PR. label Apr 15, 2026
@nuzant nuzant requested a review from rchardx as a code owner April 15, 2026 07:58
@nuzant nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 15, 2026
@nuzant nuzant temporarily deployed to AReaL-unittests April 15, 2026 08:29 — with GitHub Actions Inactive
@nuzant nuzant temporarily deployed to AReaL-unittests April 15, 2026 08:29 — with GitHub Actions Inactive
@nuzant nuzant temporarily deployed to AReaL-unittests April 15, 2026 08:29 — with GitHub Actions Inactive
@nuzant nuzant temporarily deployed to AReaL-unittests April 15, 2026 08:29 — with GitHub Actions Inactive
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026
nuzant added 2 commits April 16, 2026 11:10
Split the CI matrix from 2 jobs (sglang, vllm) to 4 by adding a
test_type dimension (unit, integration). Each variant now provisions
two separate GCP runners that execute in parallel, reducing overall
wall-clock time.

Key changes:
- Add test_type matrix dimension to provision-runner, unit-tests, cleanup jobs
- Add conditional steps: unit tests run only on test_type=unit, SFT+GRPO on integration
- Update runner labels and instance names to include test_type for isolation
The tests sent application/octet-stream but the body was JSON-serialized
data. FastAPI passed raw bytes instead of parsing JSON, causing
'bytes' object has no attribute 'nbytes' in rtensor._store_local.

Key changes:
- Change Content-Type from application/octet-stream to application/json
- Clear _storage_stats in test fixture to prevent cross-test leaks
@nuzant nuzant force-pushed the mzy/parallelize-ci branch from 9159038 to 47b248e Compare April 16, 2026 03:11
… model tests

Add missing @torch.no_grad() to test_dcp_save_load_weights to prevent
in-place param.zero_() error on grad-requiring leaf tensors, matching
the existing decorator on test_hf_save_load_weights.

Mock dist.get_rank in test_create_device_model_applies_use_kernels to
handle the new rank check added by f34bea8 (memory_efficient_load meta
device optimization) when distributed is not initialized.
@nuzant nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026
Validate that raw_payload is a dict before unpacking into
BatchShardRequest. Non-dict JSON bodies (e.g. arrays) raised
TypeError instead of ValidationError, bypassing the 400
handler and falling through to the 500 catch-all.
@nuzant nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026
@nuzant nuzant temporarily deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Inactive
@nuzant nuzant temporarily deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Inactive
@nuzant nuzant temporarily deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Inactive
@nuzant nuzant deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Active
@garrett4wade garrett4wade merged commit 4eb423c into main Apr 16, 2026
19 checks passed
@garrett4wade garrett4wade deleted the mzy/parallelize-ci branch April 16, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants