ci: parallelize unit and integration tests across 4 GPU instances by nuzant · Pull Request #1185 · inclusionAI/AReaL

nuzant · 2026-04-15T06:51:06Z

Description

Split the CI matrix from 2 jobs (sglang, vllm) to 4 by adding a test_type dimension (unit, integration) to the existing variant matrix. Each variant now provisions two separate GCP runners that execute in parallel, reducing overall wall-clock time for the CI pipeline.

Fixed unit test bugs introduced by previous PRs.

Related Issue

N/A

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

The 2x2 matrix produces these 4 parallel jobs:

sglang unit -- unit tests only
sglang integration -- SFT + GRPO integration tests
vllm unit -- unit tests only
vllm integration -- SFT + GRPO integration tests

Files changed:

.github/workflows/test-areal.yml

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist · 2026-04-15T06:51:12Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Split the CI matrix from 2 jobs (sglang, vllm) to 4 by adding a test_type dimension (unit, integration). Each variant now provisions two separate GCP runners that execute in parallel, reducing overall wall-clock time. Key changes: - Add test_type matrix dimension to provision-runner, unit-tests, cleanup jobs - Add conditional steps: unit tests run only on test_type=unit, SFT+GRPO on integration - Update runner labels and instance names to include test_type for isolation

The tests sent application/octet-stream but the body was JSON-serialized data. FastAPI passed raw bytes instead of parsing JSON, causing 'bytes' object has no attribute 'nbytes' in rtensor._store_local. Key changes: - Change Content-Type from application/octet-stream to application/json - Clear _storage_stats in test fixture to prevent cross-test leaks

… model tests Add missing @torch.no_grad() to test_dcp_save_load_weights to prevent in-place param.zero_() error on grad-requiring leaf tensors, matching the existing decorator on test_hf_save_load_weights. Mock dist.get_rank in test_create_device_model_applies_use_kernels to handle the new rank check added by f34bea8 (memory_efficient_load meta device optimization) when distributed is not initialized.

Validate that raw_payload is a dict before unpacking into BatchShardRequest. Non-dict JSON bodies (e.g. arrays) raised TypeError instead of ValidationError, bypassing the 400 handler and falling through to the 500 catch-all.

nuzant requested a review from garrett4wade as a code owner April 15, 2026 06:51

nuzant added the safe-to-test Ready to run unit-tests in a PR. label Apr 15, 2026

nuzant had a problem deploying to AReaL-unittests April 15, 2026 07:03 — with GitHub Actions Failure

nuzant had a problem deploying to AReaL-unittests April 15, 2026 07:03 — with GitHub Actions Error

nuzant requested a review from rchardx as a code owner April 15, 2026 07:58

nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 15, 2026

nuzant temporarily deployed to AReaL-unittests April 15, 2026 08:29 — with GitHub Actions Inactive

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026

garrett4wade had a problem deploying to AReaL-unittests April 16, 2026 02:55 — with GitHub Actions Error

nuzant added 2 commits April 16, 2026 11:10

nuzant force-pushed the mzy/parallelize-ci branch from 9159038 to 47b248e Compare April 16, 2026 03:11

nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026

nuzant had a problem deploying to AReaL-unittests April 16, 2026 03:34 — with GitHub Actions Error

nuzant had a problem deploying to AReaL-unittests April 16, 2026 03:34 — with GitHub Actions Failure

nuzant had a problem deploying to AReaL-unittests April 16, 2026 03:34 — with GitHub Actions Error

nuzant had a problem deploying to AReaL-unittests April 16, 2026 03:34 — with GitHub Actions Failure

nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Apr 16, 2026

nuzant temporarily deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Inactive

nuzant deployed to AReaL-unittests April 16, 2026 04:47 — with GitHub Actions Active

garrett4wade approved these changes Apr 16, 2026

View reviewed changes

garrett4wade merged commit 4eb423c into main Apr 16, 2026
19 checks passed

garrett4wade deleted the mzy/parallelize-ci branch April 16, 2026 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: parallelize unit and integration tests across 4 GPU instances#1185

ci: parallelize unit and integration tests across 4 GPU instances#1185
garrett4wade merged 4 commits intomainfrom
mzy/parallelize-ci

nuzant commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nuzant commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nuzant commented Apr 15, 2026 •

edited

Loading