[pull] main from inclusionAI:main by pull[bot] · Pull Request #38 · axistore80-coder/AReaL

pull · 2026-04-16T07:21:27Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

#1181) * fix: update weights on disk for fsdp, megatron, and archon engine --------- Co-authored-by: Mohammad Asiful Hossain (A) <m00802411@china.huawei.com>

…blueprint (#1179) * refactor: mount data blueprint via ASGI and implement engine Pydantic models * fix: correct typo in set_env attribute name * test: add RPC engine validation tests and apply NonEmptyStr constraints

…load (#1182) When memory_efficient_load=True, all ranks previously created the full model on CPU via from_config(), causing CPU OOM on nodes with limited RAM (e.g. 8 workers × 64GB = 512GB on a 256GB node). Now only rank-0 loads on CPU; other ranks use meta device (zero memory cost). Weights are broadcast from rank-0 after FSDP sharding via fsdp2_load_full_state_dict with broadcast_from_rank0=True. Also defer AutoProcessor import in saver.py and recover.py to avoid importing torchvision eagerly (which can crash when torchvision version mismatches torch). Key changes: - fsdp_engine.py: non-rank-0 uses "meta" device in memory_efficient_load - fsdp_utils/__init__.py: use to_empty() for meta→device conversion - saver.py, recover.py: move AutoProcessor to TYPE_CHECKING block

) * ci: parallelize unit and integration tests across 4 GPU instances Split the CI matrix from 2 jobs (sglang, vllm) to 4 by adding a test_type dimension (unit, integration). Each variant now provisions two separate GCP runners that execute in parallel, reducing overall wall-clock time. Key changes: - Add test_type matrix dimension to provision-runner, unit-tests, cleanup jobs - Add conditional steps: unit tests run only on test_type=unit, SFT+GRPO on integration - Update runner labels and instance names to include test_type for isolation * fix(test): use correct Content-Type in data proxy rtensor tests The tests sent application/octet-stream but the body was JSON-serialized data. FastAPI passed raw bytes instead of parsing JSON, causing 'bytes' object has no attribute 'nbytes' in rtensor._store_local. Key changes: - Change Content-Type from application/octet-stream to application/json - Clear _storage_stats in test fixture to prevent cross-test leaks * fix(test): fix test_train_engine failures in dcp save/load and device model tests Add missing @torch.no_grad() to test_dcp_save_load_weights to prevent in-place param.zero_() error on grad-requiring leaf tensors, matching the existing decorator on test_hf_save_load_weights. Mock dist.get_rank in test_create_device_model_applies_use_kernels to handle the new rank check added by f34bea8 (memory_efficient_load meta device optimization) when distributed is not initialized. * fix(infra): return 400 for non-dict JSON in /data/batch endpoint Validate that raw_payload is a dict before unpacking into BatchShardRequest. Non-dict JSON bodies (e.g. arrays) raised TypeError instead of ValidationError, bypassing the 400 handler and falling through to the 500 catch-all.

asif07hossain and others added 4 commits April 16, 2026 10:42

fix _update_weights_from_disk function to prevent training to be stuck (

c89dfa7

#1181) * fix: update weights on disk for fsdp, megatron, and archon engine --------- Co-authored-by: Mohammad Asiful Hossain (A) <m00802411@china.huawei.com>

pull bot locked and limited conversation to collaborators Apr 16, 2026

pull bot added the ⤵️ pull label Apr 16, 2026

pull bot merged commit 4eb423c into axistore80-coder:main Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from inclusionAI:main#38

[pull] main from inclusionAI:main#38
pull[bot] merged 4 commits intoaxistore80-coder:mainfrom
inclusionAI:main

pull bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pull bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull bot commented Apr 16, 2026 •

edited

Loading