Skip to content

[pull] main from inclusionAI:main#32

Merged
pull[bot] merged 1 commit intoaxistore80-coder:mainfrom
inclusionAI:main
Apr 10, 2026
Merged

[pull] main from inclusionAI:main#32
pull[bot] merged 1 commit intoaxistore80-coder:mainfrom
inclusionAI:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 10, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…#1157)

* feat(infra): allow colocation with offloading and disk weight updates

Enable scheduler-level colocation to run actor/critic and engine processes
on shared GPU allocations while preserving async update correctness.

Key changes:
- Add colocate/offload/disk-update config fields and scheduler plumbing
- Harden RPC server and engine blueprint coordination for colocated roles
- Extend trainer paths and tests for colocated evaluation dispatch behavior

* fix(infra): restore default train-engine RPC broadcast

Keep initialized TrainEngine RPC calls backward compatible across the guard and Ray servers so non-head model-parallel ranks continue to receive controller payloads without every call site opting in manually.

* fix: enforce offload prerequisites for colocated training

Fail fast when colocated or explicit train-engine offload would run without TMS support, and provision Ray workers with the same offload environment as the local and Slurm schedulers.

---------

Co-authored-by: Wentai Zhang <zhangwentai.zwt@antgroup.com>
@pull pull bot locked and limited conversation to collaborators Apr 10, 2026
@pull pull bot added the ⤵️ pull label Apr 10, 2026
@pull pull bot merged commit c4f22f2 into axistore80-coder:main Apr 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant