Skip to content

[pull] main from inclusionAI:main#20

Merged
pull[bot] merged 8 commits intoaxistore80-coder:mainfrom
inclusionAI:main
Mar 30, 2026
Merged

[pull] main from inclusionAI:main#20
pull[bot] merged 8 commits intoaxistore80-coder:mainfrom
inclusionAI:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Mar 30, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

TaoZex and others added 5 commits March 30, 2026 13:12
* feat: support model training in IPv6-only environment

---------

Co-authored-by: bingyechen <bingyechen@bytedance.com>
Co-authored-by: root <root@dc05-p13-t0-n028.byted.org>
Co-authored-by: truongnp5 <v.truongnp5@vinsmartfuture.tech>
* feat: megatron-bridge-adaptation and dependency conficts resolution

- tested TP,PP>1 megatron-bridge integration with mbridge backward compatibility
- darwin with x86_64 needs special handling as torch >2.9.1 stops support
- some packages conflicts due to megatron-bridge are overridden to previous versions

* chore: added docs for the megatron-bridge feature

* fix: handing case where load/ save in megatron-bridge does not support critic

---------

Co-authored-by: Wei Fu <36355462+garrett4wade@users.noreply.github.com>
@pull pull bot added the ⤵️ pull label Mar 30, 2026
rchardx and others added 3 commits March 30, 2026 16:01
* fix(archon): add missing POST /data/batch endpoint to data proxy

PR #1077 added batch RTensor fetching via POST /data/batch but only
implemented the endpoint on the Flask RPC server (rpc_server.py),
missing the FastAPI data proxy. This caused RTensor.localize() to
fail with HTTP 405 in integration tests that use the data proxy.

Refs: #1077

* fix(archon): harden data proxy batch endpoint with Flask-parity error handling

Align POST /data/batch error responses, JSON parsing, and exception
handling with the Flask rpc_server.py counterpart to ensure identical
behavior across both servers.

Key changes:
- Replace HTTPException with JSONResponse for Flask-compatible error bodies
- Add outer try/except with traceback logging matching Flask pattern
- Normalise falsy/non-dict JSON payloads via or {} + isinstance guard
- Add 12 unit tests for all RTensor data proxy endpoints (no GPU)

Refs: #1077, #1105
…rOptimWrapper (#1108)

Replace hardcoded torch.cuda.Stream, torch.cuda.Event, torch.cuda.stream(),
torch.cuda.current_stream(), and torch.cuda.empty_cache() with
current_platform equivalents to support non-CUDA accelerators.

Resolves two TODO comments about platform abstraction.
- Split weight update into async bucket start + explicit wait
- Add _PendingWeightUpdateBucket dataclass for async tracking
- Overlap bucket N-1 broadcast with bucket N all-gather
- Keep training ranks aligned before entering next collective
@pull pull bot merged commit 2ddd959 into axistore80-coder:main Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants