Skip to content

[pull] main from inclusionAI:main#35

Merged
pull[bot] merged 2 commits intoaxistore80-coder:mainfrom
inclusionAI:main
Apr 14, 2026
Merged

[pull] main from inclusionAI:main#35
pull[bot] merged 2 commits intoaxistore80-coder:mainfrom
inclusionAI:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 14, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

garrett4wade and others added 2 commits April 14, 2026 11:21
* chore: add project governance for PyTorch ecosystem

Add governance documentation required for PyTorch Ecosystem
application per LF Minimum Viable Governance framework.

Key changes:
- Add GOVERNANCE.md with BDFL model, maintainer table, and decision-making process
- Add CODE_OF_CONDUCT.md (Contributor Covenant v3.0)
- Add .github/CODEOWNERS for automated PR review assignment
- Update CONTRIBUTING.md with cross-links to governance docs

* chore: explain community moderatorst  and improve decision making process

* chore: improve clarity
…1169)

* feat(infra): add microservice-based training service (controller v2)

Add GatewayTrainController that decomposes training into five HTTP
microservices: guard (process manager), worker (engine container),
data proxy (batch dispatcher), router (service registry), and gateway
(API ingress). This enables training orchestration without requiring
the scheduler's RPC infrastructure.

Key changes:
- Add GatewayTrainController with 7-step async initialization
- Add guard /set_env endpoint for NCCL env propagation
- Add worker, router, gateway, data proxy FastAPI/Flask services
- Wire TrainEngineConfig.log_level to suppress HTTP access logs
- Add create_process_group stubs to existing engines
- Add comprehensive unit and integration tests

* fix(infra): stabilize controller v2 training service

Consolidate the controller v2 training-service follow-up work into one atomic infra commit. This keeps startup, routing, dispatch, recovery, and health-handling changes together as the post-9c70 stabilization series while restoring a clean branch history.

* feat(infra): add microservice-based training service (controller v2)

Add GatewayTrainController that decomposes training into five HTTP
microservices: guard (process manager), worker (engine container),
data proxy (batch dispatcher), router (service registry), and gateway
(API ingress). This enables training orchestration without requiring
the scheduler's RPC infrastructure.

Key changes:
- Add GatewayTrainController with 7-step async initialization
- Add guard /set_env endpoint for NCCL env propagation
- Add worker, router, gateway, data proxy FastAPI/Flask services
- Wire TrainEngineConfig.log_level to suppress HTTP access logs
- Add create_process_group stubs to existing engines
- Add comprehensive unit and integration tests

* fix(infra): stabilize controller v2 training service

Consolidate the controller v2 training-service follow-up work into one atomic infra commit. This keeps startup, routing, dispatch, recovery, and health-handling changes together as the post-9c70 stabilization series while restoring a clean branch history.

* chore: add SPDX headers to training service modules

---------

Co-authored-by: Wentai Zhang <zhangwentai.zwt@antgroup.com>
@pull pull bot locked and limited conversation to collaborators Apr 14, 2026
@pull pull bot added the ⤵️ pull label Apr 14, 2026
@pull pull bot merged commit d130b99 into axistore80-coder:main Apr 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant