Skip to content

perf: parallelize Docker build with multi-stage BuildKit#350

Open
valarLip wants to merge 2 commits intomainfrom
feat/docker-parallel-build
Open

perf: parallelize Docker build with multi-stage BuildKit#350
valarLip wants to merge 2 commits intomainfrom
feat/docker-parallel-build

Conversation

@valarLip
Copy link
Collaborator

Summary

Refactor atom_image Docker build into 5 multi-stage parallel builds using BuildKit.

Problem

Nightly Docker build serially compiles mori (44m) + RCCL (20m) + Triton (65m) + aiter (60m) = ~225 min, exceeding the 180 min timeout.

Solution

Split into independent stages that BuildKit runs in parallel:

base ──┬── build_mori   (44 min) ──┐
       ├── build_rccl   (20 min) ──┤
       ├── build_triton (65 min) ──┼── atom_image (merge + ATOM)
       └── build_aiter  (60 min) ──┘

Expected: ~225 min → ~70 min (max of parallel stages + merge)

Changes

  • docker/Dockerfile — split atom_image into base → 4 parallel build stages → final merge with selective COPY --from
  • docker-release.yaml — add DOCKER_BUILDKIT=1
  • OOT image unchanged

Verification

  • docker build --target atom_image succeeds
  • pip show mori aiter triton atom all present
  • Build log shows 4 stages running in parallel
  • Total build time < 90 min

Refactor atom_image build into 5 independent stages that BuildKit
runs in parallel:

  base ──┬── build_mori   (44 min) ──┐
         ├── build_rccl   (20 min) ──┤
         ├── build_triton (65 min) ──┼── atom_image (merge + ATOM)
         └── build_aiter  (60 min) ──┘

Before: ~225 min (serial) — exceeds 180 min timeout
After:  ~70 min (parallel) — max(65,60,44,20) + merge

Changes:
- docker/Dockerfile: split atom_image into base → 4 parallel build
  stages → final merge stage with selective COPY --from
- docker-release.yaml: add DOCKER_BUILDKIT=1 to enable parallel stages

The OOT (vLLM plugin) image is unchanged.
Copilot AI review requested due to automatic review settings March 18, 2026 09:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the atom_image Docker build to use BuildKit-friendly multi-stage builds so independent components (mori/rccl/triton/aiter) can be built in parallel, reducing nightly CI build time and avoiding the 180-minute workflow timeout.

Changes:

  • Split atom_image into a shared base stage plus 4 independent build stages and a final merge stage.
  • In the final stage, selectively COPY --from=... artifacts from builder stages and install ATOM.
  • Enable BuildKit for the nightly docker release workflow build step.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docker/Dockerfile Introduces parallel multi-stage build and final artifact merge for atom_image.
.github/workflows/docker-release.yaml Enables BuildKit for docker build to allow parallel stage execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

COPY is not a shell command — 2>/dev/null and || true are treated as
file paths, causing BuildKit to fail with "not found".

- Aiter: drop egg-link COPY (pip install -e recreates them)
- Triton: replace COPY glob with RUN --mount + cp to avoid glob issues
@valarLip valarLip requested a review from gyohuangxin March 19, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants