perf: parallelize Docker build with multi-stage BuildKit#350
Open
perf: parallelize Docker build with multi-stage BuildKit#350
Conversation
Refactor atom_image build into 5 independent stages that BuildKit
runs in parallel:
base ──┬── build_mori (44 min) ──┐
├── build_rccl (20 min) ──┤
├── build_triton (65 min) ──┼── atom_image (merge + ATOM)
└── build_aiter (60 min) ──┘
Before: ~225 min (serial) — exceeds 180 min timeout
After: ~70 min (parallel) — max(65,60,44,20) + merge
Changes:
- docker/Dockerfile: split atom_image into base → 4 parallel build
stages → final merge stage with selective COPY --from
- docker-release.yaml: add DOCKER_BUILDKIT=1 to enable parallel stages
The OOT (vLLM plugin) image is unchanged.
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors the atom_image Docker build to use BuildKit-friendly multi-stage builds so independent components (mori/rccl/triton/aiter) can be built in parallel, reducing nightly CI build time and avoiding the 180-minute workflow timeout.
Changes:
- Split
atom_imageinto a sharedbasestage plus 4 independent build stages and a final merge stage. - In the final stage, selectively
COPY --from=...artifacts from builder stages and install ATOM. - Enable BuildKit for the nightly docker release workflow build step.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
docker/Dockerfile |
Introduces parallel multi-stage build and final artifact merge for atom_image. |
.github/workflows/docker-release.yaml |
Enables BuildKit for docker build to allow parallel stage execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
COPY is not a shell command — 2>/dev/null and || true are treated as file paths, causing BuildKit to fail with "not found". - Aiter: drop egg-link COPY (pip install -e recreates them) - Triton: replace COPY glob with RUN --mount + cp to avoid glob issues
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor
atom_imageDocker build into 5 multi-stage parallel builds using BuildKit.Problem
Nightly Docker build serially compiles mori (44m) + RCCL (20m) + Triton (65m) + aiter (60m) = ~225 min, exceeding the 180 min timeout.
Solution
Split into independent stages that BuildKit runs in parallel:
Expected: ~225 min → ~70 min (max of parallel stages + merge)
Changes
docker/Dockerfile— split atom_image into base → 4 parallel build stages → final merge with selectiveCOPY --fromdocker-release.yaml— addDOCKER_BUILDKIT=1Verification
docker build --target atom_imagesucceedspip show mori aiter triton atomall present