ci: free more disk in docker-test-gpu/lb runners#96
Merged
Conversation
The multi-Python rollout in #89 added a ~7 GB torch reinstall to Dockerfile and Dockerfile-lb when PYTHON_VERSION != 3.12. On the ubuntu-latest runner this pushed the build over the available disk budget; recent main runs fail with `OSError: [Errno 28] No space left on device` while pip downloads torch-2.9.1+cu128 for cp311. Reclaim ~25 GB more by also removing the Android SDK, CodeQL toolcache, .ghcup, and the Microsoft/Google language toolchains in the existing "Clear space" step. sudo is required because the preinstalled toolchains are owned by root.
Contributor
There was a problem hiding this comment.
Pull request overview
Updates CI workflow cleanup to reclaim significantly more disk space on GitHub-hosted runners before building the heavyweight GPU/LB Docker images, addressing “No space left on device” failures during large PyTorch wheel installs in the multi-Python matrix.
Changes:
- Expand the “Clear space” step in
docker-test-gpuanddocker-test-lbto remove additional preinstalled toolchains/caches (Android, CodeQL, Microsoft/Google toolchains, etc.). - Run the cleanup with
sudoso removal works on root-owned directories.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
KAJdev
approved these changes
Apr 28, 2026
jhcipar
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DockerfileandDockerfile-lbwhenPYTHON_VERSION != 3.12. The Dockerfile header openly notes the overhead.mainCI runs (e.g. 25080308087, 25079769003) fail indocker-test-lb (3.11)anddocker-test-gpu (3.11)withOSError: [Errno 28] No space left on devicewhile pip downloadstorch-2.9.1+cu128-cp311(~901 MB wheel) on top of the base image's already-installed cp312 torch. The trailingSystem.IO.IOExceptionin the run summary is collateral — the runner can't even flush its diagnostic log.Clear spacestep only reclaims ~13 GB. Adding the Android SDK, CodeQL toolcache,.ghcup, and the Microsoft/Google language toolchains reclaims roughly +25 GB, comfortably covering the ~7 GB torch install plus GHA cache headroom.sudois required because the preinstalled toolchains are owned by root; without it the existing line was already silently failing on protected paths.Same change applied to both
docker-test-gpuanddocker-test-lbjobs (the only two that build the heavyweight pytorch base image).Test plan
docker-test-lb (3.10/3.11/3.12)all green on this PRdocker-test-gpu (3.10/3.11/3.12)all green on this PRdocker-validationpasses (gates the release pipeline that's been stuck since feat: multi-Python worker images with startup version check (AE-2827) #89)make quality-checkpasses (281 unit + 14 handler tests, 81% coverage) ✅