Skip to content

ci: free more disk in docker-test-gpu/lb runners#96

Merged
deanq merged 1 commit intomainfrom
fix/ci-disk-space-multi-python
Apr 29, 2026
Merged

ci: free more disk in docker-test-gpu/lb runners#96
deanq merged 1 commit intomainfrom
fix/ci-disk-space-multi-python

Conversation

@deanq
Copy link
Copy Markdown
Contributor

@deanq deanq commented Apr 28, 2026

Summary

  • The multi-Python rollout in feat: multi-Python worker images with startup version check (AE-2827) #89 added a ~7 GB torch reinstall to Dockerfile and Dockerfile-lb when PYTHON_VERSION != 3.12. The Dockerfile header openly notes the overhead.
  • Recent main CI runs (e.g. 25080308087, 25079769003) fail in docker-test-lb (3.11) and docker-test-gpu (3.11) with OSError: [Errno 28] No space left on device while pip downloads torch-2.9.1+cu128-cp311 (~901 MB wheel) on top of the base image's already-installed cp312 torch. The trailing System.IO.IOException in the run summary is collateral — the runner can't even flush its diagnostic log.
  • The existing Clear space step only reclaims ~13 GB. Adding the Android SDK, CodeQL toolcache, .ghcup, and the Microsoft/Google language toolchains reclaims roughly +25 GB, comfortably covering the ~7 GB torch install plus GHA cache headroom.
  • sudo is required because the preinstalled toolchains are owned by root; without it the existing line was already silently failing on protected paths.

Same change applied to both docker-test-gpu and docker-test-lb jobs (the only two that build the heavyweight pytorch base image).

Test plan

The multi-Python rollout in #89 added a ~7 GB torch reinstall to
Dockerfile and Dockerfile-lb when PYTHON_VERSION != 3.12. On the
ubuntu-latest runner this pushed the build over the available disk
budget; recent main runs fail with `OSError: [Errno 28] No space
left on device` while pip downloads torch-2.9.1+cu128 for cp311.

Reclaim ~25 GB more by also removing the Android SDK, CodeQL
toolcache, .ghcup, and the Microsoft/Google language toolchains
in the existing "Clear space" step. sudo is required because the
preinstalled toolchains are owned by root.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates CI workflow cleanup to reclaim significantly more disk space on GitHub-hosted runners before building the heavyweight GPU/LB Docker images, addressing “No space left on device” failures during large PyTorch wheel installs in the multi-Python matrix.

Changes:

  • Expand the “Clear space” step in docker-test-gpu and docker-test-lb to remove additional preinstalled toolchains/caches (Android, CodeQL, Microsoft/Google toolchains, etc.).
  • Run the cleanup with sudo so removal works on root-owned directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@deanq deanq merged commit 2654cf4 into main Apr 29, 2026
25 checks passed
@deanq deanq deleted the fix/ci-disk-space-multi-python branch April 29, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants