Skip to content

Fix torch autocast deprecation warning in gradient checkpointing#1010

Draft
Copilot wants to merge 6 commits into
masterfrom
copilot/fix-autocast-deprecation-warning
Draft

Fix torch autocast deprecation warning in gradient checkpointing#1010
Copilot wants to merge 6 commits into
masterfrom
copilot/fix-autocast-deprecation-warning

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 24, 2026

  • Confirmed torch 2.6.0 patches the torch.load weights_only=True RCE vulnerability
  • Confirmed torch 2.6.0+cu121 does not exist; switched CUDA backend from cu121 → cu124
  • Updated pyproject.toml: torch ^2.5^2.6, source URL cu121 → cu124
  • Updated poetry.lock:
    • torch: 2.5.1+cu121 → 2.6.0+cu124
    • All NVIDIA CUDA packages: 12.1.x → 12.4.x versions
    • nvidia-cusparselt-cu12: new dependency 0.6.2 (required by torch 2.6.0)
    • triton: 3.1.0 → 3.2.0
    • sympy: 1.13.3 → 1.13.1 (pinned exactly by torch 2.6.0+cu124)
    • Updated content-hash to match new pyproject.toml
  • Added explanatory comment for use_reentrant=False in gradient_checkpointing_kwargs

This change is Reviewable

Copilot AI linked an issue Apr 24, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Fix Torch autocast deprecation warning during training Fix torch autocast deprecation warning in gradient checkpointing Apr 24, 2026
Copilot AI requested a review from benjaminking April 24, 2026 18:21
Copy link
Copy Markdown
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good change to make, but I tested it with use_reentrant=False and I still got the autocast warning, but it came from a different line this time:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning:
`torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.

@benjaminking reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on copilot[bot]).

…cpu.amp.autocast

Agent-Logs-Url: https://github.com/sillsdev/silnlp/sessions/4592cc6c-17bc-498d-9320-6da2e1b8729b

Co-authored-by: benjaminking <1214233+benjaminking@users.noreply.github.com>
Copilot AI and others added 2 commits April 24, 2026 20:16
@benjaminking
Copy link
Copy Markdown
Collaborator

This warning was being produced by Torch code, and it looks like PyTorch's policy is to stop producing patches when the next minor version is released. So, the way to fix it was to upgrade the Torch version from 2.4 to 2.6, which also involved moving from CUDA 12.1 to 12.4 with the PyTorch wheel. I'm sure we'll eventually want to upgrade PyTorch and CUDA versions, but we should discuss whether now is the right time.

On jobs_backlog, the installed CUDA version is 12.4, and on cheetah_47gb, it is 13.0. Weirdly though, this branch works fine in an interactive session in jobs_backlog, but fails for a missing library file (libcudnn.so.9) when sending a task to jobs_backlog.

@mshannon-sil
Copy link
Copy Markdown
Collaborator

Strange. Thinking through this, we know that the silnlp container has pip requirements already installed based on the poetry lock file at the time the docker image was made. When running remote execution with this image, clearml looks at the poetry.lock file and does the installation process again, though most of the time everything is already installed so it takes little time. When running an interactive session, this installation does not happen and you're just using the already installed packages unless you manually install more packages. I'm guessing this difference is the cause for any different results you're seeing. I'd think that if you create a new docker image with the updated requirements, this bug would go away, but I'm not 100% positive.

Copy link
Copy Markdown
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out what was going on and put in a temporary fix for. I'll include the long version below, but the short version is that, in the long term, we need to have the Nvidia CUDA packages installed on the Docker image to avoid this issue we were seeing. The temporary fix is to have poetry reinstall all of the packages in the venv.

And now for the long version. Our current Docker image supports CUDA 12.4, which is the version this PR upgrades to. The CuDNN package is already present on the Docker image, while other Nvidia packages need to be installed by Poetry. Poetry has a setting to use packages already installed on the Docker image if it can. But there is a phenomenon called "shadowing" where the package on the image is hidden if any package from that namespace is installed in the venv with poetry.

Essentially, our two options with the Nvidia packages are to either have all of them pre-installed on the Docker image or install all of them with Poetry. I've temporarily implemented the latter, but updating the Docker image is the long-term fix. I have verified that this change successfully removes the autocast warning and that the experiment pipeline runs successfully.

@benjaminking made 1 comment.
Reviewable status: 0 of 3 files reviewed, all discussions resolved (waiting on mshannon-sil).

@mshannon-sil
Copy link
Copy Markdown
Collaborator

That makes sense. I'd vote for going ahead and updating the docker container to have the nvidia packages preinstalled. Maybe this is something I should do, since I can test that the new dockerfile creates an image that runs successfully on my local GPU, before pushing a new version of the silnlp image publicly.

@benjaminking
Copy link
Copy Markdown
Collaborator

We will plan to tie this upgrade to the upgrade of the Python and Ubuntu version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Torch autocast deprecation warning

3 participants