Fix torch autocast deprecation warning in gradient checkpointing#1010
Fix torch autocast deprecation warning in gradient checkpointing#1010Copilot wants to merge 6 commits into
autocast deprecation warning in gradient checkpointing#1010Conversation
Agent-Logs-Url: https://github.com/sillsdev/silnlp/sessions/718c44c7-fd32-4f0e-923f-1c0164875e59 Co-authored-by: benjaminking <1214233+benjaminking@users.noreply.github.com>
autocast deprecation warning in gradient checkpointing
benjaminking
left a comment
There was a problem hiding this comment.
This is probably a good change to make, but I tested it with use_reentrant=False and I still got the autocast warning, but it came from a different line this time:
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning:
`torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
@benjaminking reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on copilot[bot]).
…cpu.amp.autocast Agent-Logs-Url: https://github.com/sillsdev/silnlp/sessions/4592cc6c-17bc-498d-9320-6da2e1b8729b Co-authored-by: benjaminking <1214233+benjaminking@users.noreply.github.com>
…he source Agent-Logs-Url: https://github.com/sillsdev/silnlp/sessions/3eeb9ac4-ca51-42c0-a741-15f1ccea7d80 Co-authored-by: benjaminking <1214233+benjaminking@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sillsdev/silnlp/sessions/3eeb9ac4-ca51-42c0-a741-15f1ccea7d80 Co-authored-by: benjaminking <1214233+benjaminking@users.noreply.github.com>
|
This warning was being produced by Torch code, and it looks like PyTorch's policy is to stop producing patches when the next minor version is released. So, the way to fix it was to upgrade the Torch version from 2.4 to 2.6, which also involved moving from CUDA 12.1 to 12.4 with the PyTorch wheel. I'm sure we'll eventually want to upgrade PyTorch and CUDA versions, but we should discuss whether now is the right time. On |
|
Strange. Thinking through this, we know that the silnlp container has pip requirements already installed based on the poetry lock file at the time the docker image was made. When running remote execution with this image, clearml looks at the poetry.lock file and does the installation process again, though most of the time everything is already installed so it takes little time. When running an interactive session, this installation does not happen and you're just using the already installed packages unless you manually install more packages. I'm guessing this difference is the cause for any different results you're seeing. I'd think that if you create a new docker image with the updated requirements, this bug would go away, but I'm not 100% positive. |
benjaminking
left a comment
There was a problem hiding this comment.
I figured out what was going on and put in a temporary fix for. I'll include the long version below, but the short version is that, in the long term, we need to have the Nvidia CUDA packages installed on the Docker image to avoid this issue we were seeing. The temporary fix is to have poetry reinstall all of the packages in the venv.
And now for the long version. Our current Docker image supports CUDA 12.4, which is the version this PR upgrades to. The CuDNN package is already present on the Docker image, while other Nvidia packages need to be installed by Poetry. Poetry has a setting to use packages already installed on the Docker image if it can. But there is a phenomenon called "shadowing" where the package on the image is hidden if any package from that namespace is installed in the venv with poetry.
Essentially, our two options with the Nvidia packages are to either have all of them pre-installed on the Docker image or install all of them with Poetry. I've temporarily implemented the latter, but updating the Docker image is the long-term fix. I have verified that this change successfully removes the autocast warning and that the experiment pipeline runs successfully.
@benjaminking made 1 comment.
Reviewable status: 0 of 3 files reviewed, all discussions resolved (waiting on mshannon-sil).
|
That makes sense. I'd vote for going ahead and updating the docker container to have the nvidia packages preinstalled. Maybe this is something I should do, since I can test that the new dockerfile creates an image that runs successfully on my local GPU, before pushing a new version of the silnlp image publicly. |
|
We will plan to tie this upgrade to the upgrade of the Python and Ubuntu version. |
torch.loadweights_only=TrueRCE vulnerabilitypyproject.toml: torch^2.5→^2.6, source URL cu121 → cu124poetry.lock:content-hashto match new pyproject.tomluse_reentrant=Falsein gradient_checkpointing_kwargsThis change is