Conversation
|
I the rocm model harness neccessary? I was hoping that ROCM is just a backend swap from cuda, i.e. tensor.cuda() would be directly translated to the rocm device, if the code is run on rocm. If that's not the case, we need to make sure that the ddp support will be compatible with both cuda and rocm. |
|
We can remove the test related to model harness and jvp update. The rocm installation tests should be enough. All I would want to add is some tests that cover the movement of data from CPU to GPU memory, and vice-versa. We can also add a ddp test with ROCM. I'm sure the latest versions of ROCM are more stable and we can assume the ROCM backend works out of the box via .cuda, but I still think its a good idea to test the installation. |
|
Streamlined the testing of ROCM support and updated deployment README. Like in #78, we can wait to merge this PR after we test ddp. |
…Perlmutter w/ additional ROCM dependencies.
|
Updated the install script and job submission to have parity with deployment on Perlmutter.
|
anagainaru
left a comment
There was a problem hiding this comment.
This looks good to me, is this complete @rz4 or do you still need to add anything?
|
This PR is complete. Will make a separate PR for deployment with DDP training. |
Summary
Running framework on Frontier. Setup guide documented in deployment README.
Motivation & Context
Quick way to start running experiments on Frontier with ROCM support.
Approach
I hit some errors with the MIOpen (AMD's cuDNN equivalent) when caching the conv kernels. Resolved these in the SLURM scripts by setting the cache path in the scratch directory. The issue was with IO errors on the home directory (default location).
Performance on MNIST example is worse than on my Macbook with default settings. Since its a small model, you only see compute benefits by increasing batchsize.
Currently, I have it working with ROCM 6.4.2. This is version allows for torch and torchvision versions within the range of pyproject.toml.
There's ROCM 7, but things seem stable with ROCM 6.4.2 for now.
Screenshots / Logs (optional)
Testing ROCm installation:
API / CLI Changes
Breaking Changes
Performance (optional)
Security & Privacy
N/A
Dependencies
Testing Plan
Described in updated README.
Documentation
Checklist
ruff format --checkruff check .mypy srcpytest -qRisk & Rollback Plan
Probably not needed in the beginning
Notes for Reviewers