Skip to content

Running MNIST example on Frontier#67

Merged
anagainaru merged 13 commits intomainfrom
deploy_frontier
Mar 24, 2026
Merged

Running MNIST example on Frontier#67
anagainaru merged 13 commits intomainfrom
deploy_frontier

Conversation

@rz4
Copy link
Copy Markdown
Collaborator

@rz4 rz4 commented Feb 9, 2026

Summary

Running framework on Frontier. Setup guide documented in deployment README.

Motivation & Context

Quick way to start running experiments on Frontier with ROCM support.

Approach

  1. cd to scratch directory.
  2. Clone repo, create a python virtual environment and install poetry.
  3. Install dependencies and torch with ROCM support.
  4. Test torch+rocm libraries.
  5. Test model harness and jvp update on ROCM device.
  6. Submit MNIST example job.

I hit some errors with the MIOpen (AMD's cuDNN equivalent) when caching the conv kernels. Resolved these in the SLURM scripts by setting the cache path in the scratch directory. The issue was with IO errors on the home directory (default location).

Performance on MNIST example is worse than on my Macbook with default settings. Since its a small model, you only see compute benefits by increasing batchsize.

Currently, I have it working with ROCM 6.4.2. This is version allows for torch and torchvision versions within the range of pyproject.toml.

There's ROCM 7, but things seem stable with ROCM 6.4.2 for now.

Screenshots / Logs (optional)

Testing ROCm installation:

==============================================
ROCm Installation & Harness Test
==============================================
Date: Mon Feb  9 12:36:31 PM EST 2026
Hostname: frontier05758
ROCM_PATH: /opt/rocm-6.4.2
==============================================
============================= test session starts ==============================
platform linux -- Python 3.13.0, pytest-9.0.1, pluggy-1.6.0 -- /lustre/orion/lrn097/scratch/rzamora/ModCon/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /lustre/orion/lrn097/scratch/rzamora/ModCon/temp/BaseSim_Framework
configfile: pyproject.toml
plugins: Faker-38.2.0, anyio-4.11.0
collecting ... collected 7 items

tests/deployment/frontier/test_rocm_install.py::test_torch_import PASSED [ 14%]
tests/deployment/frontier/test_rocm_install.py::test_torchvision_import PASSED [ 28%]
tests/deployment/frontier/test_rocm_install.py::test_rocm_available PASSED [ 42%]
tests/deployment/frontier/test_rocm_install.py::test_gpu_count PASSED    [ 57%]
tests/deployment/frontier/test_rocm_install.py::test_gpu_properties PASSED [ 71%]
tests/deployment/frontier/test_rocm_install.py::test_tensor_on_gpu PASSED [ 85%]
tests/deployment/frontier/test_rocm_install.py::test_torch_rocm_build PASSED [100%]

============================== 7 passed in 19.78s ==============================
============================= test session starts ==============================
platform linux -- Python 3.13.0, pytest-9.0.1, pluggy-1.6.0 -- /lustre/orion/lrn097/scratch/rzamora/ModCon/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /lustre/orion/lrn097/scratch/rzamora/ModCon/temp/BaseSim_Framework
configfile: pyproject.toml
plugins: Faker-38.2.0, anyio-4.11.0
collecting ... collected 14 items

tests/deployment/frontier/test_model_harness_rocm.py::TestModelLoading::test_model_on_gpu PASSED [  7%]
tests/deployment/frontier/test_model_harness_rocm.py::TestModelLoading::test_model_device_matches_config PASSED [ 14%]
tests/deployment/frontier/test_model_harness_rocm.py::TestDataLoader::test_data_loaders_created PASSED [ 21%]
tests/deployment/frontier/test_model_harness_rocm.py::TestDataLoader::test_data_loader_batch_shape PASSED [ 28%]
tests/deployment/frontier/test_model_harness_rocm.py::TestDataLoader::test_data_moves_to_gpu PASSED [ 35%]
tests/deployment/frontier/test_model_harness_rocm.py::TestForwardPass::test_forward_pass_runs PASSED [ 42%]
tests/deployment/frontier/test_model_harness_rocm.py::TestForwardPass::test_forward_pass_output_shape PASSED [ 50%]
tests/deployment/frontier/test_model_harness_rocm.py::TestForwardPass::test_forward_pass_output_on_gpu PASSED [ 57%]
tests/deployment/frontier/test_model_harness_rocm.py::TestEval::test_eval_runs PASSED [ 64%]
tests/deployment/frontier/test_model_harness_rocm.py::TestEval::test_eval_returns_metrics PASSED [ 71%]
tests/deployment/frontier/test_model_harness_rocm.py::TestEval::test_eval_metrics_are_valid PASSED [ 78%]
tests/deployment/frontier/test_model_harness_rocm.py::TestTrainingStep::test_training_step PASSED [ 85%]
tests/deployment/frontier/test_model_harness_rocm.py::TestTrainingStep::test_gradients_computed PASSED [ 92%]
tests/deployment/frontier/test_model_harness_rocm.py::TestTrainingStep::test_weights_updated PASSED [100%]

============================= 14 passed in 39.74s ==============================
============================= test session starts ==============================
platform linux -- Python 3.13.0, pytest-9.0.1, pluggy-1.6.0 -- /lustre/orion/lrn097/scratch/rzamora/ModCon/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /lustre/orion/lrn097/scratch/rzamora/ModCon/temp/BaseSim_Framework
configfile: pyproject.toml
plugins: Faker-38.2.0, anyio-4.11.0
collecting ... collected 6 items

tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPRegUpdater::test_jvp_updater_creation PASSED [ 16%]
tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPRegUpdater::test_jvp_updater_forward_backward PASSED [ 33%]
tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPRegUpdater::test_jvp_gradients_on_gpu PASSED [ 50%]
tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPUpdateStep::test_jvp_step_runs PASSED [ 66%]
tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPUpdateStep::test_jvp_step_updates_weights PASSED [ 83%]
tests/deployment/frontier/test_jvp_update_rocm.py::TestJVPUpdateStep::test_jvp_step_multiple_iterations PASSED [100%]

============================== 6 passed in 22.89s ==============================

API / CLI Changes

Breaking Changes

  • Installing ROCM compatible torch requires a pip install after poetry install.

Performance (optional)

  • Works on single GPU. No performance benefits on MNIST.toml unless you increase batchsize.

Security & Privacy

N/A

Dependencies

  • rocm==6.4.2
  • torch=2.9.1+rocm6.4
  • torchvision==0.24.1+rocm6.4

Testing Plan

Described in updated README.

Documentation

  • Docstrings updated
  • User docs / README updated
  • CHANGELOG entry

Checklist

  • Code formatted (Ruff) → ruff format --check
  • Lint passes (Ruff) → ruff check .
  • Types pass (mypy/pyright) → mypy src
  • Tests pass (pytest) → pytest -q
  • Backward compatibility considered
  • Adequate comments for tricky parts
  • CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

@ScSteffen
Copy link
Copy Markdown
Collaborator

I the rocm model harness neccessary? I was hoping that ROCM is just a backend swap from cuda, i.e. tensor.cuda() would be directly translated to the rocm device, if the code is run on rocm.

If that's not the case, we need to make sure that the ddp support will be compatible with both cuda and rocm.

@rz4
Copy link
Copy Markdown
Collaborator Author

rz4 commented Feb 18, 2026

We can remove the test related to model harness and jvp update.

The rocm installation tests should be enough. All I would want to add is some tests that cover the movement of data from CPU to GPU memory, and vice-versa. We can also add a ddp test with ROCM.

I'm sure the latest versions of ROCM are more stable and we can assume the ROCM backend works out of the box via .cuda, but I still think its a good idea to test the installation.

@rz4
Copy link
Copy Markdown
Collaborator Author

rz4 commented Feb 18, 2026

Streamlined the testing of ROCM support and updated deployment README.

Like in #78, we can wait to merge this PR after we test ddp.

@rz4 rz4 added the Deployment Issues and PRs related to the deployment of the model back in the system label Mar 9, 2026
Rafael Zamora-Resendiz added 2 commits March 17, 2026 13:17
@rz4
Copy link
Copy Markdown
Collaborator Author

rz4 commented Mar 17, 2026

Updated the install script and job submission to have parity with deployment on Perlmutter.

  • The user will clone repo in scratch directory.
  • Install dependenices into virtual environment with src/deployment/frontier/install_venv.sh.
  • Test ROCM support with tests/test_rocm.py
  • Download MNIST data, and submit src/deployment/frontier/mnist_example.sbatch with -A account_id.

Copy link
Copy Markdown
Collaborator

@anagainaru anagainaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, is this complete @rz4 or do you still need to add anything?

@rz4
Copy link
Copy Markdown
Collaborator Author

rz4 commented Mar 24, 2026

This PR is complete. Will make a separate PR for deployment with DDP training.

@anagainaru anagainaru merged commit 1d140fd into main Mar 24, 2026
3 checks passed
@anagainaru anagainaru deleted the deploy_frontier branch March 24, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Deployment Issues and PRs related to the deployment of the model back in the system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants