CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47
Open
cenamiller wants to merge 3 commits intomasterfrom
Open
CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47cenamiller wants to merge 3 commits intomasterfrom
cenamiller wants to merge 3 commits intomasterfrom
Conversation
The Nsight workflow always ran with host-staged MPI, even though the
MPAS-A OpenACC build ships with `acc host_data use_device(...)` around
halo exchanges. Without an explicit opt-in, MPI silently `cudaMemcpy`s
device buffers through host memory, so the profile shows a lot of
H<->D copy traffic and the MPI calls themselves are missing from the
timeline (nsys was not tracing MPI either).
Two changes:
1. Add `cuda_aware_mpi` workflow input (default false, preserves prior
behaviour). When true, run-nsys-profile.sh sets the right env vars
per MPI implementation:
- MPICH: MPICH_GPU_SUPPORT_ENABLED=1
- OpenMPI: OMPI_MCA_pml=ucx, OMPI_MCA_osc=ucx, UCX_TLS includes
cuda transports; also passes `--mca pml ucx --mca osc ucx`
on the mpirun line.
These do nothing useful unless the container's MPI is built with GPU
support, but the failure mode is loud (MPI abort) rather than silent.
2. Add `mpi` to the nsys trace targets so halo exchanges show up in the
timeline regardless of the cuda-aware setting. This lets us compare
host-staged vs cuda-aware runs by dispatching the workflow twice.
Made-with: Cursor
ad8aeac to
bdafcb3
Compare
Both motivated by debugging cuda-aware MPI on PR #47. Without per-rank pinning, all ranks default to GPU 0 on the CIRRUS-4x8-gpu node, which masks any cuda-aware MPI win because there is no cross-GPU traffic. - pin-gpu.sh: tiny shim that sets CUDA_VISIBLE_DEVICES to (local_rank % visible_gpu_count) using whichever local-rank var the MPI runtime exports (MPI_LOCALRANKID / OMPI_COMM_WORLD_LOCAL_RANK / PMI_LOCAL_RANK / SLURM_LOCALID). No-op if no GPUs detected, so safe when the container has no GPU mapping. Round-robin only; not yet wired into _test-gpu / _test-bfb (those can adopt it next pass). - run-nsys-profile.sh: launch the model through pin-gpu.sh inside mpirun so the pin happens per child process, not once for the whole job. Comment now flags why the wrapper is here. - profile-gpu-nsight.yml: replace the old single nvidia-smi probe (which fails inside containers shipping the GDK stub libnvidia-ml.so) with a structured diagnostic block that lists /dev/nvidia*, runs nvidia-smi -L and --query-gpu (both bypass libnvidia-ml), and dumps CUDA / OpenACC / MPI env vars. Goes into the workflow log and step summary so we can tell whether the runner exposed >1 GPU and what the model actually launched with. Made-with: Cursor
The MPI-side env vars are only half the cuda-aware path. MPAS host-stages halo exchanges itself unless config_gpu_aware_mpi=.true. is set in &development; without it the MPI library never sees a device pointer and MPICH_GPU_SUPPORT_ENABLED has nothing to do. Make cuda_aware_mpi a single combined knob: when true, it now both sets the MPI env vars (existing run-nsys-profile.sh logic) and patches nsight-case/namelist.atmosphere so MPAS uses acc host_data use_device around halo sends. Idempotent: appends a &development block if missing, inserts the key if the block exists without it, or flips an existing .false. to .true. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
cuda_aware_mpiinput to the GPU Nsight profiling workflow and traces MPI in the Nsight session so we can actually see halo exchanges in the timeline.Why
The Nsight workflow always ran with host-staged MPI even though the MPAS-A OpenACC build wraps halo exchanges with
acc host_data use_device(...). Without an explicit MPI-side opt-in, the MPI library silentlycudaMemcpys device buffers through host memory; the trace shows lots of H↔D copy traffic and the MPI calls themselves are missing because nsys was not tracing MPI.Changes
profile-gpu-nsight.yml: newcuda_aware_mpiboolean input (defaultfalse, preserves prior behaviour).run-nsys-profile.sh: when the input is true, sets the right env vars per MPI implementation:MPICH_GPU_SUPPORT_ENABLED=1OMPI_MCA_pml=ucx,OMPI_MCA_osc=ucx,UCX_TLS=cuda,cuda_copy,cuda_ipc,sm,self, plus--mca pml ucx --mca osc ucxon the mpirun line.These do nothing unless the container's MPI was built with GPU support; if not, MPI aborts loudly rather than silently degrading.
run-nsys-profile.sh: nsys trace targets nowcuda,nvtx,osrt,mpi. MPI calls now show up in the timeline regardless of the cuda-aware setting.How to use
Dispatch the workflow twice (e.g. with mpich + 240km), once with the input default and once with
cuda_aware_mpi: true. Compare the resultingnsys-profile.nsys-repartifacts:cudaMemcpyAsynccalls bracketing each MPI send/recv.--with-cuda): MPI calls go directly between device buffers.Test plan
profile-gpu-nsight.ymlwith default input (regression baseline)profile-gpu-nsight.ymlwithcuda_aware_mpi: trueand confirm it either runs cleanly or aborts loudly (so we know whether the container's MPI is GPU-built)