Skip to content

Using MPICH on Aurora@ALCF

Rob Latham edited this page Feb 20, 2026 · 20 revisions

This page describes how to build and use MPICH on the 'Aurora' machine at Argonne. Aurora is uses Intel CPUs and GPUs with the Cray Slingshot interconnect. Support for Slingshot is provided via the system libfabric.

Prerequisite

As of 03/01/2024, it is best to build the MPICH git main branch for use on Aurora.

Build MPICH

Build ZE-enabled MPICH for use with the Cray PALS launcher

presumably, since you are on Aurora, you want DAOS support as well

./autogen.sh
./configure --prefix=/home/raffenet/proj/mpich/i \
    --with-device=ch4:ofi --with-libfabric=$(pkg-config libfabric --variable=prefix) \
    --with-ze --enable-ze-native=pvc \
    --with-pm=no --with-pmi=pmix --with-pmix=/usr \
    --with-file-system=ufs+lustre+daos \
    --with-daos=/usr \
    --with-cart=/usr \
    --with-file-system=daos+lustre+ufs \
    CC=icx CXX=icpx FC=ifx
    
make -j16 install

A correctly configured MPICH build should print the following message in configure output.

*****************************************************
***
*** device      : ch4:ofi
*** shm feature : auto
*** gpu support : ZE
***
*****************************************************

Running MPI Applications

On Aurora, MPICH is configured without a process manager (--with-pm=no). We instead can use the mpiexec that is available as part of Cray's PALs package. Setting PALS_PMI=pmix in the execution environment is required for processes to properly query the runtime for job information. If launching processes on a single host, it is required to set MPIR_CVAR_SINGLE_HOST_ENABLED=0 in the environment.

If you use the -ppn flag to mpiexec, you might get shepherd died from signal 11. Hui says that's because PALs can't handle a PMIx_fence with a non-world scope. To work around, set MPIR_CVAR_PMI_DISABLE_GROUP=1. Or, if you ask for more processes than available nodes, the launcher will assign processes in a round-robin way . The aurora_test branch sets this already, so you'll only need this if building from HEAD

The PALS_PMI=pmix variable should be set as well. The mpich module on aurora will set this but if it's unloaded it might take it out of your environment. If not set, your job will be unable to launch.

Debugging

  • gdb: but it's not called gdb. it's gdb-oneapi
  • gdb4hpc: pretty nice parallel wrapper around gdb, though it will probably timeout if you have more than a few dozen processes

Common Issues

  • It's possible (maybe only with large counts) to trigger an unaligned memory error in the ZE code:
.../modules/yaksa/src/backend/ze/hooks/yaksuri_zei_type_hooks.c:101:30: runtime error: load of misaligned address 0xfbc50007672a3559 for type 'struct _ze_module_handle_t *', which requires 8 byte alignment
0xfbc50007672a3559: note: pointer points here
<memory cannot be printed>
Segmentation fault

If you don't need GPU processing you can work around this by configuring with --without-ze

Building libfabric for use on Aurora

The CXI provider is open source, but development happens in more than one repository. As of April 2025, libfabric has been successfully built and tested on Aurora using the v1.22.x-ss branch from the https://github.com/HewlettPackard/shs-libfabric repository.

NOTE: You must unload the system libfabric module or else prepend your installation to LD_LIBRARY_PATH to ensure the correct library is linked. We have seen issues where module unload libfabric does not actually modify LD_LIBRARY_PATH, so care is need to ensure you link the correct library at runtime.

module unload libfabric
git clone -b v1.22.x-ss https://github.com/HewlettPackard/shs-libfabric
# disable verbs and efa to avoid picking up unnecessary dependencies
./configure --enable-cxi --disable-verbs --disable-efa --with-ze --prefix=<path/to/install>
make -j16 install

Clone this wiki locally