Skip to content

mpir/pmi: dynamic world_id for singleton initialisations#7750

Open
nmnobre wants to merge 5 commits intopmodels:mainfrom
nmnobre:singleton
Open

mpir/pmi: dynamic world_id for singleton initialisations#7750
nmnobre wants to merge 5 commits intopmodels:mainfrom
nmnobre:singleton

Conversation

@nmnobre
Copy link
Copy Markdown
Contributor

@nmnobre nmnobre commented Mar 18, 2026

The issue

Refs #7374.

Two unrelated, singleton applications can interfere with each other if MPICH is configured with shared memory support. This is because, in its current state, the world_id for any two singleton apps is the same, which means MPL_initshm_open() will try to open the same shared memory object. This is especially troublesome because a user can influence another user's behaviour (this is how I found this after much digging).

Consider a user who runs a singleton app and then: a) ctrl-c's their app; b) forgets MPI_Finalize(); or c) launches a very long app. Then, that user effectively takes that shared memory location hostage, and a second user is deprived of running their singleton application (in cases a) and b) probably until the system is rebooted), getting only MPIDI_POSIX_comm_bootstrap(288).......: Out of memory.

The suggested solution

At first, my patch was in MPIDI_POSIX_comm_bootstrap() because that's closer to the issue's critical code portion. But, then, I thought maybe it makes more sense to just change the world_id from the get-go, as it more closely mirrors the non-singleton approach with a dynamic id.

This was quite hard to debug, but also quite fun, so I'd like to see this through, even if the current solution isn't satisfactory. :)

Cheers,
-Nuno

if (pmi_kvs_name) {
HASH_FNV(pmi_kvs_name, strlen(pmi_kvs_name), world_id);
if (strcmp(pmi_kvs_name, "singinit") == 0) {
world_id = getpid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is acceptable for now. Let's add a comment to explain the reason:

/* world_id is used in creating shared memory name. We need make sure
 * it is unique in case of multiple instances of singleton-init */

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded on the existing docstring for world_id instead, so all notes are in the same place. :)
I've also fixed a number of typos in the docs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider making this solution compatible to PMIx, i.e., comparing pmi_kvs_name also to "0" to detect the PMIx singleton case here. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants