Open
Conversation
…her_release The number of received bytes in release_gather_release is badly cast between int and MPI_Aint. On most arch this is not an issue, but for Big-Endian 64b arch (s390x) it ends up losing the actual value as we only copy the first 4 MSB. Fix the issue by writing the whole MPI_AInt in the shm_buf instead of just an int. Signed-off-by: Nicolas Morey <nmorey@suse.com>
In anysrc receive, we create both a NM request and a SHM request. The SHM request is the user visible one. When the NM matches (or in progress), we will cancel the SHM partner. But since the SHM request is user-visible, we can't touch the cancel bits. Instead, we use another mechanism to mark the SHM request is cancelled (by resetting its anysrc_partner fields).
Don't assume the complex types are always available in CXX.
When we changed the signature of the launch_procs files, we neglected to update the header pbs.h.
In the multinic case, provider may provide an invalid "null" pci info, which will result in hwloc failing to obtain topology. Rather than dealing this invalid case in the topology code, let's guard this case and deal with it in the higher layer. In the case of ofi multi-nic, we will simply treat all nics are equally close and equally distribute them among the ranks.
mpipr.h is used to replace MPI calls with PMPI calls inside romio to prevent spurious hits in the profiling tools. This patch adds the large count _c APIs that we previously neglected. Add additional missed functions including MPI_Type_free_keyval, MPI_Comm_get_info, MPI_Type_create_hindexed etc. Also add MPIX_Type_iov and MPIX_Type_iov_len. Co-Authored-by: Lisandro Dalcin <dalcinl@gmail.com>
Some MPI File APIs are implemented outside romio.
Avoid internally call MPI_ functions and replace them with internal impl functions.
The device_state->subdevices buffer allocated during initialization was not being freed in finalize_hook, causing a memory leak detected by AddressSanitizer. Fixes pmodels#7724.
When we successfully get topology information from PMIx, we should destroy it at finalize time to avoid memory leaks.
Since now we generate bindings for MPI-IO functions directly, we should no longer need the hack in all_romio_symbols to force pulling romio funcitions (such as MPI_File_open) into libmpi. The inclusion of all_romio_symbols.c ends up with libpmpi.so referring to MPI_File_xxx symbols in the case when profiling libraries are built.
Do not neglect the error returns from RNDV callbacks. If it is not appropriate to return error during progress, the RNDV callbacks should take measure to return MPI_SUCCESS instead.
Missing error check in MPIDI_Reduce_intra_composition_alpha causing errors in MPIC_Recv undetected. The issue is triggered by a source to device intranode send going to the CMA path.
MPIDI_IPCI_TYPE__SKIP refers to the case where the buffer resides on a device but GPU IPC is unavailable. Since it is a device buffer, XPMEM and CMA cannot be used either. This logic applies to both the send side and the receive side when the receiver must decide whether to use IPC write. Return MPIDI_IPCI_TYPE__SKIP rather than MPIDI_IPCI_TYPE__NONE so the receive side can make the correct decision.
The CMA path calls the kernel for remote copy but it cannot handle the device buffer. This is tricky if the recv buffer is in device but can't fallback to GPU IPC write, for example due to non-contig messages. Allocate a host bounce buffer to make it work.
In commit 214c6bc we switched to using the dev.anysrc_partner field to flag whether a partner request has been cancelled. However, there is a gap in anysource_irecv during MPIDI_NM_mpi_irecv, it only sets anysrc_partner for the netmod request. MPIDI_NM_mpi_irecv may match and complete immediately but it may miss the cancelling of the shm partner due to its anysrc_partner not set yet. To fix this, we need make sure anysrc_partner for both the shm and nm requests are set at the same time. Move the macro MPIDI_REQUEST_SET_LOCAL to mpidpost.h and convert it into an inline function.
Fortunately this is caught by the ubsan test large_type_sendrecv:
ofi_rndv_read.c:348:18: runtime error: load of value 182, which is
not a valid value for type '_Bool'
Contributor
Author
|
test:mpich/ch4/most |
The dynamic_sendrecv is used in MPI_Intercomm_create. The mismatching between threads are protected by the user provided tag, thus it is okay to yield during the blocking progress. Without the yield, MPI_Intercomm_create may block another thread's progress when the remote processes are not present (blocked by other communications). In the dynamic process accept/connect path, we force peer_comm's context id to 0. This is okay because the leader exchange is established with a specific pair of addresses and there is no other communications yet during leader_exchange.
The context_id are not reflected in get_dynamic_match_bits because MPIDI_UCX_DYNPROC_MASK masked it off. Step back, we can't safely rely on MPIDI_UCX_DYNPROC_MASK since we didn't set a protocol bit for dynamic exchanges. This commit defines MPIDI_UCX_DYNPROC and use it to separate dynamic exchanges from other messages.
Contributor
Author
|
test:mpich/ch4/most |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
A batch of bug fixes and improvements intended for a 5.0.1 release.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.