Skip to content

Parallel merger mpimpi2prv hangs in a deadlock #141

@valentin-seitz

Description

@valentin-seitz

When trying to merge a trace with the parallel merger of extrae 4.2.12 (and probably newer and older versions) like mpimpi2prv -f TRACE.mpits with 308 processors the merging gets stuck.

I used MUST to confirm the observed deadlock and provide some hints on where its happening:

mpi2prv: Error! Found unmatched communication! Continuing...
mpi2prv: Progress ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Error! Found 2874 unmatched communications. Resulting tracefile may be inconsistent.
mpi2prv: Error! Found 94610 pending communications. Resulting tracefile may be inconsistent.
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, writing output.===============================
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, detailed information is available in the MUST output file. You should either investigate details with a debugger or abort, the operation of MUST will stop from now.
[MUST-RUNTIME] ===============================
[MUST-RUNTIME] ----Deadlock detection timing ----
[MUST-RUNTIME] syncTime=3500366
[MUST-RUNTIME] wfgGatherTme=323
[MUST-RUNTIME] preparationTime=1448
[MUST-RUNTIME] wfgCheckTime=1797
[MUST-RUNTIME] outputTime=25416
[MUST-RUNTIME] dotTime=0
Image

The offending MPI_Recv is located in

res = MPI_Recv (&tmp, 1, MPI_INT, my_master, ASK_MERGE_REMOTE_BLOCK_TAG, MPI_COMM_WORLD, &s);

For this execution it seems that process 26 was waiting in that RECV for a message of process 0, which was waiting in a barrier:

res = MPI_Barrier (MPI_COMM_WORLD);

I attached the whole MUST output in case you need it to debug the case :)

extrae-merger-hangs.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions