-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Description
When trying to merge a trace with the parallel merger of extrae 4.2.12 (and probably newer and older versions) like mpimpi2prv -f TRACE.mpits with 308 processors the merging gets stuck.
I used MUST to confirm the observed deadlock and provide some hints on where its happening:
mpi2prv: Error! Found unmatched communication! Continuing...
mpi2prv: Progress ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Error! Found 2874 unmatched communications. Resulting tracefile may be inconsistent.
mpi2prv: Error! Found 94610 pending communications. Resulting tracefile may be inconsistent.
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, writing output.===============================
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, detailed information is available in the MUST output file. You should either investigate details with a debugger or abort, the operation of MUST will stop from now.
[MUST-RUNTIME] ===============================
[MUST-RUNTIME] ----Deadlock detection timing ----
[MUST-RUNTIME] syncTime=3500366
[MUST-RUNTIME] wfgGatherTme=323
[MUST-RUNTIME] preparationTime=1448
[MUST-RUNTIME] wfgCheckTime=1797
[MUST-RUNTIME] outputTime=25416
[MUST-RUNTIME] dotTime=0
The offending MPI_Recv is located in
extrae/src/merger/paraver/paraver_generator.c
Line 1240 in b98693d
| res = MPI_Recv (&tmp, 1, MPI_INT, my_master, ASK_MERGE_REMOTE_BLOCK_TAG, MPI_COMM_WORLD, &s); |
For this execution it seems that process 26 was waiting in that RECV for a message of process 0, which was waiting in a barrier:
extrae/src/merger/paraver/paraver_generator.c
Line 1397 in b98693d
| res = MPI_Barrier (MPI_COMM_WORLD); |
I attached the whole MUST output in case you need it to debug the case :)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels