Skip to content

fix mpi deadlocks

When using MPI on application start:

  • rank 0 initializes services calls WaitForExit where it will run the while(!rpl.empty()..) loop .
  • satellites initialize services and then call WaitForExit. This will block on ProcessRMIs.

on a new servermanager message rank 0 will receive it (via e.g RequestObservables) and broadcast it to all satellites. The message will break satellites from ProcessRMIs and fill the run_loop on the main thread wherever the engine coordinate is used. Then the satellite's main thread will come back to ProcessRMIs.

However, if any of the messages processed on a satellite has run_on_main_thread tag or RunAsBlocking this will block forever if it happens at a point where the satellite's main thread waits on ProcessRMIs.

Edited by Christos Tsolakis

Merge request reports