Pipeline MPI failed tests not reported in summary
Check this pipeline https://gitlab.kitware.com/mennodeij1/paraview/-/jobs/6657337
There are many failed tests due to MPI issues with memory, but these failed tests are not in the summary at the end This makes it very hard to see what is going on. To start with, it would seem that the MPI tests are running into limitations on the (virtual?) machines they are running on?
Take for example a test I'm interested in:
220/227 Test #129: ParaView::VTKExtensionsPCGNSWriterCxx-MPI-TestPartitionedDataSetCollection ........................***Failed 0.24 sec
[1646213770.863287] [runner-juxqemm-project-375-concurrent-0:14076:0] mm_posix.c:164 UCX ERROR Not enough memory to write total of 4292720 bytes. Please check that /dev/shm or the directory you specified has more available memory.
[1646213770.863288] [runner-juxqemm-project-375-concurrent-0:14078:0] mm_posix.c:164 UCX ERROR Not enough memory to write total of 4292720 bytes. Please check that /dev/shm or the directory you specified has more available memory.
[1646213770.863606] [runner-juxqemm-project-375-concurrent-0:14078:0] uct_mem.c:149 UCX ERROR failed to allocate 4292720 bytes using md posix for mm_recv_desc: Out of memory
[1646213770.863621] [runner-juxqemm-project-375-concurrent-0:14078:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=mm_recv_desc) chunk: Out of memory
[1646213770.863626] [runner-juxqemm-project-375-concurrent-0:14078:0] mm_iface.c:673 UCX ERROR failed to get the first receive descriptor
[1646213770.863639] [runner-juxqemm-project-375-concurrent-0:14076:0] uct_mem.c:149 UCX ERROR failed to allocate 4292720 bytes using md posix for mm_recv_desc: Out of memory
[1646213770.863644] [runner-juxqemm-project-375-concurrent-0:14076:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=mm_recv_desc) chunk: Out of memory
[1646213770.863648] [runner-juxqemm-project-375-concurrent-0:14076:0] mm_iface.c:673 UCX ERROR failed to get the first receive descriptor
Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1139)..............:
MPIDI_OFI_mpi_init_hook(1626):
create_endpoint(2284)........: OFI endpoint open failed (ofi_init.c:2284:create_endpoint:Invalid argument)
Abort(1615503) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1139)..............:
MPIDI_OFI_mpi_init_hook(1626):
create_endpoint(2284)........: OFI endpoint open failed (ofi_init.c:2284:create_endpoint:Invalid argument)
This is the list of failed tests:
The following tests FAILED:
16 - ParaView::RemotingApplicationPython-GenerateIdScalarsBackwardsCompatibility (Failed)
24 - ParaView::RemotingApplicationPython-ProgrammableFilterProperties (Failed)
40 - ParaView::RemotingApplicationPython-RepresentationTypeHint (Failed)
42 - ParaView::RemotingApplicationPython-SaveScreenshot (Failed)
43 - ParaView::RemotingApplicationPython-ScalarBarActorBackwardsCompatibility (Failed)
77 - ParaView::RemotingApplicationPython-MPI-Batch-SymmetricTestFetchData (Failed)
89 - ParaView::RemotingApplicationPython::Module::pvbatch::paraview.tests (Failed)
106 - SurfaceLIC-OfficeVSlice-Batch (Failed)
167 - Catalyst::MPI::WaveletMiniApp.package_test_zip (Failed)
173 - Catalyst::FileDriverMiniApp::Wavelet.ValidateChangingTime (Failed)