Ghost Cell Generator Causes Core dump in certain circumstances
I have been experiencing an inability to run a ParaView Catalyst script with a LANL code on 10s of ks of nodes if a ghost cell generator is in the pipeline with ParaView 5.5.0. I have replicated the the problem with a 21 million cell partitioned dataset (a pvtu with 64 vtus). It tars up to 274 MB with zlib (.tar.gz). The problem exposes itself on 4 nodes with 8 procs per node (total=32) on the trinitite supercomputer. Using ParaView 5.5.0 client from pkg and a pvserver built with: intel/17.0.4 PrgEnv-intel/6.0.4 cray-mpich/7.7.0 python/2.7-anaconda-4.1.1 paraview/5.5.0-osmesa The problem does not expose itself on the same computer using 2 nodes with 16 procs per node. This is possibly a system problem ... @wascott @boonth testme4.py The attached file is used to test on haswell: 1 Node 64 cores works 2 Nodes 64 cores doesn't
` projects/gcg_test> srun -N 2 -n 64 pvbatch ./testme4.py Generic Warning: In /tmp/pv-5.5.0-again/SRC/paraview/VTK/Parallel/MPI/vtkMPICommunicator.cxx, line 71 MPI had an error
Message truncated, error stack: PMPI_Test(178)...................: MPI_Test(request=0x22d0f30, flag=0x7fffffff282c, status=0x7fffffff2818) failed MPIR_Test_impl(67)...............: MPID_nem_gni_lmt_start_recv(1667): Message from rank 30 and tag 9001 truncated; 1611360 bytes received but buffer size is 185760
Rank 32 [Wed May 16 19:07:00 2018] [c0-0c2s4n3] application called MPI_Abort(comm=0x84000007, 942310158) - process 32 Generic Warning: In /tmp/pv-5.5.0-again/SRC/paraview/VTK/Parallel/MPI/vtkMPICommunicator.cxx, line 71 MPI had an error
Message truncated, error stack: PMPI_Test(178)...................: MPI_Test(request=0x27ba9a0, flag=0x7fffffff282c, status=0x7fffffff2818) failed MPIR_Test_impl(67)...............: MPID_nem_gni_lmt_start_recv(1667): Message from rank 32 and tag 9001 truncated; 1006776 bytes received but buffer size is 1006152
Rank 30 [Wed May 16 19:07:00 2018] [c0-0c2s4n2] application called MPI_Abort(comm=0x84000004, 203064078) - process 30 srun: error: nid00147: task 32: Aborted srun: Terminating job step 333504.13 slurmstepd: error: *** STEP 333504.13 ON nid00146 CANCELLED AT 2018-05-16T19:07:01 *** srun: error: nid00146: task 30: Aborted srun: error: nid00146: tasks 0,4-6,8-9,11-16,18,20,22-23,25,27,31: Terminated srun: error: nid00147: tasks 34-54,56-58,60-63: Terminated srun: error: nid00147: tasks 33,55: Terminated srun: error: nid00146: tasks 1-3,7,10,17,19,21,24,26,28-29: Terminated srun: error: nid00147: task 59: Terminated srun: Force Terminated job step 333504.13
projects/gcg_test> srun -N 1 -n 64 pvbatch ./testme4.py SWR detected AVX2 instruction support (using: libswrAVX2.so). SWR detected AVX2 instruction support (using: libswrAVX2.so). projects/gcg_test> `
Sample Data Set Here: one.tgz it will break with four nodes and 8 or more ranks: srun -N4 -n8 pvbatch testme4.py (linked above)