CTH reads file 0 for all processes
This issue was created automatically from an original Mantis Issue. Further discussion may take place here.
We suspect that large Cray clusters are serializing access to single files when multiple pvservers are trying to access these single files. As we scale into the thousands of pvservers, we believe this is becoming fatal.
ParaView 3.12.0, remote server (I am using 8 processes), Linux client. Although I am sure you can replicate with any cth dataset, I am doing the following:
- Make soft links (ln -s) to files spcta.0, spcta.1, spcta.2 and spcta.3 of Dave's big CTH AMR dataset (i.e., 256 files). Now, we have a 4 file subset of this dataset.
- strace -o $HOME/pvserver.strace -tt -f -ff -e trace=open,close,read,write
- This will create a different file for each process. Do a ls -ls on these files, the smaller ones are not of interest, the larger are from lib/paraview3.12/pvserver. We care about the larger ones.
- Note that 4 of them are slightly larger than the smaller ones. We care about these larger files.
Open each file in turn. Search for spcth. Notice that each file opens file 0 4 times, and then opens it's real file 2 times.
As stated, we believe that these 4 opens of file 0 are fatal for Cielo and possibly other cray systems.
This is a show stopper bug for Cielo going into production with expected size datasets.
I will send the log files to Utkarsh and Robert from my run. I am marking this as a crash, although technically it is a hang (or a glacier - take your pick).