ParaView HPC Scalability Guide

Please add the HPC Scalability Guide that was created by Mathieu to the ParaView Reference manual. We don't want to lose this really spectacular document.

Document is here: https://www.kitware.com/paraview-hpc-scalability-guide/

My edit notes, from a year ago, for whatever it is worth, were:

Just because I love bullets, let me do this as bullets.

I really like this document. We have needed it for a very long time. Well done.
Taking the title literally, I would strongly argue that you increase the width of this topic. In my experience, performance issues (i.e., can something run well or not) are – in order – 1) Do you have enough memory? How can you tell? (Memory inspector, see below). How do you get your data visualized? (More nodes, fewer variables, smaller dataset. Time count doesn’t matter. Structured over unstructured.). If it won’t fit in memory, cpu performance doesn’t matter. 2) Disk read speed. You covered this. Basically, faster disk, more nodes reading it. I will say that there is one subtopic here – Is the reader doing dumb stuff? For instance, the Exodus reader is HORRIBLE on an uncached disk. It reads a byte at a time, each of which brings in a disk sector. 3) Rendering speeds. Note that speeds don’t need to be anything near what they are for games. 8 or so FPS is plenty. Stefan, a user that visualizes fire, puts up with a frame every 5 or 10 seconds. (He also does insane volume rendering of hero size datasets.). 4) Algorithm speeds. You covered this very well.
What IS big data and what is huge? Memory Inspector (never go over 70%. Number of variables (over a hundred is huge. Only load what you need.). Number of blocks (over 100 is huge. Only load what you need). Number of sets (over 100 is huge. Only turn on what you need). Number of files (Over 10,000 is huge, I have seen millions. DON’T turn on the advanced information on the Open dialog.). Number of timesteps (100’s are big, 1000’s are huge).
You did say this, but for someone that doesn’t know what distributed data is, elaborate. This is one of the biggest issues I have – users deciding 1 node (16 or 32 cores) isn’t enough to read one file, so they bring in 8 nodes.
There is a fairly large disagreement within the community with regards to = software rendering vs GPUs. I believe LLNL has dedicated clusters with GPUs for vis servers. LANL has dedicated clusters, I’m not sure about GPUs. Sandia goes a different direction that has been extremely successful. ParaView is so incredibly good with reducing rendering load, and so incredibly good with distributed server rendering that we just ask for a small slice of the big clusters. Think of it this way. If I have 5 million dollars for Viz, do I want a dedicated small cluster (with GPUs), or a bigger primary cluster with more nodes and memory? Don’t forget this gives us extreme flexibility to expand or contract our node count on a platform with a low cost. All processing at Sandia backend pvserver is osmesa. We get framerates around 50/second, plenty good. Having said all this, once we did implement the back end on a GPU cluster (Doom), and for worst cases, framerates did double.
Your writeup is somewhat incorrect and confusing when a user has a remote distributed pvserver(s). I get caught by this all the time in testing. If you are running a small dataset (can.exo, disk_out_ref.exo), ParaView will actually render not the data, but the surface on the client. If it’s bigger, it will render the surface on the server. The “switch” is of course Settings/ Rendering/ Remote Rendering Threshold.
If a user has a slow, or long connection (such as if I am in Albuquerque, and I am connected to a cluster in Los Alamos), if needed, I can decrease the pixels moved back and forth. Edit/ Settings/ Advanced/ Rendering/ Image Reduction Factor. We haven’t heeded to tweak this in years.
For advanced users, there are three ways I try to figure out “hot spots”. Build debug, and then run ParaView in a debugger. Stop a lot. If you start seeing the same spot over and over, that’s the problem. Alternatively, profilers. Doing this remote server is a challenge. Last, the good old, ancient print statements in the code. This paragraph is probably too much information. Better may be “contact Kitware, we know stuff”.
The progress bar is useless. I think we really should get rid of it, and just add a MacOS type of spinning wheel of death. The reasons are fairly complex. When you really need a progress bar is remote distributed server. We were passing information back to rank 0 so often 10 years ago that the progress bar was significantly slowing ParaView down. Now, we just use rank zero, and assume that’s good enough, which it isn’t. Many filters go from 0 to 100% done with nothing in between. Many pipelines just flash back and forth like they are crazy ants running around. Again, the progress bar is useless.
You mentioned the Timer Log (Excellent), but forgot to mention Utkarsh’s more granular and selectable Log Viewer. The Log Viewer is an even more useful tool, especially if you figure out how to use it and start imbedding it in the code with better granularity.
A really good way to check rendering speeds is with a dataset that is known and everyone has – i.e., Source/ Fast Uniform Grid, or Sphere, or Unstructured Cell Types. Make LOTS of cells, spin. This is a way to push a huge number of vertexes, or pixels, down the pipe.
There are minor grammatical mistakes in the document. The meaning is perfect, but it is slightly distracting. If you give me a text copy, I can go through it and edit appropriately.

That’s about everything I found. It really is a good job, don’t take my thoughts as too negative. I just see performance issues frequently from a different viewpoint.

Here is the "link" that Ben gave me to this original email: Message-Id: C87AD652-5912-4DA5-B1E4-8CD9CF603E35@sandia.gov

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information