VTK-m needs a way to express a max number of parallel CUDA tests per GPU

When testing in parallel it is possible that to many executables are running that use cuda. This than causes tests to fail with the following error:

2019-01-17 09:27:39.016 (   0.005s) [main thread     ]             loguru.hpp:1969  Info| arguments: UnitTestStreamLineUniformGrid --device=Cuda
2019-01-17 09:27:39.017 (   0.005s) [main thread     ]             loguru.hpp:1972  Info| Current dir: /home/kitware/buildslave/root/vtk-m-adora-linux-static-release_cuda_host_gcc_5_cuda_native_examples_gcc_logging/build/vtkm/worklet/testing
2019-01-17 09:27:39.017 (   0.005s) [main thread     ]             loguru.hpp:1974  Info| stderr verbosity: 0
2019-01-17 09:27:39.017 (   0.005s) [main thread     ]             loguru.hpp:1975  Info| -----------------------------------
2019-01-17 09:27:39.017 (   0.005s) [main thread     ]            Logging.cxx:138   Info| Logging initialized.
2019-01-17 09:27:48.313 (   9.301s) [main thread     ]RuntimeDeviceTracker.cx:159   Info| Setting device 'Cuda' to 0
2019-01-17 09:27:48.313 (   9.301s) [main thread     ]RuntimeDeviceTracker.cx:159   Info| Setting device 'TBB' to 0
2019-01-17 09:27:48.313 (   9.301s) [main thread     ]RuntimeDeviceTracker.cx:159   Info| Setting device 'OpenMP' to 0
2019-01-17 09:27:48.313 (   9.301s) [main thread     ]RuntimeDeviceTracker.cx:159   Info| Setting device 'Serial' to 1
2019-01-17 09:27:48.313 (   9.302s) [main thread     ]         Initialize.cxx:73     ERR| Unavailable device specificed after option '--device': 'Cuda'.
Valid devices are: "Any" "Serial"

To help alleviate this we have done the following:

We have made the DeviceAdapterRuntimeDetectorCuda handle hardware that is at max capacity gracefully. Previously it would hard disable using that GPU for-ever. Now it ignore that the hardware is at capacity and presumes that by the first kernel launch the device will be ready. ( !1533 (merged) )
We have setup in our buildbot infrastructure rules to re-run any test that fails in parallel again in serial. This works around the max capacity issue but increase the test time turn-around.
This issue is not only a CUDA issue. The OPENMP see inverse scaling when run in paralell, and is why currently they are all run serially.

What needs to be done:

We need support for OpenMP and TBB tests to be passed on the command line the number of threads to use
We need support for CUDA tests to be told which hardware gpu device to execute on
We need to watch the work that other people are doing to help ctest fix this issue:
- https://gitlab.kitware.com/snl/project-1/issues/68
- https://github.com/trilinos/Trilinos/issues/2422

Edited Jan 14, 2021 by Robert Maynard