Benchmarking and Performance Regression Testing

As VTK-m grows, we need a way to monitor the performance of key algorithms and identify performance issues as they arise. Ideally, performance monitoring would be integrated with the CDash application as part of the regular nightly/CI testing process.

Implementing Benchmarks

While VTK-m currently uses a custom set of benchmarking utilities, it may be worth exploring other options to reduce complexity and maintenance requirements. The Google Benchmark library is feature rich and compatible with VTK-m's compiler requirements. It offers a number of features that would be useful for the project:

Automatically vary and compare workload size for each benchmark.
Utilities to prevent optimizations that remove benchmarked code (e.g. dead code elimination when the result is not used).
Templated benchmarks that allow easily varying types (value types, device adapters, etc).
Manual timing for tracking execution non-CPU devices.
Comparison scripts for checking a benchmark against a baseline.
Runtime selection of benchmarks.
Results can be written to file outputs using a variety of formats (human readable, csv, json).
More...

The library is provided under the Apache 2.0 license, making it easy to include and modify to suit VTK-m's needs. The actual benchmark implementations are simpler than VTK-m's current framework -- rather than writing a class-per-benchmark, they consist of a single function that encapsulates startup, benchmark timing, and teardown:

Example of a current VTK-m benchmark:

  template <typename Value, typename DeviceAdapter>
  struct BenchLowerBounds
  {
    using ValueArrayHandle = vtkm::cont::ArrayHandle<Value, StorageTag>;

    const vtkm::Id N_VALS;
    const vtkm::Id PERCENT_VALUES;
    ValueArrayHandle InputHandle, ValueHandle;
    IdArrayHandle OutHandle;

    VTKM_CONT
    BenchLowerBounds(vtkm::Id value_percent)
      : N_VALS((Config.ComputeSize<Value>() * value_percent) / 100)
      , PERCENT_VALUES(value_percent)
    {
      vtkm::Id arraySize = Config.ComputeSize<Value>();
      auto iHPortal = InputHandle.PrepareForOutput(arraySize, DeviceAdapter());
      Algorithm::Schedule(FillTestValueKernel<Value, decltype(iHPortal)>(iHPortal), arraySize);
      auto vHPortal = ValueHandle.PrepareForOutput(N_VALS, DeviceAdapter());
      Algorithm::Schedule(FillScaledTestValueKernel<Value, decltype(vHPortal)>(2, vHPortal),
                          N_VALS);
    }

    VTKM_CONT
    vtkm::Float64 operator()()
    {

      Timer timer{ DeviceAdapter() };
      timer.Start();
      Algorithm::LowerBounds(InputHandle, ValueHandle, OutHandle);

      return timer.GetElapsedTime();
    }

    VTKM_CONT
    std::string Description() const
    {
      vtkm::Id arraySize = Config.ComputeSize<Value>();
      std::stringstream description;
      description << "LowerBounds on " << arraySize << " input values ("
                  << "(" << vtkm::cont::GetHumanReadableSize(static_cast<vtkm::UInt64>(arraySize) *
                                                             sizeof(Value))
                  << ") (" << PERCENT_VALUES << "% configuration)";
      return description.str();
    }
  };
  VTKM_MAKE_BENCHMARK(LowerBounds5, BenchLowerBounds, 5);
  VTKM_MAKE_BENCHMARK(LowerBounds10, BenchLowerBounds, 10);
  VTKM_MAKE_BENCHMARK(LowerBounds15, BenchLowerBounds, 15);
  VTKM_MAKE_BENCHMARK(LowerBounds20, BenchLowerBounds, 20);
  VTKM_MAKE_BENCHMARK(LowerBounds25, BenchLowerBounds, 25);
  VTKM_MAKE_BENCHMARK(LowerBounds30, BenchLowerBounds, 30);
  VTKM_MAKE_BENCHMARK(LowerBounds35, BenchLowerBounds, 35);
  VTKM_MAKE_BENCHMARK(LowerBounds40, BenchLowerBounds, 40);
  VTKM_MAKE_BENCHMARK(LowerBounds45, BenchLowerBounds, 45);
  VTKM_MAKE_BENCHMARK(LowerBounds50, BenchLowerBounds, 50);
  VTKM_MAKE_BENCHMARK(LowerBounds75, BenchLowerBounds, 75);
  VTKM_MAKE_BENCHMARK(LowerBounds100, BenchLowerBounds, 100);

The same benchmark under Google Benchmark:

  template <typename ValueType, typename DeviceAdapter>
  void BenchLowerBounds(benchmark::State& state)
  {
    using ValueArrayHandle = vtkm::cont::ArrayHandle<ValueType, StorageTag>;

    const vtkm::Id arraySize = Config.ComputeSize<ValueType>();
    const vtkm::Id percentValues = static_cast<vtkm::Id>(state.range(0));
    const vtkm::Id numVals = (arraySize * percentValues) / 100;

    ValueArrayHandle inputHandle;
    {
      auto iHPortal = inputHandle.PrepareForOutput(arraySize, DeviceAdapter());
      Algorithm::Schedule(FillTestValueKernel<ValueType, decltype(iHPortal)>(iHPortal), arraySize);
    }

    ValueArrayHandle valueHandle;
    {
      auto vHPortal = valueHandle.PrepareForOutput(numVals, DeviceAdapter());
      Algorithm::Schedule(FillScaledTestValueKernel<ValueType, decltype(vHPortal)>(2, vHPortal), numVals);
    }

    IdArrayHandle outHandle;
    for (auto _ : state)
    {
      Timer timer{ DeviceAdapter() };
      timer.Start();
      Algorithm::LowerBounds(inputHandle, valueHandle, outHandle);

      // Manually record time for CUDA:
      state.SetIterationTime(timer.GetElapsedTime());
    }
  };
  BENCHMARK_TEMPLATE(BenchLowerBounds, vtkm::Int32, vtkm::cont::DeviceAdapterTagTBB)
    ->DenseRange(0, 100, 5); // Varies state.range(0) from 0-100 in increments of 5.

VTK-m would likely need to add some variations of the BENCHMARK_TEMPLATE macro that create a cross-product of typelists to support multiple value types and device adapters.

Monitoring Regressions

CDash would require some modification for this to work. At a minimum, CDash would need:

The ability to parse timing information uploaded from ctest (using Google Benchmark's json output would be great here). Rather than execution time per executable, a single executable should be able to provide timing information from multiple benchmarks.
Custom benchmark thresholds per benchmark per machine. If a benchmark runs too slow OR too fast there should be a notification on the dashboard's web interface.
The ability to reset the min/max thresholds per benchmark per machine via the web interface.
A log of historic times and thresholds per benchmark per machine.
A plot of historic times and thresholds shown in the web interface.
A mechanism to ensure that the benchmarks have exclusive access to the machine (e.g. don't run CPU benchmarks while compiling ParaView in the background).

Developer Workflow

During a performance optimization workflow, the ability to easily compare benchmarks to a baseline is essential. Google Benchmark provides a utility similar to VTK-m's custom benchCompare.py script that can be used to compare two sets of benchmarks:

$ ./compare.py benchmarks ./baseline.json ./benchmark.exe
RUNNING: ./benchmark.exe --benchmark_out=/tmp/tmprBT5nW
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:16:44
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
BM_memcpy/8            36 ns         36 ns   19101577   211.669MB/s
BM_memcpy/64           76 ns         76 ns    9412571   800.199MB/s
BM_memcpy/512          84 ns         84 ns    8249070   5.64771GB/s
BM_memcpy/1024        116 ns        116 ns    6181763   8.19505GB/s
BM_memcpy/8192        643 ns        643 ns    1062855   11.8636GB/s
BM_copy/8             222 ns        222 ns    3137987   34.3772MB/s
BM_copy/64           1608 ns       1608 ns     432758   37.9501MB/s
BM_copy/512         12589 ns      12589 ns      54806   38.7867MB/s
BM_copy/1024        25169 ns      25169 ns      27713   38.8003MB/s
BM_copy/8192       201165 ns     201112 ns       3486   38.8466MB/s
Comparing ./baseline.json to ./benchmark.exe
Benchmark                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------
BM_memcpy/8            +0.0020         +0.0020            36            36            36            36
BM_memcpy/64           -0.0468         -0.0470            76            73            76            73
BM_memcpy/512          +0.0081         +0.0083            84            85            84            85
BM_memcpy/1024         +0.0098         +0.0097           116           118           116           118
BM_memcpy/8192         +0.0200         +0.0203           643           656           643           656
BM_copy/8              +0.0046         +0.0042           222           223           222           223
BM_copy/64             +0.0020         +0.0020          1608          1611          1608          1611
BM_copy/512            +0.0027         +0.0026         12589         12622         12589         12622
BM_copy/1024           +0.0035         +0.0028         25169         25257         25169         25239
BM_copy/8192           +0.0191         +0.0194        201165        205013        201112        205010

Additional comparison modes are detailed here, including filtered benchmarks and comparing specific benchmarks within the same run. Using these will allow developers to quickly observe the effect of their changes on an algorithm's performance.

The compare.py script from Google Benchmark should be integrated into the vtk-m benchmarking workflow. For instance, a test could fail when it identifies cases where a parallel algorithm is being outperformed by the Serial backend, or cases where optimizations should occur (such as vectorized bitwise copying of POD data, see #456).

Edited Dec 31, 2019 by Allison Vacanti