Benchmarking and Performance Regression Testing
As VTK-m grows, we need a way to monitor the performance of key algorithms and identify performance issues as they arise. Ideally, performance monitoring would be integrated with the CDash application as part of the regular nightly/CI testing process.
Implementing Benchmarks
While VTK-m currently uses a custom set of benchmarking utilities, it may be worth exploring other options to reduce complexity and maintenance requirements. The Google Benchmark library is feature rich and compatible with VTK-m's compiler requirements. It offers a number of features that would be useful for the project:
- Automatically vary and compare workload size for each benchmark.
- Utilities to prevent optimizations that remove benchmarked code (e.g. dead code elimination when the result is not used).
- Templated benchmarks that allow easily varying types (value types, device adapters, etc).
- Manual timing for tracking execution non-CPU devices.
- Comparison scripts for checking a benchmark against a baseline.
- Runtime selection of benchmarks.
- Results can be written to file outputs using a variety of formats (human readable, csv, json).
- More...
The library is provided under the Apache 2.0 license, making it easy to include and modify to suit VTK-m's needs. The actual benchmark implementations are simpler than VTK-m's current framework -- rather than writing a class-per-benchmark, they consist of a single function that encapsulates startup, benchmark timing, and teardown:
Example of a current VTK-m benchmark:
template <typename Value, typename DeviceAdapter>
struct BenchLowerBounds
{
using ValueArrayHandle = vtkm::cont::ArrayHandle<Value, StorageTag>;
const vtkm::Id N_VALS;
const vtkm::Id PERCENT_VALUES;
ValueArrayHandle InputHandle, ValueHandle;
IdArrayHandle OutHandle;
VTKM_CONT
BenchLowerBounds(vtkm::Id value_percent)
: N_VALS((Config.ComputeSize<Value>() * value_percent) / 100)
, PERCENT_VALUES(value_percent)
{
vtkm::Id arraySize = Config.ComputeSize<Value>();
auto iHPortal = InputHandle.PrepareForOutput(arraySize, DeviceAdapter());
Algorithm::Schedule(FillTestValueKernel<Value, decltype(iHPortal)>(iHPortal), arraySize);
auto vHPortal = ValueHandle.PrepareForOutput(N_VALS, DeviceAdapter());
Algorithm::Schedule(FillScaledTestValueKernel<Value, decltype(vHPortal)>(2, vHPortal),
N_VALS);
}
VTKM_CONT
vtkm::Float64 operator()()
{
Timer timer{ DeviceAdapter() };
timer.Start();
Algorithm::LowerBounds(InputHandle, ValueHandle, OutHandle);
return timer.GetElapsedTime();
}
VTKM_CONT
std::string Description() const
{
vtkm::Id arraySize = Config.ComputeSize<Value>();
std::stringstream description;
description << "LowerBounds on " << arraySize << " input values ("
<< "(" << vtkm::cont::GetHumanReadableSize(static_cast<vtkm::UInt64>(arraySize) *
sizeof(Value))
<< ") (" << PERCENT_VALUES << "% configuration)";
return description.str();
}
};
VTKM_MAKE_BENCHMARK(LowerBounds5, BenchLowerBounds, 5);
VTKM_MAKE_BENCHMARK(LowerBounds10, BenchLowerBounds, 10);
VTKM_MAKE_BENCHMARK(LowerBounds15, BenchLowerBounds, 15);
VTKM_MAKE_BENCHMARK(LowerBounds20, BenchLowerBounds, 20);
VTKM_MAKE_BENCHMARK(LowerBounds25, BenchLowerBounds, 25);
VTKM_MAKE_BENCHMARK(LowerBounds30, BenchLowerBounds, 30);
VTKM_MAKE_BENCHMARK(LowerBounds35, BenchLowerBounds, 35);
VTKM_MAKE_BENCHMARK(LowerBounds40, BenchLowerBounds, 40);
VTKM_MAKE_BENCHMARK(LowerBounds45, BenchLowerBounds, 45);
VTKM_MAKE_BENCHMARK(LowerBounds50, BenchLowerBounds, 50);
VTKM_MAKE_BENCHMARK(LowerBounds75, BenchLowerBounds, 75);
VTKM_MAKE_BENCHMARK(LowerBounds100, BenchLowerBounds, 100);
The same benchmark under Google Benchmark:
template <typename ValueType, typename DeviceAdapter>
void BenchLowerBounds(benchmark::State& state)
{
using ValueArrayHandle = vtkm::cont::ArrayHandle<ValueType, StorageTag>;
const vtkm::Id arraySize = Config.ComputeSize<ValueType>();
const vtkm::Id percentValues = static_cast<vtkm::Id>(state.range(0));
const vtkm::Id numVals = (arraySize * percentValues) / 100;
ValueArrayHandle inputHandle;
{
auto iHPortal = inputHandle.PrepareForOutput(arraySize, DeviceAdapter());
Algorithm::Schedule(FillTestValueKernel<ValueType, decltype(iHPortal)>(iHPortal), arraySize);
}
ValueArrayHandle valueHandle;
{
auto vHPortal = valueHandle.PrepareForOutput(numVals, DeviceAdapter());
Algorithm::Schedule(FillScaledTestValueKernel<ValueType, decltype(vHPortal)>(2, vHPortal), numVals);
}
IdArrayHandle outHandle;
for (auto _ : state)
{
Timer timer{ DeviceAdapter() };
timer.Start();
Algorithm::LowerBounds(inputHandle, valueHandle, outHandle);
// Manually record time for CUDA:
state.SetIterationTime(timer.GetElapsedTime());
}
};
BENCHMARK_TEMPLATE(BenchLowerBounds, vtkm::Int32, vtkm::cont::DeviceAdapterTagTBB)
->DenseRange(0, 100, 5); // Varies state.range(0) from 0-100 in increments of 5.
VTK-m would likely need to add some variations of the BENCHMARK_TEMPLATE
macro that create a cross-product of typelists to support multiple value types and device adapters.
Monitoring Regressions
CDash would require some modification for this to work. At a minimum, CDash would need:
- The ability to parse timing information uploaded from
ctest
(using Google Benchmark'sjson
output would be great here). Rather than execution time per executable, a single executable should be able to provide timing information from multiple benchmarks. - Custom benchmark thresholds per benchmark per machine. If a benchmark runs too slow OR too fast there should be a notification on the dashboard's web interface.
- The ability to reset the min/max thresholds per benchmark per machine via the web interface.
- A log of historic times and thresholds per benchmark per machine.
- A plot of historic times and thresholds shown in the web interface.
- A mechanism to ensure that the benchmarks have exclusive access to the machine (e.g. don't run CPU benchmarks while compiling ParaView in the background).
Developer Workflow
During a performance optimization workflow, the ability to easily compare benchmarks to a baseline is essential. Google Benchmark provides a utility similar to VTK-m's custom benchCompare.py
script that can be used to compare two sets of benchmarks:
$ ./compare.py benchmarks ./baseline.json ./benchmark.exe
RUNNING: ./benchmark.exe --benchmark_out=/tmp/tmprBT5nW
Run on (8 X 4000 MHz CPU s)
2017-11-07 21:16:44
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
BM_memcpy/8 36 ns 36 ns 19101577 211.669MB/s
BM_memcpy/64 76 ns 76 ns 9412571 800.199MB/s
BM_memcpy/512 84 ns 84 ns 8249070 5.64771GB/s
BM_memcpy/1024 116 ns 116 ns 6181763 8.19505GB/s
BM_memcpy/8192 643 ns 643 ns 1062855 11.8636GB/s
BM_copy/8 222 ns 222 ns 3137987 34.3772MB/s
BM_copy/64 1608 ns 1608 ns 432758 37.9501MB/s
BM_copy/512 12589 ns 12589 ns 54806 38.7867MB/s
BM_copy/1024 25169 ns 25169 ns 27713 38.8003MB/s
BM_copy/8192 201165 ns 201112 ns 3486 38.8466MB/s
Comparing ./baseline.json to ./benchmark.exe
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------
BM_memcpy/8 +0.0020 +0.0020 36 36 36 36
BM_memcpy/64 -0.0468 -0.0470 76 73 76 73
BM_memcpy/512 +0.0081 +0.0083 84 85 84 85
BM_memcpy/1024 +0.0098 +0.0097 116 118 116 118
BM_memcpy/8192 +0.0200 +0.0203 643 656 643 656
BM_copy/8 +0.0046 +0.0042 222 223 222 223
BM_copy/64 +0.0020 +0.0020 1608 1611 1608 1611
BM_copy/512 +0.0027 +0.0026 12589 12622 12589 12622
BM_copy/1024 +0.0035 +0.0028 25169 25257 25169 25239
BM_copy/8192 +0.0191 +0.0194 201165 205013 201112 205010
Additional comparison modes are detailed here, including filtered benchmarks and comparing specific benchmarks within the same run. Using these will allow developers to quickly observe the effect of their changes on an algorithm's performance.
The compare.py
script from Google Benchmark should be integrated into the vtk-m benchmarking workflow. For instance, a test could fail when it identifies cases where a parallel algorithm is being outperformed by the Serial
backend, or cases where optimizations should occur (such as vectorized bitwise copying of POD data, see #456).