Skip to content

TBB Copy slower than serial

In !967 (merged), we discovered that TBB's copy is slower than the serial backend for multicore, single slot processors:

Speedup serial TBB Benchmark (Type)
0.738 0.001166 +- 0.000049 0.001580 +- 0.000048 Copy 2097152 values (vtkm::Float32)
0.741 0.002305 +- 0.000146 0.003113 +- 0.000053 Copy 2097152 values (vtkm::Float64)
0.735 0.001137 +- 0.000093 0.001547 +- 0.000047 Copy 2097152 values (vtkm::Int32)
0.769 0.002402 +- 0.000101 0.003124 +- 0.000057 Copy 2097152 values (vtkm::Int64)
0.755 0.001201 +- 0.000078 0.001590 +- 0.000062 Copy 2097152 values (vtkm::UInt32)
1.051 0.000339 +- 0.000029 0.000322 +- 0.000021 Copy 2097152 values (vtkm::UInt8)
1.072 0.006688 +- 0.000291 0.006239 +- 0.000101 Copy 2097152 values (vtkm::Vec< vtkm::Float32, 4 >)
1.124 0.010427 +- 0.000617 0.009280 +- 0.000126 Copy 2097152 values (vtkm::Vec< vtkm::Float64, 3 >)
1.078 0.003398 +- 0.000142 0.003153 +- 0.000057 Copy 2097152 values (vtkm::Vec< vtkm::Int32, 2 >)
1.165 0.001856 +- 0.000098 0.001593 +- 0.000041 Copy 2097152 values (vtkm::Vec< vtkm::UInt8, 4 >)

We should investigate launching a single pinned thread per slot / LLC instead of per logical core, since memory bandwidth and cache contention seem to be causing the parallel slowdown.

Related: https://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity

Edited by Allison Vacanti