TBB Copy slower than serial
In !967 (merged), we discovered that TBB's copy is slower than the serial backend for multicore, single slot processors:
Speedup | serial | TBB | Benchmark (Type) |
---|---|---|---|
0.738 | 0.001166 +- 0.000049 | 0.001580 +- 0.000048 | Copy 2097152 values (vtkm::Float32) |
0.741 | 0.002305 +- 0.000146 | 0.003113 +- 0.000053 | Copy 2097152 values (vtkm::Float64) |
0.735 | 0.001137 +- 0.000093 | 0.001547 +- 0.000047 | Copy 2097152 values (vtkm::Int32) |
0.769 | 0.002402 +- 0.000101 | 0.003124 +- 0.000057 | Copy 2097152 values (vtkm::Int64) |
0.755 | 0.001201 +- 0.000078 | 0.001590 +- 0.000062 | Copy 2097152 values (vtkm::UInt32) |
1.051 | 0.000339 +- 0.000029 | 0.000322 +- 0.000021 | Copy 2097152 values (vtkm::UInt8) |
1.072 | 0.006688 +- 0.000291 | 0.006239 +- 0.000101 | Copy 2097152 values (vtkm::Vec< vtkm::Float32, 4 >) |
1.124 | 0.010427 +- 0.000617 | 0.009280 +- 0.000126 | Copy 2097152 values (vtkm::Vec< vtkm::Float64, 3 >) |
1.078 | 0.003398 +- 0.000142 | 0.003153 +- 0.000057 | Copy 2097152 values (vtkm::Vec< vtkm::Int32, 2 >) |
1.165 | 0.001856 +- 0.000098 | 0.001593 +- 0.000041 | Copy 2097152 values (vtkm::Vec< vtkm::UInt8, 4 >) |
We should investigate launching a single pinned thread per slot / LLC instead of per logical core, since memory bandwidth and cache contention seem to be causing the parallel slowdown.
Related: https://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity
Edited by Allison Vacanti