Performance issue in TBB's Copy

Copying an ArrayHandle of Vec3f_32 takes up to 47% longer than just copying an array of Float32s with the same memory size (Size:XXXX is the number of bytes in the array).

$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchCopy<F32>" "BenchCopy<Vec3f_32>"
Comparing BenchCopy<F32> to BenchCopy<Vec3f_32> (from tbb-devAlgo.json)
Benchmark                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:4096/manual_time                     +0.3065         +0.2341           513           671           568           701
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:32768/manual_time                    -0.0274         -0.1244          3857          3751          4012          3513
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:262144/manual_time                   +0.2907         +0.2670          9091         11734          8924         11306
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:2097152/manual_time                  +0.4420         +0.3723         50114         72263         48895         67101
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:16777216/manual_time                 +0.2205         +0.2885       1694294       2067908       1438704       1853814
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:134217728/manual_time                +0.4726         +0.2202      12033028      17720059      10506466      12820513

These should both be using optimized bit-wise copies, but aren't.

Benchmark was run on TBB compiled with MSVC2015

Running bin\BenchmarkDeviceAdapter.exe
Run on (8 X 2592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
VTK-m Device State:
 - Device 1 (Serial): Enabled=0
 - Device 2 (Cuda): Enabled=0
 - Device 3 (TBB): Enabled=1
 - Device 4 (OpenMP): Enabled=0
 - Device 5 (InvalidDeviceId): Enabled=0
 - Device 6 (InvalidDeviceId): Enabled=0
 - Device 7 (InvalidDeviceId): Enabled=0

Edited Dec 31, 2019 by Allison Vacanti