Performance issue in TBB's Copy
Copying an ArrayHandle
of Vec3f_32
takes up to 47% longer than just copying an array of Float32
s with the same memory size (Size:XXXX
is the number of bytes in the array).
$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchCopy<F32>" "BenchCopy<Vec3f_32>"
Comparing BenchCopy<F32> to BenchCopy<Vec3f_32> (from tbb-devAlgo.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:4096/manual_time +0.3065 +0.2341 513 671 568 701
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:32768/manual_time -0.0274 -0.1244 3857 3751 4012 3513
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:262144/manual_time +0.2907 +0.2670 9091 11734 8924 11306
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:2097152/manual_time +0.4420 +0.3723 50114 72263 48895 67101
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:16777216/manual_time +0.2205 +0.2885 1694294 2067908 1438704 1853814
[BenchCopy<F32> vs. BenchCopy<Vec3f_32>]/Size:134217728/manual_time +0.4726 +0.2202 12033028 17720059 10506466 12820513
These should both be using optimized bit-wise copies, but aren't.
Benchmark was run on TBB compiled with MSVC2015
Running bin\BenchmarkDeviceAdapter.exe
Run on (8 X 2592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
VTK-m Device State:
- Device 1 (Serial): Enabled=0
- Device 2 (Cuda): Enabled=0
- Device 3 (TBB): Enabled=1
- Device 4 (OpenMP): Enabled=0
- Device 5 (InvalidDeviceId): Enabled=0
- Device 6 (InvalidDeviceId): Enabled=0
- Device 7 (InvalidDeviceId): Enabled=0
Edited by Allison Vacanti