Performance issue in Fill algorithms: could be faster for small POD types
The below comparison shows the performance difference between FillArrayHandle with POD data of different sizes: a single byte UInt8
and a 16-byte Pair<Int32, Float64>
. The Size:XXXXXX
parameter is the memory size of the array in bytes.
The large type fills much faster. The smaller types could be sped up by a factor of 2-10x by stamping the smaller word repeatedly into a larger buffer, and then filling using the larger buffer-type. For example, with UInt8
, we could treat the array as an array of Vec<UInt8, 16>
, and fill it using this larger word size. The size of the vector should be benchmarked to find the ideal wordsize on a modern HPC system, and ofc we should be careful to not write past the end of the array.
$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchFillArrayHandle<Pair<I32, F64>>" "BenchFillArrayHandle<UI8>"
Comparing BenchFillArrayHandle<Pair<I32, F64>> to BenchFillArrayHandle<UI8> (from tbb-devAlgo.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:4096/manual_time +7.6745 +7.0111 559 4850 612 4903
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:32768/manual_time +2.7490 +2.6774 3287 12322 3376 12414
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:262144/manual_time +6.3244 +6.1977 8800 64456 8993 64729
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:2097152/manual_time +12.8550 +12.4815 26993 373985 27248 367346
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:16777216/manual_time +1.7629 +1.7277 1103783 3049640 1101577 3004808
[BenchFillArrayHandle<Pair<I32, F64>> vs. BenchFillArrayHandle<UI8>]/Size:134217728/manual_time +1.2829 +1.2868 10333167 23589721 9895833 22629310
The FillBitField
algorithms already do this, upping the word size to 4-bytes, and see more consistent performance:
$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchFillBitFieldMask<UI32>" "BenchFillBitFieldMask<UI8>"
Comparing BenchFillBitFieldMask<UI32> to BenchFillBitFieldMask<UI8> (from tbb-devAlgo.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:4096/manual_time -0.1936 -0.1985 1202 969 1259 1009
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:32768/manual_time -0.1026 -0.1114 5956 5345 6073 5396
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:262144/manual_time -0.0293 -0.0060 15946 15479 15712 15618
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:2097152/manual_time +0.0143 +0.0116 79599 80741 80828 81766
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:16777216/manual_time +0.0142 +0.0454 1097174 1112777 1073074 1121795
[BenchFillBitFieldMask<UI32> vs. BenchFillBitFieldMask<UI8>]/Size:134217728/manual_time +0.0104 -0.0252 10152305 10258041 9672619 9428879
However, this word type should also be benchmarked, as results are better with an 8-byte fill. Larger words should be tested to try to take advantage of vector fill instructions.
$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchFillBitFieldMask<UI64>" "BenchFillBitFieldMask<UI8>"
Comparing BenchFillBitFieldMask<UI64> to BenchFillBitFieldMask<UI8> (from tbb-devAlgo.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:4096/manual_time +0.1362 +0.1322 853 969 891 1009
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:32768/manual_time +0.2191 +0.2033 4384 5345 4485 5396
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:262144/manual_time +0.4177 +0.4399 10919 15479 10847 15618
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:2097152/manual_time +0.8145 +0.8051 44498 80741 45296 81766
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:16777216/manual_time +0.0125 +0.0154 1099031 1112777 1104798 1121795
[BenchFillBitFieldMask<UI64> vs. BenchFillBitFieldMask<UI8>]/Size:134217728/manual_time -0.0213 -0.0661 10481803 10258041 10096154 9428879
Benchmark was run on TBB compiled with MSVC2015
Running bin\BenchmarkDeviceAdapter.exe
Run on (8 X 2592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
VTK-m Device State:
- Device 1 (Serial): Enabled=0
- Device 2 (Cuda): Enabled=0
- Device 3 (TBB): Enabled=1
- Device 4 (OpenMP): Enabled=0
- Device 5 (InvalidDeviceId): Enabled=0
- Device 6 (InvalidDeviceId): Enabled=0
- Device 7 (InvalidDeviceId): Enabled=0