Performance issue: StableSortIndices is extremely slow
Comparing DeviceAdapterAlgorithm<>::Sort
with vtkm::Worklet::StableSortIndices
on TBB, the performance scales very badly with element size, taking 10x longer to sort 32Ki Float32
s and 40x longer to sort 128Mi UInt8
s.
Odds are that this could be replaced with a SortByKey
and yield better performance, since the current implementation is effectively a two-hop pointer chase per element read.
$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchSort<" "BenchStableSortIndices<"
Comparing BenchSort< to BenchStableSortIndices< (from tbb-devAlgo.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:32768/manual_time +12.6116 +7.0359 160620 2186295 174124 1399254
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:262144/manual_time +11.5261 +6.2327 1258224 15760625 1304348 9433962
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:2097152/manual_time +16.9703 +9.0833 5626771 101114562 5810950 58593750
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:16777216/manual_time +29.3860 +28.3250 30333887 921726200 27173913 796875000
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:134217728/manual_time +40.7724 +36.4000 233458133 9752103000 234375000 8765625000
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:32768/manual_time -0.5083 -0.5597 620399 305078 640244 281894
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:262144/manual_time -0.6129 -0.7102 5281213 2044308 5326705 1543787
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:2097152/manual_time +2.0725 +1.7206 6055471 18605412 6250000 17003676
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:16777216/manual_time +12.5738 +3.7943 28133908 381885133 28245192 135416667
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:134217728/manual_time +13.1490 +10.6341 206996600 2928794700 213541667 2484375000
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:32768/manual_time +1.3390 +0.7862 74949 175307 82237 146892
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:262144/manual_time +0.8180 +0.5150 796639 1448258 818850 1240548
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:2097152/manual_time +0.5299 +0.4690 6147862 9405792 6279206 9224398
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:16777216/manual_time +2.2245 +2.3000 32480025 104730667 26041667 85937500
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:134217728/manual_time +4.9546 +4.3478 238657767 1421118800 239583333 1281250000
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:32768/manual_time -0.4857 -0.5767 302017 155314 310244 131335
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:262144/manual_time -0.5750 -0.6158 2577249 1095259 2644231 1015867
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:2097152/manual_time +0.4192 +0.3603 6538764 9279660 6414474 8725649
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:16777216/manual_time +2.8239 +1.9167 31445957 120247050 31250000 91145833
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:134217728/manual_time +4.3792 +3.5333 246448967 1325700200 234375000 1062500000
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:32768/manual_time +1.8708 +1.7321 68788 197479 71779 196109
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:262144/manual_time +1.4075 +1.2739 642044 1545708 645552 1467902
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:2097152/manual_time +1.5675 +1.3780 4802319 12329804 4841549 11513158
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:16777216/manual_time +2.5281 +2.1106 41281665 145644040 43198529 134375000
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:134217728/manual_time +3.4223 +3.2222 344951900 1525493700 351562500 1484375000
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:32768/manual_time +0.1645 +0.1555 52150 60728 54629 63126
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:262144/manual_time +0.3834 +0.3505 348924 482704 357001 482118
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:2097152/manual_time +0.4472 +0.4355 2595829 3756711 2633427 3780242
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:16777216/manual_time +1.1373 +0.9539 23601279 50442883 25323276 49479167
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:134217728/manual_time +2.1148 +2.0000 209363667 652128100 218750000 656250000
Benchmark was run on TBB compiled with MSVC2015
Running bin\BenchmarkDeviceAdapter.exe
Run on (8 X 2592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
VTK-m Device State:
- Device 1 (Serial): Enabled=0
- Device 2 (Cuda): Enabled=0
- Device 3 (TBB): Enabled=1
- Device 4 (OpenMP): Enabled=0
- Device 5 (InvalidDeviceId): Enabled=0
- Device 6 (InvalidDeviceId): Enabled=0
- Device 7 (InvalidDeviceId): Enabled=0