Skip to content

Performance issue: StableSortIndices is extremely slow

Comparing DeviceAdapterAlgorithm<>::Sort with vtkm::Worklet::StableSortIndices on TBB, the performance scales very badly with element size, taking 10x longer to sort 32Ki Float32s and 40x longer to sort 128Mi UInt8s.

Odds are that this could be replaced with a SortByKey and yield better performance, since the current implementation is effectively a two-hop pointer chase per element read.

$ python2 ~/code/src/benchmark/tools/compare.py filters tbb-devAlgo.json "BenchSort<" "BenchStableSortIndices<"
Comparing BenchSort< to BenchStableSortIndices< (from tbb-devAlgo.json)
Benchmark                                                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:32768/manual_time                              +12.6116         +7.0359        160620       2186295        174124       1399254
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:262144/manual_time                             +11.5261         +6.2327       1258224      15760625       1304348       9433962
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:2097152/manual_time                            +16.9703         +9.0833       5626771     101114562       5810950      58593750
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:16777216/manual_time                           +29.3860        +28.3250      30333887     921726200      27173913     796875000
[BenchSort< vs. BenchStableSortIndices<]UI8>/Size:134217728/manual_time                          +40.7724        +36.4000     233458133    9752103000     234375000    8765625000
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:32768/manual_time                               -0.5083         -0.5597        620399        305078        640244        281894
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:262144/manual_time                              -0.6129         -0.7102       5281213       2044308       5326705       1543787
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:2097152/manual_time                             +2.0725         +1.7206       6055471      18605412       6250000      17003676
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:16777216/manual_time                           +12.5738         +3.7943      28133908     381885133      28245192     135416667
[BenchSort< vs. BenchStableSortIndices<]F32>/Size:134217728/manual_time                          +13.1490        +10.6341     206996600    2928794700     213541667    2484375000
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:32768/manual_time                               +1.3390         +0.7862         74949        175307         82237        146892
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:262144/manual_time                              +0.8180         +0.5150        796639       1448258        818850       1240548
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:2097152/manual_time                             +0.5299         +0.4690       6147862       9405792       6279206       9224398
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:16777216/manual_time                            +2.2245         +2.3000      32480025     104730667      26041667      85937500
[BenchSort< vs. BenchStableSortIndices<]I64>/Size:134217728/manual_time                           +4.9546         +4.3478     238657767    1421118800     239583333    1281250000
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:32768/manual_time                               -0.4857         -0.5767        302017        155314        310244        131335
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:262144/manual_time                              -0.5750         -0.6158       2577249       1095259       2644231       1015867
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:2097152/manual_time                             +0.4192         +0.3603       6538764       9279660       6414474       8725649
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:16777216/manual_time                            +2.8239         +1.9167      31445957     120247050      31250000      91145833
[BenchSort< vs. BenchStableSortIndices<]F64>/Size:134217728/manual_time                           +4.3792         +3.5333     246448967    1325700200     234375000    1062500000
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:32768/manual_time                          +1.8708         +1.7321         68788        197479         71779        196109
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:262144/manual_time                         +1.4075         +1.2739        642044       1545708        645552       1467902
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:2097152/manual_time                        +1.5675         +1.3780       4802319      12329804       4841549      11513158
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:16777216/manual_time                       +2.5281         +2.1106      41281665     145644040      43198529     134375000
[BenchSort< vs. BenchStableSortIndices<]Vec3f_32>/Size:134217728/manual_time                      +3.4223         +3.2222     344951900    1525493700     351562500    1484375000
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:32768/manual_time                    +0.1645         +0.1555         52150         60728         54629         63126
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:262144/manual_time                   +0.3834         +0.3505        348924        482704        357001        482118
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:2097152/manual_time                  +0.4472         +0.4355       2595829       3756711       2633427       3780242
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:16777216/manual_time                 +1.1373         +0.9539      23601279      50442883      25323276      49479167
[BenchSort< vs. BenchStableSortIndices<]Pair<I32, F64>>/Size:134217728/manual_time                +2.1148         +2.0000     209363667     652128100     218750000     656250000

Benchmark was run on TBB compiled with MSVC2015

Running bin\BenchmarkDeviceAdapter.exe
Run on (8 X 2592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
VTK-m Device State:
 - Device 1 (Serial): Enabled=0
 - Device 2 (Cuda): Enabled=0
 - Device 3 (TBB): Enabled=1
 - Device 4 (OpenMP): Enabled=0
 - Device 5 (InvalidDeviceId): Enabled=0
 - Device 6 (InvalidDeviceId): Enabled=0
 - Device 7 (InvalidDeviceId): Enabled=0