Investigate ReduceByKey scaling on TBB

The TBB implementation of ReduceByKey (RBK) performs about the same as the serial version and is proving especially difficult to improve. This is a very commonly used algorithm and it would be great to find a way to get it to scale well. This issue exists to collect information as we work on testing different approaches.

Performance Measurements

Benchmarking the RBK algorithm was performed using the BenchmarkDeviceAdapter executables built by VTKm. The results were compared using the devAlgoBenchSummary.py script, located under Utilities/Scripts.

The comparison of serial to TBB algorithms can be found in [1].

Memory bandwidth

It seems probable that we may be saturating memory bandwidth, even in the serial implementation (the current TBB specialization basically distributes the input data across threads that run the same basic algorithm as the serial version). If memory bandwidth is limiting both implementations, that would explain why the algorithm does not scale.

Comparison with Unique

The Unique algorithm is very similar to the ReduceByKey implementation, but what's noteworthy here is that Unique does seem to scale in most cases, albeit poorly (See [1]). This further supports the idea that memory bandwidth is limiting. Unique scales ok-ish on renar until we start dealing with ValueTypes >= 128 bits, at which point it slows down below serial performance.

Since ReduceByKey iterates through two arrays (instead of just one, as in Unique), it will require more bandwidth for most ValueTypes. The RBK KeyType is always a 64-bit integer in these tests and we see similar scaling behavior as Unique when the combined key + value size is considered.

Bandwidth Analysis

Running a bandwidth benchmark[2] tool on renar produces the following graph:

Test Machine Architecture

Tests are performed on the renar dashboard machine, which has dual 6-core+HT processors for a total of 12 physical and 24 logical cores. Each physical core has its own 32kB L1 and 256kB L2 caches, and each processor has a 1.5MB L3 cache that is shared between all 6 physical cores. All caches use 64 bytes cache lines.

Peak Bandwidth

The benchmark provides us with roughly these peak bandwidth limits for various working set sizes:

Cache	Working Set Size	Peak Read Bandwidth (64-bit sequential)	Peak Write Bandwidth (64-bit sequential)	Peak Write Bandwidth (64-bit random)
L1	< 32 kB	~45 GB/s	~23 GB/s	~23 GB/s
L2	32 kB - 256 kB	~40 GB/s	~20 GB/s	~8 GB/s
L3	256 kB - 16 MB	~28 GB/s	~18 GB/s	~2 GB/s
(None)	> 16 MB	~11 GB/s	~8 GB/s	< 1 GB/s

In-Situ Bandwidth

The following table estimates the volume of data written/read during a benchmark using 64-bit keys and values in TBB's RBK implementation. This is used to estimate the average bandwidth used by the benchmarks. Note that the table below only considers the reads/writes that occur during the reduction operations -- the copies performed while joining partitioned outputs is not included.

Benchmark	1	2	3	4	5	6
Num inputs	20971520	20971520	20971520	20971520	20971520	20971520
Num outputs	1048576	2097152	3145728	4194304	5242880	6291456
Key size (bytes)	8	8	8	8	8	8
Value size (bytes)	8	8	8	8	8	8
Total Read (MB)	320	320	320	320	320	320
Total Written (MB)	16	32	48	64	80	96
Serial Read Bandwidth (GB / s)	8.95	9.92	9.52	10.32	6.86	6.16
Parallel Read Bandwidth (GB / s)	14.41	10.50	6.81	5.68	4.71	4.05
Serial Write Bandwidth (GB/s)	0.45	0.99	1.43	2.06	1.72	1.85
Parallel Write Bandwidth (GB/s)	0.72	1.05	1.02	1.14	1.18	1.22
Serial Walltime (s)	0.034912	0.031505	0.032826	0.030277	0.045547	0.050726
Parallel Walltime (s)	0.021685	0.029759	0.045871	0.055052	0.066393	0.077100
TBB speedup	1.61	1.06	0.72	0.55	0.69	0.66

Some observations:

All of the benchmarks exceed the 16 MB working set size that can be efficiently served by the L3 cache.
In some cases, the parallel read bandwidth exceeds the expected peak bandwidth. This may be due to the dual processor configuration effectively doubling the L3 cache size (The peak performance benchmark only considers a single thread).
Care has been taken to avoid false sharing of reduction objects.
Per-thread-instance input/output buffers are >= 1024 8-byte contiguous memory blocks -- there should not be cache contention or even much page overlap between concurrent threads.
Prefetching input data will be very efficient, as reads are frequent and clearly sequential.
Writing the reduced keys/values seems to be limiting performance:
The read bandwidth decreases as the number of outputs increase.
The write bandwidth is mostly constant around 1 GB/s.
This value is consistent with the benchmarked peak write speeds:
The write behavior is mostly sequential, but there are still far more reads than writes, which will cause (at least) L1 eviction of output cache lines between writes.
Prefetching output cache lines will be difficult for the processor, since writes are somewhat rare. Further, it will take 8 writes to fill a cache line, so the prefetch-triggering cache misses will be infrequent enough that they may not be detected.
Thus we can expect the write behavior to be somewhere in-between sequential and random. 1 GB/s sounds about right.

Profiling Results

Running the benchmarks through the perf utility provides us with some more insights:

   24,798,952,930      cycle_activity.stalls_ldm_pending
   67,297,066,543      cycles
   26,667,854,013      L1-dcache-loads
    3,630,530,716      L1-dcache-load-misses     #   13.61% of all L1-dcache hits
    1,893,052,538      LLC-loads
      250,913,161      LLC-load-misses           #   13.25% of all LL-cache hits

 Performance counter stats for 'bin/BenchmarkDeviceAdapter_TBB profile': 

  402,632,398,634      cycle_activity.stalls_ldm_pending
  583,907,867,363      cycles
   52,700,267,798      L1-dcache-loads
    5,914,061,176      L1-dcache-load-misses     #   11.22% of all L1-dcache hits
    2,726,463,006      LLC-loads
      551,400,590      LLC-load-misses           #   20.22% of all LL-cache hits

cycle_activity.stalls_ldm_pending is the number of cycles lost waiting on memory access (in particular, L3-cache misses).
Serial wastes ~36% of all cycles waiting on data from main memory, while TBB wastes ~69%.
L1-dcache-load-misses is the number of times the L1 data cache misses.
Serial misses 14% of L1 dcache loads, while TBB misses 11%.
LLC-load-misses is the number of times the unified L3 cache misses.
This nearly doubles from 13% to 20% when the algorithm is parallelized.
Per-instruction profiling shows that the cache misses and stalled cycles occur where expected: Reading / writing during reduction, and during the join copying.

Open questions:

Benchmark using implicit arrays to remove bandwidth concerns -- does scaling improve?
Can explicit prefetching improve cache performance for writes? Can we bring write bandwidth closer to the sequential peak?
The written values are never used again until the join occurs. Can we use non-temporal streaming writes to improve performance by freeing cache?
Is there a way to 'pin' the output cache lines in L1 to prevent recurring misses as input data evicts them?