Investigate ReduceByKey scaling on TBB
The TBB implementation of ReduceByKey
(RBK) performs about the same as the serial version and is proving especially difficult to improve. This is a very commonly used algorithm and it would be great to find a way to get it to scale well. This issue exists to collect information as we work on testing different approaches.
Performance Measurements
Benchmarking the RBK algorithm was performed using the BenchmarkDeviceAdapter
executables built by VTKm. The results were compared using the devAlgoBenchSummary.py
script, located under Utilities/Scripts
.
The comparison of serial to TBB algorithms can be found in [1].
Memory bandwidth
It seems probable that we may be saturating memory bandwidth, even in the serial implementation (the current TBB specialization basically distributes the input data across threads that run the same basic algorithm as the serial version). If memory bandwidth is limiting both implementations, that would explain why the algorithm does not scale.
Comparison with Unique
The Unique
algorithm is very similar to the ReduceByKey
implementation, but what's noteworthy here is that Unique
does seem to scale in most cases, albeit poorly (See [1]). This further supports the idea that memory bandwidth is limiting. Unique
scales ok-ish on renar
until we start dealing with ValueTypes
>= 128 bits, at which point it slows down below serial performance.
Since ReduceByKey
iterates through two arrays (instead of just one, as in Unique
), it will require more bandwidth for most ValueType
s. The RBK KeyType
is always a 64-bit integer in these tests and we see similar scaling behavior as Unique
when the combined key + value size is considered.
Bandwidth Analysis
Running a bandwidth benchmark[2] tool on renar
produces the following graph:
Test Machine Architecture
Tests are performed on the renar
dashboard machine, which has dual 6-core+HT processors for a total of 12 physical and 24 logical cores. Each physical core has its own 32kB L1 and 256kB L2 caches, and each processor has a 1.5MB L3 cache that is shared between all 6 physical cores. All caches use 64 bytes cache lines.
Peak Bandwidth
The benchmark provides us with roughly these peak bandwidth limits for various working set sizes:
Cache | Working Set Size | Peak Read Bandwidth (64-bit sequential) | Peak Write Bandwidth (64-bit sequential) | Peak Write Bandwidth (64-bit random) |
---|---|---|---|---|
L1 | < 32 kB | ~45 GB/s | ~23 GB/s | ~23 GB/s |
L2 | 32 kB - 256 kB | ~40 GB/s | ~20 GB/s | ~8 GB/s |
L3 | 256 kB - 16 MB | ~28 GB/s | ~18 GB/s | ~2 GB/s |
(None) | > 16 MB | ~11 GB/s | ~8 GB/s | < 1 GB/s |
In-Situ Bandwidth
The following table estimates the volume of data written/read during a benchmark using 64-bit keys and values in TBB's RBK implementation. This is used to estimate the average bandwidth used by the benchmarks. Note that the table below only considers the reads/writes that occur during the reduction operations -- the copies performed while joining partitioned outputs is not included.
Benchmark | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Num inputs | 20971520 | 20971520 | 20971520 | 20971520 | 20971520 | 20971520 |
Num outputs | 1048576 | 2097152 | 3145728 | 4194304 | 5242880 | 6291456 |
Key size (bytes) | 8 | 8 | 8 | 8 | 8 | 8 |
Value size (bytes) | 8 | 8 | 8 | 8 | 8 | 8 |
Total Read (MB) | 320 | 320 | 320 | 320 | 320 | 320 |
Total Written (MB) | 16 | 32 | 48 | 64 | 80 | 96 |
Serial Read Bandwidth (GB / s) | 8.95 | 9.92 | 9.52 | 10.32 | 6.86 | 6.16 |
Parallel Read Bandwidth (GB / s) | 14.41 | 10.50 | 6.81 | 5.68 | 4.71 | 4.05 |
Serial Write Bandwidth (GB/s) | 0.45 | 0.99 | 1.43 | 2.06 | 1.72 | 1.85 |
Parallel Write Bandwidth (GB/s) | 0.72 | 1.05 | 1.02 | 1.14 | 1.18 | 1.22 |
Serial Walltime (s) | 0.034912 | 0.031505 | 0.032826 | 0.030277 | 0.045547 | 0.050726 |
Parallel Walltime (s) | 0.021685 | 0.029759 | 0.045871 | 0.055052 | 0.066393 | 0.077100 |
TBB speedup | 1.61 | 1.06 | 0.72 | 0.55 | 0.69 | 0.66 |
Some observations:
- All of the benchmarks exceed the 16 MB working set size that can be efficiently served by the L3 cache.
- In some cases, the parallel read bandwidth exceeds the expected peak bandwidth. This may be due to the dual processor configuration effectively doubling the L3 cache size (The peak performance benchmark only considers a single thread).
- Care has been taken to avoid false sharing of reduction objects.
- Per-thread-instance input/output buffers are >= 1024 8-byte contiguous memory blocks -- there should not be cache contention or even much page overlap between concurrent threads.
- Prefetching input data will be very efficient, as reads are frequent and clearly sequential.
- Writing the reduced keys/values seems to be limiting performance:
- The read bandwidth decreases as the number of outputs increase.
- The write bandwidth is mostly constant around 1 GB/s.
- This value is consistent with the benchmarked peak write speeds:
- The write behavior is mostly sequential, but there are still far more reads than writes, which will cause (at least) L1 eviction of output cache lines between writes.
- Prefetching output cache lines will be difficult for the processor, since writes are somewhat rare. Further, it will take 8 writes to fill a cache line, so the prefetch-triggering cache misses will be infrequent enough that they may not be detected.
- Thus we can expect the write behavior to be somewhere in-between sequential and random. 1 GB/s sounds about right.
Profiling Results
Running the benchmarks through the perf
utility provides us with some more insights:
24,798,952,930 cycle_activity.stalls_ldm_pending
67,297,066,543 cycles
26,667,854,013 L1-dcache-loads
3,630,530,716 L1-dcache-load-misses # 13.61% of all L1-dcache hits
1,893,052,538 LLC-loads
250,913,161 LLC-load-misses # 13.25% of all LL-cache hits
Performance counter stats for 'bin/BenchmarkDeviceAdapter_TBB profile':
402,632,398,634 cycle_activity.stalls_ldm_pending
583,907,867,363 cycles
52,700,267,798 L1-dcache-loads
5,914,061,176 L1-dcache-load-misses # 11.22% of all L1-dcache hits
2,726,463,006 LLC-loads
551,400,590 LLC-load-misses # 20.22% of all LL-cache hits
-
cycle_activity.stalls_ldm_pending
is the number of cycles lost waiting on memory access (in particular, L3-cache misses). - Serial wastes ~36% of all cycles waiting on data from main memory, while TBB wastes ~69%.
-
L1-dcache-load-misses
is the number of times the L1 data cache misses. - Serial misses 14% of L1 dcache loads, while TBB misses 11%.
-
LLC-load-misses
is the number of times the unified L3 cache misses. - This nearly doubles from 13% to 20% when the algorithm is parallelized.
- Per-instruction profiling shows that the cache misses and stalled cycles occur where expected: Reading / writing during reduction, and during the join copying.
Open questions:
- Benchmark using implicit arrays to remove bandwidth concerns -- does scaling improve?
- Can explicit prefetching improve cache performance for writes? Can we bring write bandwidth closer to the sequential peak?
- The written values are never used again until the join occurs. Can we use non-temporal streaming writes to improve performance by freeing cache?
- Is there a way to 'pin' the output cache lines in L1 to prevent recurring misses as input data evicts them?