Cpu parallel radix sort
Parallel radix sorting will be invoked in DeviceAdapterAlgorthmTBB.h when the input is ArrayHandle<T, vtkm::cont::StorageTagBasic> where T is one of the following basic C++ types:
unsigned int unsigned short int unsigned long int unsigned long long int unsigned char char16_t char32_t wchar_t char short int long long signed char float double
If a comparison operator is provided, it must be type std::less or std::greater.
Radix sort implementation is Satish parallel radix sort as documented in the following citation:
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. In Proc. SIGMOD, pages 351–362, 2010
Implementation is based on Takuya Akiba's GitHub source code with the following changes:
- Changed parallel threading from OpenMP to TBB tasks
- Removed pair sorting
- Added minimum threshold for parallel, will instead invoke serial radix sort (kxsort)
- Added std::greater and std::less to interface for descending order sorts
- Added can_use_parallel_radix_sort<T, F>() function to determine if parallel radix sorting is possible for type T and compare function F (fallback is std::sort() if not possible)
- Added linear scaling of threads used by the algorithm for more stable performance on machines with lots of available threads (KNL and Haswell)
Added kxsort (serial MSD radix sort by Dinghua Li via GitHub) implementation without modification.