ArrayHandleZip performance implications

Many of the ByKey methods in DeviceAdapterAlgorithm implementations use a 'building block' approach, combining inputs with keys into an ArrayHandleZip and calling a simpler version of the algorithm.

We've noticed large overhead to these methods, possibly due to creating temporary vtkm::Pairs in ArrayPortalZip::Get and the related WrappedBinaryOperator/ArrayPortalValueReference::Swap.

The following benchmark highlights this issue: BenchmarkZipArraySort.cxx

Test	Serial	TBB	CUDA
Sort Basic Array	1.864s	0.638s	0.457s
Sort Zip Array	2.884s	1.229s	2.390s
Sort Zip Array w/ KeyComparator	2.513s	1.167s	XXXXXX

It may be worthwhile to look at either optimizing the zip portals or refactoring the algorithms to remove them.