Particle Advection performance limited by vcvtsi2ss

The dominant contribution to the runtime of the particle advection filter is now calls to vtkm::exec::CellLocatorUniformGrid::FindCell. The assembly reveals that this function is itself dominated by calls to vcvts22ss:

Admittedly, I don't know if there's much that can be done about this, and perhaps it indicates that I'm at a point of diminishing returns for optimizing RK4Integrator. But previously, I've managed to (for example) use floating point counters and various other workarounds to increase the speed of this sort of operation.

Any ideas are welcome; I believe this could have broader impact throughout the library. Otherwise feel free to close.