CUDA kernel now execute for multiple values in a strided manner.
Instead of launching lots of kernel instances, we re-use the kernel instances using a stride iteration pattern.
The compelling reasons for this change is that it allows us to re-use the worklet infrastructure that is created for each thread, and therefore reduce constant overhead costs.