CUDA kernel now execute for multiple values in a strided manner.

Instead of launching lots of kernel instances, we re-use the kernel instances
using a stride iteration pattern.

The compelling reasons for this change is that it allows us to re-use the worklet
infrastructure that is created for each thread, and therefore reduce constant
overhead costs.
69 jobs for more_efficient_cuda_task_scheduling