Skip to content

CUDA kernel now execute for multiple values in a strided manner.

Instead of launching lots of kernel instances, we re-use the kernel instances using a stride iteration pattern.

The compelling reasons for this change is that it allows us to re-use the worklet infrastructure that is created for each thread, and therefore reduce constant overhead costs.

Merge request reports