Don't require CUDA_LAUNCH_BLOCKING for Kokkos cuda backend
- The code now works without CUDA_LAUNCH_BLOCKING set by using explicit synchronizations where required.
- The code has also been modified to use thread specific memory spaces, which for Kokkos' Cuda backend means per thread streams.