Make detecting if we are cuda 3+ gpu running cuda 2 code faster.
The original implementing tried to run 2^31 kernels and detect a launch failure to determine this use-case. The issue with this approach is that on a cuda 3+ gpu, this would take multiple seconds and cause the gpu to terminate the kernel when opengl was also loaded.