CUDA: Support CUDA_SEPARABLE_COMPILATION on Clang
- [ ] Refactor relocatable device code flag, so we can use `-fgpu-rdc` instead of the NVCC-specific `-dc`.
- [ ] Device linking. [Bazel implementation](https://github.com/tensorflow/tensorflow/blob/ed371aa5d266222c799a7192e438cdd8c00464fe/third_party/nccl/build_defs.bzl.tpl).
issue