CUDA: Support for Relocatable Device Code Across Shared Libraries

As demonstrated in https://github.com/celeritas-project/CudaRdcUtils, it is possible (albeit in an usual way) to extend add_library, target_link_library, etc. to properly support a set of interconnected libraries containing CUDA Relocatable Device Code.

The main challenge in using (and thus supporting) this features is the particularities of the behavior/implementation of nvlink. Calling nvlink ... -dlink ${object_files} -o device_link.o is an essential step that generate a set of stub functions to bootstrap the device code. However due to the implementation details of nvlink -dlink (involving weak symbols that are not guaranteed unique) in a process one can only load exactly one result (output) of nvlink -dlink.

When trying to reuse (shared) libraries containing RDC code one can not 'just' bundle the output of nvlink -dlink for that library inside the library. When the result of nvlink -dlink is bundled inside the library that library can then be used as-is by an executable that does have its own RDC code and does not use any other libraries containing RDC code (and in that case the executable does not need to run an nvlink step). However if one tries to mixed this library with a bundle nvlink output, with any other libraries containing RDC code or with an executable that has its own RDC code, the attempt will failed because of clashes between the several nvlink output that are being loaded in the same process.

The solution that we came up with is to extend add_library and target_link_library to split a shared library that contains RDC code in 3 separate file (and 3 separate but connected CMake targets) :

A shared library containing just the regular object files (this is what the user would want to use at run-time to allow memory re-use and speed up loading)
A static library containing those same object files (this is passed to nvlink for dependent libraries and executable; nvlink requires all the object files and does not accept the shared library for this purpose)
A "final" shared library containing both the regular objects and the output of nvlink -dlink; this library can be used directly for cases where the dependent code/using code does not provide any other RDC code.

In order to hide this complexity, we extended add_library to detect the case of a library containing RDC code and to then generate all 3 targets and to add cross reference in the properties of the target to be able to find one when given the others.

target_link_libraries was extended to calculate which of the 3 libraries to actually connect to the target and how:

If the target does not contains RDC code
- If there is only one RDC library in the argument, then use the "final library" version.
- If there is more than one RDC library directly or indirectly added to the link_library of the target then transform the target into an RDC library (or executable) that will need it run its own nvlink -dlink that will need to be passed all the static library corresponding to the all the RDC libraries into the dependency tree.
If the target contains RDC code
- add the static libraries corresponding to all the RDC libraries directly or indirectly in the dependency tree to device link steps (to be passed to nvlink -dlink
Add regular libraries as usual
Also properly propagate the RDC object libraries (without the nvlink output) as needed.

Other utility function, for example target_include_directories, were also extended to automatically apply them to the 3 target as needed.

The script at https://github.com/celeritas-project/CudaRdcUtils is working well enough for a set of medium size projects (4 separate packages producing half a dozen libraries and many executables (eg. tests).

However besides awkwardly requiring to replace add_library and co with other routines, it has some hard limitation (see https://discourse.cmake.org/t/detecting-if-a-target-has-cuda-sources-with-separable-compilation/13767/6) that most likely require the support to be embedded (or at least also handled) in the generation phase.

It should be 'fanstastic' if this support could be added to CMake proper 😄

Thanks, Philippe.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information