Idea to reduce cuda compile times
The main issue with cuda compile times is due a single compiler thread expanding all the templates for a single filter. Despite all the available parallelism on a node (say 36 cores), there is a huge serial bottleneck compiling a filter comprised of many worklets. If we can separate all the worklets into cpp files, then we expose more parallelism, reducing overall build time.
I write this because I am currently tracking down cuda only linking warnings and each iteration takes 10+ minutes for a single filter. I know others feel this pain too.