Ascent build fails randomly
The following discussion from !2974 (merged) should be addressed:
-
@kmorel started a discussion: (+3 comments) I find that when I make changes to core classes that force a rebuild the CUDA code on
ascent
as yours does, it takes about 3 restarts to complete the compile.
That's possible. Last week I was working on two MRs that were both requiring
ascent
to do long compiles. I got the impression that they were interfering with each other, and they were probably interfering with yours as well.
This is possible, we can have up to 2 jobs running concurrently since the scheduler in Ascent lets you only submit two LSF jobs at the time per user.
My theory was that due to the high number of jobs, the build node was probably running out of memory which was manifesting as issues with the distributed file system.
Most probably, problems started a couple of months ago when I increased the concurrency level to use all of the cores in the Ascent machine. I will follow-up with a merge request that reduce the parallelism level.