ci-debugging Archived
CI Debugging
Kitware's CI setup has a number of moving parts and some of them can cause failures at inopportune times. This wiki aims to document the various failures that can occur and how to resolve them.
Windows
Windows has a number of issues that have been seen.
OpenGL not working
If OpenGL contexts are not being created at all, this means that the fact that gitlab-runner
runs as a Session 0 service is being blocked. This has been worked around by downgrading the NVIDIA driver to 11.0.3_451.82_win10
.
build/…
files
Unable to remove Windows refuses to remove files that are opened by other processes. Unfortunately, gitlab-runner
and ctest
are not the best at making sure that process trees actually exit when told to stop a test or CI job. These files can be left open. The fix is to either:
- kill the processes that are holding the file(s) open (not easy to discover); or
- restart the machine (reliable).
.gitlab/qt/…
files
Unable to remove Qt's default packages come with some odd permissions that git clean
does not know how to deal with. This is normally fixed in the .gitlab/ci/download_qt.cmake
file after extracting everything, but if a job is canceled in the middle of this script, the files with the odd permissions may linger.
macOS
Memory issues
Some builds (mainly VTK and ParaView) will sometimes fail in CI and CDash will show only a "warning":
*** WARNING non-zero return value in ctest from: /.../.gitlab/cmake/CMake.app/Contents/bin/cmake
This almost certainly means that a compilation rule ended up being killed because it used up too much memory. This can be verified by downloading the compile_output.log
from the job's artifacts and looking for compiler runs being "killed" by the kernel. Restart these jobs and hope the memory usage isn't so great next time.
Linux
Memory issues
See the macOS section.
Docker error messages
Failure to cleanup volumes
Sometimes a job will fail with a "Failed to cleanup volumes" at the end of the output. This is not a fatal error and can be safely ignored.
docker.sock
Could not contact Usually seen on dovim
. Jobs which fail with this message should just be restarted. Once dovim
is reinstalled to match the setup used on other machines, it should no longer occur.
X failures
Sometimes the X server can fall over and needs to be restarted. The host machine should just be restarted in this case. Usually this is seen as a failure to connect to X at all.
No AMD device
AMDGPU machines require that specific devices be injected into the Docker containers. This should be done at machine setup time, but may need to be added to older AMDGPU-using machines