Skip to content
C

ci-debugging Archived

CI Debugging

Kitware's CI setup has a number of moving parts and some of them can cause failures at inopportune times. This wiki aims to document the various failures that can occur and how to resolve them.

Windows

Windows has a number of issues that have been seen.

OpenGL not working

If OpenGL contexts are not being created at all, this means that the fact that gitlab-runner runs as a Session 0 service is being blocked. This has been worked around by downgrading the NVIDIA driver to 11.0.3_451.82_win10.

Unable to remove build/… files

Windows refuses to remove files that are opened by other processes. Unfortunately, gitlab-runner and ctest are not the best at making sure that process trees actually exit when told to stop a test or CI job. These files can be left open. The fix is to either:

  • kill the processes that are holding the file(s) open (not easy to discover); or
  • restart the machine (reliable).

Unable to remove .gitlab/qt/… files

Qt's default packages come with some odd permissions that git clean does not know how to deal with. This is normally fixed in the .gitlab/ci/download_qt.cmake file after extracting everything, but if a job is canceled in the middle of this script, the files with the odd permissions may linger.

macOS

Memory issues

Some builds (mainly VTK and ParaView) will sometimes fail in CI and CDash will show only a "warning":

*** WARNING non-zero return value in ctest from: /.../.gitlab/cmake/CMake.app/Contents/bin/cmake

This almost certainly means that a compilation rule ended up being killed because it used up too much memory. This can be verified by downloading the compile_output.log from the job's artifacts and looking for compiler runs being "killed" by the kernel. Restart these jobs and hope the memory usage isn't so great next time.

Linux

Memory issues

See the macOS section.

Docker error messages

Failure to cleanup volumes

Sometimes a job will fail with a "Failed to cleanup volumes" at the end of the output. This is not a fatal error and can be safely ignored.

Could not contact docker.sock

Usually seen on dovim. Jobs which fail with this message should just be restarted. Once dovim is reinstalled to match the setup used on other machines, it should no longer occur.

X failures

Sometimes the X server can fall over and needs to be restarted. The host machine should just be restarted in this case. Usually this is seen as a failure to connect to X at all.

No AMD device

AMDGPU machines require that specific devices be injected into the Docker containers. This should be done at machine setup time, but may need to be added to older AMDGPU-using machines