Big questions
See https://liw.fi/40/#index8h1
For whom are you building the software? Whose opinions about it matter?
- CI machine maintainers.
- Developers of projects using the CI cluster.
Why are you building the software?
Insight into the state of our CI cluster.
What should the software do, in broad strokes? Also, what should it not do?
It should collect information about the CI cluters and present it in a degestable form.
How should the software work, in broad strokes?
Three main components:
- Data collection. Collect information about CI from various sources and store it in a central location
- Machines
- GitLab API
- GitLab CI logs
- CDash
- Dashboards. An up-to-date view of the system's status. Three main viewpoints:
- Runner-focused
- Project-focused
- Schedule-focused (status of scheduled jobs)
- Analyze realtime status
- Classify failed jobs based on the cause (e.g., machine problem, code problem, network issues, etc.)
- For transient/machine problems, report, possibly mitigate if confidence is high enough (reboot machine, restart job)
What's important and what is just nice to have?
Dashboards are critical. Analysis and response are nice-to-have.