Description
Why
Our CI auto builds sometimes fail for known reasons that are not related to the PRs we are trying to merge.
Most of the times, these errors are hard to understand and fix (or can't be fixed at all), decreasing the success rate of the auto builds for several weeks or months.
The impact is that we loose days of parallel compute time and hours of maintainers time that need to analyze the error message and reschedule the PRs in the merge queue.
Feature
We want to list the the stderr of the known issues we are aware of in the config.toml
file that bootstrap uses. These patterns can be expressed as regex.
We want bootstrap
to retry cargo invocations up to two times if stderr matches one of the listed patterns.
This would help reduce the failure rate of our CI because it would significantly reduce the percentage of jobs failing due to spurious errors.
The error messages need to be precise enough to avoid retrying cargo invocations over genuine problems.
Known error patterns can be found here. Not all of them can be listed.
As a start, we could just have 1 stderr string in the list (this one doesn't need to be a regex):
ranlib.exe: could not create temporary file whilst writing archive: no more archived files
which is discussed in CI: spurious error building LLVM on mingw:ranlib.exe: could not create temporary file whilst writing archive: no more archived files
#108227
Questions
- Is
config.toml
the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file:retry-patterns.toml
. I'll leave it to the bootstrap team to decide. - Which format do we use to write the stderr patterns in the
config.toml
file? For example, it can be an array of strings. It could also be an "object" if we want to customize how many times to retry per error message. I'll leave it to the boostrap team to decide. - how do we make sure these patterns are present in the
config.toml
used for CI? I'm not familiar with how theconfig.toml
for the CI is generated.