I was making a review of the errors that are making builds fail in CI, and for this I reviewed all the builds which failed in the last 12 hours and analyzed why those have failed.
Something interesting that I noticed is that most of the errors are caused by connection/timeout issues which are not related to snapd itself. Also I detected many other issues on which I am working on.
Here you can see a detail about the errors that I saw in the last 12 hours: https://paste.ubuntu.com/24637373/
My corcern is if we could do something else apart of fixing tests (which can be reproduced) to improve the productivity by reducing the time we spend reviewing the travis logs to then understand there is not a bug and retrigger the tests.
So while I/we fix errors in snapd tests, I would like to analyze different alternatives to deal with this kind of issues, here there are some different ways that came up:
- Leave it as it is now, and the dev will rexecute the build if the failure is because a third party error
- Automatic retry the failed tests and just consider as error the tests which have failed x consecutive times
- Make a wrapper in the tests to retry the most commn operations that give us problems i.e. download a snap
I would like to hear other opinions about this and see which is the best approach to deal with this.