Dealing with flaky tests

I was making a review of the errors that are making builds fail in CI, and for this I reviewed all the builds which failed in the last 12 hours and analyzed why those have failed.
Something interesting that I noticed is that most of the errors are caused by connection/timeout issues which are not related to snapd itself. Also I detected many other issues on which I am working on.

Here you can see a detail about the errors that I saw in the last 12 hours: https://paste.ubuntu.com/24637373/

My corcern is if we could do something else apart of fixing tests (which can be reproduced) to improve the productivity by reducing the time we spend reviewing the travis logs to then understand there is not a bug and retrigger the tests.

So while I/we fix errors in snapd tests, I would like to analyze different alternatives to deal with this kind of issues, here there are some different ways that came up:

  1. Leave it as it is now, and the dev will rexecute the build if the failure is because a third party error
  2. Automatic retry the failed tests and just consider as error the tests which have failed x consecutive times
  3. Make a wrapper in the tests to retry the most commn operations that give us problems i.e. download a snap

I would like to hear other opinions about this and see which is the best approach to deal with this.

Thanks

Those are integration tests, so failures will often not be related to snapd proper. That said, any piece of that overall stack that fails is a real failure, and we want to know about those as well because the point of the suite is not snapd working or not, but the whole thing working or not.

For that same reason, we cannot just retry blindly, because flakiness is often related to a very real bug that needs fixing. If we do that we’ll skip errors which are the whole point of having the suite in the first place.

That said, retrying specific operations is fine as long as we understand the exact reason why the error is happening, and implement the retrying for that particular circumstance. In fact, many months ago we fixed the vast majority of failure cases we had in the suite by retrying: turns out the CDN we use seems to provide error responses relatively often, and we changed snapd itself to retry in controlled fashion in several cases. That fixed not only the tests, but greatly improved the overall reliability of the download operations.

That’s the sort of thing we can do.

As we discussed online today, I’d recommend taking one test with one case that fails often, and trying to address it. Small but solid steps forward.

Good, I already started following the retry approach under specific conditions on failing tests.