As some of you know I’ve been chasing unreliable tests that cause us a lot of grief on both Travis and autopackage. There are diverse causes as one might expect. From the top of my head I can think of:
connectivity errors to GitHub or other sites we obtain code from
connectivity problems to the snap store
tests that rely on timing and run on a more busy machine
tests that rely on ordering of elements in a map
tests that rely on special beahavior only true while testing (e.g. configuration) that is removed by another test by accident
spread fails to allocate machines on time and we run over 49 minutes that Travis permits.
I will be trying to address those speparately and you are invited to join the effort. I will follow up with some initial findings.
One of the more common failures is TestEnsureLoopPrune in the overlord package. I tried fixing it with a naive approach (give it x10 more time) but that simply fails each time. You can find my attempts here https://github.com/snapcore/snapd/pull/3140 I will try to get an idea of what can be done about this test, can we mock the time system somehow so that it is no longer reliant on real system time?
if you look there, each time we go through the loop in Overlord.Loop (either because of the pruneInterval or the ensureInterval), we call the Ensure method of each manager, so we will call waitForPrune
I also wanted to mention a really elusive, flaky test that hits us eventually in snapd’s suite, sometimes tests/main/refresh-delta-from-core fails like in this execution: Ubuntu Pastebin
According to this part of the log maybe a retry is involved:
A bunch of “read: connection refused” on a recent execution after the publication of the core snap to edge http://paste.ubuntu.com/24427104/, for instance:
Thanks @pedronis, we could move this one to the nightly suite and execute by setting a kernel channel different from the default one (so that kernel is sideloaded), wdyt? However not sure how to extract only this variant without repeating all the logic, maybe move the logic to a library?
Aha, this one is interesting, if you look at the failure message:
+ grep -q 'Retrying.*download-snap/.*\.snap, attempt 1' snap-download.log
grep: + su -c '/usr/bin/env SNAPD_DEBUG=1 snap download core 2>snap-download.log' test
snap-download.log: No such file or directory
It seems to suggest that snap-download.log does not yet exit by the time grep runs. The test uses command & to run stuff in the background, I wonder if that also means the opening of the log file happens in the background. I’ll have a look.