Chasing unreliable tests

zyga-snapd · April 5, 2017, 9:26am

As some of you know I’ve been chasing unreliable tests that cause us a lot of grief on both Travis and autopackage. There are diverse causes as one might expect. From the top of my head I can think of:

connectivity errors to GitHub or other sites we obtain code from
connectivity problems to the snap store
tests that rely on timing and run on a more busy machine
tests that rely on ordering of elements in a map
tests that rely on special beahavior only true while testing (e.g. configuration) that is removed by another test by accident
spread fails to allocate machines on time and we run over 49 minutes that Travis permits.

I will be trying to address those speparately and you are invited to join the effort. I will follow up with some initial findings.

zyga-snapd · April 5, 2017, 9:40am

One of the more common failures is TestEnsureLoopPrune in the overlord package. I tried fixing it with a naive approach (give it x10 more time) but that simply fails each time. You can find my attempts here https://github.com/snapcore/snapd/pull/3140 I will try to get an idea of what can be done about this test, can we mock the time system somehow so that it is no longer reliant on real system time?

pedronis · April 5, 2017, 10:05am

@zyga-snapd something like this takes also into account whether we have gone through the loop cycles:

need a bit more complicated variant for the 2nd test

zyga-snapd · April 5, 2017, 10:10am

Thanks, I’ll study that to understand the state engine better. I’ll update the PR to include your changes.

pedronis · April 5, 2017, 10:16am

if you look there, each time we go through the loop in Overlord.Loop (either because of the pruneInterval or the ensureInterval), we call the Ensure method of each manager, so we will call waitForPrune

pedronis · April 5, 2017, 1:51pm

@mvo this test seems to fail often timing out at least in the autopkgtests:

autopkgtest:ubuntu-16.10-amd64:tests/main/refresh-core-with-hanging-configure-hook

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-yakkety-snappy-dev-image/yakkety/amd64/s/snapd/20170404_174712_9cd68@/log.gz

niemeyer · April 6, 2017, 1:36am

That’s a great initiative, @zyga-snapd! Thanks for putting it forward.

fgimenez · April 7, 2017, 7:43am

@zyga-snapd thanks a lot for putting this together!

WRT to spread-cron https://travis-ci.org/snapcore/spread-cron/branches we have currently 2 scenarios consistently failing:

snapd-reexec-2.21-vs-edge, the test is triggered after each core publication to edge and sets up the suite to download snapd 2.21, install an unmodified core snap from edge, this is done here https://github.com/snapcore/spread-cron/blob/snapd-reexec-2.21-vs-edge/run-checks and this is the last execution https://travis-ci.org/snapcore/spread-cron/builds/219549331
snapd-nightly-suite: this scenario runs the nightly suite in from snapd project, is executed daily this is the last failed execution https://travis-ci.org/snapcore/spread-cron/builds/219525601 and will be hopefully fixed once https://github.com/snapcore/snapd/pull/3104 lands.

I also wanted to mention a really elusive, flaky test that hits us eventually in snapd’s suite, sometimes tests/main/refresh-delta-from-core fails like in this execution: Ubuntu Pastebin

According to this part of the log maybe a retry is involved:

Apr 05 09:11:26 ubuntu /usr/lib/snapd/snapd[25130]: daemon.go:176: DEBUG: uid=0;@ GET /v2/snaps?snaps=test-snapd-delta-refresh 788.216Âµs 200
Apr 05 09:11:26 ubuntu /usr/lib/snapd/snapd[25130]: retry.go:47: DEBUG: Retrying https://assertions.ubuntu.com/v1/assertions/snap-declaration/16/99T7MUlRhtI3U0QFgl5mXXESAiSwt776?max-format=2, attempt 1, elapsed time=12.365Âµs
Apr 05 09:11:27 ubuntu /usr/lib/snapd/snapd[25130]: daemon.go:176: DEBUG: uid=0;@ GET /v2/snaps?snaps=test-snapd-delta-refresh 685.31Âµs 200

Cheers!

zyga-snapd · April 10, 2017, 11:15am

We just found one more reason why tests may fail. The tab completion test times out after five seconds. In log file https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-snappy-dev-image/xenial/amd64/s/snapd/20170406_130441_94744@/log.gz we can see this fragment:

Apr 06 12:44:13 autopkgtest snapd[5022]: 2017/04/06 12:44:13.124517 retry.go:59: DEBUG: The retry loop for https://search.apps.ubuntu.com/api/v1/snaps/search?confinement=strict%2Cclassic&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement&name=test- finished after 1 retries, elapsed time=5.353650942s, status: 200

This tells us that the store took over five seconds to do a search. This has caused tests/main/completion to fail.

Can we somehow trace what happened at the store at that particular time to see why this occurred?

fgimenez · April 21, 2017, 2:47pm

A bunch of “read: connection refused” on a recent execution after the publication of the core snap to edge http://paste.ubuntu.com/24427104/, for instance:

+ snap install test-snapd-tools --channel=edge
error: cannot install "test-snapd-tools": Get https://search.apps.ubuntu.com/api/v1/snaps/details/test-snapd-tools?channel=edge&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement: dial tcp: lookup search.apps.ubuntu.com on [::1]:53: read udp [::1]:33216->[::1]:53: read: connection refused
-----

fgimenez · April 24, 2017, 7:10am

10 more read: connection refused errors this morning after the amd64 core snap publication to edge: https://travis-ci.org/snapcore/spread-cron/builds/225067795

mvo · April 24, 2017, 7:56am

This is caused by https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1659195, this is now reverted again and tests should be back to normal with the next core image build.

pedronis · April 24, 2017, 8:29pm

@fgimenez this test (new?) seems to be flaky

linode:ubuntu-core-16-64:tests/main/failover:emptyinitrd

I’m seeing it failing often on my unrelated PRs

pedronis · April 24, 2017, 8:39pm

@fgimenez it’s failing with:

ERROR cannot replace signed kernel snap with an unasserted one

it’s probably because of the landing of the parametrizing branch, we don’t let install sideloaded kernels over non-sideloaded ones,

we need to disable this again until we think what to do

pedronis · April 24, 2017, 8:44pm

I’m disabling it in one of my branches that are to be merged

fgimenez · April 25, 2017, 7:17am

Thanks @pedronis, we could move this one to the nightly suite and execute by setting a kernel channel different from the default one (so that kernel is sideloaded), wdyt? However not sure how to extract only this variant without repeating all the logic, maybe move the logic to a library?

pedronis · April 26, 2017, 9:39am

this one seems also flaky:

Error executing autopkgtest:ubuntu-17.04-amd64:tests/main/econnreset

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-zesty-snappy-dev-image/zesty/amd64/s/snapd/20170426_092408_9dd14@/log.gz

zyga-snapd · April 26, 2017, 11:23am

Aha, this one is interesting, if you look at the failure message:

+ grep -q 'Retrying.*download-snap/.*\.snap, attempt 1' snap-download.log
grep: + su -c '/usr/bin/env SNAPD_DEBUG=1 snap download core 2>snap-download.log' test
snap-download.log: No such file or directory

It seems to suggest that snap-download.log does not yet exit by the time grep runs. The test uses command & to run stuff in the background, I wonder if that also means the opening of the log file happens in the background. I’ll have a look.