Chasing unreliable tests

As some of you know I’ve been chasing unreliable tests that cause us a lot of grief on both Travis and autopackage. There are diverse causes as one might expect. From the top of my head I can think of:

  • connectivity errors to GitHub or other sites we obtain code from
  • connectivity problems to the snap store
  • tests that rely on timing and run on a more busy machine
  • tests that rely on ordering of elements in a map
  • tests that rely on special beahavior only true while testing (e.g. configuration) that is removed by another test by accident
  • spread fails to allocate machines on time and we run over 49 minutes that Travis permits.

I will be trying to address those speparately and you are invited to join the effort. I will follow up with some initial findings.

2 Likes

One of the more common failures is TestEnsureLoopPrune in the overlord package. I tried fixing it with a naive approach (give it x10 more time) but that simply fails each time. You can find my attempts here https://github.com/snapcore/snapd/pull/3140 I will try to get an idea of what can be done about this test, can we mock the time system somehow so that it is no longer reliant on real system time?

@zyga-snapd something like this takes also into account whether we have gone through the loop cycles:

need a bit more complicated variant for the 2nd test

1 Like

Thanks, I’ll study that to understand the state engine better. I’ll update the PR to include your changes.

if you look there, each time we go through the loop in Overlord.Loop (either because of the pruneInterval or the ensureInterval), we call the Ensure method of each manager, so we will call waitForPrune

@mvo this test seems to fail often timing out at least in the autopkgtests:

autopkgtest:ubuntu-16.10-amd64:tests/main/refresh-core-with-hanging-configure-hook

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-yakkety-snappy-dev-image/yakkety/amd64/s/snapd/20170404_174712_9cd68@/log.gz

That’s a great initiative, @zyga-snapd! Thanks for putting it forward.

@zyga-snapd thanks a lot for putting this together! :heart:

WRT to spread-cron https://travis-ci.org/snapcore/spread-cron/branches we have currently 2 scenarios consistently failing:

I also wanted to mention a really elusive, flaky test that hits us eventually in snapd’s suite, sometimes tests/main/refresh-delta-from-core fails like in this execution: Ubuntu Pastebin

According to this part of the log maybe a retry is involved:

Apr 05 09:11:26 ubuntu /usr/lib/snapd/snapd[25130]: daemon.go:176: DEBUG: uid=0;@ GET /v2/snaps?snaps=test-snapd-delta-refresh 788.216µs 200
Apr 05 09:11:26 ubuntu /usr/lib/snapd/snapd[25130]: retry.go:47: DEBUG: Retrying https://assertions.ubuntu.com/v1/assertions/snap-declaration/16/99T7MUlRhtI3U0QFgl5mXXESAiSwt776?max-format=2, attempt 1, elapsed time=12.365µs
Apr 05 09:11:27 ubuntu /usr/lib/snapd/snapd[25130]: daemon.go:176: DEBUG: uid=0;@ GET /v2/snaps?snaps=test-snapd-delta-refresh 685.31µs 200

Cheers!

We just found one more reason why tests may fail. The tab completion test times out after five seconds. In log file https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-snappy-dev-image/xenial/amd64/s/snapd/20170406_130441_94744@/log.gz we can see this fragment:

Apr 06 12:44:13 autopkgtest snapd[5022]: 2017/04/06 12:44:13.124517 retry.go:59: DEBUG: The retry loop for https://search.apps.ubuntu.com/api/v1/snaps/search?confinement=strict%2Cclassic&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement&name=test- finished after 1 retries, elapsed time=5.353650942s, status: 200

This tells us that the store took over five seconds to do a search. This has caused tests/main/completion to fail.

Can we somehow trace what happened at the store at that particular time to see why this occurred?

A bunch of “read: connection refused” on a recent execution after the publication of the core snap to edge http://paste.ubuntu.com/24427104/, for instance:

+ snap install test-snapd-tools --channel=edge
error: cannot install "test-snapd-tools": Get https://search.apps.ubuntu.com/api/v1/snaps/details/test-snapd-tools?channel=edge&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement: dial tcp: lookup search.apps.ubuntu.com on [::1]:53: read udp [::1]:33216->[::1]:53: read: connection refused
-----

10 more read: connection refused errors this morning after the amd64 core snap publication to edge: https://travis-ci.org/snapcore/spread-cron/builds/225067795

This is caused by https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1659195, this is now reverted again and tests should be back to normal with the next core image build.

1 Like

@fgimenez this test (new?) seems to be flaky

  • linode:ubuntu-core-16-64:tests/main/failover:emptyinitrd

I’m seeing it failing often on my unrelated PRs

@fgimenez it’s failing with:

ERROR cannot replace signed kernel snap with an unasserted one

it’s probably because of the landing of the parametrizing branch, we don’t let install sideloaded kernels over non-sideloaded ones,

we need to disable this again until we think what to do

I’m disabling it in one of my branches that are to be merged

Thanks @pedronis, we could move this one to the nightly suite and execute by setting a kernel channel different from the default one (so that kernel is sideloaded), wdyt? However not sure how to extract only this variant without repeating all the logic, maybe move the logic to a library?

this one seems also flaky:

Error executing autopkgtest:ubuntu-17.04-amd64:tests/main/econnreset

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-zesty-snappy-dev-image/zesty/amd64/s/snapd/20170426_092408_9dd14@/log.gz

Aha, this one is interesting, if you look at the failure message:

+ grep -q 'Retrying.*download-snap/.*\.snap, attempt 1' snap-download.log
grep: + su -c '/usr/bin/env SNAPD_DEBUG=1 snap download core 2>snap-download.log' test
snap-download.log: No such file or directory

It seems to suggest that snap-download.log does not yet exit by the time grep runs. The test uses command & to run stuff in the background, I wonder if that also means the opening of the log file happens in the background. I’ll have a look.