Connection errors on spread tests

I have seen in different tests similar errors related to conection timeout and connection refused. I would like to here some ideas aout what to do to deal with these kind of errors.

Those errors are sporadic and they happen in different tests, so those are really hard to reproduce.

Here are some examples got from the ci logs:

snap install test-snapd-tools
error: cannot perform the following tasks:
- Download snap "test-snapd-tools" (6) from channel "stable" (Get https://public.apps.ubuntu.com/anon/download-
snap/eFe8BTR5L5V9F7yHeMAPxkEr2NdUXMtw_6.snap: dial tcp 162.213.33.92:443: i/o timeout)

snap refresh --revision=x1 config-versions
error: cannot refresh "config-versions": cannot refresh snap-declaration for
"core": Get
  https://assertions.ubuntu.com/v1/assertions/snap-declaration/16/99T7MUlRhtI3U0QFgl5mXXESAiSwt776?max-format=2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Reboot of linode:ubuntu-core-16-64 (Spread-24) is taking a while...
Error preparing linode:ubuntu-core-16-64:tests/main/ : kill-timeout reached, cannot reconnect to linode:ubuntu-core-16-64 (Spread-24) after reboot: dial tcp 45.79.186.250:22: i/o timeout

+ echo 'Install multiple snaps from the store'
Install multiple snaps from the store
+ snap install test-snapd-tools test-snapd-control-consumer
error: cannot install ["test-snapd-tools" "test-snapd-control-consumer"]: Get
      https://search.apps.ubuntu.com/api/v1/snaps/details/test-snapd-tools?channel=stable&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement%2Cchannel_maps_list:
      dial tcp: lookup search.apps.ubuntu.com on [::1]:53: read udp
      [::1]:41138->[::1]:53: read: connection refused

error: cannot install "jq": Get
      https://search.apps.ubuntu.com/api/v1/snaps/details/jq?channel=stable&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement%2Cchannel_maps_list:
      dial tcp: lookup search.apps.ubuntu.com on [::1]:53: read udp
      [::1]:36948->[::1]:53: read: connection refused

Let’s pick one to dig into:

error: cannot install "jq": Get
      https://search.apps.ubuntu.com/api/v1/snaps/details/jq?channel=stable&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement%2Cchannel_maps_list:
      dial tcp: lookup search.apps.ubuntu.com on [::1]:53: read udp
      [::1]:36948->[::1]:53: read: connection refused

That looks like a DNS error while attempting to look for details of a given snap. We should be retrying on those already, but I’m not sure about that particular error.

Every time we get a failure, we log to point out how many times we retried. What do the logs say around this error?

Well, I lost the reference to that issue, but I am taking a look to a timeout that happens for test linode:debian-unstable-64:tests/main/interfaces-openvswitch got from https://travis-ci.org/snapcore/snapd/builds/235118401

The test tries to download a snapt of about 8MB
Here it is the debug info: https://paste.ubuntu.com/24647332/

I see a problem that it starts downloading and after 7.x minutes it gets a unexpected EOF and it retries because of that. During the seconds retry it gets the timetout, 5 minutes after start the seconds attempt.

Is it a problem related to the CDN? perhaps whould be usefull to have internet traffic information or something else to detect connectivity issues in the nodes.

That’s why I’ve been repeating: please don’t try to address every single problem at once. It won’t work. We need to take one single issue and dig deep into it until it’s solved. Small but solid steps forward.

If it’s downloading for 7 minutes, that’s a real timeout which we cannot work around. Yes, it might be a problem in the CDN, or anywhere else in the network. It shouldn’t take that much time. If we have more similar evidence, let’s get investigate further.

To make this task easy and be able to reporduce errors I created a pull request for test reexecutions on spread.

https://github.com/snapcore/spread/pull/29

I saw some econnreset tests failing in the logs and I think this is because of a real issue.

I executed this one and here it is the log https://paste.ubuntu.com/24670490, this is the task for that log https://paste.ubuntu.com/24670497

So, the snap download command is not retrying in some conditions, and based on that it is not leaving that debug information in the file snap-download.log. Also I checked the syslog and there no not information of a second attempt (see log https://paste.ubuntu.com/24670567/)

I am not sure the real cause of this problem, but it happening once every 50 runs aprox.