During the past month I’ve been adding robustness to the download logic we have for the core snap since this really affects our snapcraft CI a lot, today I added more logging to the retry logic which includes where we’ve been redirected to, this is what we generally see in a 5 retry loop with a sleep(1) in between
Starting new HTTPS connection (1): search.apps.ubuntu.com
"GET /api/v1/snaps/details/core?channel=stable&fields=status%2Canon_download_url%2Cdownload_url%2Cdownload_sha3_384%2Cdownload_sha512%2Csnap_id%2Crevision%2Crelease HTTP/1.1" 200 573
Downloading core
Starting new HTTPS connection (1): public.apps.ubuntu.com
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Starting new HTTPS connection (1): 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:16Z&h=c1cda6c141defcbf04819467b55117e2a6863cff HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:16Z&h=c1cda6c141defcbf04819467b55117e2a6863cff
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 5.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:24Z&h=bf88e0517fb1390ed5671c7b0ebf0612b18a9d3e HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:24Z&h=bf88e0517fb1390ed5671c7b0ebf0612b18a9d3e
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 4.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:26Z&h=cbd0753f009f7d9ca8a7a4f18b1fd6032d2f04ae HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:26Z&h=cbd0753f009f7d9ca8a7a4f18b1fd6032d2f04ae
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 3.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:28Z&h=81eb231649cba42085e9a5ab0ded3b756f0c9881 HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:28Z&h=81eb231649cba42085e9a5ab0ded3b756f0c9881
error: cannot install "rocketchat-server": Get
https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement%2Cchannel_maps_list:
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting
headers)
Sergio - we definitely need to get to the bottom of this, but it will require a controlled repro where we can get a full tcpdump and gather more data on the network path, which CDN PoPs, etc. I’ll see if we can get someone to do some focused investigation soon.
Meanwhile, it looks like snapcraft doesn’t resume downloads but is starting fresh with each retry. Improving that would certainly help, especially for larger snaps. It also shouldn’t need to go back to the public.apps.u.c endpoint, but just retry the redirect URL.
If you can’t reproduce it locally, it seems pretty easy to reproduce from travis. We can make a loop that runs in a xenial docker container, and then you add in there all the debugging stuff you want.
The fact we have two completely independent code bases (even written in different languages) suffering from the same kind of instability probably highlights something is amiss on the server end of these calls.
I setup a simple test using @cprov’s script and then got it running in a docker container to be more true to the snapcraft test env, but it’s so far working cleanly for me, see:
@elopio or @kyrofa you see any major differences? next thing is I guess I’ll use a python requests script instead of curl, but for this to be a problem on the CDN side I’d expect curl to be failing too.
I can reliably reproduce this now, but only with a number of constraints: travis + docker + python requests + CDN urls. Leave out one of those and it runs smoothly. I’m a bit perplexed but continuing to debug.
Quick update: I’ve ruled out various details (HTTP headers, SSL or not, URL signing or not).
I’ve also run the exact same tests on CircleCI successfully without any resets (Circle + docker + python + CDN urls). So this is still leading toward something odd with Travis + Docker, but I haven’t been able to pinpoint it beyond that yet.
“sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1” will work around this on Travis (just make sure to run it outside the Docker container), so you should be able to have reliable CI again.
We’re still investigating the precise interactions between requests, OpenSSL, Internap and Linux conntrack that expose the problem, but our experiments show that a windowing issue is causing conntrack to consider the packets invalid (which is why it was only showing up in Docker).
William has now managed to cleanly reproduce this and has done an analysis of the tcpdumps:
The Internap CDN doesn’t seem to respect receive windows. The provided Python
client easily bottlenecks, causing the receive window to eventually drop to
zero, but Internap continues to transmit at full speed.
Linux conntrack, as used by the iptables MASQUERADE target, considers packets
that lie entirely outside the receive window to be invalid, and the kernel
rejects (not just drops) them – this behaviour can be avoided by setting
net.netfilter.nf_conntrack_tcp_be_liberal=1. This doesn’t show up on EC2
because EC2’s firewall seems to drop the packets before they get to the
instance, while GCE lets them through to be rejected by Linux.
So the summary is that Internap is doing a slightly bad thing to improve performance in the common case but with the right set of circumstances on the client end this is causing the resets. We recommend that you use the “nf_conntrack_tcp_be_liberal” workaround for CI jobs, and we will discuss the case with Internap.