101 connection resets

During the past month I’ve been adding robustness to the download logic we have for the core snap since this really affects our snapcraft CI a lot, today I added more logging to the retry logic which includes where we’ve been redirected to, this is what we generally see in a 5 retry loop with a sleep(1) in between

Starting new HTTPS connection (1): search.apps.ubuntu.com
"GET /api/v1/snaps/details/core?channel=stable&fields=status%2Canon_download_url%2Cdownload_url%2Cdownload_sha3_384%2Cdownload_sha512%2Csnap_id%2Crevision%2Crelease HTTP/1.1" 200 573
Downloading core
Starting new HTTPS connection (1): public.apps.ubuntu.com
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Starting new HTTPS connection (1): 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:16Z&h=c1cda6c141defcbf04819467b55117e2a6863cff HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:16Z&h=c1cda6c141defcbf04819467b55117e2a6863cff
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 5.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:24Z&h=bf88e0517fb1390ed5671c7b0ebf0612b18a9d3e HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:24Z&h=bf88e0517fb1390ed5671c7b0ebf0612b18a9d3e
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 4.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:26Z&h=cbd0753f009f7d9ca8a7a4f18b1fd6032d2f04ae HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:26Z&h=cbd0753f009f7d9ca8a7a4f18b1fd6032d2f04ae
Error while downloading: ChunkedEncodingError(ProtocolError("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')),). Retries left to download: 3.
"GET /anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap HTTP/1.1" 302 0
Resetting dropped connection: 068ed04f23.site.internapcdn.net
"GET /download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:28Z&h=81eb231649cba42085e9a5ab0ded3b756f0c9881 HTTP/1.1" 200 83349504
Redirections for 'https://public.apps.ubuntu.com/anon/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap': https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-25T22:36:28Z&h=81eb231649cba42085e9a5ab0ded3b756f0c9881

Which what we continously see in our travis runs: https://travis-ci.org/snapcore/snapcraft/jobs/235832611

Any ideas on how to move this forward?
I have a PR up here: https://github.com/snapcore/snapcraft/pull/1330

1 Like

Just thought i’d chime in. We’ve had a couple of users get this error and similar error in that last few days as well. The last one in question: https://demo.rocket.chat/channel/ubuntu-snap?msg=inHLxKzzi45GCLvJW

error: cannot install "rocketchat-server": Get
       https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=stable&fields=anon_download_url%2Carchitecture%2Cchannel%2Cdownload_sha3_384%2Csummary%2Cdescription%2Cdeltas%2Cbinary_filesize%2Cdownload_url%2Cepoch%2Cicon_url%2Clast_updated%2Cpackage_name%2Cprices%2Cpublisher%2Cratings_average%2Crevision%2Cscreenshot_urls%2Csnap_id%2Csupport_url%2Ccontact%2Ctitle%2Ccontent%2Cversion%2Corigin%2Cdeveloper_id%2Cprivate%2Cconfinement%2Cchannel_maps_list:
       net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting
       headers)
1 Like

Sergio - we definitely need to get to the bottom of this, but it will require a controlled repro where we can get a full tcpdump and gather more data on the network path, which CDN PoPs, etc. I’ll see if we can get someone to do some focused investigation soon.

Meanwhile, it looks like snapcraft doesn’t resume downloads but is starting fresh with each retry. Improving that would certainly help, especially for larger snaps. It also shouldn’t need to go back to the public.apps.u.c endpoint, but just retry the redirect URL.

If you can’t reproduce it locally, it seems pretty easy to reproduce from travis. We can make a loop that runs in a xenial docker container, and then you add in there all the debugging stuff you want.

I can reproduce locally, I just have to loop enough

The fact we have two completely independent code bases (even written in different languages) suffering from the same kind of instability probably highlights something is amiss on the server end of these calls.

1 Like

Right, we can instrument a travis run with something simple like:

$ for i in `seq 10`; do curl -w '%{http_code}: %{url_effective} %{size_download} %{time_total}s\n' -LsS $(curl -s -H 'X-Ubuntu-Series: 16' https://search.apps.ubuntu.com/api/v1/snaps/details/core | jq '.anon_download_url' -r) -o /dev/null; done
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:40:46Z&h=33be7f56f57720987f0f1ac59b88c0d3b7005041 83349504 46.374s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:41:35Z&h=f4ac903c274a05af5a617ebed779f8625513e03e 83349504 26.953s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:42:04Z&h=62ac923eb1a418651f6d48b17188edb420581788 83349504 29.058s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:42:36Z&h=fd2770e3e79812d6c58db8581aa67832936cf0e2 83349504 34.684s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:43:13Z&h=d68c248ec7d74f4aab9eb947b7abeed992aabc4a 83349504 37.834s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:43:53Z&h=ae132c4f7145fa3ca83f0539b8b6281e06d33ff2 83349504 37.624s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:44:34Z&h=66fb55f00bbd9270503328c52c3de968108f3860 83349504 47.076s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:45:23Z&h=1d5d7aa3f0b5f36b71d402edfed9e77229c3f6b5 83349504 28.049s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:45:54Z&h=5c0a6951c931422ec079312c6cdc7db987edcb21 83349504 30.958s
200: https://068ed04f23.site.internapcdn.net/download-snap/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_1689.snap?t=2017-05-26T17:46:27Z&h=94a93727582e91fbbd7f716033e45f519c17a163 83349504 21.608s

I will try to add a background tcpdump to obtain more information in case of failures.

@elopio, here is an extended version with a tcpdump running in background and capturing any TCP reset:

requires sudo, extra caution advised, it’s a debugging script, do not run this in production-like environments

#!/bin/bash

ATTEMPTS=${1:-10}
echo "Trying $ATTEMPTS core snap download(s) ..."

nohup sudo tcpdump -fnv -w resets.pcap dst port 443 and 'tcp[tcpflags] & (tcp-rst) != 0' \
      > tcpdump.out 2>&1 & echo $! > tcpdump.pid

for i in `seq $ATTEMPTS`;
    do curl -w '%{http_code}: %{url_effective} %{size_download} %{time_total}s\n' -LsS \
	   $(curl -s -H 'X-Ubuntu-Series: 16' https://search.apps.ubuntu.com/api/v1/snaps/details/core | jq '.anon_download_url' -r) -o /dev/null;
done

sudo kill `cat tcpdump.pid` 

Please provide resets.pcap if it’s not empty and/or stdout if you get anything else than '200’s.

2 Likes

Ok, we now resolve the redirect once and continue the download where it left off and still seeing issues: http://pastebin.ubuntu.com/24727287/

This is really brutal. We usually have to retry tests at least twice before we can land a PR.

Was it possible to use Celso’s script in Travis to get more concrete tcpdump information to diagnose this?

Sorry, I am stuck taking care of the release. And with Sergio on holidays we are all a little short on time.

@roadmr can somebody from your team take care of debugging the timeouts? You know more about TCP anyway :smiley:

I setup a simple test using @cprov’s script and then got it running in a docker container to be more true to the snapcraft test env, but it’s so far working cleanly for me, see:

@elopio or @kyrofa you see any major differences? next thing is I guess I’ll use a python requests script instead of curl, but for this to be a problem on the CDN side I’d expect curl to be failing too.

I can reliably reproduce this now, but only with a number of constraints: travis + docker + python requests + CDN urls. Leave out one of those and it runs smoothly. I’m a bit perplexed but continuing to debug.

2 Likes

Quick update: I’ve ruled out various details (HTTP headers, SSL or not, URL signing or not).

I’ve also run the exact same tests on CircleCI successfully without any resets (Circle + docker + python + CDN urls). So this is still leading toward something odd with Travis + Docker, but I haven’t been able to pinpoint it beyond that yet.

Seeing this in a PR.

“sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1” will work around this on Travis (just make sure to run it outside the Docker container), so you should be able to have reliable CI again.

We’re still investigating the precise interactions between requests, OpenSSL, Internap and Linux conntrack that expose the problem, but our experiments show that a windowing issue is causing conntrack to consider the packets invalid (which is why it was only showing up in Docker).

Maybe they are different issues, but @niemeyer mentioned above that they see it in snapd as well (I personally see it too when using snapd).

William has now managed to cleanly reproduce this and has done an analysis of the tcpdumps:

The Internap CDN doesn’t seem to respect receive windows. The provided Python client easily bottlenecks, causing the receive window to eventually drop to zero, but Internap continues to transmit at full speed.

Linux conntrack, as used by the iptables MASQUERADE target, considers packets that lie entirely outside the receive window to be invalid, and the kernel rejects (not just drops) them – this behaviour can be avoided by setting net.netfilter.nf_conntrack_tcp_be_liberal=1. This doesn’t show up on EC2 because EC2’s firewall seems to drop the packets before they get to the instance, while GCE lets them through to be rejected by Linux.

So the summary is that Internap is doing a slightly bad thing to improve performance in the common case but with the right set of circumstances on the client end this is causing the resets. We recommend that you use the “nf_conntrack_tcp_be_liberal” workaround for CI jobs, and we will discuss the case with Internap.

4 Likes

Thanks @wgrant for the thorough analysis!