101 connection resets

@elopio, here is an extended version with a tcpdump running in background and capturing any TCP reset:

requires sudo, extra caution advised, it’s a debugging script, do not run this in production-like environments

#!/bin/bash

ATTEMPTS=${1:-10}
echo "Trying $ATTEMPTS core snap download(s) ..."

nohup sudo tcpdump -fnv -w resets.pcap dst port 443 and 'tcp[tcpflags] & (tcp-rst) != 0' \
      > tcpdump.out 2>&1 & echo $! > tcpdump.pid

for i in `seq $ATTEMPTS`;
    do curl -w '%{http_code}: %{url_effective} %{size_download} %{time_total}s\n' -LsS \
	   $(curl -s -H 'X-Ubuntu-Series: 16' https://search.apps.ubuntu.com/api/v1/snaps/details/core | jq '.anon_download_url' -r) -o /dev/null;
done

sudo kill `cat tcpdump.pid` 

Please provide resets.pcap if it’s not empty and/or stdout if you get anything else than '200’s.

2 Likes

Ok, we now resolve the redirect once and continue the download where it left off and still seeing issues: http://pastebin.ubuntu.com/24727287/

This is really brutal. We usually have to retry tests at least twice before we can land a PR.

Was it possible to use Celso’s script in Travis to get more concrete tcpdump information to diagnose this?

Sorry, I am stuck taking care of the release. And with Sergio on holidays we are all a little short on time.

@roadmr can somebody from your team take care of debugging the timeouts? You know more about TCP anyway :smiley:

I setup a simple test using @cprov’s script and then got it running in a docker container to be more true to the snapcraft test env, but it’s so far working cleanly for me, see:

@elopio or @kyrofa you see any major differences? next thing is I guess I’ll use a python requests script instead of curl, but for this to be a problem on the CDN side I’d expect curl to be failing too.

I can reliably reproduce this now, but only with a number of constraints: travis + docker + python requests + CDN urls. Leave out one of those and it runs smoothly. I’m a bit perplexed but continuing to debug.

2 Likes

Quick update: I’ve ruled out various details (HTTP headers, SSL or not, URL signing or not).

I’ve also run the exact same tests on CircleCI successfully without any resets (Circle + docker + python + CDN urls). So this is still leading toward something odd with Travis + Docker, but I haven’t been able to pinpoint it beyond that yet.

Seeing this in a PR.

“sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1” will work around this on Travis (just make sure to run it outside the Docker container), so you should be able to have reliable CI again.

We’re still investigating the precise interactions between requests, OpenSSL, Internap and Linux conntrack that expose the problem, but our experiments show that a windowing issue is causing conntrack to consider the packets invalid (which is why it was only showing up in Docker).

Maybe they are different issues, but @niemeyer mentioned above that they see it in snapd as well (I personally see it too when using snapd).

William has now managed to cleanly reproduce this and has done an analysis of the tcpdumps:

The Internap CDN doesn’t seem to respect receive windows. The provided Python
client easily bottlenecks, causing the receive window to eventually drop to
zero, but Internap continues to transmit at full speed.

Linux conntrack, as used by the iptables MASQUERADE target, considers packets
that lie entirely outside the receive window to be invalid, and the kernel
rejects (not just drops) them – this behaviour can be avoided by setting
net.netfilter.nf_conntrack_tcp_be_liberal=1. This doesn’t show up on EC2
because EC2’s firewall seems to drop the packets before they get to the
instance, while GCE lets them through to be rejected by Linux.

So the summary is that Internap is doing a slightly bad thing to improve performance in the common case but with the right set of circumstances on the client end this is causing the resets. We recommend that you use the “nf_conntrack_tcp_be_liberal” workaround for CI jobs, and we will discuss the case with Internap.

4 Likes

Thanks @wgrant for the thorough analysis!

Internap support has stated this is due to some legacy kernel module they are in the process of phasing out. Target is week of July 10 for the rest of their upgrades. The workaround above stands for now we will retest once we’ve been notified of the upgrade completion.

Wow, great work debugging this. Thanks a lot @wgrant and @noise!

I applied the suggested workaround here:
https://github.com/snapcore/snapcraft/pull/1386

I’ll keep hitting the re-run button over the weekend, to give it a try with a few executions.

1 Like

I got confirmation today that the Internap fix was rolled out globally. Calling this fixed!

@noise should I remove the workaround, or leave it just in case? What do you say?

It should not be necessary anymore, but won’t hurt to leave it.

I agree with Bret in that it’s not harmful (if anything, maybe slows things down a tiny bit). I’d consider removing it and if you happen to see the same problem again, please report it (regression!) and the workaround can be re-enabled, which would be a quick action on your side.

My rationale is so cruft is not accumulated; particularly since at this point it’ll be like “what does this do?” “oh, doesn’t do anything anymore but it’s always been there”.

Git will take care of keeping the history for us :slight_smile:

I’m with Daniel, mostly because I’m the one maintaining the travis.yml file and after a few months it always gets crazy and we need to go back and clean it :slight_smile:

https://github.com/snapcore/snapcraft/pull/1410