I setup a simple test using @cprov’s script and then got it running in a docker container to be more true to the snapcraft test env, but it’s so far working cleanly for me, see:
@elopio or @kyrofa you see any major differences? next thing is I guess I’ll use a python requests script instead of curl, but for this to be a problem on the CDN side I’d expect curl to be failing too.
I can reliably reproduce this now, but only with a number of constraints: travis + docker + python requests + CDN urls. Leave out one of those and it runs smoothly. I’m a bit perplexed but continuing to debug.
Quick update: I’ve ruled out various details (HTTP headers, SSL or not, URL signing or not).
I’ve also run the exact same tests on CircleCI successfully without any resets (Circle + docker + python + CDN urls). So this is still leading toward something odd with Travis + Docker, but I haven’t been able to pinpoint it beyond that yet.
“sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1” will work around this on Travis (just make sure to run it outside the Docker container), so you should be able to have reliable CI again.
We’re still investigating the precise interactions between requests, OpenSSL, Internap and Linux conntrack that expose the problem, but our experiments show that a windowing issue is causing conntrack to consider the packets invalid (which is why it was only showing up in Docker).
William has now managed to cleanly reproduce this and has done an analysis of the tcpdumps:
The Internap CDN doesn’t seem to respect receive windows. The provided Python
client easily bottlenecks, causing the receive window to eventually drop to
zero, but Internap continues to transmit at full speed.
Linux conntrack, as used by the iptables MASQUERADE target, considers packets
that lie entirely outside the receive window to be invalid, and the kernel
rejects (not just drops) them – this behaviour can be avoided by setting
net.netfilter.nf_conntrack_tcp_be_liberal=1. This doesn’t show up on EC2
because EC2’s firewall seems to drop the packets before they get to the
instance, while GCE lets them through to be rejected by Linux.
So the summary is that Internap is doing a slightly bad thing to improve performance in the common case but with the right set of circumstances on the client end this is causing the resets. We recommend that you use the “nf_conntrack_tcp_be_liberal” workaround for CI jobs, and we will discuss the case with Internap.
Internap support has stated this is due to some legacy kernel module they are in the process of phasing out. Target is week of July 10 for the rest of their upgrades. The workaround above stands for now we will retest once we’ve been notified of the upgrade completion.
I agree with Bret in that it’s not harmful (if anything, maybe slows things down a tiny bit). I’d consider removing it and if you happen to see the same problem again, please report it (regression!) and the workaround can be re-enabled, which would be a quick action on your side.
My rationale is so cruft is not accumulated; particularly since at this point it’ll be like “what does this do?” “oh, doesn’t do anything anymore but it’s always been there”.
I’m with Daniel, mostly because I’m the one maintaining the travis.yml file and after a few months it always gets crazy and we need to go back and clean it