Snap download failures with PROTOCOL_ERROR


#1

The problem is visible in spread runs on Travis. The problem happens when downloading snaps from the store.
The sequence of events is as follows:

  • snapd sends GET HTTP/1.1 request to the store
  • store responds with HTTP/1.1 302 FOUND and passes Location: <fastly CDN>
  • snapd attempts GET HTTP/1.1 to the provided url
  • CDN responds with 200 OK HTTP/2.0 (protocol got upgraded?)
  • Golang’s net/http raises http.http2StreamError{StreamID:0x1, Code:0x1, Cause:error(nil)}

It’s hard to catch the requests as the CDN URL carries a token that seems to expire after some time.

Example log:

Aug 07 06:03:20 aug070543-458654 snapd[2466]: store.go:1382: DEBUG: Starting download of "/var/lib/snapd/snaps/test-snapd-go-webserver_13.snap.partial".
Aug 07 06:03:20 aug070543-458654 snapd[2466]: retry.go:49: DEBUG: Retrying https://api.snapcraft.io/api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap, attempt 1, elapsed time=12.903µs
Aug 07 06:03:20 aug070543-458654 snapd[2466]: logger.go:67: DEBUG: > "GET /api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap HTTP/1.1\r\nHost: api.snapcraft.io\r\nUser-Agent: snapd/1337.2.40 (series 16; classic; testing) ubuntu/16.04 (i386) linux/4.4.0-157-generic\r\nAccept: \r\nSnap-Cdn: cloud-name=\"gce\" region=\"us-east1\" availability-zone=\"us-east1-b\"\r\nX-Device-Authorization: Macaroon root=\"MDAxZWxvY2F0aW9uIGFwaS5zbmFwY3JhZnQuaW8KMDAxZWlkZW50aWZpZXIgZGV2aWNlLXNlc3Npb24KMDA0MGNpZCBhcGkuc25hcGNyYWZ0LmlvfHZhbGlkX3NpbmNlfDIwMTktMDgtMDdUMDU6NDk6NDcuNTkwOTAyCjAwNDZjaWQgYXBpLnNuYXBjcmFmdC5pb3xzZXNzaW9ufDRjYWYxZGVjLTY0YzYtNDUwMi05MGUwLTM4MmM5NTczYjY1YwowMDliY2lkIGFwaS5zbmFwY3JhZnQuaW98ZGV2aWNlfHsibW9kZWwiOiAiZ2VuZXJpYy1jbGFzc2ljIiwgInNlcmlhbCI6ICI1OGJjMWY0Yi0zZTQ3LTQ4M2QtYjJjYi1kYTczY2E0OWExNTkiLCAiYnJhbmQiOiAiZ2VuZXJpYyIsICJhdXRob3JpdHkiOiAiZ2VuZXJpYyJ9CjAwMjBjaWQgYXBpLnNuYXBjcmFmdC5pb3xzdG9yZXwKMDAyZnNpZ25hdHVyZSDgX_uK2dJieqLCL5WabHQvt9i1ECG5GUu9D5I7ffvYdAo\"\r\nX-Ubuntu-Architecture: i386\r\nX-Ubuntu-Classic: true\r\nX-Ubuntu-Series: 16\r\nX-Ubuntu-Wire-Protocol: 1\r\nAccept-Encoding: gzip\r\n\r\n"
Aug 07 06:03:20 aug070543-458654 snapd[2466]: logger.go:74: DEBUG: < "HTTP/1.1 302 FOUND\r\nContent-Length: 513\r\nContent-Type: text/html; charset=utf-8\r\nDate: Wed, 07 Aug 2019 06:03:20 GMT\r\nLocation: https://fastly.cdn.snapcraft.io/download-origin/fastly/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap?token=1565172000_65c986ad70720045b74b36859cd98ad98d39f92b\r\nServer: gunicorn/19.7.1\r\nSnap-Store-Version: 15\r\nX-Request-Id: 23E344FEAE2E0A325C8501BB5D4A69A83DF153A6\r\nX-Vcs-Revision: c0e0e44\r\nX-View-Name: snapdevicegw.webapi_download.snap_download\r\n\r\n"
Aug 07 06:03:20 aug070543-458654 snapd[2466]: logger.go:67: DEBUG: > "GET /download-origin/fastly/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap?token=1565172000_65c986ad70720045b74b36859cd98ad98d39f92b HTTP/1.1\r\nHost: fastly.cdn.snapcraft.io\r\nUser-Agent: snapd/1337.2.40 (series 16; classic; testing) ubuntu/16.04 (i386) linux/4.4.0-157-generic\r\nAccept: \r\nReferer: https://api.snapcraft.io/api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap\r\nSnap-Cdn: cloud-name=\"gce\" region=\"us-east1\" availability-zone=\"us-east1-b\"\r\nX-Device-Authorization: Macaroon root=\"MDAxZWxvY2F0aW9uIGFwaS5zbmFwY3JhZnQuaW8KMDAxZWlkZW50aWZpZXIgZGV2aWNlLXNlc3Npb24KMDA0MGNpZCBhcGkuc25hcGNyYWZ0LmlvfHZhbGlkX3NpbmNlfDIwMTktMDgtMDdUMDU6NDk6NDcuNTkwOTAyCjAwNDZjaWQgYXBpLnNuYXBjcmFmdC5pb3xzZXNzaW9ufDRjYWYxZGVjLTY0YzYtNDUwMi05MGUwLTM4MmM5NTczYjY1YwowMDliY2lkIGFwaS5zbmFwY3JhZnQuaW98ZGV2aWNlfHsibW9kZWwiOiAiZ2VuZXJpYy1jbGFzc2ljIiwgInNlcmlhbCI6ICI1OGJjMWY0Yi0zZTQ3LTQ4M2QtYjJjYi1kYTczY2E0OWExNTkiLCAiYnJhbmQiOiAiZ2VuZXJpYyIsICJhdXRob3JpdHkiOiAiZ2VuZXJpYyJ9CjAwMjBjaWQgYXBpLnNuYXBjcmFmdC5pb3xzdG9yZXwKMDAyZnNpZ25hdHVyZSDgX_uK2dJieqLCL5WabHQvt9i1ECG5GUu9D5I7ffvYdAo\"\r\nX-Ubuntu-Architecture: i386\r\nX-Ubuntu-Classic: true\r\nX-Ubuntu-Series: 16\r\nX-Ubuntu-Wire-Protocol: 1\r\nAccept-Encoding: gzip\r\n\r\n"
Aug 07 06:03:21 aug070543-458654 snapd[2466]: logger.go:74: DEBUG: < "HTTP/2.0 200 OK\r\nContent-Length: 1744896\r\nAccept-Ranges: bytes\r\nAge: 0\r\nCache-Control: max-age=86400\r\nContent-Disposition: attachment; filename=DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap\r\nContent-Type: application/octet-stream\r\nDate: Wed, 07 Aug 2019 06:03:21 GMT\r\nEtag: 59ddab9f-ae17-4d7c-9fed-003b18b631a4\r\nLast-Modified: Wed, 07 Aug 2019 04:44:48 GMT\r\nServer: TwistedWeb/14.0.2\r\nStrict-Transport-Security: max-age=2592000\r\nVary: Accept-Encoding\r\nVia: 1.1 varnish\r\nVia: 1.1 varnish\r\nX-Bzr-Revision-Number: 7464\r\nX-Cache: MISS, MISS\r\nX-Cache-Hits: 0, 0\r\nX-Request-Id: XUpXQH8AAQEAAClijP8AAABO1\r\nX-Served-By: cache-lcy19245-LCY, cache-bwi5037-BWI\r\nX-Timer: S1565157801.925351,VS0,VE115\r\n\r\n"
Aug 07 06:03:21 aug070543-458654 snapd[2466]: retry.go:136: DEBUG: Not retrying: http.http2StreamError{StreamID:0x1, Code:0x1, Cause:error(nil)}
Aug 07 06:03:21 aug070543-458654 snapd[2466]: store.go:1398: DEBUG: download of "https://api.snapcraft.io/api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap" failed: http.http2StreamError{StreamID:0x1, Code:0x1, Cause:error(nil)}
Aug 07 06:03:21 aug070543-458654 snapd[2466]: task.go:337: DEBUG: 2019-08-07T06:03:21Z ERROR stream error: stream ID 1; PROTOCOL_ERROR

Since the job failed recently, I tried using curl with to retry the request, but curl did not raise anything suspicious. Full log for those interested:

maciek@galeon:~ curl -D - --compressed --output /dev/null -v -H 'Referrer: https://api.snapcraft.io/api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap' 'https://fastly.cdn.snapcraft.io/download-origin/
fastly/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap?token=1565172000_65c986ad70720045b74b36859cd98ad98d39f92b'  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 151.101.14.217:443...
* TCP_NODELAY set
* Connected to fastly.cdn.snapcraft.io (151.101.14.217) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [106 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [5224 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [300 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Fastly, Inc.; CN=t2.shared.global.fastly.net
*  start date: Aug  6 22:43:28 2019 GMT
*  expire date: Feb 16 16:37:16 2020 GMT
*  subjectAltName: host "fastly.cdn.snapcraft.io" matched cert's "fastly.cdn.snapcraft.io"
*  issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign CloudSSL CA - SHA256 - G3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x5607fbc70f30)
} [5 bytes data]
> GET /download-origin/fastly/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap?token=1565172000_65c986ad70720045b74b36859cd98ad98d39f92b HTTP/2
> Host: fastly.cdn.snapcraft.io
> User-Agent: curl/7.65.3
> Accept: */*
> Accept-Encoding: deflate, gzip
> Referrer: https://api.snapcraft.io/api/v1/snaps/download/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap
> 
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
} [5 bytes data]
< HTTP/2 200 
HTTP/2 200 
< server: TwistedWeb/14.0.2
server: TwistedWeb/14.0.2
< content-disposition: attachment; filename=DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap
content-disposition: attachment; filename=DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap
< last-modified: Wed, 07 Aug 2019 06:41:26 GMT
last-modified: Wed, 07 Aug 2019 06:41:26 GMT
< etag: 59ddab9f-ae17-4d7c-9fed-003b18b631a4
etag: 59ddab9f-ae17-4d7c-9fed-003b18b631a4
< x-bzr-revision-number: 7464
x-bzr-revision-number: 7464
< cache-control: max-age=86400
cache-control: max-age=86400
< content-type: application/octet-stream
content-type: application/octet-stream
< strict-transport-security: max-age=2592000
strict-transport-security: max-age=2592000
< x-request-id: XUpyln8AAQEAACJluvQAAABz1
x-request-id: XUpyln8AAQEAACJluvQAAABz1
< via: 1.1 varnish
via: 1.1 varnish
< accept-ranges: bytes
accept-ranges: bytes
< date: Wed, 07 Aug 2019 07:13:44 GMT
date: Wed, 07 Aug 2019 07:13:44 GMT
< via: 1.1 varnish
via: 1.1 varnish
< age: 0
age: 0
< x-served-by: cache-lcy19240-LCY, cache-fra19128-FRA
x-served-by: cache-lcy19240-LCY, cache-fra19128-FRA
< x-cache: HIT, MISS
x-cache: HIT, MISS
< x-cache-hits: 1, 0
x-cache-hits: 1, 0
< x-timer: S1565162024.910748,VS0,VE181
x-timer: S1565162024.910748,VS0,VE181
< vary: Accept-Encoding
vary: Accept-Encoding
< content-length: 1744896
content-length: 1744896

< 
{ [5 bytes data]
100 1704k  100 1704k    0     0   748k      0  0:00:02  0:00:02 --:--:--  748k
* Connection #0 to host fastly.cdn.snapcraft.io left intact

One thing that makes me wonder is that curl negotiates h2 (HTTP/2.0) during TLS ALPN. Then curl sends GET like this:

GET /download-origin/fastly/DVvhXhpa9oJjcm0rnxfxftH1oo5vTW1M_13.snap?token=1565172000_65c986ad70720045b74b36859cd98ad98d39f92b HTTP/2

While when the request is done from snapd, it looks like the initial GET is with HTTP/1.1, but the response comes with HTTP/2. This would suggest that h2 was not negotiated by ALPN, but rather Golang’s net/http client passed an Upgrade: header in the request. Unfortunately, this is not included in the debug log from snapd.
I wasn’t able to convince curl to not expose h2 to ALPN but rather include an upgrade header in the request.

cc @cachio @chipaca @pedronis

FYI, the logs are from the last Travis run on master: https://travis-ci.org/snapcore/snapd/jobs/568682345


#2

@mborzecki, Thanks for this topic. I would like to add a conversation abou tthis problem which could help also to understand it:

ago 02 17:12:15 verterok - cachio: from that log, looks like fastly is sending the http/2 response
ago 02 17:12:43 cachio - verterok, yes
ago 02 17:12:47 cachio - after a redirect
ago 02 17:14:40 verterok - cachio: the redirect is common, api.snapcraft.io replies a 302 with the CDN URL, in this case fastly
ago 02 17:15:08 verterok - s/is common/is normal/
ago 02 17:22:02 verterok - cachio: so, it’s ok to switch to http/2 for the download. here being fastly the one doing HTTP/2
ago 02 17:22:22 verterok - cachio: I see a MISS in the cache, so it’s fastly streaming the snap from our updown service
ago 02 17:23:15 verterok - something is failing there that causes, the protocol error…no idea what, as the log is not clear about what happenend in the protocol :slight_smile:
ago 02 17:24:21 cachio - verterok, I am comparing the responses which make that fail
ago 02 17:24:26 cachio - with the others
ago 02 17:24:33 cachio - to see the diff
ago 02 17:24:40 cachio - both are 2.0
ago 02 17:27:33 cachio - verterok, the only diff I see is -> X-Cache: HIT, MISS
ago 02 17:27:46 cachio - vs -> X-Cache: MISS, MISS
ago 02 17:28:03 verterok - cachio: yeah, a MISS, MISS means it will stream the snap from our servers
ago 02 17:28:16 verterok - all is updown fault
ago 02 17:28:57 cachio - verterok, that could cause the protocol error?
ago 02 17:29:49 cachio - the error is reproduced when X-Cache: HIT, MISS
ago 02 17:29:49 verterok - cachio: no idea what causes a protocol error, but the diff there is that one is served directly by the CDN and the other streamed from our servers: updown -> fastly -> client
ago 02 17:29:57 verterok - cachio: oh
ago 02 17:30:04 verterok - I thought it was in the MISS, MISS case
ago 02 17:30:11 cachio - no
ago 02 17:30:30 verterok - then it’s faslty serving the content
ago 02 17:30:35 verterok - *fastly
ago 02 17:30:45 cachio - verterok, https://paste.ubuntu.com/p/2xdRRbsXFD/

Aug 02 18:37:38 aug021813-410046 snapd[26575]: logger.go:74: DEBUG: < "HTTP/2.0 200 OK
Content-Length: 54185984
Accept-Ranges: bytes
Accept-Ranges: bytes
Age: 0
Cache-Control: max-age=86400
Content-Disposition: attachment; filename=CSO04Jhav2yK0uz97cr0ipQRyqg0qQL6_1073.snap
Content-Type: application/octet-stream
Date: Fri, 02 Aug 2019 18:37:38 GMT
Etag: b275918f-1889-41e5-9ccc-019c242fdc22
Last-Modified: Fri, 02 Aug 2019 18:13:06 GMT
Server: TwistedWeb/14.0.2
Strict-Transport-Security: max-age=2592000
Vary: Accept-Encoding
Via: 1.1 varnish
Via: 1.1 varnish
X-Bzr-Revision-Number: 7464
X-Cache: HIT, MISS
X-Cache-Hits: 0, 0
X-Request-Id: XUR9Mn8AAQEAAHqP0JMAAAA41
X-Served-By: cache-lcy19233-LCY, cache-bwi5039-BWI
X-Timer: S1564771058.332371,VS0,VE111"    

ago 02 17:30:52 cachio - this is the response
ago 02 17:31:45 cachio - after this I see http.http2StreamError
ago 02 17:32:03 cachio -it is like the the sream from fastly is not correct?
ago 02 17:33:27 verterok -yeah, looks like that
ago 02 17:34:08 cachio - verterok, is it possible to validate in the cdn logs?
ago 02 17:34:09 verterok - cachio: what’s the MISS MISS output?
ago 02 17:34:29 cachio - verterok, https://paste.ubuntu.com/p/KFN2rFZ2fX/

Aug 02 20:04:41 aug021957-423320 snapd[27762]: logger.go:74: DEBUG: < "HTTP/2.0 200 OK
Content-Length: 93229056
Accept-Ranges: bytes
Accept-Ranges: bytes
Age: 0
Cache-Control: max-age=86400
Content-Disposition: attachment; filename=99T7MUlRhtI3U0QFgl5mXXESAiSwt776_7529.snap
Content-Type: application/octet-stream
Date: Fri, 02 Aug 2019 20:04:41 GMT
Etag: 797323e3-df0e-41cd-bf00-c6b72f96785b
Last-Modified: Fri, 02 Aug 2019 20:04:41 GMT
Server: TwistedWeb/14.0.2
Strict-Transport-Security: max-age=2592000
Vary: Accept-Encoding
Via: 1.1 varnish
Via: 1.1 varnish
X-Bzr-Revision-Number: 7464
X-Cache: MISS, MISS
X-Cache-Hits: 0, 0
X-Request-Id: XUSXWX8AAQEAACxuVtYAAABv1
X-Served-By: cache-lcy19244-LCY, cache-dca17722-DCA
X-Timer: S1564776282.811358,VS0,VE180"

ago 02 17:34:31 verterok - cachio: we only have log for errors in the CDN (5xx codes)|
ago 02 17:35:51 verterok - cachio: https://docs.fastly.com/guides/performance-tuning/understanding-cache-hit-and-miss-headers-with-shielded-services#the-x-cache-header
ago 02 17:36:42 verterok - HIT, MISS -> “The header returned a MISS on the edge and a HIT on the shield. This shows that the edge datacenter did not have the object, but when the request was forwarded to the shield datacenter, the object was found.”


#3

This is a second conversation about that:

ago 05 14:30:54 verterok - cachio: hi, no news.
ago 05 14:31:07 verterok - cachio: any chance you can run those tests with http2debug=1 or http2debug=2?
ago 05 14:31:39 cachio - verterok, In our code I see that it is failing when it tries to download but we just print the error which comes
ago 05 14:31:51 verterok - cachio: also found: https://github.com/golang/go/issues/16847
ago 05 14:32:21 verterok - cachio: http2debug should make it super verbose, like the output linked in the above bug
ago 05 14:32:54 cachio - verterok, sure, I can create a branch with that change and force to run in travis
ago 05 14:33:29 verterok - cachio: if you are going to ask fastly support, better to provide some trace data too
ago 05 14:33:53 cachio - verterok, sure
ago 05 14:33:57 verterok - noise, roadmr-lunch: ^ do we have a way to raise issues to fastly?
ago 05 14:35:20 noise - verterok: should be able to submit support tickets
ago 05 14:35:28 verterok - k, cool
ago 05 14:37:11 verterok - looks like it’s also happening in the wild, not only in the tests: https://www.google.com/search?q=“snap+install”+“stream+error%3A+stream+ID+1%3B+PROTOCOL_ERROR”


#4

Hi,
Filed a support ticket to Fastly, including the logs we have from the tests. The reply is they can’t reproduce and requested more logs.
Would be great if we can provide a minimal reproducer script so they can debug this on their side.


#5

Quoting comments @mvo left under https://github.com/snapcore/snapd/pull/7227

The log has the protocol error for frame 0xc4204fbce0, I grepped for all logs with this frame and got the following:

$ grep  0xc4204fbce0 log.txt 
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=869
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: read SETTINGS len=6, settings: MAX_CONCURRENT_STREAMS=100
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: wrote SETTINGS flags=ACK len=0
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: read WINDOW_UPDATE len=4 (conn) incr=16711681
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: read SETTINGS flags=ACK len=0
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: read HEADERS flags=END_HEADERS stream=1 len=378
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: read RST_STREAM stream=1 len=4 ErrCode=PROTOCOL_ERROR
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Framer 0xc4204fbce0: wrote RST_STREAM stream=1 len=4 ErrCode=CANCEL

the last few lines look a lot like https://github.com/golang/go/issues/29125 - this bug is stale upstream and it seems the workaround might be to “just” retry but this is of course not great, would be nice to know why we get the RST_STREAM (which seems to come from the store end AFAICT?).

Looking a bit more it seems like we get the PROCOTOL_ERROR from the remote side which appears to be fastly in this case:

Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Transport failed to get client conn for fastly.cdn.snapcraft.io:443: http2: no cached connection was available
Aug 09 20:03:16 aug091949-909109 snapd[7255]: 2019/08/09 20:03:16 http2: Transport creating client conn 0xc4200baa80 to 151.101.250.217:443
...

#6

When googling for this error it seems like this is also affecting our users, not only our tests. I think short term we should simply retry on this error to unblock the tests and our users.

Once that is done we should probably try to write a reproducer. I played a bit and just wrote a trivial http go client that fetches the failing snap from fastly but ~1000 times did not yield the error on my machine/network. So it might be something that is easier to trigger in the test infrastructure. Fwiw, my very naive test was:

package main

import (
	"io/ioutil"
	"net/http"
)

func main() {
	resp, err := http.Get("https://api.snapcraft.io/api/v1/snaps/download/QJMh93s02nAXHvBJlHqR2K9Qg0eJys2u_2.snap")
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	b, err = ioutil.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}
        println(len(b))
}

and I checked via GODEBUG=http2debug=2 that it uses http2 and goes to fastly. But maybe running this (or something similar) in our CI yield something different?


#7

This is what I researched on friday:

cachio -> verterok, donna try with GODEBUG=http2debug=2 now
verterok -> thanks
verterok -> I see a duplicated “Accept-Ranges: bytes\r\nAccept-Ranges: bytes” in the failed case…but not sure that is the cause
verterok -> is the only difference I was able to spot
roadmr -> verterok: duplicated in the request or the response?
verterok -> roadmr: response
verterok -> according to the HTTP/1.1 RFC, it’s wrong
roadmr -> verterok: :zap:
roadmr -> ah, but this is HTTP/2.0, right?
roadmr -> verterok: it does look fishy
cachio -> verterok, error using debug=2 https://paste.ubuntu.com/p/PRv4DdH38p/
verterok -> that was fast!
cachio -> the error is at the end
cachio -> verterok, it failed on the first attempt :slight_smile:
cachio -> verterok, some details about the error https://http2.github.io/http2-spec/#RST_STREAM
cachio -> RST_STREAM frames MUST NOT be sent for a stream in the “idle” state. If a RST_STREAM frame identifying an idle stream is received, the recipient MUST treat this as a connection error (Section 5.4.1) of type PROTOCOL_ERROR.
cachio -> RST_STREAM frames MUST be associated with a stream. If a RST_STREAM frame is received with a stream identifier of 0x0, the recipient MUST treat this as a connection error (Section 5.4.1) of type PROTOCOL_ERROR.
roadmr -> so fastly is sending RST_STREAM?
cachio -> roadmr, yes
verterok -> will follow the ticket with fastly and will try to get the server logs they want


#8

I created a test http2 server based on nghttp to inject arbitrary RST frames into the stream to see if we can make the error handling better. It seems like the http2 error handling on the go side is not very nice, it looks like its all a net.Error with strings. I think its worthwhile to report a bug to go and ask for something that is a bit more structured.


#9

We’ve started seeing the problem once again, the logs are here: https://paste.ubuntu.com/p/bFgCsXWvzX/