Run-checks failure in lxd container (new in 2.30)

jdstrand · December 5, 2017, 8:25pm

In master and release/2.30, but not release/2.29 when I run ./run-checks in a xenial lxd container on an artful host, I see the following failure:

=== RUN   TestDeviceManager

----------------------------------------------------------------------
FAIL: devicestate_test.go:766: deviceMgrSuite.TestDoRequestSerialErrorsOnNoHost

devicestate_test.go:808:
    c.Check(chg.Status(), Equals, state.ErrorStatus)
... obtained state.Status = 3 ("Doing")
... expected state.Status = 9 ("Error")

OOPS: 45 passed, 1 skipped, 1 FAILED
--- FAIL: TestDeviceManager (3.49s)
FAIL
coverage: 84.6% of statements
exit status 1
FAIL	github.com/snapcore/snapd/overlord/devicestate	3.575s

Crushing failure and despair.

pedronis · December 5, 2017, 9:09pm

what does host nowhere.invalid gives you inside that container? do you have some kind of http proxying set up?

jdstrand · December 5, 2017, 9:32pm

Host isn’t found for nowhere.invalid in or outside of the container:

$ host nowhere.invalid
Host nowhere.invalid not found: 2(SERVFAIL)

I did not configure an http proxy (http_proxy and https_proxy are unset). From inside the containter (outside is the same):

$ printenv|grep -i proxy
$

LXD apparently can setup an http/https proxy, but it isn’t configured that way here:

$ lxc config get core.proxy_http
$ lxc config get core.proxy_https
$

pedronis · December 5, 2017, 10:19pm

I naively would have expected a NXDOMAIN error, wondering if that’s the issue and where this comes from

jdstrand · December 5, 2017, 11:16pm

It is a 17.10 host with systemd-resolved. The host returns the same thing:

$ host nowhere.invalid
Host nowhere.invalid not found: 2(SERVFAIL)

The container is 16.04 and it has the typical setup of having /etc/resolv.conf point to the gateway’s dnsmasq that is running on the host, which uses the host’s /etc/resolv.conf, which points at the host’s systemd-resolved. Ie, should be nothing out of the ordinary for a 17.10 host running 16.04 containers.

mvo · December 6, 2017, 8:06am

I can reproduce the failure on my artful system. It looks like dnsmasq and systemd-resolved is the culprit. systemd-resolved has special code to return .invalid tld to comply to rfc6761. It returns no-servers in this case which gets translated to SERVFAIL which go considers a temporary error. So our code retries instead of failing. A simple workaround is pushed to https://github.com/snapcore/snapd/pull/4361

pedronis · December 6, 2017, 8:56am

that’s a bit of an interesting interpretation of the RFC, the section about invalid mostly mention NXDOMAIN