Run-checks failure in lxd container (new in 2.30)

In master and release/2.30, but not release/2.29 when I run ./run-checks in a xenial lxd container on an artful host, I see the following failure:

=== RUN   TestDeviceManager

----------------------------------------------------------------------
FAIL: devicestate_test.go:766: deviceMgrSuite.TestDoRequestSerialErrorsOnNoHost

devicestate_test.go:808:
    c.Check(chg.Status(), Equals, state.ErrorStatus)
... obtained state.Status = 3 ("Doing")
... expected state.Status = 9 ("Error")

OOPS: 45 passed, 1 skipped, 1 FAILED
--- FAIL: TestDeviceManager (3.49s)
FAIL
coverage: 84.6% of statements
exit status 1
FAIL	github.com/snapcore/snapd/overlord/devicestate	3.575s

Crushing failure and despair.

what does host nowhere.invalid gives you inside that container? do you have some kind of http proxying set up?

Host isn’t found for nowhere.invalid in or outside of the container:

$ host nowhere.invalid
Host nowhere.invalid not found: 2(SERVFAIL)

I did not configure an http proxy (http_proxy and https_proxy are unset). From inside the containter (outside is the same):

$ printenv|grep -i proxy
$

LXD apparently can setup an http/https proxy, but it isn’t configured that way here:

$ lxc config get core.proxy_http
$ lxc config get core.proxy_https
$

I naively would have expected a NXDOMAIN error, wondering if that’s the issue and where this comes from

It is a 17.10 host with systemd-resolved. The host returns the same thing:

$ host nowhere.invalid
Host nowhere.invalid not found: 2(SERVFAIL)

The container is 16.04 and it has the typical setup of having /etc/resolv.conf point to the gateway’s dnsmasq that is running on the host, which uses the host’s /etc/resolv.conf, which points at the host’s systemd-resolved. Ie, should be nothing out of the ordinary for a 17.10 host running 16.04 containers.

I can reproduce the failure on my artful system. It looks like dnsmasq and systemd-resolved is the culprit. systemd-resolved has special code to return .invalid tld to comply to rfc6761. It returns no-servers in this case which gets translated to SERVFAIL which go considers a temporary error. So our code retries instead of failing. A simple workaround is pushed to https://github.com/snapcore/snapd/pull/4361

1 Like

that’s a bit of an interesting interpretation of the RFC, the section about invalid mostly mention NXDOMAIN