Unit test failure in master ~ July 2019

We’re seeing a number of unit test failures in master on travis. In all cases travis doesn’t show enough information (test dies after 10 minutes). @mborzecki managed to get one panic with traceback on his system. The log is available in https://paste.ubuntu.com/p/yhGfmdRvj7/

I tried to reproduce the issue and proceeded to run GOMAXPROCS=2 GOTRACEBACK=1 go test -count=100 in cmd/snap. On my system the failure was different as I exhausted the number of open file descriptors (1024 on my system).

I looked around this area and I can only see the rename patches from https://github.com/snapcore/snapd/pull/7162 - perhaps any init() functions in the affected packages are relevant?

I opened https://github.com/snapcore/snapd/pull/7190 to see if the traceback from Maciek matches what is going on on Travis.

Running more tests I also saw this failure:

FAIL: session_agent_test.go:65: sessionAgentSuite.TestStartStop

session_agent_test.go:84:
    c.Check(agent.Stop(), IsNil)
... value *errors.errorString = &errors.errorString{s:"context canceled"} ("context canceled")

session_agent_test.go:85:
    ...
session_agent_test.go:70:
    defer func() { c.Check(agent.Stop(), IsNil) }()
... value *errors.errorString = &errors.errorString{s:"context canceled"} ("context canceled")

OOPS: 2 passed, 1 FAILED
--- FAIL: Test (0.01s)
FAIL
FAIL	github.com/snapcore/snapd/usersession/agent	0.031s

Any ideas?

Might not be the last fix for this, but we’re definitely leaking file descriptors. https://github.com/snapcore/snapd/pull/7195

This actually does look related to the “context canceled” error.

In SessionAgent.Stop, we’re calling http.Server.Shutdown with a context set to expire in 5 seconds, then passing the return value to the tomb. If we’re seeing the context.Canceled error, then graceful shutdown did not complete within that time.

If the response body has not been fully consumed on the client side, perhaps it is holding the connection in the active state?

More fixes were pushed to #7195. In order to avoid such problems in the future we’re passing -timeout 5m to tests now.

Anyways, the problem still occurs randomly. Relevant backtrace:

ok  	github.com/snapcore/snapd/cmd/cmdutil	0.063s
=== RUN   Test
2019/07/30 12:48:41.809817 cmd_run.go:779: cannot extract runtime data: cannot parse end of exec profile: EOF
panic: test timed out after 5m0s

goroutine 2449 [running]:
testing.startAlarm.func1()
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:1145 +0xf9
created by time.goFunc
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/time/sleep.go:170 +0x44

goroutine 1 [chan receive, 5 minutes]:
testing.(*T).Run(0xc4200c8000, 0xc7d924, 0x4, 0xccdaf8, 0x485b66)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:790 +0x2fc
testing.runTests.func1(0xc4200c8000)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:1004 +0x64
testing.tRunner(0xc4200c8000, 0xc4204b7de0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:746 +0xd0
testing.runTests(0xc42000c180, 0x10b04b0, 0x1, 0x1, 0xc0020488200)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:1002 +0x2d8
testing.(*M).Run(0xc4203cbf18, 0xc4204b7f70)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:921 +0x111
main.main()
	github.com/snapcore/snapd/cmd/snap/_test/_testmain.go:46 +0xdb

goroutine 5 [syscall, 5 minutes]:
os/signal.signal_recv(0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/sigqueue.go:131 +0xa6
os/signal.loop()
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/os/signal/signal_unix.go:28 +0x41

goroutine 20 [chan receive, 4 minutes]:
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*suiteRunner).runTest(0xc42022d780, 0xc420250540, 0x0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:819 +0x55
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*suiteRunner).run(0xc42022d780, 0xc4200e8000)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:624 +0xfc
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.Run(0xc4e940, 0xc4200e8000, 0xc420010780, 0xc4200e87e0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:92 +0x4d
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.RunAll(0xc420010780, 0x0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:84 +0xa4
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.TestingT(0xc4200c80f0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:72 +0x386
github.com/snapcore/snapd/cmd/snap_test.Test(0xc4200c80f0)
	/home/travis/gopath/src/github.com/snapcore/snapd/cmd/snap/main_test.go:50 +0x2b
testing.tRunner(0xc4200c80f0, 0xccdaf8)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:746 +0xd0
created by testing.(*T).Run
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/testing/testing.go:789 +0x2de

goroutine 2442 [chan receive, 4 minutes]:
github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*Conn).outWorker(0xc4200be300)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/conn.go:427 +0x63
created by github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*Conn).Auth
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/auth.go:119 +0x701

goroutine 2439 [select, 4 minutes]:
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc42052a000)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:470 +0xff
created by github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*resultTracker).start
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:450 +0x3f

goroutine 513 [select, 4 minutes, locked to thread]:
runtime.gopark(0xcce8a8, 0x0, 0xc7ff57, 0x6, 0x18, 0x1)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/proc.go:287 +0x12c
runtime.selectgo(0xc42010b750, 0xc420358780)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/select.go:395 +0x1149
runtime.ensureSigM.func1()
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/signal_unix.go:511 +0x220
runtime.goexit()
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/asm_amd64.s:2337 +0x1

goroutine 2433 [select, 4 minutes]:
github.com/snapcore/snapd/cmd/snap.(*cmdUserd).runAgent(0xc42015ba48, 0x0, 0x0)
	/home/travis/gopath/src/github.com/snapcore/snapd/cmd/snap/cmd_userd.go:111 +0x250
github.com/snapcore/snapd/cmd/snap.(*cmdUserd).Execute(0xc42015ba48, 0xc42048bbe0, 0x0, 0x2, 0xc42015ba48, 0x1)
	/home/travis/gopath/src/github.com/snapcore/snapd/cmd/snap/cmd_userd.go:72 +0x4c
github.com/snapcore/snapd/vendor/github.com/jessevdk/go-flags.(*Parser).ParseArgs(0xc420320690, 0xc42048bb40, 0x2, 0x2, 0xc420049ab0, 0x4b39a3, 0xc4e940, 0xc4202498c0, 0x10)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/jessevdk/go-flags/parser.go:333 +0x849
github.com/snapcore/snapd/cmd/snap_test.(*userdSuite).TestSessionAgentSocket(0xc4200e8000, 0xc4200c82d0)
	/home/travis/gopath/src/github.com/snapcore/snapd/cmd/snap/cmd_userd_test.go:148 +0xea
reflect.Value.call(0xc4e940, 0xc4200e8000, 0x2e13, 0xc7d728, 0x4, 0xc420049f70, 0x1, 0x1, 0x10e5b00, 0xc6fca0, ...)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc4e940, 0xc4200e8000, 0x2e13, 0xc420037f70, 0x1, 0x1, 0xc4200c83c0, 0xc420037f07, 0xc4203d6100)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/reflect/value.go:302 +0xa4
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc4200c82d0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:781 +0x588
github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc42022d780, 0xc4200c82d0, 0xc42000cc20)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:675 +0x7c
created by github.com/snapcore/snapd/vendor/gopkg.in/check%2ev1.(*suiteRunner).forkCall
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:672 +0x218

goroutine 2452 [IO wait, 4 minutes]:
internal/poll.runtime_pollWait(0x7f68e7ef3ea0, 0x72, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420365f98, 0x72, 0xc42059bc00, 0x0, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc420365f98, 0xffffffffffffff00, 0x0, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc420365f80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc420365f80, 0xc42024a120, 0xb9ad00, 0xc42059be50)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/fd_unix.go:238 +0x42
net.(*UnixListener).accept(0xc4206a3950, 0x7ee3f8, 0x45ba60, 0xc42059be98)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc4206a3950, 0xcce1d8, 0xc42024a0a0, 0x107ed40, 0xc4203dc390)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/unixsock.go:241 +0x49
net/http.(*Server).Serve(0xc420241110, 0x107de80, 0xc4206a3950, 0x0, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/http/server.go:2695 +0x1b2
github.com/snapcore/snapd/usersession/agent.(*SessionAgent).Start.func1(0xc4200780f0, 0x0)
	/home/travis/gopath/src/github.com/snapcore/snapd/usersession/agent/session_agent.go:107 +0x49
github.com/snapcore/snapd/vendor/gopkg.in/tomb%2ev2.(*Tomb).run(0xc420365da8, 0xc420051f70)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/tomb.v2/tomb.go:163 +0x2b
created by github.com/snapcore/snapd/vendor/gopkg.in/tomb%2ev2.(*Tomb).Go
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/gopkg.in/tomb.v2/tomb.go:159 +0xc0

goroutine 2441 [IO wait, 4 minutes]:
internal/poll.runtime_pollWait(0x7f68e7efa038, 0x72, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42022d818, 0x72, 0xc420649300, 0x1000, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc42022d818, 0xc4202bc300, 0x10, 0x10)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).ReadMsg(0xc42022d800, 0xc4202bc380, 0x10, 0x10, 0xc420649320, 0x1000, 0x1000, 0x0, 0x0, 0x0, ...)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/internal/poll/fd_unix.go:194 +0x222
net.(*netFD).readMsg(0xc42022d800, 0xc4202bc380, 0x10, 0x10, 0xc420649320, 0x1000, 0x1000, 0x0, 0xb76900, 0xc4204382a0, ...)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/fd_unix.go:214 +0x90
net.(*UnixConn).readMsg(0xc42000e610, 0xc4202bc380, 0x10, 0x10, 0xc420649320, 0x1000, 0x1000, 0x9, 0xc4202bc288, 0x1080fe0, ...)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/unixsock_posix.go:115 +0x91
net.(*UnixConn).ReadMsgUnix(0xc42000e610, 0xc4202bc380, 0x10, 0x10, 0xc420649320, 0x1000, 0x1000, 0x4125c6, 0x7f68e7f469a8, 0x1, ...)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/net/unixsock.go:137 +0xaa
github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*oobReader).Read(0xc420649300, 0xc4202bc380, 0x10, 0x10, 0xc420649300, 0x10c4340, 0x0)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/transport_unix.go:21 +0x8f
io.ReadAtLeast(0x1074340, 0xc420649300, 0xc4202bc380, 0x10, 0x10, 0x10, 0xbffa20, 0xb63d01, 0xc420649300)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/io/io.go:309 +0x86
io.ReadFull(0x1074340, 0xc420649300, 0xc4202bc380, 0x10, 0x10, 0x9, 0x0, 0x0)
	/home/travis/.gimme/versions/go1.9.7.linux.amd64/src/io/io.go:327 +0x58
github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc420050970, 0xccdce8, 0xc4200be300, 0xc4203dc000)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/transport_unix.go:85 +0x115
github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*Conn).inWorker(0xc4200be300)
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/conn.go:285 +0x4b
created by github.com/snapcore/snapd/vendor/github.com/godbus/dbus.(*Conn).Auth
	/home/travis/gopath/src/github.com/snapcore/snapd/vendor/github.com/godbus/dbus/auth.go:118 +0x6dc
FAIL	github.com/snapcore/snapd/cmd/snap	300.133s
=== RUN   Test
OK: 22 passed

Userd and session-agent use the same pattern, they start, and set up channel to receive signals on. In the tests, the signal is sent in a goroutine to own pid (as obtained from os.Getpid()). Looking at the backtrace above, the tests are waiting for session-agent to stop. The SIGUSR1 must have been sent, since the goroutine that does it is already gone. However, it seems as if the signal was never received. Investigation continues.

Edit: cannot reproduce the problem locally with Go 1.9.x either

Perhaps we should not rely on signals but instead use channels for all test notification?

The tests in question are actually trying to test signal behaviour though. The commands in question are written to exit cleanly on the INT, TERM, and USR1 signals. The first two are the ones that matter in real life (e.g. systemd trying to shut down a service), but the test uses USR1 since the default handler won’t terminate the test suite itself If the code malfunctions.

I’ve refactored signal handling in userd to allow better mocking. Since new wrapper was added, I’ve added a test to make sure it is covered as well. The branch has landed and we’re updating open PRs.