Unexpected reboot during snapd tests execution on ubuntu core 16

Running beta validation for 2.32.5 I have seen that the system reboots unexpectly running random tests.

The failures happen on different tests, still I am trying to find a pattern. I am attaching some logs for different errors.

Syslogs:

Thanks for sharing this! These reboots are concerning. I just tried to reproduce them with a hand made image (following the tests/external-backend.md docs) using 2.32.5 from beta. In the run this morning (https://paste.ubuntu.com/p/pFDh9Yqmxv/) I did not see a reboot and the system finished successfully. I used the release/2.32 branch for the spread tests and also added the patch from #5059 to detect pending shutdowns during the tests. I will rerun the tests again to see if I can reproduce it.

I just did a second run and this time I got a kernel error “linux-kernel-bde (14371): DMA allocation failed” (fully details in https://paste.ubuntu.com/p/Kypt2bx9yN/). This was with the kernel from the beta channel (that was released yesterday).

I ran a couple of tests this morning and ran into some issues:

Runaway network-bind-consumer process

I had a runaway “python3 network-bind-consumer” process that would make “journalctl --sync” hang (presumably because it would generate output preventing journalctl --sync from settling). I pushed https://github.com/snapcore/snapd/pull/5061 for this. The ordering is that we call journalctl --sync first and then remove the snaps so a runaway process that keeps writing to the journal can prevent journalctl --sync from ever finishing.

Kernel OOPs

I had kernel oopses during the run, I added https://github.com/snapcore/snapd/pull/5060 to detect those. I saw it with the pc-kernel in the “beta” channel worth checking if this happens also with the stable kernel

Reboot during the tests

The interfaces-content test triggers the reboot for me. The reason is that this test removes state.json which triggers a reseeding and installs the kernel from beta again. I had refreshed to the stable kernel in my testing to ensure to not get hit by kernel oopses. This made snapd trigger a reboot. I pushed https://github.com/snapcore/snapd/pull/5062 for this.

Just did another run with the PRs from above applied:

...
2018-04-17 11:01:20 Restoring external:ubuntu-core-16-64:tests/regression/...
2018-04-17 11:01:22 Restoring external:ubuntu-core-16-64...
2018-04-17 11:01:23 Successful tasks: 198
2018-04-17 11:01:23 Aborted tasks: 0

A second run (with the stable kernel) triggered an Oops: https://paste.ubuntu.com/p/5kz2qHXNyQ/

With some help from the kernel team (thanks Andy!) we found that the OOPs is triggered by the linux-kernel-bde module. It asks for a huge chunk of kernel memory (page allocation failure: order:7) which will not be available once the system has run for a bit. A possible workaround for this is to run the test early so that the module gets loaded when the system is fresh and has lots of continuous memory. This is a bug in the driver that it. The following PR https://github.com/snapcore/snapd/pull/5063 adds a workaround. I am running the tests again with it in place.

After applying the above PRs I got three successful runs without hickups in a row.