How to figure out why ubuntu core keeps restarting

sborovkov · February 27, 2018, 5:59pm

Hi,
I am looking at the logs of device that keeps restarting. The only relevant thing I can find in the logs is this:

2018/02/09 14:37:17.801552 handlers.go:306: Reported install problem for “core” as 3043e93c-1be3-11e8-bae9-fa163e192766 OOPSID
2018-02-27T17:26:08Z INFO Waiting for restart…
snap “core” has “refresh-snap”

There is a bunch of messages like this in the last hour. Any idea at what I should look at or if any of those messages indicate some issue?

mvo · February 27, 2018, 7:47pm

Thanks for your message! I checked the error tracker and I see there that the error is: ERROR cannot finish core installation, there was a rollback across reboot. The full syslog might indicate why it reverts back into the old version on reboot. The output of snap changes may also give us some hints (it should have the failed change, then the output of snap change <failed-change-nr> would be great.

sborovkov · February 27, 2018, 8:31pm

This is a node customer is using so I can’t ssh in. I have a python service that has snapd-control interface connected. Can I use that to get snap changes information?

I will try to get more of the system journal logs tomorrow. I am getting them all to logging service, just need to filter out our services logs from there.

sborovkov · February 27, 2018, 8:49pm

This hastebin has all logs from system journal from 1+ power cycle, https://hastebin.com/unerabagik.coffeescript

I tried to filter out most of our logs from there but it’s still quite a lot. Hopefully this can help identify the issue. Just a note about snapd version - image was created on 20th February from the stable channel so it’s quite new.

sborovkov · February 28, 2018, 11:02am

Ok, I am being told that that device node died now.

sborovkov · March 1, 2018, 7:27pm

And this issue keeps happening on the different nodes. We have similar reports from different customers. So seems to be pretty critical bug.

chipaca · March 1, 2018, 8:07pm

If you have access to a device that is failing, there are things we might try (for one, enabling debug logs on snapd).

Looking at the logs, the only thing that seems strange (beyond the OOPS itself) is

var-lib-sudo.mount: Directory /var/lib/sudo to mount over is not empty, mounting anyway.
var-lib-extrausers.mount: Directory /var/lib/extrausers to mount over is not empty, mounting anyway.
var-log.mount: Directory /var/log to mount over is not empty, mounting anyway.
var-lib-snapd.mount: Directory /var/lib/snapd to mount over is not empty, mounting anyway.
snap.mount: Directory /snap to mount over is not empty, mounting anyway.
var-lib-initramfs\x2dtools.mount: Directory /var/lib/initramfs-tools to mount over is not empty, mounting anyway.
var-lib-apparmor.mount: Directory /var/lib/apparmor to mount over is not empty, mounting anyway.

this isn’t my area of expertise.
@ogra, does our initramfs cope with having those mounts? Or is it working ‘under’ the mount?

sborovkov · March 1, 2018, 9:09pm

I don’t have it unfortunately. This is happening on customer nodes. We had one device die and made a copy of sd card (at that point it fails to boot already).

mvo · March 2, 2018, 12:35pm

Yes, this sounds very critical - if you have a copy of the sd card from the failing node, could you please do an "ls -l " of the root of the sdcard? And maybe the first level of sub directories? Also, does fsck show anything? Anything in /lost+found ?

mvo · March 2, 2018, 6:53pm

I was able to reproduce this with a custom image provided by @sborovkov (thanks for this). It seems this is only reproducable when a usb stick is inserted into the device before the core refresh. Also the udisks2 snap must be installed and potentially the application from @sborovkov (that is not clear yet). While this is reproducible the root cause is still not found. The error message is cannot finish core installation, there was a rollback across reboot. But looking at the /snap/core/current symlink before the reboot and from uboot everything looks good.

mvo · March 2, 2018, 8:45pm

The reason for the cannot finish core installation, there was a rollback across reboot is that the /boot/uboot/uboot.env file appears two times on the fat filesystem /boot/uboot.

One version of that file (uboot.env) has snap_mode=try - and in this mode snapd does not consider the system ready for a new successfl boot (this only happens when the snap_mode=trying is reached. This is the file that snapd reads.

However the second file (UBOOT.ENV) has snap_mode=trying. This is what uboot set and what snapd expects. Except its in the “wrong” file. fatcat shows that the file have different cases. But the regular linux vfat module seems to be not able to show this.

What is interessting is that the fresh image starts with UBOOT.ENV and after the refresh we end up with one uppercase and one lower case version of this file. It looks like uboot and linux are not agreeing about the casing. It also seems to be for some reason specific to the image. I tried this with a manually created image (using the ubuntu-image 1.3+18.04ubuntu2 bionic deb version with udisks2 in the default install) and no mixed casing happend.

niemeyer · March 3, 2018, 2:34am

Thank you very much for the careful inspection and for the quality of the details, @mvo.

@sborovkov It looks like many of those details are actually specific to your image. If that’s indeed the case, do you have enough details to continue fixing the situation on your end? Feel free to let us know if there are changes that you think are necessary in snapd to better support your use case.

renat2017 · March 3, 2018, 6:14am

@mvo, @niemeyer, thanks a lot! I found that it may be possible because udisks2 says that mmcblk0p1 (which is mounted as /boot/ I think) is MediaRemovable and we try to mount/unmount it.