Snapd causes corruption on upgrade

vpetersson · May 2, 2018, 7:35pm

We have over the last month or so seen a fairly substantial number bricked devices that all end up in the following state (as in multiple reports per day):

Here are some observations:

It happens both on RPI3 and CM3
It appears to happen when snapd is upgrading
It’s hard to reproduce and our theory is that it happens when snapd is upgrading between certain versions

If you connect a serial console to the device, this is the out you receive:

-Boot 2017.05-dirty (Mar 12 2018 - 09:55:39 +0000)

DRAM:  752 MiB
RPI 3 Model B (0xa22082)
MMC:   bcm2835_sdhci: 0
reading uboot.env
In:    serial
Out:   serial
Err:   lcd
Net:   Net Initialization Skipped
No ethernet found.
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc0 is current device
LOADBOOTENV
reading uEnv.txt
** Unable to read file uEnv.txt **
Running uenvcmd ...
ENVCMD
Saving Environment to FAT...
writing uboot.env
FAT: Misaligned buffer address (2eb00560)
FAT: Misaligned buffer address (2eb00760)
done
reading pi2-kernel_52.snap/kernel.img
5724256 bytes read in 919 ms (5.9 MiB/s)
reading pi2-kernel_52.snap/initrd.img
3059751 bytes read in 509 ms (5.7 MiB/s)
reading psplash.img
FAT: Misaligned buffer address (023eb027)
1343508 bytes read in 6139 ms (212.9 KiB/s)
Kernel image @ 0x200000 [ 0x000000 - 0x575860 ]
## Flattened Device Tree blob at 02000000
   Booting using the fdt blob at 0x2000000
   Loading Ramdisk to 2e6ec000, end 2eb1f03b ... OK
   Loading Device Tree to 2e6e4000, end 2e6eb350 ... OK

Starting kernel ...

After the last line, it is unresponsive (both over the serial console and with a keyboard).

We’ve also provided @ondra with a disk image.

Since we cannot control these updates, it is also rather unpredictable but we have had at least 20 devices in this state over the last week or two.

mvo · May 3, 2018, 12:32pm

Thanks for the problem report! Just to double check - what you see on the serial console is this specific kernel booting and on then nothing and at the same time on the monitor you see the mount failure and it stops. What happens if you reboot the device? Does it go into the same mount failure state ?

We need to get the state.json and the /boot/uboot/uboot.env file, we will get in touch with @ondra for this.

vpetersson · May 3, 2018, 1:07pm

What happens if you reboot the device? Does it go into the same mount failure state ?

Yeah. It’s bricked. The only recovery path is to reflash it.

We need to get the state.json and the /boot/uboot/uboot.env file, we will get in touch with @ondra for this.

Sure. Let me send you a disk image in a DM.

mvo · May 3, 2018, 3:41pm

Thanks a lot for the image. I looked at it using the fatcat tools and it looks like the issue is similar to: How to figure out why ubuntu core keeps restarting

Here is the outcome:

# fatcat  /dev/mapper/loop12p1  -l /
Listing path /
Directory cluster: 2
f 30/4/2018 14:18:20  UBOOT.ENV                      c=18534 s=131072 (128K)
f 0/0/1980 00:00:00  uboot.env                      c=10376 s=131072 (128K)

When I look at the two files I see:

uboot.env (lower-case):

# fatcat /dev/mapper/loop12p1 -R 10376|strings|grep snap_
loadinitrd=load ${target} ${devnumber}:${partition} ${initrd_addr} ${snap_kernel}/${initrd_file}; setenv initrd_size ${filesize}
loadkernel=load ${target} ${devnumber}:${partition} ${loadaddr} ${snap_kernel}/${kernel_file}
snap_core=core_4409.snap
snap_kernel=pi2-kernel_51.snap
snap_mode=trying
snap_try_core=core_4489.snap
snap_try_kernel=pi2-kernel_52.snap

uboot.env (upper case):

snap_core=core_4409.snap
snap_kernel=pi2-kernel_51.snap
snap_mode=try
snap_try_core=core_4489.snap
snap_try_kernel=pi2-kernel_52.snap

This mismatch explains why the automatic rollback is not working.

vpetersson · May 3, 2018, 4:02pm

Yeah, we noticed the duplicate files too (which looked suspicious).

That said, @ondra said he found something just now and he’s looking into it.

As you correctly point out, this is identical to the previous reported issue. A that point in time, we thought it was related to our udisk tool and we spent a fair amount of time refactoring it. it does however seem that this was not the root cause and that we are dealing with the same issue.

mvo · May 4, 2018, 7:13am

Thank you for the update. We are very keen to get to the bottom of this (and make it more robust).

Maybe you can (via PM if you prefer) tell me/us if/how the vfat partition is touched/modified when you customize the image,

It seems for some reason there is a mismatch between the kernel vfat and the uboot vfat which results in this failure. We have not observed this kind of duplication ourselfs yet but your installed base is much higher than our testing machines so if its a race of some sort the chances you hit it are much higher of course.

Is there any pattern to this failure? Like power failure or unclean shutdowns or anything like this?

Fwiw, the complete failure is fully explained by the different file names. The linux kernel writes “snap_mode=try”. Then the system boots and the uboot vfat writes the file with “snap_mode=trying”. However at this point it is a different file. So on boot the linux side does not see “trying” and assumes that things went wrong and revert the core (removes the new core). On the next reboot the bootloader sees to still load the “snap_mode=try” file so it tries (again) to load the new kernel. But this core was removed inthe previous revert. So the failure boils down to the mismatch of the two files. We need to figure out why this is happening, it might be a bug in the fat driver of uboot or it might be some subtle corruption on the vfat partition (or something else entirely ).

renat2017 · May 4, 2018, 7:32am

Hi, Michael.

It’s Renat from Screenly.

We don’t modify VFat partition during the build process. I will send you a link to our ubuntu-image command in PM.

Also, we have a custom gadget snap which may affect a boot partition.

vpetersson · May 4, 2018, 10:50am

It always happens during reboots (but that’s a given), but I have seen it myself happening happen even during clean reboots.

Also, @renat2017 will provide you with a diff between our gadget snap and the upstream gadget snap. I’m almost 100% confident that we can rule out our changes there as a root cause of this.

zyga-snapd · May 4, 2018, 1:42pm

mvo:

# fatcat  /dev/mapper/loop12p1  -l /
Listing path /
Directory cluster: 2
f 30/4/2018 14:18:20  UBOOT.ENV                      c=18534 s=131072 (128K)
f 0/0/1980 00:00:00  uboot.env                      c=10376 s=131072 (128K)

I wonder if it would help if we always mounted /boot with case=lower. While we don’t understand the cause of the issue it is clear that UBOOT.ENV is the one written from linux (at the very least, because it has a timestamp that’s not bogus). If Linux was overwriting the same file we would not be in this situation.

ogra · May 4, 2018, 1:45pm

while this is a great workaround, it will only hide the issue after all … finding the reason why it actually creates an uppercase file at all would be nice …

zyga-snapd · May 4, 2018, 1:47pm

Interestingly I think (since I saw this issue before) that this is valid. The lower case one looks bogus. FAT has two file names for everything, one 8.3 and one long one (where the 8.3 name has a specific content). AFAIK the 8.3 names are all upper case unless you use long file name support.

ogra · May 4, 2018, 1:51pm

we could indeed default to an uppercase file (needs u-boot code changes though, it hardcodes the filename at build time) … but interestingly this does not seem to happen for anynoe else (i have personally pi installs running over here that were originally installed when we started core-16 and still run and upgrade happily without causing such an issue). this seems to only happen to these specific images

vpetersson · May 4, 2018, 1:54pm

Just to chime in here - it doesn’t happen to all our installs. We have a 100s of devices running happily that receives updates, then in other cases (like this week) we had a customer where it happened to twice on the same RPi in 48h.

I wish I could provide a better way to narrow down the exact condition where it happens, but it appears to be rather random unfortunately.

ogra · May 4, 2018, 2:19pm

do they all use the same brand/model of SD card ?

vpetersson · May 4, 2018, 2:20pm

Interesting point. I cannot guarantee that they are, but all the ones that were purchased from one of our distributors should be.

That said, since this also affects CM3s, I’m not sure if that is the right path.

ogra · May 4, 2018, 2:21pm

yeah, probably not … was just a wild guess

vpetersson · May 4, 2018, 4:43pm

Just an update -We just heard from a client that had this happen to three of his screens.

kyleN · May 7, 2018, 1:28pm

So, it is still not yet proven whether this is a or is not a screenly image specific issue or more general to Core.

I think we need a formal bug against snappy to centralize information and workflow around this. https://bugs.launchpad.net/snappy

@vpetersson , can you please file this and assign it to MVO?

vpetersson · May 7, 2018, 1:52pm

Done: https://bugs.launchpad.net/snappy/+bug/1769669

mvo · May 8, 2018, 8:06am

I looked into this a bit deeper and because its a complex issue I wrote down the details so that we remember them.

What happens

Ubuntu core auto-refreshes to a new core
snapd writes “uboot.env” with:

snap_mode=try
snap_core=core_4409.snap
snap_kernel=pi2-kernel_51.snap
snap_try_core=core_4489.snap
snap_try_kernel=pi2-kernel_52.snap
system boots, uboot changes “snap_mode” to “snap_mode=trying”
but writes this to a different (corrupted) uboot.env.
This shows show up with “fatcat” as a lower-case file instead of the all upper case (UBOOT.ENV) that linux writes.
The system boots and snapd runs: devicemgr.go:ensureBootOk
which will run partition.MarkBootSuccessful. This will
check if the boot env is in trying mode and if not does
nothing so the boot config from step (1) is still in place.
After reboot the ifstate/handlers.go:doSetupProfiles()
will run and notice we are in “corePhase2” and query
the current core revision. This will get “snap_core=4409”
(and not 4489) because the partition.MarkBootSuccessful
step which will normally update from
snap_try_core -> snap_core is run only if we are
in snap_mode=trying (which is not the case due to above
corruption). So the current core in the state is “4489”
but the snapd assuem we booted with “4409” and this
mismatch causes the setup-security (phase2) task for core
install to fail.
The failure in step (4) undones the core install lane which
also means that the new core 4489 is removed. This also
triggers snapstate/handlers.go:undoUnlinkCurrentSnap which
calls boot/kernel_os.go:SetNextBoot().
kernel_os.go:SetNextBoot() will normally set “snap_try_core=4409”
and “snap_mode=try”.
However “snap_core=4409” is already set so there is no need
to set “snap_try_core=4409,snap_mode=try” and bootenv is
left unmodified. This means the next boot will still try to
boot with configuration in step (1) and never the version it
wrote that sets snap_mode=trying. If it would read the
snap_mode=trying one things would revert fine.
The undo in step (4) will also trigger a reboot via
snapstate/handlers.go:undoUnlinkCurrentSnap in maybeReboot.
The next boot will fail because the core_4489 is no longer
available. Subsequent boots fail because uboot will read the
uboot.env file from step (1) but writes a corrupted one.
Fsck.vfat will not report any errors on the partition.

Root cause

The root of the problem is that the “snap_mode=try” is something that
is changed on a reboot by the uboot/grub bootloader to either
snap_mode=trying or to snap_mode="". So our code does not deal with
this “impossible” condition.

Mitigation

One workaround to make things more robust would be to ensure that
in kernel_os.go:SetNextBoot() we clear “snap_mode=” if it is in
“try” mode.

This means that in step 8 the boot will no longer fail and the system
is no longer bricked. It will boot from core_4409. It will mean that
systems are stuck at this specific core revision until we fix the
underlying bug. However if the bug is random (every 100th) fat write
failures in uboot then the next refresh would fix things. Its unclear if that is the case though. It seems like once the fat partition is in this state with two uboot files it is permanent.