We have over the last month or so seen a fairly substantial number bricked devices that all end up in the following state (as in multiple reports per day):
It’s hard to reproduce and our theory is that it happens when snapd is upgrading between certain versions
If you connect a serial console to the device, this is the out you receive:
-Boot 2017.05-dirty (Mar 12 2018 - 09:55:39 +0000)
DRAM: 752 MiB
RPI 3 Model B (0xa22082)
MMC: bcm2835_sdhci: 0
reading uboot.env
In: serial
Out: serial
Err: lcd
Net: Net Initialization Skipped
No ethernet found.
Hit any key to stop autoboot: 0
switch to partitions #0, OK
mmc0 is current device
LOADBOOTENV
reading uEnv.txt
** Unable to read file uEnv.txt **
Running uenvcmd ...
ENVCMD
Saving Environment to FAT...
writing uboot.env
FAT: Misaligned buffer address (2eb00560)
FAT: Misaligned buffer address (2eb00760)
done
reading pi2-kernel_52.snap/kernel.img
5724256 bytes read in 919 ms (5.9 MiB/s)
reading pi2-kernel_52.snap/initrd.img
3059751 bytes read in 509 ms (5.7 MiB/s)
reading psplash.img
FAT: Misaligned buffer address (023eb027)
1343508 bytes read in 6139 ms (212.9 KiB/s)
Kernel image @ 0x200000 [ 0x000000 - 0x575860 ]
## Flattened Device Tree blob at 02000000
Booting using the fdt blob at 0x2000000
Loading Ramdisk to 2e6ec000, end 2eb1f03b ... OK
Loading Device Tree to 2e6e4000, end 2e6eb350 ... OK
Starting kernel ...
After the last line, it is unresponsive (both over the serial console and with a keyboard).
Thanks for the problem report! Just to double check - what you see on the serial console is this specific kernel booting and on then nothing and at the same time on the monitor you see the mount failure and it stops. What happens if you reboot the device? Does it go into the same mount failure state ?
We need to get the state.json and the /boot/uboot/uboot.env file, we will get in touch with @ondra for this.
Yeah, we noticed the duplicate files too (which looked suspicious).
That said, @ondra said he found something just now and he’s looking into it.
As you correctly point out, this is identical to the previous reported issue. A that point in time, we thought it was related to our udisk tool and we spent a fair amount of time refactoring it. it does however seem that this was not the root cause and that we are dealing with the same issue.
Thank you for the update. We are very keen to get to the bottom of this (and make it more robust).
Maybe you can (via PM if you prefer) tell me/us if/how the vfat partition is touched/modified when you customize the image,
It seems for some reason there is a mismatch between the kernel vfat and the uboot vfat which results in this failure. We have not observed this kind of duplication ourselfs yet but your installed base is much higher than our testing machines so if its a race of some sort the chances you hit it are much higher of course.
Is there any pattern to this failure? Like power failure or unclean shutdowns or anything like this?
Fwiw, the complete failure is fully explained by the different file names. The linux kernel writes “snap_mode=try”. Then the system boots and the uboot vfat writes the file with “snap_mode=trying”. However at this point it is a different file. So on boot the linux side does not see “trying” and assumes that things went wrong and revert the core (removes the new core). On the next reboot the bootloader sees to still load the “snap_mode=try” file so it tries (again) to load the new kernel. But this core was removed inthe previous revert. So the failure boils down to the mismatch of the two files. We need to figure out why this is happening, it might be a bug in the fat driver of uboot or it might be some subtle corruption on the vfat partition (or something else entirely ).
It always happens during reboots (but that’s a given), but I have seen it myself happening happen even during clean reboots.
Also, @renat2017 will provide you with a diff between our gadget snap and the upstream gadget snap. I’m almost 100% confident that we can rule out our changes there as a root cause of this.
I wonder if it would help if we always mounted /boot with case=lower. While we don’t understand the cause of the issue it is clear that UBOOT.ENV is the one written from linux (at the very least, because it has a timestamp that’s not bogus). If Linux was overwriting the same file we would not be in this situation.
while this is a great workaround, it will only hide the issue after all … finding the reason why it actually creates an uppercase file at all would be nice …
Interestingly I think (since I saw this issue before) that this is valid. The lower case one looks bogus. FAT has two file names for everything, one 8.3 and one long one (where the 8.3 name has a specific content). AFAIK the 8.3 names are all upper case unless you use long file name support.
we could indeed default to an uppercase file (needs u-boot code changes though, it hardcodes the filename at build time) … but interestingly this does not seem to happen for anynoe else (i have personally pi installs running over here that were originally installed when we started core-16 and still run and upgrade happily without causing such an issue). this seems to only happen to these specific images
Just to chime in here - it doesn’t happen to all our installs. We have a 100s of devices running happily that receives updates, then in other cases (like this week) we had a customer where it happened to twice on the same RPi in 48h.
I wish I could provide a better way to narrow down the exact condition where it happens, but it appears to be rather random unfortunately.
system boots, uboot changes “snap_mode” to “snap_mode=trying” but writes this to a different (corrupted) uboot.env.
This shows show up with “fatcat” as a lower-case file instead of the all upper case (UBOOT.ENV) that linux writes.
The system boots and snapd runs: devicemgr.go:ensureBootOk
which will run partition.MarkBootSuccessful. This will
check if the boot env is in trying mode and if not does
nothing so the boot config from step (1) is still in place.
After reboot the ifstate/handlers.go:doSetupProfiles()
will run and notice we are in “corePhase2” and query
the current core revision. This will get “snap_core=4409”
(and not 4489) because the partition.MarkBootSuccessful
step which will normally update from
snap_try_core -> snap_core is run only if we are
in snap_mode=trying (which is not the case due to above
corruption). So the current core in the state is “4489”
but the snapd assuem we booted with “4409” and this
mismatch causes the setup-security (phase2) task for core
install to fail.
The failure in step (4) undones the core install lane which
also means that the new core 4489 is removed. This also
triggers snapstate/handlers.go:undoUnlinkCurrentSnap which
calls boot/kernel_os.go:SetNextBoot().
kernel_os.go:SetNextBoot() will normally set “snap_try_core=4409”
and “snap_mode=try”.
However “snap_core=4409” is already set so there is no need
to set “snap_try_core=4409,snap_mode=try” and bootenv is
left unmodified. This means the next boot will still try to
boot with configuration in step (1) and never the version it
wrote that sets snap_mode=trying. If it would read the
snap_mode=trying one things would revert fine.
The undo in step (4) will also trigger a reboot via
snapstate/handlers.go:undoUnlinkCurrentSnap in maybeReboot.
The next boot will fail because the core_4489 is no longer
available. Subsequent boots fail because uboot will read the
uboot.env file from step (1) but writes a corrupted one.
Fsck.vfat will not report any errors on the partition.
Root cause
The root of the problem is that the “snap_mode=try” is something that
is changed on a reboot by the uboot/grub bootloader to either
snap_mode=trying or to snap_mode="". So our code does not deal with
this “impossible” condition.
Mitigation
One workaround to make things more robust would be to ensure that
in kernel_os.go:SetNextBoot() we clear “snap_mode=” if it is in
“try” mode.
This means that in step 8 the boot will no longer fail and the system
is no longer bricked. It will boot from core_4409. It will mean that
systems are stuck at this specific core revision until we fix the
underlying bug. However if the bug is random (every 100th) fat write
failures in uboot then the next refresh would fix things. Its unclear if that is the case though. It seems like once the fat partition is in this state with two uboot files it is permanent.