One more interesting observation. If the incorrect file (lower case uboot.env) gets removed the subsequent writes uboot writes go into the correct “UBOOT.ENV” file, i.e. the UBOOT.ENV file is now in “snap_mode=trying” and this means that in the worst case a power cycle will fix the device. However I have not (yet) found a way to remove the incorrect file from linux, I only managed it by removing both files and copying the original one back.
I.e. it looks like once the system is in this strange state things keep being bad but once the file is gone the bug does not happen again immediately (but it may happen again at some future point of course).
due to the u-boot design, the uboot.env filename needs to be hardcoded in the build config…
it might help if we simply use the uppercase name here ?
(since the linux side doesnt seem to care about capitalization)
Looking at uboot source code I don’t think that will help. Uboot’s fat implementation uses lower-case lookup function so it should have no impact on how the file is named.
I modified go-fat to work with fat32 and poked at the image. It turns out that with this mode there is just a single uboot.env and a FSCK0000.000 that contains the uboot.env: snap_mode=trying bits. This is also observable when mounting the image with -t msdos instead of the default vfat.
Here is the current status of the investigation around the double uboot.env mystery:
Something ran an fsck on the system-boot partition
A FAT16 file FSCK0000.000 got created but this file has the “lfn” (long-file-name) name “uboot.env” (lower-case) and that name is stored (as speced) directly before the FSCK0000.000 file.
Our system also has UBOOT.ENV (FAT16, no lfn entry for this file)
A vfat based file system will ignore the FAT16 name and just display the “lfn” name which happens to be uboot.env. As fat is case-insensitive both file will be displayed as uboot.env
a msdos based filesystem will ignore the lfn entires and just pick the FAT16 name. this is why we see the different files here.
uboot is confused but only slightly, it looks for a uboot.env file and uses the lfn filename which is (arguable) not incorrect. The real problem is that fsck created a file FSCK0000.000 with a very strange lfn entry in the cluster next to it.
f "BCM271~1.DTB" "bcm2710-rpi-cm3.dtb"
f "START_X.ELF" "START_X.ELF"
f "PSPLASH.IMG" "PSPLASH.IMG"
f "BCM270~1.DTB" "bcm2709-rpi-2-b.dtb"
f "CONFIG.TXT" "CONFIG.TXT"
f "BOOTCODE.BIN" "BOOTCODE.BIN"
f "COPYIN~1.LIN" "COPYING.linux"
f "START_CD.ELF" "START_CD.ELF"
f "LICENC~1.BRO" "LICENCE.broadcom"
f "UBOOT.BIN" "UBOOT.BIN"
f "START.ELF" "START.ELF"
d OVERLAYS
f "FIXUP_X.DAT" "FIXUP_X.DAT"
f "BCM271~2.DTB" "bcm2710-rpi-3-b.dtb"
f "FIXUP_DB.DAT" "FIXUP_DB.DAT"
f "FIXUP.DAT" "FIXUP.DAT"
f "FIXUP_CD.DAT" "FIXUP_CD.DAT"
f "UBOOT.ENV" "UBOOT.ENV"
----- "k\xc03f\x00allargs=setenv bootargs \"${args} r...
d pi2-kernel_51.snap
f "CMDLINE.TXT" "CMDLINE.TXT"
f "START_DB.ELF" "START_DB.ELF"
d pi2-kernel_52.snap
f "FSCK0000.000" "uboot.env"
----- "\x0e\x8f\xba\xac\x00allargs=setenv bootargs \"${args}
When dumping the full output the FSCK0000.000 file is the uboot.env with the snap_mode=trying.
Next steps
make snapd itself more robust against this failure screnario (as outlined above)
figure out what wrote the FSCK0000.000 file - fsck.vfat from dosfstools creates fsck0000.rec files, the creation of FSCK0000.000 files is strange.
add code to the system to ensure we clean FSCK0000 files when we see them.
it would also work if you mounted it with “sync” in the mountoptions and removed all traces of fsck.vfat from the system but thats as bad as making everything 8+3
@mvo Thanks a lot for digging into this issue, Michael.
@vpetersson As Michael mentioned, this is a top priority for us, so we will have a fix out as soon as we can code and test it appropriately. Our goal is to make snapd more robust to the nature of the bug, instead of simply fixing its cause. In other words, if a similar situation happens in the future due to a bug elsewhere, the fix we’re putting in place will prevent any relevant damage from happening.
We apologize for the trouble caused by this issue.
It turns out the FSCK0000.000:uboot.env file is annoyingly hard to fix. Mounting using -t msdos and removing fsck0000.000 means we leave the l-f-n for uboot.env around as a (orphaned) directory entry. This seems to confuse uboot and will happily (re)create a fsck0000.000:uboot.env file and we are at square one. Even running fsck.vfat after deleting fsck0000.000 in msdos mode does not help. Fsck will claim that it removed the orphaned entry but uboot will pick it up again and write the file as before.
So far only the following has worked and requires us to add the nls_ascii.ko module in the initramfs (which is not too bad, its small):
mount -t vfat -o rw,check=s,shortname=win95,iocharset=ascii "$boot_partition" "$tmpboot_mnt"
if stat "$tmpboot_mnt/uboot.env" && stat "$tmpboot_mnt/UBOOT.ENV"; then
mv "$tmpboot_mnt/UBOOT.ENV" "$tmpboot_mnt/uboot.env"
fi
uboot starts loads the uboot.env and writes it with snap_mode=trying
system boots, /boot/uboot is set to fs_passno=2 and systemd checks it before it mounts it
the file is corrupted, most likely by the uboot write (but there is a chance the fs is in a strange state already and uboot just makes it worse)
the fsck in (6) renames the file to fsck0000.000 but leaves the l-f-n entries intact (probably a bug in dosfstools, possible fix https://github.com/dosfstools/dosfstools/pull/83)
7.a here we can mitigate: fix dosfstools
7.b here we can mitigate: add snapd.core-fixup code
snapd is now confused because snap_mode=try instead of snap_mode=trying and the vfat system shows two uboot.env files
8.a here we can mitigate and make snapd robust against this situtation (by cleaning snap_mode=try at the right time).
snapd reverts to 4409 and reboots
from now on uboot writes to the wrong (fsck0000.000:uboot.env) file
Once those are in the devices should recover gracefully. I have a system based on the corrupted fs-image. When I ran the core-fixup code that will be part of each boot the system recovered and eventually refreshed to the right update.