Snapd causes corruption on upgrade


#21

One more interesting observation. If the incorrect file (lower case uboot.env) gets removed the subsequent writes uboot writes go into the correct “UBOOT.ENV” file, i.e. the UBOOT.ENV file is now in “snap_mode=trying” and this means that in the worst case a power cycle will fix the device. However I have not (yet) found a way to remove the incorrect file from linux, I only managed it by removing both files and copying the original one back.

I.e. it looks like once the system is in this strange state things keep being bad but once the file is gone the bug does not happen again immediately (but it may happen again at some future point of course).


#22

due to the u-boot design, the uboot.env filename needs to be hardcoded in the build config…
it might help if we simply use the uppercase name here ?
(since the linux side doesnt seem to care about capitalization)


#23

Looking at uboot source code I don’t think that will help. Uboot’s fat implementation uses lower-case lookup function so it should have no impact on how the file is named.


#24

the question is, does it also use lower-case write() functions :wink:


#25

And another interesting fact:

I modified go-fat to work with fat32 and poked at the image. It turns out that with this mode there is just a single uboot.env and a FSCK0000.000 that contains the uboot.env: snap_mode=trying bits. This is also observable when mounting the image with -t msdos instead of the default vfat.


#26

I think so, it looks like it writes the lower-case file name when it writes to a file.


#27

So is the solution to mount things as msdos from linux?


#28

Why the duplicated uboot.env

Here is the current status of the investigation around the double uboot.env mystery:

  • Something ran an fsck on the system-boot partition
  • A FAT16 file FSCK0000.000 got created but this file has the “lfn” (long-file-name) name “uboot.env” (lower-case) and that name is stored (as speced) directly before the FSCK0000.000 file.
  • Our system also has UBOOT.ENV (FAT16, no lfn entry for this file)
  • A vfat based file system will ignore the FAT16 name and just display the “lfn” name which happens to be uboot.env. As fat is case-insensitive both file will be displayed as uboot.env
  • a msdos based filesystem will ignore the lfn entires and just pick the FAT16 name. this is why we see the different files here.
  • uboot is confused but only slightly, it looks for a uboot.env file and uses the lfn filename which is (arguable) not incorrect. The real problem is that fsck created a file FSCK0000.000 with a very strange lfn entry in the cluster next to it.

The following gist contains code that validates this theory https://gist.github.com/mvo5/8e51b0efbbf5d12963192ba79d580b85

When I run it I see:

f "BCM271~1.DTB" "bcm2710-rpi-cm3.dtb" 
f "START_X.ELF" "START_X.ELF" 
f "PSPLASH.IMG" "PSPLASH.IMG" 
f "BCM270~1.DTB" "bcm2709-rpi-2-b.dtb" 
f "CONFIG.TXT" "CONFIG.TXT" 
f "BOOTCODE.BIN" "BOOTCODE.BIN" 
f "COPYIN~1.LIN" "COPYING.linux" 
f "START_CD.ELF" "START_CD.ELF" 
f "LICENC~1.BRO" "LICENCE.broadcom" 
f "UBOOT.BIN" "UBOOT.BIN" 
f "START.ELF" "START.ELF" 
d OVERLAYS 
f "FIXUP_X.DAT" "FIXUP_X.DAT" 
f "BCM271~2.DTB" "bcm2710-rpi-3-b.dtb" 
f "FIXUP_DB.DAT" "FIXUP_DB.DAT" 
f "FIXUP.DAT" "FIXUP.DAT" 
f "FIXUP_CD.DAT" "FIXUP_CD.DAT" 
f "UBOOT.ENV" "UBOOT.ENV" 
----- "k\xc03f\x00allargs=setenv bootargs \"${args} r...
d pi2-kernel_51.snap 
f "CMDLINE.TXT" "CMDLINE.TXT" 
f "START_DB.ELF" "START_DB.ELF" 
d pi2-kernel_52.snap 
f "FSCK0000.000" "uboot.env" 
----- "\x0e\x8f\xba\xac\x00allargs=setenv bootargs \"${args}

When dumping the full output the FSCK0000.000 file is the uboot.env with the snap_mode=trying.

Next steps

  • make snapd itself more robust against this failure screnario (as outlined above)
  • figure out what wrote the FSCK0000.000 file - fsck.vfat from dosfstools creates fsck0000.rec files, the creation of FSCK0000.000 files is strange.
  • add code to the system to ensure we clean FSCK0000 files when we see them.

#29

Unfortunately that won’t work, we use long-filenames like pi2-kernel_52.snap (which is longer than 8 chars) and would loook like: pi2-ke~1.sna :confused:


#30

Too bad, it would work if we called it with an 8.3 name :slight_smile:


#31

it would also work if you mounted it with “sync” in the mountoptions and removed all traces of fsck.vfat from the system :stuck_out_tongue: but thats as bad as making everything 8+3 :slight_smile:

(and would make kernel snap updates really slow)


#32

Do we have a rough ETA for a fix?


#33

We hope to have something by next week, we are currently exploring multiple angles to make it more robust overall. The following PR https://github.com/snapcore/core-build/pull/29 should provide a workaround. I think the underlying bug is in dosfstools and (from code inspection) I suspect https://github.com/dosfstools/dosfstools/pull/83 will help.


#34

Thanks, mvo. We would really appreciate if we could get a hotfix out this week (even if it is just a temporary workaround).


#35

@mvo Thanks a lot for digging into this issue, Michael.

@vpetersson As Michael mentioned, this is a top priority for us, so we will have a fix out as soon as we can code and test it appropriately. Our goal is to make snapd more robust to the nature of the bug, instead of simply fixing its cause. In other words, if a similar situation happens in the future due to a bug elsewhere, the fix we’re putting in place will prevent any relevant damage from happening.

We apologize for the trouble caused by this issue.


#36

It turns out the FSCK0000.000:uboot.env file is annoyingly hard to fix. Mounting using -t msdos and removing fsck0000.000 means we leave the l-f-n for uboot.env around as a (orphaned) directory entry. This seems to confuse uboot and will happily (re)create a fsck0000.000:uboot.env file and we are at square one. Even running fsck.vfat after deleting fsck0000.000 in msdos mode does not help. Fsck will claim that it removed the orphaned entry but uboot will pick it up again and write the file as before.

So far only the following has worked and requires us to add the nls_ascii.ko module in the initramfs (which is not too bad, its small):

mount -t vfat -o rw,check=s,shortname=win95,iocharset=ascii "$boot_partition" "$tmpboot_mnt"
if stat "$tmpboot_mnt/uboot.env" && stat "$tmpboot_mnt/UBOOT.ENV"; then
     mv "$tmpboot_mnt/UBOOT.ENV" "$tmpboot_mnt/uboot.env"
fi

#37

Here is the sequence of events:

  1. snapd refreshes
  2. writes new uboot.env with snap_mode=try
  3. snapd reboots
  4. uboot starts loads the uboot.env and writes it with snap_mode=trying
  5. system boots, /boot/uboot is set to fs_passno=2 and systemd checks it before it mounts it
  6. the file is corrupted, most likely by the uboot write (but there is a chance the fs is in a strange state already and uboot just makes it worse)
  7. the fsck in (6) renames the file to fsck0000.000 but leaves the l-f-n entries intact (probably a bug in dosfstools, possible fix https://github.com/dosfstools/dosfstools/pull/83)
    7.a here we can mitigate: fix dosfstools
    7.b here we can mitigate: add snapd.core-fixup code
  8. snapd is now confused because snap_mode=try instead of snap_mode=trying and the vfat system shows two uboot.env files
    8.a here we can mitigate and make snapd robust against this situtation (by cleaning snap_mode=try at the right time).
  9. snapd reverts to 4409 and reboots
  10. from now on uboot writes to the wrong (fsck0000.000:uboot.env) file

#38

The relevant PRs to mitigate this problem are:

Once those are in the devices should recover gracefully. I have a system based on the corrupted fs-image. When I ran the core-fixup code that will be part of each boot the system recovered and eventually refreshed to the right update.


#39

You the man, @mvo! Will this be rolled out as a hotfix release when they get landed?


#40

Thanks Michael!

The critical snappy bug is still unassigned, though, is that correct? https://bugs.launchpad.net/snapd/+bug/1769669