Thanks, mvo. We would really appreciate if we could get a hotfix out this week (even if it is just a temporary workaround).
@mvo Thanks a lot for digging into this issue, Michael.
@vpetersson As Michael mentioned, this is a top priority for us, so we will have a fix out as soon as we can code and test it appropriately. Our goal is to make snapd more robust to the nature of the bug, instead of simply fixing its cause. In other words, if a similar situation happens in the future due to a bug elsewhere, the fix we’re putting in place will prevent any relevant damage from happening.
We apologize for the trouble caused by this issue.
It turns out the FSCK0000.000:uboot.env file is annoyingly hard to fix. Mounting using
-t msdos and removing
fsck0000.000 means we leave the l-f-n for
uboot.env around as a (orphaned) directory entry. This seems to confuse uboot and will happily (re)create a
fsck0000.000:uboot.env file and we are at square one. Even running
fsck.vfat after deleting fsck0000.000 in msdos mode does not help. Fsck will claim that it removed the orphaned entry but uboot will pick it up again and write the file as before.
So far only the following has worked and requires us to add the nls_ascii.ko module in the initramfs (which is not too bad, its small):
mount -t vfat -o rw,check=s,shortname=win95,iocharset=ascii "$boot_partition" "$tmpboot_mnt" if stat "$tmpboot_mnt/uboot.env" && stat "$tmpboot_mnt/UBOOT.ENV"; then mv "$tmpboot_mnt/UBOOT.ENV" "$tmpboot_mnt/uboot.env" fi
Here is the sequence of events:
- snapd refreshes
- writes new uboot.env with snap_mode=try
- snapd reboots
- uboot starts loads the uboot.env and writes it with snap_mode=trying
- system boots, /boot/uboot is set to fs_passno=2 and systemd checks it before it mounts it
- the file is corrupted, most likely by the uboot write (but there is a chance the fs is in a strange state already and uboot just makes it worse)
- the fsck in (6) renames the file to fsck0000.000 but leaves the l-f-n entries intact (probably a bug in dosfstools, possible fix https://github.com/dosfstools/dosfstools/pull/83)
7.a here we can mitigate: fix dosfstools
7.b here we can mitigate: add snapd.core-fixup code
- snapd is now confused because snap_mode=try instead of snap_mode=trying and the vfat system shows two uboot.env files
8.a here we can mitigate and make snapd robust against this situtation (by cleaning snap_mode=try at the right time).
- snapd reverts to 4409 and reboots
- from now on uboot writes to the wrong (fsck0000.000:uboot.env) file
The relevant PRs to mitigate this problem are:
Once those are in the devices should recover gracefully. I have a system based on the corrupted fs-image. When I ran the core-fixup code that will be part of each boot the system recovered and eventually refreshed to the right update.
You the man, @mvo! Will this be rolled out as a hotfix release when they get landed?
The critical snappy bug is still unassigned, though, is that correct? https://bugs.launchpad.net/snapd/+bug/1769669
A new core release 2.32.8 with the workaround is now in the beta channel.
@kyleN We have a workaround now and we think we are good there but the underlying bug (I suspect in uboot) is still not fully understood. We will discuss in the next team meeting who will look at it. But the situation should be stable now with the fix, i.e. it should stop the catastrophic failures we are currently seeing and machine will continue to refresh now.
I’ve created a small tool that parses can parse FAT and edit directory entries, both regular and LFN ones. For now it works with FAT12/FAT16. I have done only limited testing with FAT32 and I’d guess it does not work yet. All in all, because of the legacy stuff, FAT is super weird to parse with many corner cases. Here’s the code:
So far I have been able to generate an image, then forcefully switch the LFN of one entry so that it would conflict with another, identically short named entry. Running fsck on such image:
$ /snap/core/current/sbin/fsck.vfat -av img fsck.fat 3.0.28 (2015-05-16) Checking we can access the last sector of the filesystem Boot sector contents: System ID "mkfs.fat" Media byte 0xf8 (hard disk) 512 bytes per logical sector 2048 bytes per cluster 1 reserved sector First FAT starts at byte 512 (sector 1) 2 FATs, 12 bit entries 1024 bytes per FAT (= 2 sectors) Root directory starts at byte 2560 (sector 5) 512 root directory entries Data area starts at byte 18944 (sector 37) 502 data clusters (1028096 bytes) 32 sectors/track, 64 heads 0 hidden sectors 2048 sectors total Checksum in long filename part wrong (48 vs. expected 9a). Not auto-correcting this. Wrong checksum for long file name "uboot.env". (Short name UBOOT.ENV may have changed without updating the long name) Not auto-correcting this. /UBOOT.ENV Duplicate directory entry. First Size 8 bytes, date 16:22:00 maj 14 2018 Second Size 8 bytes, date 16:22:02 maj 14 2018 Auto-renaming second. Renamed to FSCK0000.000 Reclaiming unconnected clusters. Performing changes. img: 2 files, 2/502 clusters
Listing files with
$ mdir -i img Volume in drive : has no label Volume Serial Number is F66D-CA4C Directory for ::/ uboot env 8 2018-05-14 14:22 FSCK0000 000 8 2018-05-14 14:22 uboot.env 2 files 16 bytes 1 024 000 bytes free
Mounting the image, the files are identically named, only using
shortname=win95 allows to distinguish one from the other.
$ sudo mount -o check=s /dev/loop4 /mnt/tmp $ ls -l /mnt/tmp total 4 -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env $ sudo umount /dev/loop4 $ sudo mount -o check=s,shortname=win95 /dev/loop4 /mnt/tmp $ ls -l /mnt/tmp total 4 -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env -rwxr-xr-x 1 root root 8 05-14 16:39 UBOOT.ENV
Thanks a lot for doing this! Fwiw, I tested the proposed fix to dosfstools https://github.com/dosfstools/dosfstools/pull/83 with your tool and it seems like its DTRT - i.e. with the unpatched fsck I don’t see the new short name FSCK0000.000 and instead a confusing uboot.env. With the patched fsck the long name is gone and only FSCK0000.000 is visible.
@mvo (et al). Good job on fixing this, but do we have any timeframe for promoting this from beta. We desperately need this fix into the stable channel?
We will push this as quickly as QA permits. Our QA team is currently testing this on real devices, they are US timezone based so results are not in yet but I will post an update here as soon as I can.
The new version of the core snap with the fix is in the candidate repository. We plan to release it to stable this Monday (2018-05-21). Please help testing, ideally we would verify that it fixes the issue for real. However AIUI there is no way to reproduce this, right? It just happens out of the blue?
@mvo, there is a way to 100% reproduce it in our custom image.
Will try that today.
@renat2017 That is excellent, please keep me posted. Can you share (maybe privately if there are concerns about leaking information) how to reproduce it? Maybe it gives us further clues into the root cause of the issue.
I tried to reproduce it with a candidate image and the bug didn’t appear.
The test was a little bit different though. I created an image with outdated pi2 kernel added to the image using --extra-snaps argument and tried to update only pi2 kernel snap but our issue was happening when the snapd was updating 4 snaps, kernel, core, gadget and our software snap.
@mvo - did this get promoted to snapd Stable today? I don’t have a device to test with here as I’m traveling.
The workaround is in stable since a while. We now got a reply from upstream as well and there is a fix there as well. I created https://bugs.launchpad.net/ubuntu/+source/dosfstools/+bug/1776523 so that we can SRU the fixed fsck.vfat into Ubuntu Core. Ideally we would have it in edge/beta for a while. Is that something that you test? The upstream diff looks fine but nothing beats real-word-testing
Generally speaking, things seems to have been resolved by the prior fix, but as far as I understand, this is the root cause fix, whereas the other one was the bandaid.