A new core release 2.32.8 with the workaround is now in the beta channel.
@kyleN We have a workaround now and we think we are good there but the underlying bug (I suspect in uboot) is still not fully understood. We will discuss in the next team meeting who will look at it. But the situation should be stable now with the fix, i.e. it should stop the catastrophic failures we are currently seeing and machine will continue to refresh now.
I’ve created a small tool that parses can parse FAT and edit directory entries, both regular and LFN ones. For now it works with FAT12/FAT16. I have done only limited testing with FAT32 and I’d guess it does not work yet. All in all, because of the legacy stuff, FAT is super weird to parse with many corner cases. Here’s the code:
So far I have been able to generate an image, then forcefully switch the LFN of one entry so that it would conflict with another, identically short named entry. Running fsck on such image:
$ /snap/core/current/sbin/fsck.vfat -av img fsck.fat 3.0.28 (2015-05-16) Checking we can access the last sector of the filesystem Boot sector contents: System ID "mkfs.fat" Media byte 0xf8 (hard disk) 512 bytes per logical sector 2048 bytes per cluster 1 reserved sector First FAT starts at byte 512 (sector 1) 2 FATs, 12 bit entries 1024 bytes per FAT (= 2 sectors) Root directory starts at byte 2560 (sector 5) 512 root directory entries Data area starts at byte 18944 (sector 37) 502 data clusters (1028096 bytes) 32 sectors/track, 64 heads 0 hidden sectors 2048 sectors total Checksum in long filename part wrong (48 vs. expected 9a). Not auto-correcting this. Wrong checksum for long file name "uboot.env". (Short name UBOOT.ENV may have changed without updating the long name) Not auto-correcting this. /UBOOT.ENV Duplicate directory entry. First Size 8 bytes, date 16:22:00 maj 14 2018 Second Size 8 bytes, date 16:22:02 maj 14 2018 Auto-renaming second. Renamed to FSCK0000.000 Reclaiming unconnected clusters. Performing changes. img: 2 files, 2/502 clusters
Listing files with
$ mdir -i img Volume in drive : has no label Volume Serial Number is F66D-CA4C Directory for ::/ uboot env 8 2018-05-14 14:22 FSCK0000 000 8 2018-05-14 14:22 uboot.env 2 files 16 bytes 1 024 000 bytes free
Mounting the image, the files are identically named, only using
shortname=win95 allows to distinguish one from the other.
$ sudo mount -o check=s /dev/loop4 /mnt/tmp $ ls -l /mnt/tmp total 4 -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env $ sudo umount /dev/loop4 $ sudo mount -o check=s,shortname=win95 /dev/loop4 /mnt/tmp $ ls -l /mnt/tmp total 4 -rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env -rwxr-xr-x 1 root root 8 05-14 16:39 UBOOT.ENV
Thanks a lot for doing this! Fwiw, I tested the proposed fix to dosfstools https://github.com/dosfstools/dosfstools/pull/83 with your tool and it seems like its DTRT - i.e. with the unpatched fsck I don’t see the new short name FSCK0000.000 and instead a confusing uboot.env. With the patched fsck the long name is gone and only FSCK0000.000 is visible.
@mvo (et al). Good job on fixing this, but do we have any timeframe for promoting this from beta. We desperately need this fix into the stable channel?
We will push this as quickly as QA permits. Our QA team is currently testing this on real devices, they are US timezone based so results are not in yet but I will post an update here as soon as I can.
The new version of the core snap with the fix is in the candidate repository. We plan to release it to stable this Monday (2018-05-21). Please help testing, ideally we would verify that it fixes the issue for real. However AIUI there is no way to reproduce this, right? It just happens out of the blue?
@renat2017 That is excellent, please keep me posted. Can you share (maybe privately if there are concerns about leaking information) how to reproduce it? Maybe it gives us further clues into the root cause of the issue.
I tried to reproduce it with a candidate image and the bug didn’t appear.
The test was a little bit different though. I created an image with outdated pi2 kernel added to the image using --extra-snaps argument and tried to update only pi2 kernel snap but our issue was happening when the snapd was updating 4 snaps, kernel, core, gadget and our software snap.
@mvo - did this get promoted to snapd Stable today? I don’t have a device to test with here as I’m traveling.
The workaround is in stable since a while. We now got a reply from upstream as well and there is a fix there as well. I created https://bugs.launchpad.net/ubuntu/+source/dosfstools/+bug/1776523 so that we can SRU the fixed fsck.vfat into Ubuntu Core. Ideally we would have it in edge/beta for a while. Is that something that you test? The upstream diff looks fine but nothing beats real-word-testing