A new core release 2.32.8 with the workaround is now in the beta channel.
@kyleN We have a workaround now and we think we are good there but the underlying bug (I suspect in uboot) is still not fully understood. We will discuss in the next team meeting who will look at it. But the situation should be stable now with the fix, i.e. it should stop the catastrophic failures we are currently seeing and machine will continue to refresh now.
Iāve created a small tool that parses can parse FAT and edit directory entries, both regular and LFN ones. For now it works with FAT12/FAT16. I have done only limited testing with FAT32 and Iād guess it does not work yet. All in all, because of the legacy stuff, FAT is super weird to parse with many corner cases. Hereās the code:
So far I have been able to generate an image, then forcefully switch the LFN of one entry so that it would conflict with another, identically short named entry. Running fsck on such image:
$ /snap/core/current/sbin/fsck.vfat -av img
fsck.fat 3.0.28 (2015-05-16)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
512 bytes per logical sector
2048 bytes per cluster
1 reserved sector
First FAT starts at byte 512 (sector 1)
2 FATs, 12 bit entries
1024 bytes per FAT (= 2 sectors)
Root directory starts at byte 2560 (sector 5)
512 root directory entries
Data area starts at byte 18944 (sector 37)
502 data clusters (1028096 bytes)
32 sectors/track, 64 heads
0 hidden sectors
2048 sectors total
Checksum in long filename part wrong (48 vs. expected 9a).
Not auto-correcting this.
Wrong checksum for long file name "ļ»æuboot.env".
(Short name UBOOT.ENV may have changed without updating the long name)
Not auto-correcting this.
/UBOOT.ENV
Duplicate directory entry.
First Size 8 bytes, date 16:22:00 maj 14 2018
Second Size 8 bytes, date 16:22:02 maj 14 2018
Auto-renaming second.
Renamed to FSCK0000.000
Reclaiming unconnected clusters.
Performing changes.
img: 2 files, 2/502 clusters
Listing files with mtools
:
$ mdir -i img
Volume in drive : has no label
Volume Serial Number is F66D-CA4C
Directory for ::/
uboot env 8 2018-05-14 14:22
FSCK0000 000 8 2018-05-14 14:22 ļ»æuboot.env
2 files 16 bytes
1 024 000 bytes free
Mounting the image, the files are identically named, only using shortname=win95
allows to distinguish one from the other.
$ sudo mount -o check=s /dev/loop4 /mnt/tmp
$ ls -l /mnt/tmp
total 4
-rwxr-xr-x 1 root root 8 05-14 16:39 ļ»æuboot.env
-rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env
$ sudo umount /dev/loop4
$ sudo mount -o check=s,shortname=win95 /dev/loop4 /mnt/tmp
$ ls -l /mnt/tmp
total 4
-rwxr-xr-x 1 root root 8 05-14 16:39 ļ»æuboot.env
-rwxr-xr-x 1 root root 8 05-14 16:39 UBOOT.ENV
Thanks a lot for doing this! Fwiw, I tested the proposed fix to dosfstools https://github.com/dosfstools/dosfstools/pull/83 with your tool and it seems like its DTRT - i.e. with the unpatched fsck I donāt see the new short name FSCK0000.000 and instead a confusing uboot.env. With the patched fsck the long name is gone and only FSCK0000.000 is visible.
@mvo (et al). Good job on fixing this, but do we have any timeframe for promoting this from beta. We desperately need this fix into the stable channel?
We will push this as quickly as QA permits. Our QA team is currently testing this on real devices, they are US timezone based so results are not in yet but I will post an update here as soon as I can.
The new version of the core snap with the fix is in the candidate repository. We plan to release it to stable this Monday (2018-05-21). Please help testing, ideally we would verify that it fixes the issue for real. However AIUI there is no way to reproduce this, right? It just happens out of the blue?
@renat2017 That is excellent, please keep me posted. Can you share (maybe privately if there are concerns about leaking information) how to reproduce it? Maybe it gives us further clues into the root cause of the issue.
I tried to reproduce it with a candidate image and the bug didnāt appear.
The test was a little bit different though. I created an image with outdated pi2 kernel added to the image using --extra-snaps argument and tried to update only pi2 kernel snap but our issue was happening when the snapd was updating 4 snaps, kernel, core, gadget and our software snap.
@mvo - did this get promoted to snapd Stable today? I donāt have a device to test with here as Iām traveling.
The workaround is in stable since a while. We now got a reply from upstream as well and there is a fix there as well. I created Bug #1776523 āWhen repairing files with long filenames can creat...ā : Bugs : dosfstools package : Ubuntu so that we can SRU the fixed fsck.vfat into Ubuntu Core. Ideally we would have it in edge/beta for a while. Is that something that you test? The upstream diff looks fine but nothing beats real-word-testing
Ping @renat2017.
Generally speaking, things seems to have been resolved by the prior fix, but as far as I understand, this is the root cause fix, whereas the other one was the bandaid.