Snapd causes corruption on upgrade

mvo · May 11, 2018, 3:05pm

A new core release 2.32.8 with the workaround is now in the beta channel.

mvo · May 11, 2018, 3:07pm

@kyleN We have a workaround now and we think we are good there but the underlying bug (I suspect in uboot) is still not fully understood. We will discuss in the next team meeting who will look at it. But the situation should be stable now with the fix, i.e. it should stop the catastrophic failures we are currently seeing and machine will continue to refresh now.

mborzecki · May 15, 2018, 5:29am

I’ve created a small tool that parses can parse FAT and edit directory entries, both regular and LFN ones. For now it works with FAT12/FAT16. I have done only limited testing with FAT32 and I’d guess it does not work yet. All in all, because of the legacy stuff, FAT is super weird to parse with many corner cases. Here’s the code:

So far I have been able to generate an image, then forcefully switch the LFN of one entry so that it would conflict with another, identically short named entry. Running fsck on such image:

$ /snap/core/current/sbin/fsck.vfat -av img
fsck.fat 3.0.28 (2015-05-16)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
      2048 bytes per cluster
         1 reserved sector
First FAT starts at byte 512 (sector 1)
         2 FATs, 12 bit entries
      1024 bytes per FAT (= 2 sectors)
Root directory starts at byte 2560 (sector 5)
       512 root directory entries
Data area starts at byte 18944 (sector 37)
       502 data clusters (1028096 bytes)
32 sectors/track, 64 heads
         0 hidden sectors
      2048 sectors total
Checksum in long filename part wrong (48 vs. expected 9a).
  Not auto-correcting this.
Wrong checksum for long file name "uboot.env".
  (Short name UBOOT.ENV may have changed without updating the long name)
  Not auto-correcting this.
/UBOOT.ENV
  Duplicate directory entry.
  First    Size 8 bytes, date 16:22:00 maj 14 2018
  Second   Size 8 bytes, date 16:22:02 maj 14 2018
  Auto-renaming second.
  Renamed to FSCK0000.000
Reclaiming unconnected clusters.
Performing changes.
img: 2 files, 2/502 clusters

Listing files with mtools:

$ mdir -i img
 Volume in drive : has no label
 Volume Serial Number is F66D-CA4C
Directory for ::/

uboot    env         8 2018-05-14  14:22 
FSCK0000 000         8 2018-05-14  14:22  uboot.env
        2 files                  16 bytes
                          1 024 000 bytes free

Mounting the image, the files are identically named, only using shortname=win95 allows to distinguish one from the other.

$ sudo mount -o check=s /dev/loop4 /mnt/tmp
$ ls -l /mnt/tmp 
total 4
-rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env
-rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env
$ sudo umount /dev/loop4                   
$ sudo mount -o check=s,shortname=win95 /dev/loop4 /mnt/tmp
$ ls -l /mnt/tmp                                           
total 4
-rwxr-xr-x 1 root root 8 05-14 16:39 uboot.env
-rwxr-xr-x 1 root root 8 05-14 16:39 UBOOT.ENV

mvo · May 15, 2018, 5:58am

Thanks a lot for doing this! Fwiw, I tested the proposed fix to dosfstools https://github.com/dosfstools/dosfstools/pull/83 with your tool and it seems like its DTRT - i.e. with the unpatched fsck I don’t see the new short name FSCK0000.000 and instead a confusing uboot.env. With the patched fsck the long name is gone and only FSCK0000.000 is visible.

vpetersson · May 15, 2018, 7:14am

@mvo (et al). Good job on fixing this, but do we have any timeframe for promoting this from beta. We desperately need this fix into the stable channel?

mvo · May 15, 2018, 8:49am

We will push this as quickly as QA permits. Our QA team is currently testing this on real devices, they are US timezone based so results are not in yet but I will post an update here as soon as I can.

mvo · May 16, 2018, 5:56pm

The new version of the core snap with the fix is in the candidate repository. We plan to release it to stable this Monday (2018-05-21). Please help testing, ideally we would verify that it fixes the issue for real. However AIUI there is no way to reproduce this, right? It just happens out of the blue?

renat2017 · May 17, 2018, 6:58am

@mvo, there is a way to 100% reproduce it in our custom image.

Will try that today.

mvo · May 17, 2018, 11:36am

@renat2017 That is excellent, please keep me posted. Can you share (maybe privately if there are concerns about leaking information) how to reproduce it? Maybe it gives us further clues into the root cause of the issue.

renat2017 · May 18, 2018, 10:23am

I tried to reproduce it with a candidate image and the bug didn’t appear.

The test was a little bit different though. I created an image with outdated pi2 kernel added to the image using --extra-snaps argument and tried to update only pi2 kernel snap but our issue was happening when the snapd was updating 4 snaps, kernel, core, gadget and our software snap.

vpetersson · May 21, 2018, 5:11am

@mvo - did this get promoted to snapd Stable today? I don’t have a device to test with here as I’m traveling.

mvo · June 12, 2018, 4:47pm

The workaround is in stable since a while. We now got a reply from upstream as well and there is a fix there as well. I created Bug #1776523 “When repairing files with long filenames can creat...” : Bugs : dosfstools package : Ubuntu so that we can SRU the fixed fsck.vfat into Ubuntu Core. Ideally we would have it in edge/beta for a while. Is that something that you test? The upstream diff looks fine but nothing beats real-word-testing

vpetersson · June 15, 2018, 9:17am

Ping @renat2017.

Generally speaking, things seems to have been resolved by the prior fix, but as far as I understand, this is the root cause fix, whereas the other one was the bandaid.