Squashfs is a terrible storage format


#1

[Image filenames are relative to https://k3.botanicus.net/tmp/snap/ since the forums prevent me from posting images or links!]

Hi there,

I haven’t touched Linux on the desktop much since ~2007 and was very excited to try out snaps. Ubuntu’s integration is a really wonderful experience – it’s hard to spot the rough edges because as best I can tell they don’t exist. :slight_smile: With one exception…

On a 32GB RAM 8 thread laptop with Samsung 970 Evo NVMe disk, somehow LibreOffice startup performance is perceptually on a par with spinning rust, and in the case of warm page cache, clearly significantly worse than spinning rust. On investigating how this is possible it became apparent Squashfs is the smoking gun, and after further investigations I’m left puzzled by how a filesystem intended for resource-constrained embedded systems has become deeply ‘embedded’ in Snap’s design, where it plays the role of constraining resources on much bigger machines!

During sleuthing I found Ubuntu bug #1636847 and was surprised to learn that in order to cope with pushing squashfs into a lesser-explored realm where it burns memory, Ubuntu kernels as of 18.10 (>2 years later!) are built with squashfs decompression single-threaded with respect to each filesystem. No wonder the computer feels like spinning rust while loading LibreOffice - the most expensive squashfs computation is serialized across all cores, and I expect this implies the NVMe drive’s parallelism is also being completely wasted.

Test Setup

I made some profiles of LibreOffice startup in various configurations. Each invocation uses libreoffice macro:/// to run a standard library macro that opens a Writer document, waits 500ms, then exits the program.

The cold boot case is with sync; echo 3 > /proc/sys/vm/drop_caches, while the warm boot case is the same invocation run immediately after a cold boot measurement. The scripts to recreate these measurements are included at the bottom.

Real IO counts were measured separately with iostat /dev/nvme0n1.

Test system is Ubuntu 18.10, Dell XPS 9550

Cold Boot (LibreOffice Apt)

 Performance counter stats for 'system wide':

       2483.663066      task-clock (msec)         pstat-15446 #    0.768 CPUs utilized          
             5,841      context-switches          pstat-15446 #    0.002 M/sec                  
                47      cpu-migrations            pstat-15446 #    0.019 K/sec                  
            48,623      page-faults               pstat-15446 #    0.020 M/sec                  
     6,885,944,560      cycles                    pstat-15446 #    2.772 GHz                      (99.60%)
     9,685,864,191      instructions              pstat-15446 #    1.41  insn per cycle           (99.60%)
     2,044,567,580      branches                  pstat-15446 #  823.207 M/sec                    (99.60%)
        32,418,805      branch-misses             pstat-15446 #    1.59% of all branches          (99.60%)

       3.232673706 seconds time elapsed

2.23user 0.35system 0:03.34elapsed 77%CPU (0avgtext+0avgdata 239556maxresident)k
554248inputs+3024outputs (1847major+47931minor)pagefaults 0swaps

This is 3.2 seconds to fault in 272MiB of disk, open a document, wait half a second, then fully close. This is about the performance I’d expect from a NVMe machine.

Warm Boot (LibreOffice Apt)

 Performance counter stats for 'system wide':

       2307.322024      task-clock (msec)         pstat-15527 #    0.954 CPUs utilized          
             1,599      context-switches          pstat-15527 #    0.693 K/sec                  
                48      cpu-migrations            pstat-15527 #    0.021 K/sec                  
            46,829      page-faults               pstat-15527 #    0.020 M/sec                  
     6,670,519,294      cycles                    pstat-15527 #    2.891 GHz                      (99.85%)
     9,352,629,215      instructions              pstat-15527 #    1.40  insn per cycle           (99.85%)
     1,982,135,931      branches                  pstat-15527 #  859.063 M/sec                    (99.85%)
        31,260,595      branch-misses             pstat-15527 #    1.58% of all branches          (99.85%)

       2.419333197 seconds time elapsed

2.14user 0.23system 0:02.49elapsed 95%CPU (0avgtext+0avgdata 239532maxresident)k
0inputs+2264outputs (0major+47942minor)pagefaults 0swaps

A slight improvement from previously - 2.4 seconds, no IO-triggering major faults, and iostat reports 367KiB read from disk as might be expected.

Cold Boot (LibreOffice Snap)

And now for current Snap… :frowning:

 Performance counter stats for 'system wide':

      11770.091563      task-clock (msec)         pstat-15075 #    0.996 CPUs utilized          
            19,659      context-switches          pstat-15075 #    0.002 M/sec                  
               291      cpu-migrations            pstat-15075 #    0.025 K/sec                  
            70,871      page-faults               pstat-15075 #    0.006 M/sec                  
    37,918,835,347      cycles                    pstat-15075 #    3.222 GHz                      (99.79%)
    50,906,529,193      instructions              pstat-15075 #    1.34  insn per cycle           (99.79%)
     7,504,717,974      branches                  pstat-15075 #  637.609 M/sec                    (99.79%)
       472,008,620      branch-misses             pstat-15075 #    6.29% of all branches          (99.79%)

      11.816502575 seconds time elapsed

4.68user 7.20system 0:11.89elapsed 99%CPU (0avgtext+0avgdata 289080maxresident)k
395876inputs+3000outputs (2455major+69633minor)pagefaults 0swaps

11.8 seconds to pull in 222MiB (despite the compression barely any improvement over plain uncompressed ext4), but CPU has exploded - almost 6x as many cycles compared to plain filesystem. Thank goodness I’m plugged into AC!

Warm Boot (LibreOffice Snap)

But this is a cost only paid on cold cache, right?

 Performance counter stats for 'system wide':

       5017.110589      task-clock (msec)         pstat-15259 #    1.027 CPUs utilized          
             2,727      context-switches          pstat-15259 #    0.544 K/sec                  
               103      cpu-migrations            pstat-15259 #    0.021 K/sec                  
            68,444      page-faults               pstat-15259 #    0.014 M/sec                  
    15,923,433,894      cycles                    pstat-15259 #    3.174 GHz                      (99.90%)
    20,389,071,208      instructions              pstat-15259 #    1.28  insn per cycle           (99.90%)
     3,764,918,161      branches                  pstat-15259 #  750.416 M/sec                    (99.90%)
        43,101,749      branch-misses             pstat-15259 #    1.14% of all branches          (99.90%)

       4.883716615 seconds time elapsed

4.76user 0.35system 0:04.95elapsed 103%CPU (0avgtext+0avgdata 295588maxresident)k
8inputs+2432outputs (0major+69637minor)pagefaults 0swaps

We still manage to burn 4.8 seconds pulling in 738KiB of disk, and look at the cycles – almost 3x worse than cold cached regular filesystem, and still almost half the work involved in a cold cache squashfs run. Even despite paying the battery cost of having gobs of RAM for cache in a laptop, still the CPU will burn brightly every time a Snap application is opened.

Squashfs Confirmation

I used perf record to try and confirm the extra runtime was spent in squashfs, but the results weren’t entirely conclusive: for example the Snap traces show much higher futex() time, but this may just be due to the extra delays introduced in the run.

Here is the Snap cold boot case recorded using prec.sh, clearly showing Squashfs responsible for almost all the runtime: snap-cold-boot-perf.png

Despite caching, Snap warm boot still shows decompression accounting for >11% of samples: snap-warm-boot-perf.png

Finally for completeness a cold boot run of the Apt package, note how filemap_fault is active only 1.70% of the time compared to 38.8% in the Snap run, despite the Apt run completing almost 5x faster: apt-cold-boot-perf.png

Background

According to the snap format documentation, which is the closest I could find to a design document:

This method delivers fast and extremely predictable installations, with no installation remnants, and no way for a snap’s content to be mutated or interfered with. A snap is either installed and available as originally built, or it is not available at all.

It seems the idea is that since the wire transfer format matches that of the storage format on the end system, no extraction step is necessary, and users are less likely to tinker with the opaque squashfs blob downloaded by the system.

This sounds great, except of course, squashfs provides no immutability guarantees, nor was it ever designed to - users can easily mksquashfs modded images, or even more easily bind or union mount their edits on top of snapd’s repository.

Finally it seems very apparent that Snap is optimizing for entirely the wrong thing – software installations happen once, whereas reboots and program restarts happen constantly. Even with page cache assistance, squashfs performance is still more than twice as slow as straight ext4. It is not even the case that Snap is amortizing the cost of some expensive operation over the first program run, it repays the same cost over and over again, making all packages and invocations less responsive and burning battery in perpetuity.

Conclusions / Suggestions

Measured on a macro scale this design worsens usability for everyone - and not just while Snap is running, but at all times when any Snap-packaged software is active.

Squashfs might be a reasonable choice of transfer format, but it was never intended as a storage format in this use case. It might be possible to improve squashfs performance over time - at the expense of more cache or computation (and therefore energy) - but it is not possible to work around a more fundamental misstep:

Compression has existed in desktop operating systems for more than 3 decades, but since the mid 90s as storage densities and transfer rates exploded it has never, ever been the case that compression was a mandatory feature – and for good reason, it almost always hurts performance, as demonstrated here. There are few examples of compressed containers helping runtime on a macro scale, and where they do, it is for bizarre reasons (one example is Python’s ZIP importer, it speeds up some scenarios by avoiding system calls!).

Compression is better left as a choice to the end user, where they can decide if the processing/energy/battery/latency tradeoff is worth it in their particular scenario. Linux has good support for this already at least in the form of btrfs.

Logical immutability is a noble design goal, but it can only ever be considered a logical abstraction. A squashfs blob is a low hanging fruit form of obfuscation at best – it is specifically no better (and IMHO for this application – much worse) than ext4 chattr -i immutable attributes, or mount --bind -o ro /snap /snap, schemes which are much easier to implement and perfectly match the design intention of those VFS/filesystem features.

It might be possible that strong integrity checking is a real desire in the future, but the kernel has dedicated subsystems for that developed by some very smart people, for example in the form of dm-verity/fs-verity as employed on Android.

One suggestion would be to trade the complexity of managing many filesystem mounts for a streamy installer – one that extracts e.g. the squashfs image as it arrives off the network, hiding UI latency just as amortizing the decompression cost over every program invocation does today. Even a mid-90s disk could keep up with streamy extraction of a filesystem at typical WAN link speeds that exist today. A similar trick can be played during uninstall – make a journal entry, and asynchronously delete while the user continues working.

A halfway solution is possible too – by making the snapd backend configurable, so that users who care about performance can disable the currently mandatory and extraordinarily wasteful compression.

Thanks!

pstat.sh

Isolate the script process in a new perf_event cgroup, then run perf stat selecting processes in that cgroup.

#!/bin/bash
set -ex
cgname=$(echo pstat-$$)
sudo mkdir /sys/fs/cgroup/perf_event/$cgname
echo $$ | sudo tee /sys/fs/cgroup/perf_event/$cgname/cgroup.procs
mv perf.data perf.data.prev || true
iostat /dev/ssd
/usr/bin/time perf stat -ae task-clock,context-switches,cpu-migrations,page-faults,major-faults,minor-faults,cycles,instructions,branches,branch-misses -G $cgname "$@"
iostat /dev/ssd

prec.sh

As above, but record kernel stacks for use with perf report

set -ex
cgname=$(echo pstat-$$)
sudo mkdir /sys/fs/cgroup/perf_event/$cgname
echo $$ | sudo tee /sys/fs/cgroup/perf_event/$cgname/cgroup.procs
mv perf.data perf.data.prev || true
/usr/bin/time perf record --all-kernel -ae cycles -G $cgname "$@"

LibreOffice macro

Saved to the standard macro library of both LibreOffice installations

Sub Main
	url = ConvertToURL("/home/dmw/snap/libreoffice/current/test.docx")	
	oDocument = StarDesktop.LoadComponentFromURL( url, "_blank", 0, Array())
        Wait 500
	StarDesktop.Terminate
End Sub

LibreOffice Snap command line

./pstat.sh snap run  libreoffice macro:///Standard.Module1.Main 
./prec.sh snap run  libreoffice macro:///Standard.Module1.Main 

Version:

Name               Version                 Rev   Tracking  Publisher   Notes
libreoffice        6.1.4.2                 100   stable    canonical✓  -

LibreOffice Apt command line

./pstat.sh libreoffice macro:///Standard.Module1.Main 
./prec.sh libreoffice macro:///Standard.Module1.Main 

Version:

ii  libreoffice                                   1:6.1.3-0ubuntu0.18.10.2                      amd64        office productivity suite (metapackage)

#2

Just wanted to point out your linked reference documents have 2 broken links on the perf.data.* files.

Other than that this seems like a really thought out critic with excellent data. I think I’ve seen aspects of these performance issues and will try to find time to dig in more deeply as well.


#3

Hi Alan :slight_smile: Those files are probably useless anyway without the exact kernel and binaries I have locally, dumping to a web server was just because Discourse considered the post spam.

Noticed 1 mistake since posting – the ‘raw IO’ numbers include more than just the squashfs, so squashfs IO reduction effect may be much better than reported (other uncompressed data may be pulled in from /usr/lib, /usr/bin/time etc)

I couldn’t think of a quick way to test the presence/absence of IO parallelism, but it should be possible, perhaps even just by using iostat / block device stats as a proxy


#4

I wonder how much the choice of compression algorithm has here? If you repacked the LO snap using simple deflate compression rather than xz, it might get rid of the CPU as bottleneck.


#5

It will certainly help! But note any improvement here is likely to have an adverse effect on image size, making it less suitable as a transfer format (AIUI no compression is applied to the squashfs on the wire?).

This is a classic time-space tradeoff… app runtime wants fast random seeks, compression wants large chunks to share a dictionary. Faster seeks = smaller compressed blocks = bigger images.

I noticed the 477MB squashfs for Libreoffice on my machine drops to almost 300MB when compressed directly with tar+xzip (where one large dictionary covers the entire stream). So existing solution is not quite ideal for transfer either – that is quite a big difference even for reasonably modern WAN connections (my home is 3.5MB/sec, translating to ~40 second difference).

I don’t believe there is any good solution that does not trade runtime performance for transfer efficiency, but would love to be proven incorrect :slight_smile:


#6

Another avenue that might be worth exploring, if the format allows it, is arranging the image to have better locality of access, e.g. recording access order using blktrace or similar, and packing the image to minimize blocks requiring decompression to open the files/extents used in a ‘normal’ run. Naturally the order could be config dependent (e.g. due to i18n), but an 80% solution might help a lot


#7

Thanks for this very detailed analysis! (I wouldn’t quite agree with the title though, its a bit too strong for my taste).

Looking at the performance, we did some work recently to look at how much the compression influence the performance (and also parameters like dictionary size etc) and came to the conclusion that choosing a different compression (like lzo or zstd) will help with start-up performance quite a bit. But there is a cost in the size here if we do it. So its a indeed a classic trade-off. I hope we can publish some actual numbers/scripts soon (the script needs a bit of cleanup first :slight_smile:

So thanks for brining this up and we will look into improving the performance.


#8

It’s possible as a new user that it’s not obvious, but Squashfs seems superfluous to Snap’s value proposition as painless distro-agnostic app packaging, introducing hard problems that other systems never needed to solve. It seems the aforementioned tradeoff is unavoidable while storage and transfer formats are identical – one of download experience or runtime performance will always be subpar. With today’s LibreOffice both are subpar – cold boot is ~10x slower despite a transfer size already 25% larger than a traditional archive.

When fully trading transfer efficiency for performance, say by storing 4KiB LZO-compressed sectors everywhere (and shipping an almost 2GiB LibreOffice image), current gen commodity PCIe is so efficient that it will likely still be possible to encounter real world scenarios where compression overhead remains measurable, perhaps not least in the complex packing/encoding scheme used for directories in Squashfs.

This is a completely optional tradeoff that need not be made, and for that reason I’d like to reaffirm the choice of ‘terrible’ :slight_smile: On-demand decompression has not been widely deployed since the mid-90s due to this differential, storage advancements in the past decade have made it even less practical than at any time in the past, and future lower latency pmem-based storage generations (like Optane) do little to help the matter.

I promise to not labour this point any more, but it is one that is important enough that it’s worth making twice.

Thanks again


#9

Upstream kernels since Nov 2017 support zstd, but 18.10’s mksquashfs doesn’t seem to be built with it enabled. Disk usage would worsen, but decompression may be far better:

| Method         | Ratio | Compression MB/s | Decompression MB/s |
|----------------|-------|------------------|--------------------|
| xz             |  3.43 |              5.5 |                 35 |
| xz 256 KB      |  3.53 |              5.4 |                 40 |
| zstd 10        |  3.01 |               41 |                225 |
| zstd 15        |  3.13 |             11.4 |                224 |
| zstd 16 256 KB |  3.24 |              8.1 |                210 |

Some of the most impressive compressors are throughput-oriented, they look amazing in benchmarks but completely fall apart on initialization time, which is again useless for random seeks. I wonder if zstd suffers from this.


#10

Are there any news on this?